A Grammatical Inference Sequential Mining Algorithm for Protein Fold Recognition

—Protein fold recognition plays an important role in computational protein analysis since it can determine protein function whose structure is unknown. In this paper, a Classified Sequential Pattern mining technique for Protein Fold Recognition (CSPF) is proposed. CSPF technique consists of two main phases: the sequential mining pattern phase and the fold recognition phase. In the sequential mining pattern phase, Mix & Test algorithm is developed based on Grammatical Inference, which is used as a training phase. Mix & Test algorithm minimizes I/O costs by one database scan, discovers subsequence combinations directly from sequences in memory without searching the whole sequence file, has no database projection, handles gaps, and works with variant length sequences without having to align them. In addition, a parallelized version of Mix & Test algorithm is applied to speed up Mix & Test algorithm performance. In the fold recognition phase, unknown protein folds are predicted via a proposed testing function. To test the performance, 36 SCOP protein folds are used, where the accuracy rate is 75.84% for training data and 59.7% for testing data.


INTRODUCTION
Protein fold recognition is an important step towards understanding protein three-dimensional structures and their biological functions.Fold recognition techniques do not require similar sequences in the protein databank, just similar folds.Successful approaches have been applied to protein fold recognition [1].For example, various researchers used Neural networks to predict protein folds, such as GeneThreader [2], TUNE (Threading Using Neural nEtwork) [3], neural networks with tailored early-stopping [4], Bayesian Networks [5], structural-pattern based methods [6], and Genetic Algorithms [7,8].Examples of using Support Vector Machines (SVM) have been illustrated as follows: directly predict the alignment accuracy of a sequence template alignment [9] and a combined technique of Support Vector Machine (SVM) classifier with Regularized Discriminant Analysis (RDA) [10].
Other research has been performed using Monte Carlo methods [11].In addition, many researchers used parallel evolutionary algorithms for protein fold recognition, such as parallel EST, probabilistic roadmap for motion planning, pRNAPredict for RNA secondary structure [12][13][14][15][16].However, although significant improvement has been made, the accuracy of the existing methods remains low and there is a need for new methods contributing to the field of fold recognition.Sequential mining algorithms have been proposed to predict protein folds.The objective of sequential pattern mining is to discover interesting sequential patterns in a sequence database.It is one of the essential data mining tasks widely used in many applications, including customer purchase pattern analysis and biological data sequences [17][18][19][20][21][22], etc.Many research have been performed to efficient sequential pattern mining, such as [23][24][25], closed and maximal sequential pattern mining [26][27][28][29], constraint-based sequential pattern mining [30][31][32] approximate sequential pattern mining [33], sequential pattern mining in multiple data sources [34], sequential pattern mining in noisy data [35], incremental mining of sequential patterns [36], and time-interval weighted sequential pattern mining [37].Two of the general sequential mining algorithms are SPADE [24] and PrefixSpan [23], which are more efficient than others in terms of processing time.SPADE is one of the vertical-format based algorithms and uses equivalence classes in the mining process.PrefixSpan is one of the pattern-growth approaches.It recursively projects a sequence database into a set of smaller projected sequence databases and grows sequential patterns in each projected database by exploring only the locally frequent fragments.cSPADE [38] algorithm is a straightforward extension of SPADE algorithm.The only difference is the involvement of constraints in the cSPADE.These constraints include length, width, and duration limitations on the sequences, item constraints, event constraints, and incorporating class information.In addition, one of the SPADE based algorithm called SPAM (Sequential PAttern Mining) [39] has been proposed.It integrates the ideas of GSP, SPADE, and www.ijacsa.thesai.orgFreeSpan and combines a vertical bitmap representation of the database with efficient support counting.
One of the promising areas is Formal Language Theory and Grammatical Inference (GI), which is playing important role in the development of new methods to process biological data [40].
Many works propose GI techniques to tackle bioinformatics tasks, such as secondary structure identification [41], protein motifs detection [42], and optimal consensus sequence discovery [43].In this paper, GI is used as the backbone of the sequential pattern mining algorithm, which has achieved faster and higher performance accuracy than other sequential pattern mining algorithms for protein fold recognition.
In this paper, we introduce a Classified Sequential Pattern mining technique for Protein Fold Recognition (CSPF).CSPF consists of two main phases: 1) Sequential pattern mining and 2) fold recognition.It handles gap constraints, uses data parallelization, and performs incremental updating.CSPF has shown efficient results when applied to 36 SCOP protein folds.This paper is organized as follows: section 2 explains the proposed CSPF technique.Section 3 describes datasets used and the performance study.Finally, section 4 gives the conclusions and future work.

II. METHODS
CSPF technique consists of two main phases: the sequential pattern mining phase and the fold recognition phase.In the sequential pattern mining phase, Mix & Test algorithm is developed, which is used as a training phase.In the fold recognition phase, unknown protein folds are predicted via a proposed testing function.Our work is close to the sequential pattern mining suggested in [13].However, this work depends on a new algorithm for sequential pattern mining, based on grammatical inference.In addition, it employs parallel sequential pattern mining and incremental updating.

A. Phase I: Sequential Pattern Mining:
During this phase, Mix & Test algorithm is developed in order to mine sequential patterns for each fold, based on Grammatical Inference.The key advantages of Mix &Test algorithm are minimizing I/O costs via one database scan, discovering combinations directly from sequences in-memory without searching the whole sequences file, no database projection, handling gaps, and working with variant length sequences without having to align them.In addition, Mix & Test algorithm supports incremental updating, where it does not prune infrequent patterns and count the support of them during the mining steps.
Mix & Test algorithm acts iteratively.
First, it generates a list of no gap sequential combinations, which will serve as the seed for the coming generation if there is a gap value specified.If no gap is specified, this list will be evaluated by the testing strategy with the specified minimum support threshold.Thus, this list will obtain the frequent and infrequent lists.If the gap value is specified, Mix &Test will loop to the combinations generation step and will use the combinations list obtained from the previous step to construct new combinations list with a gap by following steps of Mix &Test algorithm's grammar.
The steps of the algorithm are shown in Fig. 1.

1) Mix Strategy:
Problem Definition: Given a sequences file S that contains a set of sequences S= {s 1 , s 2 , ..., s m } and a set of items I = { i 1 , i 2 , …, i n } that may appear in any sequence (here, a set of amino acids), where m is the number of sequences in a file and n is the number of amino acids.A sequence s j = <i 1 , ...,i n > , where i 1 is the first item in the sequence and i n is the last item in the sequence.Let P is a subsequence that is derived from s j , P t is the current generated subsequence.P t-1 is the previous generated subsequence.The first generated subsequence will be: The generated subsequence will be:

Sequential combinations Generation "No-Gap combinations"
Mix strategy will first generate all "no gap combinations" list.It starts by reading the first sequence of protein sequences file and generates all possible sequential combinations of it.Mix strategy inserts the generated combination to the "no gap combinations" list with support equals to 1. Mix strategy will loop through new generated P to generate all possible combinations of it, using a removing procedure.This procedure removes the last item of the last generated combination to get a new combination from current P. It will stop generate P t when t equals to number of items in the sequence n.An example of generated sequential combinations of "No-gap combinations" is illustrated in Table I, given original sequence MAKNNGCDP.After generating all possible sequential combinations from the first sequence of the protein sequences file.It will start to read the second sequence and go through the previous steps and generate all new combinations.If the new generated combination is previously composed, its support will be incremented by one; otherwise, it www.ijacsa.thesai.orgwill be inserted to "no gap combinations" list with support equals to 1, as clarified in Fig. 1.

Gapped Sequential combinations Generation
If there is a gap value specified, the "no gap combinations" list will be used to generate "one gap combinations" list, which will be used to generate "two gaps combinations" list, and so on.Mix strategy will use two procedures to generate all possible gapped sequential combinations: Ladder and CrissCross procedures.
First, the Ladder procedure reads each combination in "no gap combinations" list and loops through it by inserting one gap at a time starting from the second character position shifted right in each loop until reaching the last character of the combination.Then, it will start again to read the next no gap combination and apply the previous steps on it.Definition 1: Given C as a "no gap combinations" list.C i is a no gap combination.Let L be the gapped combination list generated by Ladder procedure, as follows: Where L y (C i (S j )) is the y combination generated by Ladder procedure from no gap combination C i , and i y+1 is the item i with the position y+1 in C i combination.
Consider the first combination in the "No gap combinations" list is MAKNNGCDP, applying this procedure, we will obtain these one gap combinations: M_KNNGCDP, MA_NNGCDP, MAK_NGCDP, MAKN_GCDP, MAKNN_CDP, MAKNNG_DP, and MAKNNGC_P.Note that MAK_NGCDP is equivalent to MAKN_GCDP, so that they are treated as one combination and inserted only once in "one gap combinations" list as MAKNGCDP.
Second, the Crisscross procedure generates the rest of possible gapped sequential combinations of "one gap combinations" list.It reads each combination in "no gap combinations" list, looping through it and inserting one gap between each character of combination's characters.It starts from the second character's position shifted right one character position in each loop.
Definition 2: Given C as a "no gap combinations" list.C i is a no gap combination.Let Q be the gapped combination list generated by Crisscross procedure, as follows: Where Q r (C i (S j )) is the r combination generated by Crisscross procedure from no gap combination C i , and i r+1 is the item i with the position r+1 in C i combination.The concatenation part of the function will stop when n equals to or greater than the number of items in C i .
By applying this procedure in the last example, MAKNNGCDP no-gap combination will produce: M_K_N_C_P, MA_N_G_D, MAK_N_C_P, MAKN_G_D, MAKNN_C_P, MAKNNG_D, and MAKNNGC_P.Notice that all these derivative combinations by the two procedures will take the same support of the parent no gap combination which they are derived from it.Mix strategy will stop generating new combinations when the number of sequences in protein sequences file.The final result from applying the Mix strategy will be a list of all combinations derived from all combinations lists.

2) Test strategy:
The Test strategy will filter final combinations list, which contains all no-gap and gapped combinations to distinguish frequent and infrequent patterns, according to user-specified support.However, infrequent patterns will not be discarded because incremental updating will be performed later on.
The most time consuming step in the Mix&Test algorithm is updating the combinations list, where a search is required in order to ensure if the generated combination is a new one to insert it or an old one to update its support.Thus, the combinations list may become very large.Therefore, a lexicographic prefix tree of lists is suggested, where each list contains all combinations with the same prefix.For example, let P = {p 1 , p 2 , … , p n } be a set of lists (here n= 20 Amino Acids).Each p i represents a list of all combinations with a prefix i.For example, if i = M, the list P m can contain combinations, such as MV, MVV, MTV, MNKLSV.After Mix strategy generates the new combination, the first character of this combination is checked to determine which list to be inserted in.So, instead of having one big list, we will have p n lists, this shrinks time T to find or insert combination to T/n.In order to increase the speed of computing and minimize the time required to generate the combinations in Mix strategy, especially with the large number of files and the rapid incoming rates, Parallel Mix strategy (PMix) is proposed.PMix uses horizontal data parallelization, where the data are split into chunks in the memory for the task.These data chunks will be distributed on PMix threads.Each thread will apply Mix strategy to generate the combinations of candidate patterns of this data chunk.After all threads finish their work, a combination integrator module will integrate all combinations generated from the threads into one final combinations list.The final combinations list is used by combinations evaluator module, which applies test strategy to get frequent and infrequent patterns.www.ijacsa.thesai.org 3) Incremental updating CSPF saves and records the sequential patterns of each fold, which are generated from the training phase.However, increasing the speed of processing, especially with large volumes of data and high data rates, is highly required.Existing incremental updating algorithms are highly based on the availability of main memory.As a result, the use of In-Memory relational databases is proposed, where TimesTen Oracle database management system is applied.TimesTen is an In-Memory DBMS technology, which provides very fast data access time because all its data will reside in physical memory (RAM) during run time.TimesTen provides applications with short, consistent response times and very high throughput required by applications with databaseintensive workloads.
Incremental updating handles two cases: inserting new data and deleting old data.First, Insert module, as shown in Fig. 2, deals with new protein files to existing fold trial, the Mix strategy is applied to obtain the combination patterns of these files.These patterns are sent to database and added to the previously obtained frequent sequential patterns.Updated patterns can be classified into four cases: 1) Patterns that were frequent in the old database and become infrequent in the new database, 2) Patterns that were frequent in the old database and still frequent in the new database, 3) Patterns that were infrequent in the old database and become frequent in the new database, and 4) Patterns that were infrequent in the old database and still infrequent in the new database.Second, the Delete module deals with deleted sequences from the original database, which yields an inconsistent state with respect to the same specified minimum support threshold.The Delete procedure is similar to the Insert procedure.When deleting some protein sequences from existing fold trial, the obtained lists of frequent and infrequent patterns are affected.Delete module provides two ways for deletion either by deleting files directly by specifying their names or by a range of time to delete files in between.

B. Phase II: Protein Fold Recognition
The objective of the fold recognition phase is to classify unknown protein folds.In addition, an incremental updating module is used for maintaining the underlying database.

1) Weight Function for Protein Fold Recognition
The proposed weight function classifies the unknown protein by matching the extracted sequential patterns of each fold with the coming protein sequence.A weight for each fold with respect to the unknown protein is calculated.The higher the number of matched patterns is found, the higher the weight for the fold and the higher the probability of it to be selected as the recognized fold.However, there are very important aspects that have to be considered: 1) The length of the matched sequential patterns.The more matched frequent patterns with long length are reached, the higher the accuracy of the fold classification.2) Two folds having the same number of sequential patterns.The proposed Weight Fold Function is: Where N is Number of matched Patterns, M is the Maximum length of extracted patterns for the fold, L is Length of pattern, K represents Number of patterns with the same length, S is the number of extracted sequential patterns for a fold, and W is the weight of the fold.

III. APPLICATION
The CSPF technique is evaluated using different parameters, such as different support thresholds, number of sequences, memory consumption, and number of items per sequence.CSFP is trained and tested by a specific set of selected folds from the Structural Classification of Proteins (SCOP) database1 .The ASTRAL SCOP 1.75B dataset updated on 25-4-2013 is selected, where no proteins with more than 40% identity between them are included.The ASTRAL SCOP 1.75B dataset release has 49,757 PDB entries and 136,776 Domains.For each fold in this set, a corresponding set of at least 30 protein members is obtained from Protein Data Bank (PDB) [44], which is a worldwide archive of structural data of biological macromolecules.The protein sequences extracted from this release are used to validate the results of the proposed model.Two third of this dataset is used in the training phase to establish features set for each fold and one third is used in the test data to check validity of the proposed model.The algorithms are developed using Java language with NetBeans IDE 7.2 as the Java execution environment.The algorithms are tested on an Intel Core™ i5 2.50 GHz with 6 GB of main memory.The operating system used is Windows 7.
The following performance evaluation tests are achieved: recognition [39,45].However, SAM requires higher computational effort during training, since it employs the Baum-Welch algorithm for training the model, which is an iterative procedure.

A. Performance analysis of no gap mix strategy 1) Number of sequences Test:
In this study, we measure the performance of Mix & Test, PMix, and SPAM algorithms according to the change in number of sequences.Fig. 3 shows the performance results derived from Mix &Test, PMix, and SPAM having data ranges from 100, 000 to 900,000 sequences.Fig. 4 illustrates the performance results derived from Mix&Test, PMix, and SPAM having data ranges from 1,000,000 to 5,000,000 sequences.In both figures, Mix &Test and PMix outperform SPAM, where time taken by them is much smaller than time taken by SPAM.In addition, PMix outperforms both Mix & Test and SPAM algorithms because of parallelization step.Fig. 3. M&T, PMix vs. SPAM having data ranges from 100,000 to 900,000 sequences 2) Minimum Support Threshold test: Fig. 5 and Fig. 6 show the processing time of Mix&Test and PMix versus PrefixSpan and SPAM at different values of support threshold having the number of sequences equals 25,000 and 50,000, respectively.For protein sequences data and with very low minimum support threshold, the performance of PrefixSpan and SPAM take hours to process.On the other hand, Mix&Test and PMix take seconds and are not affected with the change of minimum support threshold values.Fig. 4. M&T, PMix vs. SPAM having data ranges from 1,000,000 to 5,000,000 sequences

3) Number of Items per Sequence
Four tests are applied, having 180 and 300 items per sequence (ips) and variant support threshold, as shown in Fig. 7(a,b), respectively .Each trial in each test of the experiment is represented by adding 5% to the support threshold value of the previous trial.Thus, the first trial with support threshold value equals to 5% and the last one with support threshold value equals to 50%.The execution time is measured in each trial.The result of these tests shows the relationship between the value of the support threshold and the processing time in seconds according of the four algorithms: Mix& Test, PMix, PrefixSpan, and SPAM.As shown in Fig. 7(a,b), Mix & Test and PMix are much faster than PrefixSpan and SPAM.

B. Performance analysis of gapped mix strategy
In this case, the performance of Mix&Test and PMix versus cSPADE algorithm is tested, according to the changes in maximum gap value, as illustrated in Fig. 8.This minimum support threshold equals to 35%.One can observe that the higher the gap value, the higher consumed time taken, having Mix&Test and PMix algorithms outperform cSPADE in small gap values.In addition, PMix outperforms both Mix&Test and cSPADE.

C. Performance analysis of Incremental Updating Process
The Incremental updating module is implemented via two different database management systems.The first is MySQL DBMS with a conventional disk-resident database and the other is the Oracle TimesTen database, as explained previously.The performance of Mix&Test(TimesTen) and Mix&Test(MySql) according to the change in number of sequences (in this case from 10,000 to 50,000 sequences) is tested.
In this case, a support threshold value equals to 20% with no gap value is applied, as illustrated in Fig. 11.In addition, the performance result of Mix&Test(TT) outperforms Mix&Test(MySql).Mix&Test(TT) takes around 30 seconds to process 10,000 sequences file where M&T(MySql) takes around 200 seconds to process it.This is because Timesten database is more efficient than MySql DBMS, where it offers a small, fast multithreaded, and transactional database engine with in-memory and disk-based tables.

D. Performance Analysis of Memory Consumption
To evaluate the memory consumption of Mix&Test and PMix are evaluated versus cSPADE under two aspects, which are the different gap values and the variant number of sequences.Changing gap values, Mix& Test and PMix are tested versus cSPADE algorithm by using sequences file with 30,000 sequences with minimum support threshold value equals to 30%, as illustrated in Fig. 10.PMix consumes memory greater than Mix&Test because it processes multithreads in the same time.Also, cSPADE consumes much memory more than both Mix& Test and Pmix.

E. Performance Analysis of Fold recognition Phase:
The fold recognition phase of CSPF technique is trained and tested by the dataset described previously [13].In Table II, we compare the sensitivity of the CSPF to SPM sensitivity for fold recognition.Sensitivity of each model represents the number of proteins, which are classified successfully from the whole proteins under evaluation.
A set of 804 protein experiments (test data set) are used to measure the accuracy of the model with the test set.CSPF reported an overall accuracy of testing data equals to 34.32%, as shown in Table III.Using the same test datasets and in order to compare the efficiency of the proposed model, SAM model [16] is also employed.A comparison of the results obtained by CPSF, "SPM for FR" and SAM (E-values ranking) are presented in Table IV.
CSPF outperforms the other two models, where it reports an overall accuracy of testing data equals to 34.32% while the overall accuracy of "SPM for FR" model was 24.9% and SAM's overall accuracy was 29.4%.The classification results of the proposed method CSPF, and "SPM for FR" algorithm and SAM (E-values) of the test set are shown in Table IV.
In terms of space complexity, for a sequence file with n as the number of sequences, and m as the number of items per sequence and number of items equals to 20 which is the 20 amino acids, the space complexity of Mix&Test algorithm is O(20m+n In addition, performance of CSFP fold recognition was compared with "SPM for FR" and SAM (E-values) models.CSPF outperformed "SPM for FR" and SAM (E-values) models with an overall accuracy for training data equals to 75.84% and "SPM for FR" model was 59.7% for testing data.Future work of CSFP can be in several directions: utilizing optimization techniques to enhance the prediction results and applying high performance computing to provide very fast process over protein sequences databases.In addition, more protein sequences will be used.www.ijacsa.thesai.org

1 )
For no gap mix strategy: a) Comparison of Mix & Test, PMix, and SPAM in terms of varied number of sequences, b) Comparison of Mix& Test, PMix, SPAM, and PrefixSpan in case of varied support threshold, and c) Comparison of Mix& Test, PMix, SPAM, and PrefixSpan in case of changing number of items per sequence.2) For gapped mix strategy: Comparison of Mix & Test, and cSPADE algorithms according to the changes in maximum gap value.3)Incremental Updating, 4) Memory consumption, and 5) Fold recognition phase: a comparison between the proposed method and SAM, which is widely used as a benchmark in fold www.ijacsa.thesai.

Fig. 10 .
Fig. 10.The memory consumption of M&T and PMix vs. cSPADE under different gap values

TABLE II .
SENSITIVITY FOR ALL FOLDS AND OVERALL ACCURACY OF THE PROPOSED CSPF TECHNIQUE AND "SPM FOR FOLD RECOGNITION (FR)" ).In terms of time complexity, the complexity of generating all the candidate patterns of Mix&Test with no gap is O(n2).The complexity of generating all the candidate patterns of Mix&Test with a gap m is O(n2)*m.The complexity of discovering the frequent patterns is O(N).IV.CONCLUSIONS In this work, we proposed a CSFP technique for protein fold recognition.This technique consisted of two main phases: sequential patterns extraction and protein fold recognition.Sequential patterns extraction phase introduced Mix & Test algorithm.Several experiments were conducted to assess the performance of Mix&Test and PMix.The performance of M&T and PMix algorithms were compared with PrefixSpan, SPAM and cSPADE algorithms.

TABLE III .
DETAILED SENSITIVITY RESULTS FOR ALL FOLDS UNDER EVALUATION AND OVERALL ACCURACY OF THE PROPOSED CSPF MODEL IN THE TEST SET

TABLE IV .
CLASSIFICATION RESULTS OF THE PROPOSED METHOD CSPF, "SPM FOR FR" ALGORITHM AND SAM (E-VALUES) IN THE TEST SET