Protein Sequence Matching Using Parametric Spectral Estimate Scheme

Putative protein sequences decoded from the messenger ribonucleic acid (mRNA) sequences are composed of twenty amino acids with different physical-chemical properties, such as hydrophobicity and hydrophilicity (uncharged, positively charged or negatively charged amino acids). In this paper, the power spectral estimate (PSE) technique for random processes is applied to the protein sequence matching framework. First, the twenty kinds of amino acids are classified based on their hydrophobicity and hydrophilicity. Then each amino acid in the protein sequence is mapped to a corresponding complex value. Consider the various Hidden Markov chain orders in the complex valued sequences. The PSE method can explore the implicit statistical relations among protein sequences. The mean squared error between the power spectra of two sequences is determined and then used to measure their similarity. The experimental results verify that the proposed PSE method provides the consistent similarity measurement with the wellknown ClustalW and BLASTp schemes. Moreover, the proposed PSE can show better similarity relevance than ClustalW and BLASTp schemes. Keywords—protein sequence; amino acids; digital signal processing; parametric spectral estimate; hydrophilicity; hydrophobicity; Markov chain


INTRODUCTION
In the past two decades, deoxyribonucleic acid (DNA) and protein sequences in various organisms have been massively obtained with the help of high-throughput sequencing technologies [1].Biologists unravel the functionality and capability of numerous protein sequence domains by understanding their 3-D structures obtained by the x-ray diffraction technique or NMR technology.These procedures require laborious preparations of protein crystals and are extremely time-consuming.Therefore, alternative methods based on digital signal processing (DSP) technique were developed to circumvent the extremely complicated crystallographic tasks.Generally speaking, two types of methods are commonly used to analyze the protein sequences and predict their functions : (1) Statistical methods [2], which apply the well-known mathematical models in stochastic processes to analyze the sequences.(2) Geometrical methods [3], which apply graphs to represent the sequences and then analyze them.Both types of methods first transform the symbolic amino acids to numerical values.The global or local similarity of any two sequences can then be measured according to the differences between the extracted sequence features.High similarity between two sequences may infer two meanings: (1) the two sequences could be homologous; (2) the protein structures and/or their biological functions are similar.
Recently, various methods are proposed to study the DNA and protein sequences.Among them are the DSP-based methods [4]- [7].Some of the related studies put the focus on the visualization of sequences in various graphic forms [8]- [15].In DSP techniques, each character in the DNA sequences or each amino acid in the protein sequence is mapped to a numerical value.According to the characteristics of the organisms, the different values used in the numerical mapping can be designed to accordingly fit the physical-chemical properties [16]- [21].Thus, the comparison method especially utilizing certain properties of the residence in the DNA or protein sequence must be specifically designed.There are two well-known character-based tools for DNA and protein sequence comparison; ClustalW [22] and BLAST [23].ClustalW is designed by using multiple-sequence alignment based on the meta-heuristics methods.The feature that the arrangement of each amino acid is similar in the evolution of the same species is utilized.On the other hand, BLAST alignment has four components: query, database, program, and search purpose/goal.BLAST is designed to locate the homologous sequence sites between two sequences using a heuristic approach.It compares partial sequences progressively, such that the local alignment results are obtained.Standard protein-protein BLAST (BLASTp) compares an amino acid query sequence against a protein sequence database.It is used for both identifying a query amino acid sequence and for finding similar sequences in protein databases.
In this paper, a parametric spectral estimate (PSE) method based on stochastic signal processing is proposed for protein sequence comparison.The numerical signals are used to represent the protein sequences and then analyzed in the frequency domain.First, a new model of mapping complex values to amino acids according to the physical-chemical characteristics is proposed.Next, the PSE method is used to determine the power spectrum density (PSD) of each numerical protein sequence.Finally, the mean squared error (MSE) values between two power spectra under various Markov orders are determined and served as a metric for sequence similarity measurement.As compared to the ClustalW and Blastp methods, our experimental results show that the proposed method provides an alternative way to efficiently www.ijacsa.thesai.orgdistinguish the differences between two protein sequences.The remaining part of this paper is organized as follows: Section 2 describes the proposed PSE method.The experimental results under different perspectives and their discussions are provided in Section 3. Finally, the conclusion is drawn in Section 4.

II. METHODS
Figure 1 shows the block diagram of the proposed method for protein sequence comparison.There are three parts in this method: (1)  A. Numerical mapping Table 1 shows the standard one-letter abbreviation of the 20 amino acids and their properties on charge, hydrophilicity or hydrophobicity, which are important for protein structure and protein-protein interaction.[24,25].
According to the physical-chemical properties, twenty kinds of amino acids can be represented by twenty complex values which locate at the different positions on the unit circle in the complex plane.Figures 2(a)-2(d) show the four arrangements, Methods 1-4, of mapping the symbolic amino acids to the numerical ones.The twenty complex values are distributed on the unit circle whose center is at the origin of the complex plane.As shown in Fig. 2(a) (Method 1), the hydrophilic amino acids are distributed on the upper part of the circle.The first amino acid H is assigned to be 45。 with respect to the real axis of the circle.The other amino acids {R, K, E, D, Q, N, Y, C, T, and S}, which have 9。 separation from each other, are assigned after the amino acid H.The hydrophobic amino acids are distributed on the lower part of the circle.The first amino acid G is assigned to 234。 with respect to the real axis on the circle.The other amino acids {A, V, L, I, P, M, F, and W}, which have 9。 separation from each other, are assigned after the amino acid G.In addition to the hydrophobic and hydrophilic properties, we assign the positions according to their basic structures and the general chemical characteristics in their side chain (R) groups [26].According to the position of the amino acids on the circle, the mapping is established so that every amino acid is adequately separated.
In addition to Method 1 shown in Fig. 2(a), three other mapping methods shown in Figs.2(b)-2(d) are also proposed for performance comparison and evaluation.Table 2 shows the complex values corresponding to the coordinates of the 20 amino acids on the circle.In addition to Method 1, Methods 2, 3, and 4 are proposed to verify the effects of amino acid properties in this study by changing the positions on the circle.In Methods 2 and 3, the 20 amino acids still conform the rules of the characteristics of hydrophilicity and hydrophobicity, but the position can be exchanged in the random and horizontally reversed ways, respectively.In Method 4, the mappings are random, and thus none of the rules of the characteristics of hydrophilicity and hydrophobicity is obeyed.

B. Parameter spectrum estimation (PSE) method
Consider a stochastic process with the random variable (RV) {X n , n = 1, 2, 3, … , m}, which describes a protein sequence composed of twenty kinds of amino acids and m is the sequence length.Let a sequence be denoted in Eq. ( 1), where represents that the status is i when an amino acid is at the nth position in the sequence.
 Let P ij denote a transition probability given that the current status is i and the next status is j.
Note that Eq. ( 2) denotes a Markov chain process, in which the probability of the current amino acid X n is only dependent on the previous amino acid X n-1 .A one-step transition probability matrix of order 20×20 is obtained by letting the first-order Markov chain model corresponding to the possible transitions between two amino acids in a protein sequence and is shown in Eq. (3). .


Let the amino acids in a protein sequence be denoted as a discrete signal source where the occurrence probability of each element is p i , i=1, 2,…, m.Equation (4) defines the information amount I(x i ) of xi for an event which occurs with a probability The average information or entropy of X is defined in Eq. ( 5):


The conditional entropy of the current RV, X, given the m previous RVs: x 1 , x 2 ,…,x m , is defined in Eq. ( 6): The entropy of a first-order Markov process is defined in Eq. ( 7): www.ijacsa.thesai.org where p(x i1 ,x i2 ) is the probability when the two values x i1 and x i2 occur together.The entropy of a second-order Markov process is defined in Eq. ( 8): The homologous gene sequences have similar entropies when a higher-order Markov process is used [27].In a lower order, however, each set of homologous gene sequences have various entropy values.The q th -order Markov model is shown in Eq. ( 9) where both RVs X and W are zero-mean and the variance Var{W  .In the estimation of random variables [28], the q-dimensional vector is denoted as provides linear prediction estimate of the


are determined as the solution of the orthogonal equation, and the optimum value of a, denoted as a o is shown in Eq. ( 10).11) and ( 12) define the cross covariance vector and the covariance matrix of X[n], respectively.
To obtain a simple PSD estimate, the covariance function k XX [ ] is replaced as shown in [28], and the solution yields parameter estimates q a a a , , , ˆ2 1  .Finally, the PSD estimate is defined in Eq. ( 13): where 2  ˆW  denotes the variance and k a ˆ is a parameter of RV.The covariance matching property when the PSD function is regarded to as an auto-regressive model is defined in Eq. (14).
where IFT{} denotes the inverse Fourier transform,  is the power spectral order, q' is the highest order, and x R ˆ is the auto-correlation function of ) ( ˆ S .

C. MSE determination
When the PSD of each sequence in the q th Markov order is determined by the use of PSE, the PSD values are normalized within the range [0, 1].Next, the MSE defined in Eq. ( 15), is used to compare the similarity between two protein sequences y i and y j under the q th Markov order. , where the parameter N is the length of protein sequence, q i y and q j y are the normalized PSD values under the q th order transformed from one and the other protein sequences, respectively.If the lengths of two sequences are different, their PSDs are not of the same length, either.Thus the MSE cannot be directly computed.To solve this problem, two methods are used to make two sequences the same length.First, the shorter sequence is interpolated to be of the same length with the longer one and is denoted as y sl .Second, the longer sequence is down-sampled to be of the same length with the shorter one and is denoted as y ls .These two methods are determined by Bilinear interpolation shown in Eqs. ( 16) and (17)., , , Here N s and N l are the lengths of the shorter and longer sequences, respectively, j' n is between j and j+1, and ) ' ( j j n n    .In the proposed methods, the MSE based on www.ijacsa.thesai.orgshort sequences, MSE s , and the MSE based on long sequences, MSE l , under a certain Markov order q, are determined, respectively, to obtain the average value MSE final shown in Eq. ( 18).
Amino acids with polar residue Amino acids with non-polar residue Amino acids with polar residue Amino acids with non-polar residue Amino acids with polar residue Amino acids with non-polar residue

III. EXPERIMENTAL RESULTS
Eighteen proteins sequences extracted from NCBI databases shown in Table 3 are considered in our experiments.According to their amino acid (aa) numbers, these sequences are categorized into six groups: Group 1 (<200 aa), Group 2 (300-400 aa), Group 3 (401-800 aa), Group 4 (801-1200 aa), Group 5 (1201-2000 aa), and Group 6 (2001-3000 aa) sequences.In each group, two sets of different homologous sequences are tested.The nine MSE values corresponding to the Markov orders from one to nine are determined in each sequence pairs.Then, the average value is regarded the final result in each experiment.Four experiments according to the four mapping schemes (Methods 1-4) shown in Table 2 are designed.Method 1 encounters the properties of hydrophilic and hydrophobic amino acids sorted by the physical-chemical properties.Method 2 only encounters the properties of hydrophilic and hydrophobic.The amino acids are randomly distributed at hydrophilic part with the property of hydrophilic and hydrophobic part with the property of hydrophobic.Method 3 is a similar mapping with Method 1, but the amino acids are distributed in a horizontally reversed way.Finally, the mapping in Method 4 encounters none of the physical-chemical properties of the amino acids.The comparison results with the well-known ClustalW (Pairwise Alignment: SLOW/ACCURATE, Weight Matrix: BLOSUM, Gap Open Penalty: 10, Gap Extension Penalty: 0.1) and BLASTp (Weight Matrix: BLOSUM62) schemes are provided to validate the proposed methods.
Based on Method 1 shown in Table 2, the PSDs of the protein sequences in these six groups are determined.Figure 3 www.ijacsa.thesai.orgshows the power spectra of the various Markov orders obtained by using the PSE method for the sequences in Group 1.The blue and red lines represent the normalized power spectral values of two sequences.The x-axis denotes the angular frequency (  2 -0 ), while the y-axis denotes the normalized PSD.For each pair of protein sequences, five power spectra corresponding to five Markov orders q=1-5 are determined.If these two lines are close to each other in the same Markov order, we may infer that the two sequences are significantly related.For example, the PSDs in Figs.3(a), 3(b), 3(f), 3(m), 3(n), and 3(o) are similar in each Markov order because all the sequence pairs are homologous protein sequences, while in Figs.3(c), 3(d), 3(e), 3(g), 3(h), 3(i), 3(j), and 3(l) are obviously different because the sequence pairs are nonhomologous.
Based on Methods 1-4 shown in Table 2, Tables 4 to 9 show the comparison results of the MSE values between various pairs of the sequences in the six groups, respectively.A smaller MSE value represents less difference between two sequences.In these six tables, all the MSE values of the homologous sequence pairs are smaller than 0.2.On the contrary, the MSE values of the non-homologous sequence pairs are greater than 0.2.This can be observed in the other parts in these tables as well.The proposed method contributes the classification of protein sequences and thus can serve as an alternative for the sequence comparison task.4 to 9. The horizontal axis denotes the set of experiments (sequence pair), while the vertical axis denotes the normalized difference value.In order to compare the sequence similarly in each method accordingly, the BLASTp scores are replaced by the value 1-BLASTp to correspond to the same numerical characteristics with that in the proposed methods and ClustalW.In Table 3, some results of BLASTp method are shown as NF, which means that the two protein sequences have no similarity found.Here, we set the NF value as 0 and then the value of 1-BLASTp is 1, which denotes the maximal difference.In Fig. 4(a), the depicted lines of MSE1, MSE2, MSE3, MSE4, and ClustalW have the similar rising and descending trends for the short sequences.The scores in all the six methods basically can be used to distinguish the homologous and non-homologous sequences.However, in Figs.4(b) and 4(c), MSE2 and MSE4 are not consistent to MSE1, ClustalW, and 1-BLASTp for the medium and long sequence pairs.In Figs.4(b) and 4(f), the 1-BLASTp scores are quite different from other scores and the variations among these methods are larger than that in the other figures.Note that the MSE4 values are higher than other values for Sequence pairs 14, and 15 in Figs.4(d) to 4(f).The experimental results show that the mapping methods while encountering characteristics of hydrophilic amino acids and the general physical-chemical properties of amino acids can affect the comparison results.In Figs.4(a) to 4(f), the MSE1 and MSE3 are nearly the same because the separation between each two amino acids is the same even if the mapping positions are horizontally reversed.
According to the results of the proposed methods, ClustalW, and BLASTp, the following observations are obtained: (1) The numerical mapping according to the physical-chemical characteristics of amino acids affect the results of comparison.In the experimental results, the more characteristics of protein are considered and arranged, the more correct results are obtained.(2) The MSE1 results are similar to the ClustalW scores, which means that the mapping scheme is consistent with the ClustalW method.(3) BLASTp and ClustalW are different methods, especially designed for the global and local sequence comparisons, respectively.The differences shown in the experimental results are especially obvious for the sequences in Groups 2 and 6.

IV. CONCLUSION
We proposed a new comparative tool for protein sequence comparison utilizing the parametric spectral estimate in stochastic processes to analyze protein sequences.The concepts of hydrophobicity in the amino acid physical-chemical properties are used to transform amino acids to numerical values.The experimental results show that the proposed methods effectively achieved the consistent comparison results with the well-known ClustalW and BLASTp.This research provides a new insight for the biologists as to how protein sequences can be analyzed.In our future work, more protein sequences will be tested by the proposed method.The problems encountered by two protein sequences with large difference in length will be tackled as well.4; (c) Table 5;

Fig. 2 .
Fig. 2. Graphs of the mapping between the 20 amino acids and their corresponding positions on the circle in the proposed four mapping methods: (a) Method 1, (b) Method 2, (c) Method 3, and (d) Mehtod 4

Figures 4 (
Figures 4(a)-4(f) illustrate the comparison results of the proposed method, ClustalW, and BLASTp in Tables4 to 9. The horizontal axis denotes the set of experiments (sequence pair), while the vertical axis denotes the normalized difference value.In order to compare the sequence similarly in each method accordingly, the BLASTp scores are replaced by the value 1-BLASTp to correspond to the same numerical characteristics with that in the proposed methods and ClustalW.In Table3, some results of BLASTp method are shown as NF, which means that the two protein sequences have no similarity found.Here, we set the NF value as 0 and then the value of 1-BLASTp is 1, which denotes the maximal difference.In Fig.4(a), the depicted lines of MSE1, MSE2, MSE3, MSE4, and ClustalW have the similar rising and descending trends for the short sequences.The scores in all the six methods basically can be used to distinguish the homologous and non-homologous sequences.However, in Figs.4(b) and 4(c), MSE2 and MSE4 are not consistent to MSE1, ClustalW, and 1-BLASTp for the This research is partially supported by Ministry of Science and Technology (MOST), Taiwan under the contract number MOST 103-2221-E-224-035-MY2.

TABLE I .
THE STANDARD ONE-LETTER ABBREVIATION OF THE 20 AMINO ACIDS AND THEIR PROPERTIES ON POLARITY, CHARGE, AND HYDROPHILICITY OR HYDROPHOBICITY

TABLE III .
THE THREE GROUPS OF THE TEST PROTEIN SEQUENCES IN OUR EXPERIMENTS

TABLE IV .
THE COMPARISON RESULTS OF THE PROPOSED, CLUSTALW, AND BLASTP METHODS FOR GROUP 1 (<200 AA) SEQUENCES

TABLE VI .
THE COMPARISON RESULTS OF THE PROPOSED, CLUSTALW, AND BLASTP METHODS FOR GROUP 3 (401 -800 AA) SEQUENCES

TABLE VIII .
THE COMPARISON RESULTS OF THE PROPOSED, CLUSTALW, AND BLASTP METHODS FOR GROUP 5 (1201 -2000 AA) SEQUENCES

TABLE IX .
THE COMPARISON RESULTS OF THE PROPOSED, CLUSTALW, AND BLASTP METHODS FOR GROUP 6 (2001 -3000 AA) SEQUENCES