DNA Sequence Representation and Comparison Based on Quaternion Number System

Conventional schemes for DNA sequence representation, storage, and processing areusually developed based on the character-based formats.We propose the quaternion number system for numerical representation and further processing on DNA sequences.In the proposed method, the quaternion cross-correlation operation can be used to obtain both the global and local matching/mismatching information between two DNA sequences from the depicted one-dimensional curve and two-dimensional pattern, respectively.Simulation results on various DNA sequences and the comparison result with the wellknown BLAST method are obtained to verify the effectiveness of the proposed method. KeywordsBioinformatics; genomic signal processing; DNA sequence; quaternion number; data visualization.


INTRODUCTION
Recently, the great progress on biotechnology makes the deoxyribonucleic acid (DNA) sequencing more efficiently.Huge amount of DNA sequences of various organisms have been successfully sequenced with higher accuracy.Analyzing DNA sequences can investigate the biological relationships such as homologous andphylogeny of different species.However, the analysis of DNA sequences using the biological methods is too slow for processing huge amount of DNA sequences.Therefore, the assistance of computers is necessary and thus bioinformatics is extensively developed.Efficient algorithms are desired to deal with the considerable and tedious biomolecular data.
Computer-based algorithms have solved various problems dealt in bioinformatics, such as the sequence matching (two and multiple sequences, global and local alignments), fragments assembly of DNA pieces, and physical mapping of DNA sequences.Most of the algorithms consider the data structures of DNA sequences as the string, tree, and graph.The artificial intelligence techniques such as the genetic algorithm [1], artificial neural networks [2], and data mining [3] have been intensively employed in this research area.In [4], the study of genomic signals mainly at scales of 104~108 bp, to detect general trends of the genomic signals, potentially significant in revealing their basic properties and to search for specific genomic signals with possible control functions.On the other hand, many distributed databases over the Internet have been constructed and can be easily accessed from the World Wide Web [5]- [7].Most of the techniques treat the DNA sequences as the symbolic data, which are the composition of four characters A, G, C, and T corresponding to the four types of nucleic acids: Adenine, Guanine, Cytosine, and Thymine, respectively.However, the biomolecular structures of genomic sequences can be represented as not only the symbolic data but also the numeric form.
Recently the genomic signal processing era has received a great deal of attention [8], [9].The well-known digital signal processing (DSP) techniques have been developed to analyze the numeric signals for many applications [10].If the symbolic DNA sequences can be transformed to the numeric ones, then the DSP-based algorithms would provide alternative solutions for the bioinformatics problems defined in the symbolic domain.Hsieh et.al. proposed DNA-based schemes to efficiently solve the graph isomorphism problems [11], [12].Some previous studies have shown various methods of mapping the symbolic DNA sequences to numeric ones for further processing such as discrete Fourier transform or wavelet transform [13]- [15].Then, the periodic patterns existed in DNA sequences can be observed from the determined scalograms or spectrograms.In assigning the four bases as some real or complex numbers such as , the further mathematical operations are straightforward and simple.However, exploring the biological relationship is difficult in the mapping between the bases and numbers.
Magarshak proposed a quaternion representation of RNA sequences [16].Four bases with eight biological states form the group of quaternions and the tertiary structure of the RNA sequence can be analyzed through the quaternion formalism.Hypercomplex signals can be considered as the general form of quaternion signals [17].Shuet.al. proposed hypercomplex number representation for pairwise alignment and determining the cross-correlation of DNA sequences [18]- [20].The DNA sequences are aligned with fuzzy composition and a new scoring system was proposed to adapt the hypercomplex number representation.Based on the similar ideas shown above, we adapt two implications using the quaternion number system [21] for DNA sequences.The quaternion numbers are complex numbers with one real part and three imaginary parts.Here four bases in a DNA sequence are assigned with four different quaternion numbers.A DNA sequence can thus be transformed into a quaternion-number sequence.Instead of finding the local frequency information of DNA sequences, the cross-correlation algorithm based on quaternion numbers is proposed for both the global and local matching between two DNA sequences.Since the real parts of four quaternion www.ijacsa.thesai.orgnumbers are all zero and the imaginary parts are with certain properties.The matching information can be observed from the real part in the result of the quaternion cross-correlation operation.In addition to the global cross-correlation result, the local matching information can be extracted from the product of each multiplication in the correlation operation.The global and local comparisons can be represented by 1-D curve and 2-D pattern, respectively.The simulation results show that the proposed quaternion number system can efficiently represent DNA sequences and then be used to determine the global and local sequence alignment with the help of the cross-correlation operation.
The organization of this paper is as follows: Section 2 introduces the basics of quaternion number systems and the quaternion number representation of DNA sequences.The global and local sequence matching based on the quaternion correlation is described.Section 3 provides the simulation results for certain DNA sequences, which verify the effectiveness of the proposed method.Finally, the conclusion is drawn in Section 4.

A. Quaternion Number Sequence Representation
Quaternion numbers [21] (also called the hyper-complex numbers) are the generalization of complex numbers.They have been applied in certain applications such as the color image filtering [22] and segmentation [23] and, the design of 3-D infinite impulse response filters [24].Since the quaternion number system is not well known in all signal processing areas, here their properties are briefly reviewed.
The quaternion number has four components: one real part and three imaginary parts.The notation of a quaternion number q is defined as Where q a , q b , q c , q d are real numbers, and k j i ˆ , ˆ , ˆare operators for the three imaginary parts.The conjugate of quaternion number, q*, is defined as More detailed description of quaternion operations can be referred in [25].
Table I shows the proposed mapping method between the four characters and the corresponding quaternion numbers.Note that all the real parts are zero, while the imaginary parts can be considered as the coordinates of four vertices of a regular tetrahedron in 3-D space.
That is, the 3-D coordinates (x,y,z) of the vertices of a regular tetrahedron are applied to the imaginary part `b,c,d' of quaternion numbers, and the real part `a' of quaternion numbers are set to zero.Then, a character sequence is mapped to a quaternion number sequence.For example, a character sequence s c [n]={A,A,T,A,G,C,G,T} is mapped to a quaternion number sequence s q [n]={q A , q A , q T , q A , q G , q C , q G , q T }.After the mapping procedure in the former discussion, a quaternion number sequence standing for a DNA sequence is obtained: q 3 , ...,q n }. (3) Then, an accumulating process is applied to obtain another quaternion number series where (5) By extracting the imaginary part of series ] [n s q , the 3-D coordinates for each accumulated quaternion number in the series can be obtained.Line segments are used to connect these points in order, and a 3-D trajectory can thus be obtained for DNA sequence visualization [26]- [28].

B. Quaternion Correlation
Once the quaternion number sequences of two DNA sequences have been obtained, the cross-correlation operation is performed for the global comparison.The cross-correlation of two quaternion number sequences ] [ 1 n s q and ] [ 2 n s q of lengths M and N, respectively, is defined as where  is the index of the correlation function and In the correlation operation, the conjugate operation is applied on one of the two quaternion numbers multiplied.Therefore, the cross-correlation of two identical and different symbols contributes +1 and -1/3 to the result, respectively.If there are two sequences to be correlated with length N and M (N>M), the real part of correlation result, v Re , forzbp overlap Where p is the matching counts and z-p is the mismatching counts.
Equation (6) shows that the cross-correlation result for a specific  value is the sum of the product of the original and the shifted and conjugate sequences.The products in certain n are non-zero and zero values when two sequences overlap and not, respectively.www.ijacsa.thesai.orgWhen the two sequences overlap, the product of two quaternion numbers reflects whether they are the same or not.That is, during the cross-correlation operation, the local alignment proceeds under a given  value.For example, if there are z overlapping numbers between two sequences, Eq. (6) becomes Where m denotes the starting position of the overlap region under the given  value.Since there are (N+M-1) possible  values in the cross-correlation result, all the possible local alignments between two sequences can be obtained.Therefore, the local alignment results can be shown in a 2-D array of size (N+M-1) (N+M-1), in which the entry denotes the matching status of two nucleotides from two DNA sequences.A grayscale image f Re (x,y) of size (N+M-1) (N+M-1) corresponding this 2-D array can be generated based on the following rule: Here Re{ } denotes the real part of the complex value in the bracket and 0<x, y N+M-1.If the connected pixels in the horizontal direction constitutes as a black line in the image, the local matching between two sequences exists.Therefore, the local matching information between two DNA sequences can be obtained from the quaternion correlation result in addition to the global matching information.

C. Mismatching Analysis
The real part of correlation result reflects the matching information, while the imaginary part reflects the mismatching information in sequence comparison.Therefore, the number of the matching and mismatching counts from the real and imaginary parts of the correlation result could be investigated.Let q 3 and q 4 denote the products of two quaternion numbers q 1 and q 2 .That is, q 3 =q 1 q 2 and q 4 =q 2 q 1 ,where , n=1,2,3,4.According to the rules of quaternion multiplication, the following results can be obtained: , and ., , , In addition to the real part of the calculated quaternion number, the rest imaginary parts can provide further more mismatching information between the two sequences.A color image in which the R, G, and B components are respectively corresponding to the three imaginary parts in the quaternion number can be generated.
First, the possible values of the three imaginary parts are determined and listed in Table I.Second, for each component, the finite values are normalized into the monotone pixel values between 0 and 255.
There are six, five, and three possible values for the k j i ˆ and , ˆ , ˆimaginary parts, respectively.Each value in each imaginary part can be assigned to as a grayscale value.Therefore, three grayscale images for the R, G, and B components can be generated.By combining the three components, different colors corresponding to the various mismatching conditions between two nucleotides can be obtained.Figure 1 shows the different colors and the corresponding mismatching conditions.According to the color assignments in Fig. 1, a 2-D pattern f Im (x,y) that reflects the imaginary parts of the correlation results can be generated.The image f Im (x,y) is similar to the image f Re (x,y) that reflects www.ijacsa.thesai.org the real parts of the correlation results.However, each pixel represents one of the different mismatching conditions.

D. Complexity Analysis
General performance measurements of an algorithm are the time/computation and space complexity.By definition, the correlation of two sequences x 1 [n] and x 2 [n] of size N needs N 2 multiplication and N-1 addition if the two sequences are of length N. The time complexity of the original correlation is O(N 2 ).In DSP theories, the discrete Fourier transform (FT) is used to accelerate the speed of correlation (the correlation theorem).That is, Let FT Q and -1 Q FT denote the quaternion Fourier transform (QFT) and inverse QFT, respectively.From the Hypercomplex Wienear-Khintchine theorem [23], Eq.( 15) becomes:

X
, and .Note that  ˆ is the unit pure quaternion and it is referred to as the eigen axis, which represents the direction in the 3-D space of imaginary part of a quaternion.A QFT can be implemented by two ordinary FTs [29].That is, .Therefore, the time complexity of quaternion correlation can be significantly reduced toO(Nlog 2 N).
Regarding to space complexity, the most memoryconsumable is the storage of numerical data for DNA sequences.If the correlation is calculated by using the QFT, it needs additional memory space to store data of frequency domain in the computational process.Therefore, this method trades the memory space for efficiency.

III. EXPERIMENTAL RESULTS
In computer simulation, computer-generated random sequences and real DNA sequences are used to perform the quaternion correlation.Consider a random quaternion sequence s x1 [n] and let s x2 [n] denote the prefix (first eight quaternion numbers) of the sequence s x1 [n].Following the cross-correlation operation, the cross-correlation result is also a quaternion number sequence.Figure 2 shows the real part of the correlation results, which distribute over the eight discrete levels.Actually, there are nine situations in the correlation results from the matching counts being zero to eight.By observing the coefficients, it shows that each level differs by 4/3.Therefore, the real parts of cross-correlation coefficients are relative to the matching counts between two sequences.There is a maximum correlation at the position zero because this is the exactly matching position.Three DNA sequences retrieved from the web-based databases of National Center for Biotechnology Information (NCBI) [30] are then used to test the proposed method.Each sequence has an accession number for identification.First of all, consider two sequences from highly similar genes: the human TGFA sequence (Accession: K03222, 867 bp) and the mouse TGFA sequence (Accession: BC003895, 1024 bp).The DiHydro Folate Reductase (DHFR) gene (Accession: L26316, 1042 bp) from a mouse is also considered. Figure 3(a) shows the quaternion cross-correlation result of the two TGFA genes.A correlation peak with value 807 appearing at the position  =867 represents that there exists large similarity (822 identical base pairs) and the best matching position of two sequences can be obtained.Figure 3(b) shows a detailed region from  =840 to 900.The correlation values at other positions are much smaller than the peak value.Therefore, it is verified that the proposed method can determine the best global matching position of two sequences.On the other hand, consider the cases of base-pair deletion and insertion, which commonly happens in DNA sequences.To investigate the effects, the sequence of mouse TGFA gene is modified and then used to determine the correlation result.According to the peak value in the correlation result, the number of matched base pairs can be estimated by the use of Eq.( 7).If the number of insertion/deletion increases, the corresponding correlation peak values decrease.The partial matching positions can still be detected from the decreased correlation peaks.Therefore, the deletion or insertion in sequences can be estimated from the correlation result.Figure 5 shows the correlation result of two quite different (human TGFA and mouse DHFR) genes.There is no significant peak value among all the real-part coefficients.It verifies that the similarity between these two gene sequences is small.The sequence matching results obtained from quaternion correlation are compared with the well-known BLAST method [31].Figure 6 shows the top 34 sequences retrieved by the use of BLAST when using the Human TGFA gene (Accession: K03222, 867 bp) as the query sequence.The sequences for the first 10 high scores (excluding the query sequence itself) are used to perform the quaternion cross-correlation with the query sequence.Each cross-correlation result shows a correlation peak at the best-matching position.Table III summarizes the correlation results and BLAST scores for comparison.The peak values almost follow the trend of BLAST scores.Since BLAST is designed specifically for local alignment of sequences, the small difference between the scores and peak values is reasonable.Finally, Fig. 8 shows the 2-D color pattern in which the mismatching information can be directly observed.According to Fig. 8, four kinds of mismatching (AT, TA, CG, and GC) are the major parts.More information can be observed by examining the detailed parts in this pattern.

IV. CONCLUSION
In this study, the quaternion correlation based on quaternion number systems is proposed for DNA sequence representation and alignment.From the cross-correlation result of quaternion-number sequences, two DNA sequences can be compared in a pair-wise mode.The peak value of real part of the correlation result corresponds to the globally bestmatching position of two similar sequences.On the other hand, the 2-D image obtained from the product terms in the cross-correlation operation can provide more information on local alignment of two DNA sequences.For the deletion or insertion happening in the sequences, they can be discriminated by analyzing the correlation results.Moreover, a color 2-D image can also be generated to visualize the mismatching conditions of two DNA sequences.The simulation results show that the proposed method is of promising potential in bioinformatics.Future work will focus on extracting more information and relationships between two sequences from the generated 2-D pattern.

Figure 1 .
Figure 1.The colors corresponding to the different combination of the multiplied quaternion numbers.
of two sequences x 1 [n] and x 2 [n], respectively.In Eq. (15), the time complexity depends on the FT.Because the time complexity of the FT is O(N 2 ) and time complexity of multiplication is O(N), the time complexity of correlation operation is O(N 2 ).With the fast FT algorithm, the time complexity can be improved to become O(Nlog 2 N).

Figure 2 .
Figure 2. The real-part correlation coefficients of two quaternion sequences sx1[n] and let sx2[n].

Figure 3 .
Figure 3.The real-part correlation coefficients of the sequences for the TGFA genes of Human and Mouse.(a) The highest correlation peak appears at the best-matching position; (b) The detailed center region.
Figure 4(a) and 4(c) show the www.ijacsa.thesai.orgcorrelation results when 1-bp deletion and insertion happen in the center position of the TGFA gene, respectively.As shown in Figs.4(b) and 4(d) for details, two half values of the original correlation peak value appear at the deletion/insertion positions.

Figure 4 .
Figure 4.The correlation results for (a) 1-bp deletion in the center position of the Human TGFA sequence; (b) the detailed center region in (a); (c) 1-bp insertion in the center position of the Human TGFA sequence;(d) the detailed center region in (c).

Figure 5 .
Figure 5.The real-part of cross-correlation coefficients of the sequences for the Human TGFA and Mouse DHFR genes.}

Figure 6 .
Figure 6.The query result of the Human TGFA gene using the Nucleotidenucleotide BLAST (blastn) tool.}TABLE III.THE CORRELATION PEAK VALUES AND POSITIONS AND CORRESPONDING TOP TEN SCORES WHEN USING THE BLASTN TO QUERY THE HUMAN TGFA SEQUENCE IN NCBI DATABASE.
Figure 7(a) show the local alignment result for human and mouse TGFA genes.During the cross-correlation operation, the product of two quaternion numbers is not zero only when two sequences overlap.The non-zero products form a parallelogram in the rectangular pattern.A horizontal line appears at the index of correlation  =867, which corresponds to the peak correlation result shown in Fig. 7(b).In addition to the cross-correlation values, which represent the global matching information of two sequences, www.ijacsa.thesai.org the local matching information can be observed and measured in the parallelogram.To demonstrate the capability of the proposed quaternion correlation method, Figs.7(c), 7(e), and 7(g) show the alignment results when 70 bp deletion, insertion, and substitution occur in one of the two sequences.For the cases of deletion and insertion, the horizontal line breaks into two parts, which are shifted horizontally by 70 bp.For the case of substitution, part of the line corresponding to the substituted nucleotides disappears.Figs.7(d), 7(f), and 7(g) show that the corresponding correlation peaks appear accordingly.In addition to the peak values, the local matching positions can also be directly observed from the 2-D pattern.Compared with the 1-D correlation result, obviously, the 2-D pattern provides more information on the local matching result.

Figure 7 .
Figure 7. (a) Local matching results obtained from the cross-correlation operation of two quaternion sequences; (b) Corresponding 1-D crosscorrelation result of the 2-D pattern shown in (a);(c) 70 bp deletion in one of the sequences;(d) Corresponding 1-D cross-correlation result of the 2-D

Figure 8 .
Figure 8.The 2-D pattern corresponding to the imaginary parts of quaternion correlation results of the two sequences for the Human TGFA and Mouse DHFR genes.

TABLE I :
THE MAPPING TABLE FROM A, C, T, AND G TO CORRESPONDING QUATERNION NUMBERS.

TABLE II .
THE IMAGINARY PARTS OF THE MULTIPLIED VALUES OF EVERY TWO DIFFERENT QUATERNION NUMBERS.