An Effective Identification of Species from DNA Sequence: A Classification Technique by Integrating

Species classification from DNA sequences remains as an open challenge in the area of bioinformatics, which deals with the collection, processing and analysis of DNA and proteomic sequence. Though incorporation of data mining can guide the process to perform well, poor definition, and heterogeneous nature of gene sequence remains as a barrier. In this paper, an effective classification technique to identify the organism from its gene sequence is proposed. The proposed integrated technique is mainly based on pattern mining and neural network-based classification. In pattern mining, the technique mines nucleotide patterns and their support from selected DNA sequence. The high dimension of the mined dataset is reduced using Multilinear Principal Component Analysis (MPCA). In classification, a well- trained neural network classifies the selected gene sequence and so the organism is identified even from a part of the sequence. The proposed technique is evaluated by performing 10-fold cross validation, a statistical validation measure, and the obtained results prove the efficacy of the technique.


INTRODUCTION
Bioinformatics is a rapidly growing area of computer science [19] that deals with the collection, organization, and analysis of Deoxyribonucleic acid (DNA) and protein sequence [18].Today it addresses the formal and practical issues that occur in the management and analysis of genomic and proteomic data because it includes the formation and development of databases, algorithms, computational and statistical technique, and hypothesis [1].
Genomic signal processing (GSP) is a relatively new area in bio-informatics that uses traditional digital signal processing techniques to deal with digital signal representations and analysis of genomic data [2] [12].GSP gains biological knowledge by the analysis, processing, and use of genomic signals and translates the gained biological knowledge into systems-based applications [3].Integration of signal processing theories and methods with global understanding of functional genomics with significant emphasis on genomic regulation is the main objective of GSP [4].
The whole DNA of a living organism is known as its Genome [5].Genomic signals carry genomic information to all the processes that take place in an organism [6].Essentially DNA is a nucleic acid that has two long strands of nucleotides twisted in the form of a double helix and its external backbone is made up of alternating deoxyribose sugar and phosphate molecules.The nitrogenous bases Adenine, Guanine, Cytosine and Thymine are present in the interior portion of the DNA in pairs [13] [9].DNA and proteins can be mathematically represented as character strings, where each character is a letter of the alphabet [6] [10] [11].
One of the vital tasks in the study of genomes is gene identification [7].DNA analysis utilizes methods such as clustering [20], data mining [21] [22] [23], gene identification [24] and gene regulatory network modeling [25] [26].These methods present cutting edge research topics and methodologies for the purpose of facilitating collaboration between researchers and bioinformaticians.Mining bioinformatics data is a rising field at the intersection of bioinformatics and data mining [14].Some of them belong to the category of data mining that decides whether or not an example not yet noticed is of a predefined type.Increased availability of huge amount of biomedical data and the expectant need to turn such data into useful information and knowledge is the main reason for the recent increased attention in data mining in the biomedical industry.
Large number of research works that incorporate data mining in bioinformatics for different purposes are available in the literature [15] [16] [17]. .A few important such researches are reviewed in section 2. One important research of this type is the identification of species or name of an organism from its gene sequence.Characterization of the unknown environmental isolates with the genomic species is not easy because genomic species are especially heterogeneous and poorly defined [8].Identifying the species or the organism from its gene sequence is a challenging task.In this paper, we propose a classification technique to effectively classify the species or name of an organism from its DNA sequence.This technique is detailed with mathematical formulations and illustrations in section 3. Section 4 discusses the implementation results and Section 5 concludes the paper.www.ijacsa.thesai.orgII.RELEATED WORKS Plenty of research works deals with the mining knowledge from the genomic sequences.Some of the recent research works are briefly reviewed here.Riccardo Bellazzi et al. [27] have discussed that in the past years, the gene expression data analysis that are aiming at complementing microarray analysis with data and knowledge of various existing sources has grown from being purely data-centric to integrative.Focusing on the evolution of gene expression data mining techniques toward knowledge-based data analysis approaches, they have reported on the overabundance of such techniques.Particularly, latest developments in gene expression-based analysis methods utilized in association and classification studies, phenotyping and reverse engineering of gene networks have been discussed.
The gene expression data sets for ovarian, prostate, and lung cancer was examined by Shital Shah et al. [28].For genetic expression data analysis, an integrated gene-search algorithm was presented.For making predictions and for data preprocessing (on partitioned data sets) and data mining (decision tree and support vector machines algorithms), a genetic algorithm and correlation-based heuristics was included in the their integrated algorithm.The knowledge, which was obtained by the algorithm, has high classification accuracy with the capability to recognize the most important genes.To further improve the classification accuracy, bagging and stacking algorithms were employed.The results were compared with the literary works.The cost and complexity of cancer detection and classification was eventually condensed by the mapping of genotype information to the phenotype parameters.
Locating motif in bio-sequences, which is a very significant primitive operation in computational biology, was discussed by Hemalatha et al. [29].Computer memory space requirement and computational complexity are few of the computational requirements that are needed for a motif discovery algorithm.To overcome the intricacy of motif discovery, an alternative solution integrating genetic algorithm and Fuzzy Art machine learning approaches was proposed for eradicating multiple sequence alignment process.The results that were attained by their planned model to discover the motif in terms of speed and length were compared with the enduring technique.By their technique, the length of 11 was found in 18 sec and length of 15 in 24 sec, whereas the existing techniques found length of 11 in 34 sec.When compared to other techniques, the proposed one has outperformed the accepted existing technique.By employing MATLAB, the projected algorithm was put into practice and with large DNA sequence data sets and synthetic data sets, it was tested.
An interactive framework which is based on web for the analysis and visualization of gene expressions and protein structures was described by Ashraf S. Hussein [30].The formulation of the projected framework encountered various confronts because of the variety of significant analysis and visualization techniques, moreover to the survival of a diversity of biological data types, on which these techniques function.Data incorporated from heterogeneous resources, for instance expert-driven data from text, public domain databases and various large scale experimental data and the lack of standard I/O that makes it difficult to integrate the most recent analysis and visualization are the two main challenges that directed the formulation of the current framework.Hence, the basic novelty in their proposed framework was the integration of the state-of -art techniques of both analysis and visualization for gene expressions and protein structures through a unified workflow.Moreover, a wide range of input data types are supported by it and three dimensional interactive outputs ready for exploration by off-the-shelf monitors and immersive, 3D, stereo display environments can be exported by it using Virtual Reality Modeling Language (VRML).
A stomach cancer detection system, which is on the basis of Artificial Neural Network (ANN) and the Discrete Cosine Transform (DCT), was developed by Ahmad M. Sarhan [31].By employing DCT the projected system extracted the classification features from stomach microarrays.The extracted characteristics from DCT coefficients were applied to an ANN for further classification (tumor or non-tumor).The microarray images that were employed were acquired from the Stanford Medical Database (SMD).Simulation results has illustrated that a very high success rate was produced by the proposed system.
The challenging issue in microarray technique which was to analyze and interpret the large volume of data was discussed by Valarmathie et al. [32].This can be made possible by the clustering techniques in data mining.In hierarchical and k-means clustering techniques which are hard clustering, the data is split into definite clusters, where each cluster has exactly one data element so that the result of the clustering may be wrong many times.
The problems that are addressed in hard clustering can be resolved in fuzzy clustering technique.Amid all fuzzy based clustering, fuzzy C-means (FCM) is best suited for microarray gene expression data.The problem that is related with fuzzy C-means was the amount of clusters that are to be generated for the given dataset and that needs to be notified first.By combining the technique with a popular probability related Expectation Maximization (EM) algorithm, it can be solved to model the cluster structure of gene expression data and it has offered the statistical frame work.Determining the accurate number of clusters and its efficient interpretation is the main purpose of the projected hybrid fuzzy C-means technique.
Explorative studies in support of solutions to facilitate the analysis and interpretation of mining results was described by Belmamoune et al. [33].A solution that was located in the extension of the Gene Expression Management System (GEMS) was described, i.e. an integrative framework for spatio-temporal organization of gene expression patterns of zebra fish to a framework that supports data mining, data analysis and patterns interpretation.
As a proof of principle, the GEMS is provided with data mining functionality which is appropriate to monitor spatiotemporal, thus generating added value to the submission of data for data mining and analysis.On the basis of the availability of domain ontologies, the analysis of the genetic networks was done which vigorously offers the meaning to the www.ijacsa.thesai.orgdiscovered patterns of gene expression data.Grouping of data mining with the already accessible potential of GEMS considerably augments the existing data processing and functional analysis strategies.

III. THE INTEGRATED TECHNIQUE FOR SPECIES CLASSIFICATION
The proposed species classification technique classifies species based on the given DNA sequence.The DNA sequence is comprised of four basic nucleotides, Adenine (A), Guanine (G), Cytosine (C) and Thymine (T).Every species has a long DNA sequence, which is formed by the four nucleotides.
The DNA sequence defines the attributes, nature and type of the species.The proposed technique is an integration of data mining and artificial intelligence.In the proposed technique, firstly, nucleotide patterns are mined from the sequence.The mined patterns form a nucleotide pattern database with higher dimension.So, secondly, the dimension of the pattern database is reduced by MPCA.Finally, the dimensionality reduced pattern database is used to train the neural network.The technique is described in the further sub sections.
The proposed species classification technique classifies a species based on its DNA sequence.The four basic nucleotides, Adenine (A), Guanine (G), Cytosine (C) and Thymine (T) are the building blocks of the long DNA sequence in every species.A DNA sequence defines the attributes, nature and type of the species.The proposed technique is developed by integrating data mining and artificial intelligence techniques.Firstly, the proposed technique mines the nucleotide patterns from the sequence and forms a high dimensional nucleotide pattern database with the mined patterns.
Secondly, the technique uses MPCA and reduces the dimension of the pattern database.Finally, the dimensionality reduced pattern database is used to train a neural network.The following sub sections elaborately describe this technique.

A. Mining Nucleotide Patterns from DNA sequence
The first and initial stage of the proposed technique mines the nucleotide pattern from the DNA sequence.At this stage, patterns formed by different combinations of nucleotides are mined using a novel mining algorithm.Let be the DNA sequence, which is a combination of four nucleotides A, G, C and T. For instance, a sample DNA sequence is given as CGTCGTGGAA.
From the sequence, the mining algorithm extracts different nucleotide patterns and their support.The algorithm is comprised of two stages, namely, pattern generation and support finding.In pattern generation, patterns with different length are generated whereas in support finding, support values for every generated pattern are determined from the DNA sequence.The basic structure of the algorithm is given as a block diagram in Fig. 1. Where, is a set of different combinations of nucleotide bases.Eq. ( 1) operates with the criterions,

2) Determination of Pattern Support
The support, which has to be determined for every extracted pattern, describes the DNA attribute.By performing a window based operation over the sequence g , the support can be determined.Window of sequences are determined for different lengths as follows (2) Once the window of sequences is extracted support is determined for the mined patterns.The pseudo code, which is given below, describes the procedure to determine the support for every pattern.The obts the support for all the elements that are present in the corresponding pattern set.From the mined pattern and its corresponding support, a constructive dataset is generated.www.ijacsa.thesai.org

3) Constructive dataset generation
A raw dataset is generated using the aforesaid mining algorithm.But the dataset is not constructive for further operation.In this stage, a constructive dataset is generated from the mined dataset, which comprises of patterns with different lengths and their support.
To accomplish this, firstly the patterns which have length 2  l are taken.From the pattern set, the modified and constructive dataset is generated as given in Table 3.

B. MPCA-based Dimensionality reduction
In all tensor modes, the multilinear algorithm MPCA captures most of the variation present in the original tensors by seeking those bases in each mode that allow projected tensors and performs dimensionality reduction [35].Initially, in the process of dimensionality reduction, the distance matrix for every th z matrix is determined as follows, Using Eq. ( 3) and by determining the mean matrix  for ) (z G using Eq. ( 4), the distance matrix can be calculated.E are obtained by transposing the arranged eigenvector.
The cumulatively distributed Eigen values for the sorted eigenvalues are generally determined using the following equation.
The sorted Eigen values sort x  and the cumulatively distributed Eigen values cdf x  of Eq. ( 6), can be determined as , is obtained by using the MPCA projections [35], where, R N can be determined as

C. Classification using ANN
For G N gene sequences, the dimensionality reduced gene patterns and their support are provided by the MPCA.Using ANN, the class of the original sequence can be identified using the dataset.Two classical operations, training and testing are involved in the classification.The neural network is trained using the G N pattern dataset.Here, the process is performed using multilayer feed forward neural network, depicted in Fig. 3. R N Input nodes, H N hidden nodes and an output node are present in the network.
Before performing any task, the ANN must be trained.Once trained, the ANN capably identifies the species by finding the class of the gene sequence.The training phase and classification phase of the ANN are described below.

1) Training Phase
Back Propagation (BP) algorithm is used to train the constructed feed forward network.The step-by-step procedure utilized in the training process is given below.

Assign arbitrary weights generated within the interval
to links between the input layer and hidden layer as well as hidden layer and output layer.2. Using Eq. ( 8), ( 9) and ( 10), determine the output of input layer, hidden layer and output layer respectively by inputting constructive dataset to the network.
where, Eq. ( 8) is the basis function for the input layer and Eq. ( 9) and ( 10) are the activation functions for hidden and output layer, respectively.

2) Classification Phase
In the classification phase, the network finds the class of a given or test gene sequence and determines the species to which it belongs.The same processes performed on the training sequence are repeated for the test sequence .Using the mined patterns and their support, the constructive nucleotide dataset is generated.Subsequent to dimensionality reduction of the generated dataset they are tested in a neural network.The neural network decides the class of the species to which the gene sequence belongs.

IV. IMPLEMENTATION RESULTS
The proposed technique is implemented in the working platform of MATLAB (version 7.10) and the technique is evaluated using the DNA sequence of two different organisms, Brucella Suis and Caenorhabditis Elegans (C.Elegans).The evaluation process is performed using 10-fold cross validation test.Here, nucleotide patterns are mined with and their corresponding support are given in Table III.In Fig. 5, different length patterns and their support are depicted and the constructive dataset that is generated from the pattern set is given in Table IV.The pattern data and constructive dataset given in Tables III and IV are generated from one of the ten folds of gene sequence of Brucella Suis.Thus, from all the ten folds of gene sequence of both Brucella Suis and C. Elegans, the pattern data have been mined and constructive datasets have been generated.The generated ten folds of data are used to train the neural network.The results obtained from network training are given in Fig. 5. Once the training process has been completed, the technique is validated using the test sequence.The results obtained from 10-fold cross validation are given in Table VI.From the results, it can be seen that when a gene sequence is given to the proposed technique it identifies the corresponding species.Here, the technique is evaluated with the DNA sequence of only two genes.The technique is developed in such a way that it can be applied to any kind of DNA sequence.The test results claim that the performance of the technique reaches a satisfactory level.

V. CONCLUSION
In this paper, we have proposed a species identification technique by integrating data mining technique with artificial intelligence.Initially, the nucleotide patterns have been mined effectively.The resultant has been subjected to MPCA-based dimensionality reduction and eventually classified using a well-trained neural network.The implementation results have shown that the proposed technique effectively identifies the organism from its gene sequence and so the species.Moreover, results obtained from 10-fold cross validation have proved that the organism can be identified even from a part of the DNA sequence.
Though the technique has been tested with the DNA sequence of only two organisms, the 10-fold cross validation results have reached a remarkable performance level.From the results, it can be hypothetically analyzed that a technique, which identifies the organism only with a part of gene sequence, have the ability to classify any kind of organism and so the species.

Figure 1 .
Figure 1.Block diagram of the pattern mining algorithm 1) Pattern generation In pattern generation, different possible combinations of nucleotide base pairs are generated.As a reference, a base set B is generated with cardinality 4 | |  B , which has the

Figure 2 .
Figure 2. Pseudo code to determine support for every mined Figure 3. ained C for each different length pattern hapattern

Then with mode 2 2  1  and 2  , the corresponding eigenvectors 1 E and 2 E and the corresponding Eigen values 1  and 2 
the obtained distance matrix.A projection matrix  is determined as follows, ) are determined using the generalized form of calculation given in Eq. (5).For are determined by subjecting the projection matrix to a generalized eigenvector problem.The rows of the eigenvector are arranged based on the index of the eigenvalues sorted in the descending order.The modified eigenvector '

T 1 
times are determined[35].The www.ijacsa.thesai.orgprocess followed for projection matrix is repeated for the tensor matrices to obtain new x

Figure 4 .
Figure 4.The multilayer feed forward neural network used in the proposed technique

e 3 .
is the BP error, T S is the target output www.ijacsa.thesai.org 2. By adjusting the weights of all the neurons based on the determined BP error, obtain new weights using(12), the weight to be changed W  depends on the rate of network learning  and the obtained network output p S for the th p gene sequence and it is determined using the formula Until the BP error gets minimized to a minimum extent, repeat the process from step 2. The termination criterion for practical cases,

Figure 6 .
Figure 6.Performance of training and test results from ANN: (a) Network performance, (b) Training evaluation and (c) Regression analysis.

TABLE IV .
CONSTRUCTIVE DATASET GENERATED FROM THE MINED NUCLEOTIDE PATTERNS (A) l 2  AND (B) l 3  (A PART OF THE PATTERN IS GIVEN).

TABLE V .
PERFORMANCE EVALUATION USING 10-FOLD CROSS VALIDATION RESULTS