Genetic Algorithm Based Approach for Obtaining Alignment of Multiple Sequences

This paper presents genetic algorithm based solution for determing alignment of multiple molecular sequences. Two datasets from DNA families Canis_familiaris and galaxy dataset have been considered for experimental work & analysis. Genetic operators like cross over rate, mutation rate can be defined by the user. Experiments & observations were recorded w.r.t variable parameters like fixed population size vs variable number of generations & vice versa, variable crossover & mutation rates. Comparative evaluation in terms of measure of fitness accuracy is also carried out w.r.t existing MSA tools like Maft, Kalign. Experimental results show that the proposed solution does offer better fitness accuracy rates. Keywords-DNA Sequences; alignment; Genetic Algorithm; Crossover; Mutation; Selection; Multiple Sequence Alignment etc.


INTRODUCTION
Simultaneous alignment of several sequences is among the most important problems in computational molecular biology.Multiple sequence alignment (MSA) can be seen as a generalization of Pairwise Sequence Alignment where instead of aligning two sequences, n sequences are aligned simultaneously, where n is > 2. Multiple sequence alignment can discover biologically significant sequence patterns that may be widely dispersed or hidden in the molecular sequence databases.MSA gives insight into the basis for sequence of similarities between homologous sequences.[1] An example of an alignment of four hypothetical DNA sequences is shown in Fig. 1.

Figure1: An Example of an Alignment
The basic idea is that the sequences are aligned on top of each other, so that a coordinate system is set up, where each row is the sequence for one protein, and each column is the 'same' position in each sequence.Each column corresponds to a specific residue in the 'prototypical' protein.
Multiple Sequence Alignment (MSA) is considered to be an important tool for computational biologists.It finds its application in phylogenetic analysis, identification of conserved motifs and domains and structure prediction [3].MSA is a computationally difficult problem, also known to be a NP-hard problem [2].Considering both the importance and complexity of solving the MSA problem, many different heuristic methods have been proposed by the researchers to provide approximate solutions to this problem.
Genetic Algorithms (GAs) as a computational means to solve the MSA problem has shown lot of potential.It can search through the solution space effectively and generate good alignment results.The main advantage of genetic algorithms over other optimization methods is that there is no need to provide a particular algorithm to solve a given problem.It only needs a fitness function to evaluate the quality of different solutions.Also since it is an implicitly parallel technique, it can be implemented very effectively on powerful parallel computers to solve exceptionally demanding large-scale problems.
The method works by breaking a series of possible MSAs into fragments and repeatedly rearranging those fragments with the introduction of gaps at varying positions.This paper also explores the possibility of applying GA based solution for MSA problem.One such proposed & developed solution is also presented.

II.
RELATED STUDY Genetic algorithm is one of the useful tools determining alignment of multiple sequences.Iterative methods may be implemented through evolutionary approach that use computational heuristics based on natural biological phenomena such as selection, crossover and mutation to evolve a population of candidate solutions based on an objective function because they work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA [3].
There are some proposed iterative methods to improve the problem of MSA.For example, evolutionary approach SAGA [5] based on genetic algorithm have been success fully applied to the MSA problem.It is used to optimize two different objective functions and shows that they can search large solution space efficiently.But due to repeated use of fitness function it may increase its time complexity.
Zhang C et al., [7] proposed an algorithm based on genetic algorithm and dynamic programming.It was used with two different distance matrices and characterized by great www.ijacsa.thesai.orgcomplexity in processing time.It has some limitations for performing crossover and mutation operations.
One of the most appropriate GA approaches to solve the MSA problem was presented by Nguyen et.al [8], however there are still some limitations w.r.t scoring scheme.
Another useful algorithm for multiple DNA sequence alignment using genetic algorithms and divide-and-conquer techniques [9] was proposed in which optimal cut points of multiple DNA sequences were selected.According to the author experimental results showed quite significant results.Approach involves cutting of the sequences for decreasing the space complexity for sequence alignment.However alignment was possible only for multiple deoxyribonucleic acid sequences, not for protein and other nucleic acid sequences.
Other new genetic algorithms [10] were used for solving the MSA in which various dataset were tested and the experimental results were compared with other methods.But after comparison it was observed that this approach could obtain good performance in the data sets with high similarity and long sequences.
After that effective GARS approach [11] based on Genetic Algorithm with Reverse Selection was proposed.But it suffers from premature convergence in which solution reaches locally at an optimal stage.Furthermore a new approach AlineaGA [12] was proposed which used a Genetic Algorithm with local search optimization embedded on its mutation operators for performing multiple sequence alignment.But its mutation probability leads to better solutions in fewer generations and that the mutation operators had a dramatic effect in this particular domain.Recently a new Cyclic Genetic Approach (CGA) [13] developed with the complete knowledge of the problem and its parameters.In CGA, the values of various parameters are decided based on the problem and fitness value obtained in each generation.But the column score value varies for each execution may not give relatively better alignment.
In this paper, we proposed an evolutionary approach using genetic algorithms to obtain alignments of multiple sequences.Experimental results show that the proposed solution does offer better fitness accuracy rates w.r.t some existing tools.

Methodology
The remainder of this section is organized as follows. .In section 3 we present genetic algorithm based approach (GAMS) for solving the problem of aligning multiple sequences.Section 4 shows the experimental results of various dataset which are used to test the performance of our method.Then section 5 is finally used for discussion and conclusion.

III. GENETIC ALGORITHM BASED APPROACH
In this section we present our algorithm for solving the MSA problem.Genetic algorithms based approach (GAMS) are applied with new selection and crossover scheme which helps us to generate best population on local schema so that better alignment could be discovered.This process flow is depicted in figure 2.

A. Chromosome Representation
The chromosome should in some way that contains information about solution which it represents.The most used way of encoding is a binary string.The chromosome then could look like this: Each chromosome has one binary string.Each bit in this string can represent some characteristic of the solution or the whole string can represent a number.Of course, there are many other ways of encoding.This depends mainly on the solved problem.For example, one can encode directly integer or real numbers; sometimes it is useful to encode some permutations and so on.Each sequence has its own length.The number of gaps in the sequence is to be inserted in each sequence.It is calculated in a way that the length of all sequence remains the same.Therefore we have to generate the maximum length of sequence by multiplying the maximum length of particular element of sequences with rsp1.2.Let"s say we have a set of sequence S = {S1, S2, S3 ….Sn}.So the maximum length of the column has to be found out by multiplying the sequence with rsp by maximum length column.The value of scaling factor rsp defines that the alignment to be 20% longer then the sequence which is based on the www.ijacsa.thesai.orgobservations that solution to common MSA problem really contains more than 20% gaps.The flow of chromosome representation is shown in figure 3

B. Evaluation of Fitness Function
To evaluate their fitness, the chromosomes must be converted to the alignment form to be applied sum-of-pairs function [3].We scored each column by looking at matches, mismatches, and gaps in the two sequences.We assume that a match = 1, a mismatch = 0, and a gap = -1.The fitness or scoring function of each individual is calculated by the formula:- The fitness Score for each alignment is calculated by summing the individual score for each column in the matrix.Scoring matrix is needed to determine the cost of aligning a residue with another.Also, a gap penalty value must be settled for determining the cost of aligning an amino acid with a gap.This penalty is only employed when aligning a residue with a gap.The fitness value calculation is to be represented by figure 4:-

C. Selection Procedure
After calculating the fitness score of all the population applying larger tournament method where n individuals are randomly chosen, the fitter of the two is selecting with the highest and second highest fitness value .In this case the fitter the individual is chosen by the following procedure:-

•
Apply larger tournament strategy for the current population based on their fitness function

•
Select two best chromosomes randomly based on their column score and select two individual with their highest fitness value.

D. Crossover
In the single point crossover process, Crossover selects sequence from parent chromosomes and creates a new offspring.The simplest way how to do this is to choose randomly some crossover point and everything before this point copy from a first parent and then everything after a crossover point copy from the second parent .weselect crossover point at the rate of 0.5 and count the entire gap in each population then multiply it with crossover rate and take ceiling of crossover rate.The crossover point is selected by the formula:- After selecting point, copy the chromosome of first parent exact at the crossover point value then copy all chromosome of second parent and vice versa so [13].There are two offspring has to be generated after applying the crossover function.Calculate the fitness score of current population and select the best individual for performing mutation operation.The flow of one point crossover is shown in figure 5.

E. Mutation
After a crossover is performed, mutation takes place.This is to prevent falling all solutions in population into a local optimum of solved problem.The system randomly chooses a gene of a chromosome form the mating pool randomly and applying binary mutation.Mutation changes randomly the new offspring.For binary encoding we can switch a few randomly chosen bits from 1 to 0 or from 0 to 1 [9].where all the gaps are represented by 0"s and all the base symbols are represented by 1"s and mutation takes place separately in each sequence up to the mutation point rate of 0.2 [9] is initialized and corresponding mutation point is selected.The mutation point is to be selected by the formula:- First the mutation operator converts the total sequence in to bit string then calculate the mutation point after calculating the mutation point every picks a random amino acid from a randomly chosen row (sequence) in the alignment and checks whether one of its neighbors has a gap.If this is the case, the algorithms swap the symbols.The flow of one point crossover is shown in figure 6

IV. IMPLEMENTATION AND RESULTS
The algorithm is implemented using Microsoft visual studio and the machine for this research is a personnel computer with Intel Pentium III processor .The main memory is 4 gigabyte and Microsoft XP was used as a platform for the implementation.The DNA query input sequence is to be taken from cans family.Query input format of the DNA sequence is listed in figure 7 During the course of experiments, we have tried various chromosome lengths in order to understand how they have an effect on the performance of the GA.

Parameter Content
Population In order to examine our algorithm validity, we test number of series with DNA sequence.Firstly the algorithm is executed 100 run on GA while number of generation become fix and size of population is to be continued changed.It is observed that running time is increased accordingly and indicate by a notable rise in fitness score about 10% after increasing each size of population.Again while size of population become fixed and number of generation continued to changed then it has to be notified that running time is increased accordingly and indicate by a notable rise in fitness score about 10% .Table 2, 3 and figure 8, 9 lists the results after varying the number of generation and size population.In the proposed solution, we have also used two specific crossover and mutation operators.In order to determine the best crossover and mutation probabilities; we have carried out three different experiments, using ten randomly selected canis_familiaris dataset that were obtained from [15].In our experiments for each of ten datasets, the algorithm is executed 200 run on GA and the statistical outcomes of the optimal fitness in each run is calculated as the results.We measure the best fitness score and running time for each generation.Our algorithm has to be run with the 30% crossover & 10%mutation option, 60% crossover & 30% mutation option and 80% crossover & 50% mutation option .it is observed that our algorithm obtained the best solutions for 80% crossover & 50% mutation option .The solutions obtained by the 60% crossover and 30% mutation for the same datasets are close to the best scores, however the option 30% crossover & 10% mutation has not achieved any good quality solutions.Therefore, we can conclude that GA has achieved overall better performance for these test datasets when the rate of crossover are selected as 80% and mutation are selected as 50%.As for results for these datasets are to be presented in table 4 and corresponding plots are to be presented in graph 10.The last set of experiments compares our algorithm (GAMS) with two different tools such as Maft (high speed multiple sequence alignment program) and Kalign (fast and accurate multiple sequence alignment algorithm).The maximum 200 generation run on GA and the statistical outcomes of the optimal fitness in each run is calculated as the results.The sequence id and specification of each dataset is given in table 5. We measure that our GA obtained the best fitness score and running time for each generation as compare to other.The GA typically found a good alignment within 200 generations.Table 6 and graph 10 lists the results after comparisng our algorithm with Maft and kalign.

V. CONCLUSION
Multiple sequence alignmen is an extension of pairwise alignment to incorporate more than two sequences at a time.Our multiple alignment methods try to align all of the sequences in a given query set.Efficient fitness value function, crossover and mutation strategies are the outcome of work.Eventually it is trying that our methods will be significantly contributed in prior efficient solution to multiple sequence alignment problems.

FitnessValueFigure 4 :
Figure 4: The flow for finding best fitness score

Figure 6 :
Figure 6: flow of space mutation

Table 1 :
GA Parameters

Table 2 :
Operators assemble on 5, 10, 15 and 20 size of population with fix number of generation with and calculated fitness score, running time Figure8: Fitness curve GA with Verifying size of population

Table 3 :
Operators assemble on 50,100,150 and 200 numbers of generations with fix size of population and calculated fitness score, running time Figure9: Fitness curve GA with Verifying number of generation

Table 4 :
fitness scores with selected Crossover and mutation rate options Figure10: Experimental results on GA with selected Crossover and mutation rate options

Table 6 :
Overall Performance of all methods of Sequence ID datasets Figure10: Overall Performance of all methods of Sequence ID datasets