Experimental Evaluation of Genetic Algorithms to Solve the DNA Assembly Optimization Problem

org


I. INTRODUCTION
The DNA fragment assembly problem is the process of reconstructing an original DNA sequence from a given set of DNA fragments. This is achieved by ordering and aligning these DNA fragments such that the resulting DNA sequence is as short as possible. It is a complex combinatorial optimization problem belonging to the class of NP-hard problems, where there is a need to find the right order of the DNA fragments to assemble them. Several metaheuristic techniques have been developed to solve this problem [1], [2], [3]. This paper exploit the genetic algorithm platform (GAP) developed in the former preliminary paper [4], in which the existence of a polynomialtime reduction of DNAFA into the Traveling Salesman Problem and the Quadratic Assignment Problem was discussed. Then, conceptually designed a GA platform for solving the DNAFA problem, inspired by the existing efficient GAs in the literature for solving the TSP and QAP problems. This platform gathers and offers several GA operators designed to solve hard optimization problems such as TSP, QAP, and DNAFA. The GAP enables the researchers to easily design an adequate variant GA algorithm for hard optimization problems in particular. This work implementing and experimenting on some GA variants judiciously built from the platform (GAP) aims to identify the best variant that efficiently deals with the DNAFA problem. Using this platform, this work is able to individually study the effects of genetic algorithm components on selected metrics, which were presented in terms of time and overlap score. This work focused on examining and discussing the effects of population size, population generation methods, selection types, and crossover types and figure out which component has the most impact on GA performance. Some of these GA components have never been tested in the context of the DNAFA problem, such as SCX crossover, which is worth to be investigated experimentally. Other components have been tested before, but when retest was done on them, a different result was found, such as greedy as a population initialization method. Because of SCX effectiveness in TSP and QAP, we believe the SCX crossover is a smart crossover that will outperform other crossovers. As a result of these comprehensive experiments, this work identifies the bestdesigned GA variant that outperforms the existing GA algorithms in solving the DNAFA problem. This GA variant features the use of 200 individuals for the population size, along with the greedy method for initializing the population, tournament selection, and SCX crossover. This GA variant showed a significant increase in overlap score compared to what is reported in the literature. The results showed that the SCX crossover was the best crossover among the studied crossovers and gave good results. Furthermore, the results showed that the greedy method is a very powerful method that improved the algorithm's performance by 37%, demonstrating that the population generation method has the greatest impact on improving the results than the other GA components. The experimental results demonstrated the efficiency of the designed approach, as it got a better result for the overlap score ranging from 56.16% to 172.74% than the previous recorded results for most data sets. This work demonstrate experimentally that the best designed GA variant outperforms existing GA algorithms in solving the DNAFA problem for some data sets.

A. The DNAFA Problem
The DNAFA problem is defined as follows: Given a set of fragments, f1, f2,…, fn, drawn from a finite alphabet Σ = { , , , }, the goal is to find the shortest superstring that contain all the input fragments that maximizes the number of overlaps between every pair of two consecutive fragments and thus minimizes the length of (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 14, No. 3, 2023 287 | P a g e www.ijacsa.thesai.org ∑ (1) .

B. Assembly Process
To understand the assembly process, we've defined some key terms.
 Fragment: A short sequence of DNA bases. It is also called read.
 Coverage: The number of fragments at a specific position in the DNA.
 Prefix: A substring from the first characters of a fragment.
 Suffix: A substring from the last characters of a fragment.
 Overlap: Common sequence between the suffix of one fragment and the prefix of another.
 Layout: An arrangement of the collection of fragments based on their overlapping order.
 Scaffold: The overlapped contigs, which may contain gaps.
 Consensus: Reconstruction of the complete sequence In the assembly process, the input for the DNA fragment assembly is a set of fragments. The traditional assembly approach works in the following order: overlap, layout, and consensus [5].
 Overlap stage: Finding the overlapping fragments and computing their similarity score (overlap score). This means finding the longest match between the suffix of one fragment and the prefix of another.
 Layout stage: Finding the order of fragments based on the computed overlap score.
 Consensus stage: Reconstructing the complete sequence from the layout.
This paper is organized as follows: the second section discusses the related works, then the proposed design is discussed in detail in the third section. In the fourth section, the experiments and the method of conducting the investigations are detailed, and then the results are listed in the fifth section. The sixth section discusses these results by comparing them with previous works, and finally, the paper is concluded in the seventh section.

II. RELATED WORKS
This section presents the previous related works organized into two subsections: the first subsection summarizes the works solving the DNAFA with GA, and the second subsection introduces the works solving the TSP and QAP with GA.

A. Genetic Algorithm for DNAFA Problem
The basic genetic algorithm schema contains various concepts such as population encoding, population initialization, fitness function, selection, crossover, and mutation. Each concept has its own importance in the algorithm. By studying previous works, it can be noted that each concept can be done in a different way. In more detail, the population can be encoded in different ways in the GA, one such way is through segmented permutation [6], identity permutation [7]. Random generation, as in [8], the greedy approach, and the 2-opt heuristics, as proposed by Minetti et al. [8] and [9], are common strategies for generating the initial populations.
For the fitness function, the most commonly used fitness function is to maximize the overlap score, where the smithwaterman algorithm is used to calculate the overlap between the fragments [7]. The smith water man algorithm takes a lot of time but, even though it is the most precise algorithm for identifying similarity regions between fragments. Overlap score is considered the best measure for measuring the quality of the solution. It was used in most of the previous works [7], [8], and [9].
The crossover operator is the main operator of GA, as it plays a crucial role in efficiently exploring the search space of the optimization problem. The parents' characteristics are mainly inherited by crossover operators.
GA can be combined with other metaheuristics to achieve good results. For this purpose, Minetti et al. [10] designed a hybrid method named SAX that combined the GA with a simulated annealing metaheuristic. Another work, by Hughes et al. [7] combines different variations of GA in different ways. Also, the authors in [5] applied multiple algorithms, such as simulated annealing and scatter search with the GA. Another recent work is provided by Uzma and Halim [9] they combine GA and Power Aware Local Search (PALS).
The studies of Bucur [6], [11] focus on minimizing the total length of the scaffold (summing the length of the overlapped contigs). Unlike previous works, Bucur used simulated data sets where the fragments were of uniform length, they were able to measure the accuracy since they had the reference genome. However, as they mentioned, the main disadvantage of their method is its increased time complexity.

B. Genetic Algorithm for TSP and QAP
This section reviews GA algorithms designed to solve the TSP and QAP problems. Different types of encoding were used for the optimization problems TSP and QAP. Most works of the wide literatures used the identity permutation such as for TSP in [12], [13] and QAP in [14]. Another advanced types of population encoding were used for TSP such as value encoding [15], and real number encoding [16]. The common strategies of generating the initial populations are the random generation as investigated for TSP in [16] and the greedy method as in [17]. Recently, more advanced strategies have been developed, the (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 14, No. 3, 2023 288 | P a g e www.ijacsa.thesai.org author in [13] proposed Multi-Agent Reinforcement Learning (MARL) for solving TSP problems, and [14] implemented the sequential sampling method for solving QAP problems.
For the selection, the roulette wheel is the common selection operator used for optimization problems [15], [13], [18], [16], the tournament selection was implemented for TSP [17], and the stochastic remainder selection was used for QAP [19]. More recently, in [16], a greedy method was designed as a selection operator for TSP.
Several advanced crossover operators have been designed for solving TSP as well as QAP using GA algorithms. The Sequential Constructive Crossover (SCX) is an intelligent crossover designed by Ahmed [12] to solve the TSP. Recently, a modified version of sequential constructive crossover, named greedy SCX (GSCX), was proposed for solving TSP [20]. The reverse greedy sequential constructive crossover (RGSCX) and the comprehensive sequential constructive crossover (CSCX) are two new crossover operators that enhance SCX for solving TSP [21]. Other types of advanced crossover operators were designed in [18] to solve the QAP, relying on the idea of a frequency model. Three crossover operators were introduced for enhancing GA, namely, the Highest Frequency Crossover (HFX), the Greedy HFX (GHFX), and the Highest Frequency Minimum Cost Crossover (HFMCX).
Various types of mutations have been investigated for the TSP and QAP problems, including the exchange mutation [13], [16], [17], and the reciprocal exchange mutation [12]. More advanced mutation operators have been designed for the TSP and QAP problems, such as the interchange mutation in [15], and the inversion mutation in [17]. In [22], the adaptive and combined mutation operators were proposed for solving QAP.

C. Other Metaheuristics Algorithms for DNA Fragments
Assembly Particle swarm optimization (PSO) was reported in the literature for the DNA fragment assembly problem. Verma and Kumar [2] used the PSO with the smallest position value (SPV) rule. The PSO can be enhanced when combined with other algorithms, such as in Huang et al. [23], who proposed a hybrid particle swarm optimization algorithm (HPSO). The algorithm was divided into two parts: (1) Tabu search combined with PSO to improve solution quality and (2) simulated annealing combined with variable neighborhood local search (VNS). Additionally, the parallel approach can reduce the computation time, so, Mallén-Fullerton and Fernández-Anaya [2] presented a parallel heuristic based on the PSO and the differential evolution (DE), which is similar to GA, but DE relies on mutation operation, while GA relies on crossover operation to assemble better solutions. Mallén-Fullerton and Fernández-Anaya used a variation of the TSP (the Lin-Kernighan algorithm [24]) with some modifications to be applied for DNA fragment assembly. Another study is that of Huang et al [25], who presented a memetic PSO algorithm with a variable neighborhood search (VNS) approach as well as TS and SA, each of these algorithms is used in different ways and in different combinations. Indumathy and Maheswari [26] used a variant of the standard PSO called the constriction factor PSO (CPSO). Another proposed metaheuristic algorithm for solving the DNA fragment assembly problem is the problem aware local search (PALS) [27]. The main drawback of PALS is its quick convergence to local optima but combining it with other algorithms can overcome this drawback. Minetti et al. [28] used PALS by combining it with SA, this suggested method shows improved performance on the largest data sets when compared with SA and PALS separately. Another algorithm founded for the DNA fragment assembly is the bee colony, Firoz et al. [29] presented the artificial bee colony (ABC) algorithm and the queen bee evolution based on the genetic algorithm (QEGA). Majid al-Rifaie [30] investigated a new algorithm, stochastic diffusion search (SDS), which follows a different strategy for calculating the overlaps, picking a model from given fragments and trying to find the same model in the rest of the fragments. Among the fragments containing the model, the one with the highest similarity is picked, assembled, and then removed from the search space.
The previous paper [4] showed that the DNAFA optimization problem is a special case of two well-known optimization problems: the traveling salesman problem and the quadratic assignment problem. Particularly, that paper theoretically demonstrated that all three optimization problems have a similar topological structure and that they need to explore a search space of solutions with the same complexity to find an optimal solution. For this reason, the GA platform designed to solve the DNAFA problem is inspired by the efficient GA approaches developed for the famous combinatorial optimization problems, TSP and QAP. The GA platform gathers several advanced GA operators and tools that have demonstrated their effectiveness in the context of TSP and QAP. Table I illustrates the GA parameters' settings from the literature for DNA, TSP, and QAP.

III. THE GA PLATFORM
The GA platform consists of the best and most advanced GA tools for the DNAFA problem (shown in Fig. 1.). One could build several variants of the GA to solve it by judiciously integrating the ingredients of this platform in different ways.

A. The GA Operations of the Designed Platform
This section describe the different GA operations involved in the platform that suggested earlier [4]. This paper will study all these operations, test them experimentally, and try different versions by combining different tools from the platform to create the best version that will be compared with other algorithms from the literature.

1) Encoding:
For the encoding, the work will use the integer encoding, where the fragments encode as numbers, such that fragment one encodes as "1", fragment two encodes as "2" and so on. Varying from (20 to 10,000) generations.
Varying from (5000 to 10k) generations. 2) Initial Population: Initial population includes the population size and the population generation method. For the population size, the work selects two sizes (200 and 500 individuals) and discusses how much time is saved if the population is small and how accurate it is.
For the population generation method, the GA platform design includes the random, greedy, and 2-opt heuristic strategies that have previously yielded good performances as shown in [7], [31], and [9]. Since the primary results showed that the greedy initialization method gave the best solutions, this paper displays its results. However, because the greedy method searches and generates populations intelligently, further experiments will investigate whether the greedy can find the solution from the beginning without relying on the rest of the GA operator.

3) Fitness Functions:
As the fitness function is repeatedly applied to each individual of each generation, it should be relatively easy to compute and should also accurately evaluate the quality of each individual [12]. A simple fitness function aims to maximize the overlap score by summing the overlap for each of the adjacent fragment pairs, as expressed by the expression (2) in [9].
where [ , + 1] is the overlap score between fragment i and fragment i + 1. F is simple in complexity since it takes O( ).
To measure the solution quality, the work will use the following formula, which is often used in TSP and QAP problems to measure the solution quality.

Gap = ((ASV-BSV)/BSV) *100
Where ASV refers to the average solution value (average overlap) and BSV refers to the best-known solution value reported in the literature.

4) Selection operators:
As roulette wheel selection is widely used and consumes the least amount of time and tournament selection can maintain diversity by giving an equal chance to all the individuals to compete [32], the roulette wheel and the tournament selections are selected to be added to the platform. www.ijacsa.thesai.org 5) Crossover operators: Several crossover operators, including SCX, OX, CX, PMX, and ERX, have been chosen for inclusion in the platform. Special attention should be paid to SCX, as it is a smart crossover and was one of the best operators for the TSP and QAP problems and is expected to have the same performance in the DNA_FA context. Moreover, SCX has never been used to solve the DNA assembly problem before; therefore, this paper will present the results related to this crossover.
6) Mutation operator: The swap mutation operator and its variants were widely used for DNA_FA, TSP, and QAP [9] and [17]. Combined and adaptive mutations were designed for the QAP problem [22]. These three mutation types are included in the platform.
7) Stopping condition: The GA platform will stop if the solution is not improved at all during a certain number of iterations or a time limit is reached.

IV. EXPERIMENTS
This section describes the data sets, the experimental setting, and the variable's values in this study.

A. Data Sets
The GA platform will be assessed on data sets produced by next-generation sequencing, the data sets are obtained from the National Center for Biotechnology Information (NCBI) 1 . These data sets are the same benchmarks used in the previous works mentioned in the related works section. This work used 17 data sets with a varying number of fragments, from 25 fragments to 352 fragments. The mean length of the fragment varies between 286 and 512 pb, the description of these data sets is given in Table II.

B. Experimental Setting
The designed algorithm has been implemented in C++ on a Windows 10 computer with a 2 GHz CPU and 16 GB of RAM. The work maintained the parameter values used in TSP and QAP that led to the best results. Based on Table I, this work chose the values for the GA operators, which are described in detail in Table III. The work applied to each data set 60 experiments using different GA parameters and operators. In detail, it applied two types of population size, three types of initialization, two types of selection, and five types of crossover (2*3*2*5 = 60). Since the GA is a stochastic process, each experiment was run 30 times to ensure the satiability of the given results (30*60=1800 experiments). Since there were 17 datasets, the total number of experiments reached more than 30 thousand (17 * 1800=30,600 experiments)

C. Evaluation Metrics
The performance of the GA algorithm will be measured in terms of the following:  Overlapping scores: should be high. The overlap score measured by calculating the length of the overlap between each fragment and all the existing fragments. The overlap scores were computed using the Smith-Waterman algorithm. Two forms of overlap scores were reported in the results: the best overlap scores out of 30 runs, and the average overlap score for 30 runs.
 Computational complexity (time complexity): should be minimized. The time for the complete assembly process was divided into two stages: the time for calculating the overlap score in the preprocessing stage, and the time for the GA to find the best solution. This paper only showed the time of the GA because this work studying the change in the performance of the GA and also because the time for SW is constant for each dataset regardless of which GA variant is studying.

D. Aspects of Investigations
This work study the effect of some algorithm components on two metrics: the GA running time and the overlap score. It will discuss the effect of population generation, including the population size and the population generation method, the effect of the crossover types, as well as the effect of the selection types. According to the comprehensive experiments, below are the major interesting investigations aspects.
 Crossover types on overlap score when varying the population generation method.
 Population size on overlap score and GA running time.
 Selection types on overlap score and GA running time when varying the population generation method.

V. RESULTS
This section presents and discusses the results obtained from the experiments conducted in this study. The results in this section are organized in subsections. Each subsection reports the results of a specific investigation, as mentioned earlier. In each subsection, the results will be illustrated with tables or pictures and discussed, in addition to summarizing the findings at the end of each subsection.

A. The Effect of Crossover Type on Overlap Score when Varying the Population Generation Method
This section studies the effect of the crossover types on the overlap score when varying the population generation methods while the other GA operators remain constant. For simplicity, the type of selection operation is the tournament, and the population size is 200.
Each of the figures below represents the effect of a specific population generation method on the best overlap score.   From Fig. 2, Fig. 3, and Fig. 4, it can be seen that the greedy initialization type is the one that gives the highest overlap score for all the data sets. Moreover, the 2-opt heuristics and random method gave different results but clearly showed that SCX is the best crossover in the majority of cases. Clearly, the SCX has the best accuracy regardless of the population generation methods. SCX is less sensitive to the type of initialization, whatever the type of initialization, it gives good results in every case. Also, the greedy approach www.ijacsa.thesai.org improves the performance of all crossover operators. It seems only when population generation method is greedy, it is not clear if SCX is still the best. This reveals the impact of the population generation method and how creating the population in a smart way from the beginning has been a strong factor in improving the results and reducing the differences between the types of crossovers. In more detail, the greedy improved the SCX overlap results by 11.7% compared to random and 2_opt heuristics, also improved the OX and CX results by 25.9% and 31.01%, respectively. And improve the PMX and ERX by 22.01% and 37.7%, respectively. This demonstrates that the SCX was the least sensitive crossover to the population generation method among the crossover types and maintained the highest overlap score. Thus, further investigation is done by checking the time cost for the greedy method using small and large population sizes.
The results of this study are summarized as follows:  SCX is not sensitive to the type of initialization, whatever the type of initialization, it gives good results in every case.
 OX, CX, PMX, and ERX are sensitive to the type of generating population, as they perform better with the greedy initialization than with the random and the 2opt heuristics. Because the greedy is good at generating a good initial population.
 The population generation method has a strong impact on improving the results.

B. The Effect of the Population Size on Overlap Score and GA Running Time
This subsection investigates the effect of population size on the overlap score, including the best overlap score and average overlap score. In addition, the effect of population size on the GA's running time.
1) The effect of population size on overlap score: Recall that this paper only report for the SCX; Table IV shows the effect when the initialization method is greedy, the crossover type is SCX, and the selection type is tournament. The "gap" column represents the gap on overlap which calculated by the formula in Section III and the "absolute difference" represents the pure difference between 500 and 200 individuals.
In this investigation, when the crossover is SCX, and the generation method is greedy, and the selection type is tournament, the results show that when the initial population is 60% less, the GA still gives high overlap score in all datasets with an average difference of 0.14%. This is important since decreasing the initial population size decreases the computation significantly.
Moreover, for datasets (f*) this work show a significant increase in accuracy compared to what is reported in the literature. The increase reaches 172.57%, with the 200population size, and 171.9% with the 500-population size. Moreover, when computing the difference between the best overlap score and average overlap score, there was not a significant difference in performance, where the best overlap score is only 0.44% and 0.42% better than the average overlap score on 200 and 500, respectively. Thus, the paper only reports the best overlap score in the rest of the paper. With regard to the gap, the table shows that our results are better in 9 data sets out of 17. When the absolute difference is negative, that means the 200 size is better than the 500. This was clear for eight data sets, and their performance was equal in four data sets, this makes the smaller size more suitable.
The results of this study are summarized as follows:  Increase the size of the population increase the computational time, however it may give chance to have good results for the big data sets.
 The population sizes of 200 and 500 individuals do not have a noticeable difference in the quality of the solution; therefore, it is preferable to take the smaller size.  2) The effect of population size on GA running time: This section studies the effect of population size on the GA running time. The following tables reveal this effect when the type of crossover is SCX, the type of initialization is greedy, and the selection type is tournament. In addition, Table VI compare the use of complex SCX crossover operators with small population size to the use of simple PMX, ERX, OX and CX crossover operators with large population sizes. The 200 and 500 refer to small and large population sizes, respectively.  The "GA time" column represents the time for the whole GA to find the result,  The "greedy time" column represents the time for initializing the population with the greedy method.
 The "overlap after greedy" column represents the overlap score after creating the population.
 The " overlap after GA " column represents the overlap score when the algorithm is done.
 The " Increase in overlap " column represents the percentage increase in the overlap score between the overlap score after greedy and the overlap score at the end of the GA.
 The "Absolute difference in time" column shows the difference in time between 500 and 200 individuals for the GA time.
As indicated earlier, the question posed for discussion is whether the greedy might override the algorithm's performance, is the solution comes from the greedy or the GA finds the solution, is the time spent in creating the population or in finding the solution. For this matter, the following table illustrate the overlap score after creating the population with the greedy method, as well as the overlap score when the algorithm is done. In addition to the time taken to create the population and the time taken by the algorithm to find the solution, the least time and the best solution for every data set are marked in bold. The table above shows that a small population size takes less time in the majority of the data sets and gives more opportunity for the algorithm to improve the solution. However, the larger population size takes more time to generate but less time to find the solution. In the case of small data sets, most of the time is taken to create population, while the time taken to find the solution is very small. These results showed that when using a larger population size with the greedy method, the improvement in the solution is small and may be nonexistent in the case of small data sets such as (f25_305, f25_400). Small population sizes are better suited to the greedy method because they allow the algorithm to improve the solution while also taking less time. When the datasets (x60189_4 and f25_500) are excluded, the results in Table V for total time show that there is a 49.21% reduction in time when using a 40% smaller population. Moreover, this table showed us that the greedy method contributed 95% to improving the solution, and the GA improved the solution by 3.51% for the 200-population size and 2.99% for the 500 size. This supports the previous investigation, that SCX with a smaller population size is better.
But it may come to mind that if we choose the larger population size with a simple crossover, could it give an overlap higher than the SCX with the small population size in a reasonable time? So, recalling what previously raised for discussion the following table compare SCX with the smaller population size, with other types of crossovers with a larger population size, for time and overlap score.
The results in Table VI show in 14 out of 17 data sets, using SCX with 40% less population size leads to better results than other crossovers with larger size. In addition to having a 40% smaller population size but comparable accuracy, SCX is also significantly faster than the other crossovers. The results show that SCX is 26.92% faster than PMX in all data sets, except for "x60189 _6" and "f25_500" datasets, and in some datasets, it is 59.46% faster (as in f50_498). Also, SCX is 38.38% faster than ERX in all data sets, except for "x60189 _4" and "f25_500" datasets, and in some datasets, it is 68.75% faster (as in f50_498). Also, SCX is 34.64% faster than OX in all data sets except for x60189_4, and SCX is 32.89% faster than CX in all data sets except for "x60189_4", "x60189_6", and "f25_500". There is a similarity in the performance of all crossovers in the small data sets. But the SCX is still the dominant one. This confirms the results obtained previously that with smallest population size SCX still gives the best solution. The results of this study are summarized as follows:  The greedy method had a clear impact on the GA performance and contributed to improving the solution by 95%.
 Most of the time is spent on creating the population especially with the large population size.
 Smart crossover like SCX with small population size is better than simple crossovers with large population size.

C. The Effect of Selection Types Varying the Population Generation on the Overlap Score
This section studies the effect of the initialization and selection types on the overlap score. The following figures show this effect when the type of crossover is SCX and the population size is 200, since previous investigations show that a small population size is more suitable .   Fig. 5, Fig. 6, and Fig. 7 show the effect of initialization types and selection types on the best overlap score. Clearly, the greedy initialization type gives a better overlap score than the random and 2-opt heuristics in 15 data sets out of 17.   7. Effect of selection types with greedy initialization on best overlap score. Fig. 8, Fig. 9, and Fig. 10 show the effect of initialization and selection types on the average overlap score. The greedy initialization still dominates the random and 2-opt heuristics for the average overlap score as well, giving a higher average in 16 out of 17 data sets. Additionally, the roulette wheel selection is still better than the tournament with the random and 2-opt heuristics. However, with the greedy initialization, the tournament is better.   The results of this study are summarized as follows:  The greedy initialization type is the best among the majority of the data sets for the best overlap scores and the best among all the data sets for the average overlap scores.
 The random initialization type is better than the 2-opt for the best overlap score, but their performance is almost similar for the average overlap score.
 The tournament selection type is better than the roulette wheel selection for the best and average overlap scores with the greedy. Table VII illustrates the results obtained when studying the effect of initialization and selection types on the GA running time. It shows this effect when the type of crossover is SCX, and the population size is 200. The GA time column represents the time it took for the GA to find the result. The least time for every data set is marked in bold green when the selection type is the tournament and marked in bold blue if the selection type is the roulette wheel. The gap column shows the difference in time between the generating methods (i.e., greedy, random, 2opt heuristic). Table VII shows that the greedy initialization type is the best from the viewpoint of time complexity. The random and 2-opt heuristics types take more time than the greedy, but the random-type records less time for 12 data sets out of 17, while the 2-opt heuristics record less time for three data sets and equal time for two. As for the selection type, the roulette wheel selection dominates the tournament selection by recording the least time for 14 data sets out of 17. The gap confirms as in the previous section that the greedy generating method is better than the random and 2-opt, and the random is better than 2-opt. In more detail, the results show that the greedy initialization results are fast compared to both random and the 2-opt heuristic in most of the datasets. Except for datasets M15421_7 and J02459_7, the greedy approach is 47.39% and 48.17% faster than random, and 2-opt heuristics when selection type is tournament. Also, greedy approach is 66.90% and 67.19% faster than random, and 2-opt heuristics when selection type is roulette. For the other crossovers OX, CX, PMX, and ERX, the www.ijacsa.thesai.org tournament was the best in terms of solution quality, and the roulette was the fastest, because the tournament chose multiple parents every time and compared them to pick the better one.

D. The Effect of Selection Types Varying the Population Generation Method on GA Running Time
The results of this study are summarized as follows:  The greedy initialization type takes less time than the random and the 2-opt heuristics in almost all cases except for the large data set (J02459_7), because the greedy method takes time to create the population, the larger the data, the more comparisons that greedy makes, and therefore it takes longer time.
 The random initialization type is faster than the 2-opt heuristics in 12 data sets out of 17, this is an expected result.
 The roulette wheel selection type is faster than the tournament selection, but the tournament is better for solution quality.

VI. DISCUSSION
This section discusses the findings that emerged from the results presented in the Results section. And conclude that the small population size (i.e., 200 individuals) is more suitable for most cases. And the greedy type of initialization is the best when look for good overlap score results and time. Furthermore, the results show that the roulette wheel selection type is more suitable than the tournament selection in the context of time, but the tournament is better in the quality of the solution. Also, this work shows that the SCX crossover is the best in the context of best overlap score and average overlap score.
This study has multiple GA versions, but in comparison to the previous works, we selected the best version we got. Moreover, the comparisons are divided as follows:  Previous works that used the GA, the comparison is presented in Table VIII.
 Previous works that used other metaheuristics algorithms, the comparison is presented in Table IX.
Table VIII compares the designed GA results and the other previous GA work's results in the context of the overlap score. The best results are marked in bold. The "difference in percentage" column shows the difference between our best results and those of the previous works. Clearly, our results for the F-series data sets (from F25_305 to F100_512) dominate all the previous work's results. This work got less than the best results of previous works in eight data sets out of 17, however, our results are still better than [23], [25], and [9] for these data sets.
Moreover, this work obtained better results than the results of all the previous works in nine data sets out of 17.
With regard to the time, the results were given in a reasonable time and there is no significant change or difference in time, because the dominant time is actually not the GA time but the assembly time (i.e., in our case, the Smith-Waterman algorithm.). GA is useful when the data set is large, and this is expected because GA avoids large search space. The results show that the designed GA gives the results in less time for large data sets such as M15421_6, M15421_7, and J02459_7, which have several fragments that vary from 173 to 352 characters.

VII. CONCLUSION
This paper is a continuation of our previous work [4] to solve the DNA fragment assembly problem. As was pointed out in the introduction to this paper, the DNAFA is an optimization problem that attempts to reconstruct an original DNA sequence by finding the shortest DNA sequence from a given set of fragments. We have designed a platform for the genetic algorithm, from which more than one version of the genetic algorithm can be deduced to solve this problem. The design was inspired by the good designs that solved TSP and QAP problems. This study is the first to our knowledge that examines the genetic algorithm for the DNAFA problem from this perspective. In more detail, this study has gone a long way towards investigating the effect of genetic algorithm operators on the quality of the solution to the DNAFA problem. The study focused on investigating the effect of the initial population, size of the population, selection types, and crossover types. This paper recorded the important results and came out with some findings, the most obvious finding to emerge from this study is that the SCX crossover is a smart crossover and has never been used before with DNA_FA, SCX crossover gave better results compared to the rest of the studied crossover types. Furthermore, the results show that the population generation method has the greatest influence on GA performance in terms of time and solution quality. Also, we configured the best-designed GA variant that outperforms the existing GA algorithms solving the DNAFA problem. This GA variant features the use of 200 individuals for the population size along with the greedy method for initializing the population, tournament selection, and SCX crossover. This study has found that generally, the size of the population does not significantly affect the quality of the solution, especially if the type of initialization is good. The results were good and competitive compared to the results of previous works. Our design showed that the results were better than all previous results from the literature for some data sets.