Elitist Animal Migration Optimization for Protein Structure Prediction based on 3D Off-Lattice Model

—Predicting the structure of protein has been the center of attraction for the researchers. The aim is to make a reliable prediction of the protein structure by obtaining the minimum energy values among amino acids interactions. According to the generated shape of amino acids, the functionality of the proteins can be determined. However, it is known as one of the most challenging tasks in the field of bioinformatics considering its high computation complexity. Metaheuristic algorithms are mainly preferred by researchers from various fields, since their performances are quite satisfactory in solving such complex problems. Animal Migration Optimization (AMO) algorithm is a metaheuristic approach which mimics the behavior of animals during the migration process. However, in this research to reach a high solution quality, an elitist version of Animal Migration Optimization (ELAMO) algorithm is considered and in particular it is applied to Protein Structure Prediction (PSP) problem. The performance of ELAMO is tested on some well-studied artificial and real protein sequences, and then compared with powerful optimization algorithms which are specially designed for solving PSP problem. The results show that ELAMO is quite capable in solving this problem. Hence, it can be used as an efficient optimizer for solving complex problems that require better solution quality in the field of bioinformatics.


INTRODUCTION
In molecular biology, comprehending the structure of a protein sequence reveals the hidden functionalities of the life [1]. When the proteins are folded in different ways, the information necessary for understanding their functionalities will arise. Proteins are formed by the combination of amino acids which are connected by peptide bonds [2]. According to Christen Anfinsen's leading work, proteins can be found in the lowest energy levels which are called Gibbs energy level, when they are in three dimensional states [3]. Protein Structure Prediction (PSP) problem is located on finding this state by seeking the minimum Gibbs energy level. As the amino acid sequence becomes large, predicting the structure of a protein sequence becomes complex.
Researchers developed an approach called 'HP model' for protein folding prediction [4]. In the HP model, a protein description is made up of smaller pieces called monomers and which are either represented on 2D or 3D surface. 'H' and 'P' letters are used to define each of the monomers which are hydrophobic or polar, respectively. It is aimed to find the optimal structure of a given H-P chain that is defined as the maximum number of H-H bondings. Although the HP model is specifically designed for solving protein folding with its simplicity, it does not provide satisfactory solutions for PSP. The problem is proved to be an NP-hard problem due to large number of amino acids sequences and requires quite efficient algorithms to solve them [5][6][7].
One of the biggest limitations of protein folding is having the multiple local optimum points in the free-energy space and the global optimum is located in between these points which is quite challenging to obtain [8]. In order to design a scheme by avoiding the large computational cost, such models with eliminated properties in protein folding have been preferred [8][9][10]. An accurate example of these kinds of models is the offlattice model which is presented by Stilinger et al. [8]. The model is employed to simplify the protein folding.
Animal Migration Algorithm (AMO) is a bioinspired metaheuristic approach proposed by Li et al. [11]. It is founded on an animal's instinct to follow their close neighbors during the migration and has quite validated performances on many optimization problems. Despite of the noticeable properties of AMO, there may be some disadvantages such as low convergence rate by choosing the next possible solutions only among the current animal's neighborhood or having a less chance by finding the global optimum because of following the wrong neighbors. In order to avoid these limitations and make it guarantee that the algorithm converges to the global optimum in less number of iterations, Elitist Animal Migration Optimization (ELAMO) is proposed on the basis of an animal's instinct to follow their leaders, not only their closest neighbors [12]. ELAMO has a validated performance in solving combinatorial NP-hard problem. However, to the best of our knowledge neither AMO nor ELAMO have been proposed for solving bioinformatics problems, particularly, for the PSP problem. In this paper, ELAMO algorithm is adapted to bring another aspect in solving Protein Structure Prediction (PSP) problem by using 3D AB Off-Lattice Model.
The rest of the paper is designed as follows; in Section II, some important studies for solving the PSP over the years have been given. In Section III, three dimensional AB off lattice model and the adaptation of Elitist Migration Algorithm to PSP problem with the model equations are given. In Section IV, ELAMO algorithm's performance is compared on both synthetic and real protein sequences with some powerful optimizers. The obtained results with the visual representations of minimum energy configurations and discussions are given in detail in this section. Lastly, in Section V, the concluding remarks are given.
It is known that many metaheuristic algorithms are verified to be quite efficient in solving complex and even NP-hard problems. In specific, it is found that Bee colony optimization algorithms and its variants are used to solve PSP problem by Li et al. [13]. Kalegari and Lopes proposed an improved Differential Evolution algorithm for solving PSP using 2D and 3D off-lattice models efficiently [14]. Another Differential Evolution algorithm variant is proposed by Rakhshani et al. for solving complex protein structure prediction problems [15].
Deep learning practices are also tested for the PSP by Senior et al. [16] and achieved promising results. Schauperl and Denny performed an AI based protein structure prediction in drug discovery [17], Chowdhurry et al. [18] solved protein prediction problem using a deep learning model and Weißenow et al. [19] have solved PSP problem using AI model accurately.
Multi-meme algorithms are also adapted for solving PSP by Krasnogor et al. [20]. Lin and Zhang introduced a novel-hybrid global optimization method by forming Genetic Algorithm and Particle Swarm Optimization to solve PSP in which aiming to produce lower energy conformation levels [21]. Boiani and Parpinelli proposed a hybrid algorithm called cuHjDE-3D which is formed by self-adaptive Differential Evolution that uses jDE and Hooke-Jeeves Direct Search (HJDS) [22].
The literature review revealed that the standard metaheuristic algorithms have limited performance in solving PSP problem. Before attempting to solve the problem, researchers either modify or hybridize the standard algorithms. By introducing such boosted algorithms, the researchers aimed to use the strengths of the algorithms on PSP problem.
In this study, none of the machine learning methods are implemented. Instead, a modified version of AMO algorithm has been studied to observe how it evaluates the problem by using its parameters. One of the main contributions of this study is to adapt animal migration algorithm by enhancing its diversity using elitist approach for protein sequence prediction problem and achieving satisfactory results.

A. Three Dimensional AB Off-Lattice Model
The AB off-lattice model is stimulated by HP model and considered as one of the useful models for solving Protein Structure Prediction Problem. When the AB off-lattice model was introduced, it was initially designed for 2-D protein structures. However, then the model was upgraded for solving 3-D models as well [8,23].
In a protein sequence 20 types of amino acids exist which are categorized in two; hydrophobic and hydrophilic. This is simply performed according to their affinity to water. The amino acids then translated to two specialized monomers 'A' and 'B'. As K-D method proposes I, V, L, P, C, M, A, G are hydrophobic amino acids represented by letter A and D, E, F, H, K, N, Q, R, S, T, W, Y are hydrophilic amino acids represented by letter B [24]. The name AB of AB off-lattice model is because of these specialized monomers A and B.
The amino acids are bonded with each other by chemical bonds and can be placed anywhere in the 3D space. The reason of that is called off-lattice is the positions of the amino acids which are not restricted by a lattice. In the AB off-lattice model, bondings are formed by set of angles; folding (θ) and rotation (ϕ). In a protein sequence which contains n monomers also contains n-2 folding angles and n-3 rotation angles. The optimal structure of AB off-lattice model produces the free energy which gives general information about the physical and chemical concept of protein sequences. Fig. 1 shows the representation of an artificial protein sequence ABAA with folding (θ) and rotation (ϕ) angles where 'A' and 'B' are hydrophobic and hydrophilic amino acids, respectively. Folding angles [θ1, θ2] and rotation angle [ϕ1] are needed to be optimized for having the minimum free energy level. Adaptation of protein structure prediction to a numerical optimization problem by 3D AB-off lattice model is done as follows, where is the characteristic of the i th amino acid. If , then i is a hydrophilic amino acid. If , then i is a hydrophobic one. The folding angles (θ) are bounded [-180°, 180°].
To obtain the distance between amino acids i and j, the following equation is used.
The following equation shows the basic rotations for two amino acids.
In AB-off lattice model, it is assumpted that strong correlations between AA pairs result with the value of ( ) relatively weaker correlations between BB pairs result with the value of ( ) and different pairs BA or AB pairs result with the value of ( ) . Using the assumptions obtained through AB-off lattice model, the protein structure problem is converted into a numerical optimization problem that can be handled by evolutionary www.ijacsa.thesai.org optimization techniques. By having various ordering of distances and rotations, different energy levels of amino acids are obtained and as the algorithm iterates the optimum energy level is obtained.

B. Adaptation of ELAMO to PSP Problem
Animal migration is a common behavior which belongs to animal herds to be used in discovering better places to live and reproduce. Animal Migration Optimization (AMO) algorithm based on this behavior and proved to be a validated optimizer in solving optimization problems [11]. Our approach is simply based on the main steps of AMO by including the elitism behavior in it. In the Elitist Animal Migration approach, the neighborhood structure of the standard AMO is reconstructed. Thus, the animals in the herd follow their leaders not only their close neighbors.
During the migration process, an animal's position depends on its neighbor. In standard AMO, migration is done by following five closest neighbors of each animal. However, in our elitist approach, an animal's instinct to follow the leader of the herd is essential. In a typical animal herd, there are three kinds of animal; Alpha (α), Beta (β) and Omega (ω).
Alpha (α) is responsible from all animals in the herd such as finding the preys or discovering new life areas. If the alpha dies, a new leader is selected among the beta (β) animals who are in charge after the alpha (α). The rest of the animals are considered as omegas (ω) who obey the rules of the herd. In the algorithm ELAMO, only α and β animals are in charge of migration. The number of α is 1 and the number of β is fixed to 4. Thus, all of the animals in the herd move towards to new life areas by following these leaders. As the algorithm iterates, new α and β animals are selected with respect to their positions among the rest of the animals. The following figures; Fig. 2 and Fig. 3 demonstrate the neighborhood structure in AMO and ELAMO, respectively where each animal represented by a circle theoretically.
Elitist animal migration algorithm is built up on two fundamental steps; Animal migration and population updating.  In animal migration step, animals change their positions towards their α and β animals as given in (4). In population updating step, displacement of animals is introduced. Some animals may be eliminated due to death or they may compete for their positions and the losers are discarded from the population. The new positions are updated according to their fitness values as it is shown in (5).

Animal migration step;
( where δ is a random number produced by Gaussian distribution and G is the dimension for each animal ∈ [1... D] and is the leader's position randomly selected from the neighborhood structure of an animal X i . Population updating step; where X betaRand is an animal selected randomly between beta animals, is the position of alpha, rand is a random number in between 0, 1 and a ≠ b.
The main control parameters of ELAMO are and which influence the population by having equilibrium between the diversity; exploring new possible areas in the search space and intensity; focusing the search area around the leaders α and β.
The adaptation of ELAMO to PSP problem is given in the Fig. 4. One of the main motivations is to reach the higher optimization level by changing the neighborhood structure. In standard AMO, the closest neighbor's position might be used just because they are considered as the closest neighbors and even if their positions are relatively worse than the others. However, in ELAMO, the best positions are chosen at each iteration and are followed by all of the animals.
As it is explained clearly in no-free lunch theorem [25], when an algorithm's performance is sufficient in some aspects, the performance may not reach to that level as it is expected for the other aspects. In ELAMO, individuals are discarded from the population as the elitism feature requires even in the www.ijacsa.thesai.org beginning of the iterations. Therefore, it may be considered in ELAMO the balance of diversification and intensification may be negatively affected by the loss of individuals in the earlier stages of optimization. A herd with α, β and ω animals correspond an AB-off lattice model for a set of sequences. All of the animals in the herd are the potential solutions in solving PSP by corresponding distances, angles and interactions between particles. According to these values, energy levels are derived. In PSP, the objective function is the energy level function and the optimal solution refers to the lowest energy value.

IV. RESULTS AND DISCUSSION
In this section, two sets of analyses were used. First, a set of artificial Fibonacci protein sequences which have been used as benchmarks commonly in the literature is studied [8,26] and then real protein sequences were analyzed. The data sets are experimentally examined structures for testing the efficiency of the methods for PSP problem.
A list of benchmark sequences with the 'A' and 'B' monomers are given in Table I where N is the sum of the monomers. A comparative study is performed in Table II to observe the performance of ELAMO with respect to some powerful optimization algorithms; Improved Particle Swarm Optimization (EPSO) [27], Internal Feedback strategy based on Artificial Bee Colony Algorithm (IF-ABC) [28], Combination of Genetic Algorithm and Particle Swarm Optimization (GAPSO) [21] and standard AMO which ELAMO is originated from. It is also important to note that the results of the compared algorithms included to the comparison table as they appeared in their original studies.
All simulations are implemented on an Intel Core i5 CPU with 4 GB RAM running at 3.10 GHz by C++ language. All benchmark sequences are evaluated for 30 independent runs with random initial points. All protein sequences are optimized by AMO and ELAMO for 50,000 number of iterations.  As shown in Fig. 5, almost all of the algorithms converge in similar rate in the first level of iterations; however as the iteration goes on convergence rate of ELAMO stands out for all benchmarks than the others. When the original algorithm AMO compared with the others, it is seen that all of the selected algorithms converge better than AMO.
All of the algorithms selected for comparison are either the hybrid version of one or more algorithms or the improved versions of the originals. AMO algorithm is the only one its performance was not boosted. It might be the main reason of the observation above which is about the lower convergence rate of AMO. On the other hand, this is a good indication that AMO has been improved efficiently to handle such kind of problems with high convergence rates.
In order to observe the effect of the added elitism feature on AMO, a detailed comparison between AMO and ELAMO is studied and shown in the Table III, where Best is for the lowest free-energy value obtained after 50,000 number of iterations, Avg is the averaged free-energy values of 30 independent runs and Stdev is for the standard deviation value. The results denote the improved solution quality of ELAMO over AMO as well as robustness. However, for all of the artificial sequences the Best and the Avg values of ELAMO are superior than the values AMO. However, only for the sequence with the length 34, the Stdev value is not satisfactory as it is expected. ELAMO algorithm produces encouraging results for the lowest-free energy values and the average values, but does not achieve the development of standard deviation for this sequence. When the Stdev values are compared for the sequence with the length 21, it is seen that AMO is quite close to ELAMO. The reason of this unexpected performance might be because of α and β animals' convergence pace in the search space, while some of the ω animals are still in the optimization process. As the iterations progress, the majority of ω animals find their way towards the optimum. Researchers dealt with well-preferred real protein benchmark functions for a better analyzing of their algorithms and employed to their works [15,[28][29][30][31]. The same protein sequences were used in this study as benchmark functions and rewritten according to the K-D method where I, V, L, P, C, M, A, G are hydrophobic amino acids represented by letter 'A' and D, E, F, H, K, N, Q, R, S, T, W, Y are hydrophilic amino acids represented by letter 'B'. The list of the amino acid sequences were selected from the widely used Protein Data Bank (PDB) database with different lengths to make a more efficient comparison with the other algorithms. The IDs, lenghts and contained amino acids are given in the Table IV. In the study with real protein sequences, only one standard algorithm chosen for the thorough check. Instead, specially designed hybrid algorithms with significant performances are selected. Table V lists the lowest free energy values obtained by ELAMO as well as other competitive algorithms to make an extensive comparison. Convergence values can be seen even in earlier iterations in the Fig. 6.  It can be seen from the figure that at the very first level of optimization all of the algorithms' convergence rates are very similar, but in the latter iterations elitism feature begins to appear in ELAMO and this affects the convergence rate in a very desirable way. The test parameters and results of the compared algorithms are accepted as they appear in the references [15,[28][29][30][31]. The best results for AMO and ELAMO are obtained over 30 runs. The stopping condition is set 200,000 iterations.
It can be acquired from both Table V and Fig. 6 that the performance of ELAMO can be distinguished from the others by having more precise solutions of the problems. Considering all of the results, it is possible to say that the algorithms' performances are very competitive but in the view of high solution quality, the performance of ELAMO is quite noteworthy.
In the comparison of ELAMO with one of the standard algorithms ABC, it is observable that ELAMO produces better results. For the other competitive algorithms, we can say that CMAES and L-SHADE performances are similar to each other, while E-MASA-PAMS superior than both and produces comparable results with ELAMO. However, in both analysing, we can see that ELAMO always contributes with the lowest free energy values and it can be a distinct evident by applying a set of modifications on the right steps of the AMO algorithm, influence the performance of ELAMO in a quite remarkable way. It is note to point that the compared algorithms can not be further analzyed since there is no Stdev or Avg reported in the literature.
In the light of findings, the visual representation of the folded protein structures by the best run of ELAMO are shown in the Fig. 7 (a) to (d). In the figures, green dots represent the hydrophobic monomer; 'A' and the purple represent the hydrophilic monomer; 'B'. As the figures reveal, the hydrophobic monomers are frequently enclosed by the www.ijacsa.thesai.org hydrophilic monomers in the folding structures. This is a natural phenomena for avoiding contact with water molecules and is verified by the following figures as well.

V. CONCLUSIONS
In this paper, an Elitist Animal Migration Optimization (ELAMO) is fitted to optimize the structure of protein sequences with 3D AB off-lattice model. According to the reformed structure of ELAMO, rather than following the neighbors, only the group leaders are followed. This movement results with high solution quality and the elimination of animals during the migration process makes the algorithm not to trap in local optimum points. To enable a more accurate comparison, standard AMO and effective algorithms are included in our experiment. Even though ELAMO has not been specially designed for solving PSP problem, its effectiveness in modelling of real protein sequences is successful.
ELAMO eliminates the animals whose positions are not adequate even in the early stages of optimization and the rest of the animals in the herd only can follow their leaders. This elimination brings faster convergence rate without trapping into local optimum points in the process. It is known that the optimization is built on consecutive iterative processes until the termination criterion is obtained. During this process, some individuals may be eliminated due to their undesired characteristics which may be improved in the latter iterations. In ELAMO, the elimination is performed even at the first level of iterations and the desired characteristics are also eliminated. It is important to keep in mind that this may bring low solution quality as the length of the protein sequence increases.
Also, when the problem complexity increases in terms of large number of amino acids, the algorithm's performance may not be as efficient as it is expected. It is because of the lack of proper design for PSP problem. In the future, ELAMO may need to be strengthened by adding some boosting steps to be efficient in solving more complex protein sequences.