Parallel Improved Genetic Algorithm for the Quadratic Assignment Problem

—Quadratic Assignment Problem is one of the most common combinatorial optimization problems that represents many real-life problems. Many techniques are applied to solve Quadratic Assignment Problem, these include exact, heuristic, and metaheuristic methods. A Genetic Algorithm is a powerful heuristic approach used to find optimal solutions or near-to-optimal for Quadratic Assignment smelborp. In this paper, we developed a Genetic Algorithm with a new crossover operator with new technology closer to that found in nature without a crossover point and a new suggested intelligent mutation operator, then we developed a Parallel Genetic Algorithm using the same crossover and mutation. The sequential Genetic Algorithm will be implemented in the Central Processing Unit (CPU), and the Parallel Genetic Algorithm will be implemented in the Graphical Processing Unit (GPU). This paper presents two comparisons, first calculates elapsed time for crossover, mutation, and selection in both CPU and GPU, then compares the results. This comparison clearly shows the enhancement degree of computation time in the parallel environment, which is around half the time executed in the sequential environment. The second comparison, iterates these operators into several generations, using twenty benchmark instances reported in Quadratic Assignment Problem Library with sizes from (12-70), population size equal to 600, the number of generations equal to 2000, and the maximum number of parallel threads will be 600. Proposed crossover and mutation give the optimal solutions with ten benchmarks with problem sizes from 12 to 32 in both Sequential Genetic Algorithm and Parallel Genetic Algorithm, the next ten benchmarks give solutions closed to the optimal solution with a small error rate.


I. INTRODUCTION
The Quadratic Assignment Problem (QAP) is one of the most common combinatorial optimization problems that represents many real-life problems. The QAP involves the assignment of n facilities that have flows (weights) among them to n possible locations that also have distances among them to achieve the minimum sum of the distances multiplied by flows, this minimum sum will be reached by assigning high facilities to nearby locations and small facilities to far locations. The problem was first introduced as a mathematical model for economic activities in 1957 [1], then it was becoming a fundamental and important problem to represent several applications in different areas, such as computer backboard wiring, locating clinics with a hospital, locating machine and electronic components, assignment of buildings in a university campus, etc.
The quadratic assignment problem (QAP) consists of n facilities and n possible locations, exactly one facility for each location. For each pair of facilities, a flow matrix, F = [ is defined, which consists of flow values that must be required to move from facility i to facility j. Also, for each pair of locations, a distance matrix, D = [ is defend and it consists of distance values between location k to location l. The assignment of facility i is not independent of other assignments, so when assigning facility i to location k we must consider the assignment for all other facilities that have nonzero relationships with facility i. Let Since the solution is derived from n! possible assignments, it makes the problem impossible to solve in polynomial time with moderate problem size, even with modern computers.
The QAP solving methods can be categorized into three main classifications: exact methods, heuristic methods, and meta-heuristic methods. The exact methods give the exact optimal solution, but the drawback of such methods is the long computational time that makes the solution impossible. Therefore, the problem was restored to be solved using heuristics and meta-heuristic methods which overcome the problem of long computational time, but they also have their drawback. Heuristics and meta-heuristic methods do not guarantee to provide the exact optimal solution, but they instead provide a good solution, near to optimal solution, in reasonable computational time. Genetic algorithms, simulated annealing, tabu search, artificial neural network, etc., are some well-known heuristic methods, and genetic algorithm is considered as one of the best heuristic methods. www.ijacsa.thesai.org A Genetic Algorithm (GA) provides individual candidate solutions that do not hold any dependencies between them, so, it will be easy to implement such an algorithm in parallel to get a more considerable speedup. This paper uses a parallelism concept which in turn becomes an effective way to simplify the difficult problems and reduce its computational time. Additionally, GA is a popular effective heuristic approach in both computation time and solution quality. So, they have motivated us to take the advantage of both GA and parallelism to solve that difficult problem.
This work exploits the recent improvement in the graphical processing unit (GPU) which is expanded to include parallel computation rather than just graphical purpose. So, we will propose a solution for QAP using a proposed genetic algorithm with enhancement in crossover and mutation. These enhancements are suggested new crossover operator with new technology which closer to that found in nature without crossover point and new intelligent mutation operator which in turn improve solution quality.

II. BACKGROUND AND RELATED WORK
Genetic algorithms were first invented on QAP by John Holland at the University of Michigan in 1975 [2]. The first applied for GA in QAP was in 1994 by Fleurent and Ferland [3]. GA is considered as a type of stochastic and local search technique, which are based on three natural operators: selection, crossover, and mutation. Also, there are many recent efficient algorithms, we will present a brief study about them to explore the new techniques and take advantage of them.

Radomil
Matousek et al [4] presented Metaheuristic Optimization Using HC12 Algorithm. It is categorized as a parallel algorithm implemented on GPU. It used HC12 which is a Genetic Algorithm using binary encoding which depends on the next population is a population from the current solution neighborhood. This algorithm gives the optimal solutions for 8 problems with sizes (12 -32) in a short run time of an average of 1.89 seconds.
Takeshi Okano et al [5] proposed variant k-opt local search (vKLS) which is categorized as a sequential algorithm in a CPU environment, vKLS used a variable depth approach that depends on exchanging multiple nodes at a time rather than just two nodes. They combine two strategies best-improvement move and the first-improvement move. vKLS tested on 48 QAPLIB instances with a range of 20 -150 in a fixed period equal to 60 seconds.
Ensieh et al improved the performance of the (NIFLS) Fast Local Search algorithm in the sequential environment by adding Temperature characteristics from simulated annealing to conduct the search to explore the search space wider [6]. The algorithm gets 0.26 APD in average execution time 1207 seconds.
Erdener et al developed ILS (Iterated Local Search) algorithm using GPU parallelism [7]. They implement the multi-start technique, use the delta function instead of calculating object function for each neighbor and design a mutation operator to escape the local optimum. The algorithm works 6.31 to 11.93 times faster than sequentially one.
Omar Abdelkaf et al. [8] suggested Parallel iterative Tabu Search (PITS) by parallelizing an existing TS algorithm called Ro-Ts using a grid of 5000 CPUs. PITS works with 350 iterations inside the process, 100 global iterations, and 40 processes. PITS gives an average standard deviation equal to 12.19 in average time equal to 13.01 minutes with problems with size 343.
Also Emrullah et al presented an algorithm called the Parallel Simulated Annealing method with multi-start technique (PMSA) using GPU parallelism [8]. PMSA starts the next SA algorithm with the best previous generated value rather than a random permutation, this technique is called the multi-start approach. It provides the optimal solution for 196 instances except for 14 instances in time less than 60 seconds.
Lopez et al presented GA-CPLS algorithm which is a type of CPU level parallelism [9]. CPLS operation depends on a group of nodes called explorers. GA-CPLS performed the Genetic algorithm as the main explorer to generate the population as a head node, other explorer nodes execute the Extremal Optimization Algorithm and robust Tabu search. GA-CPLS gives 0.054 APD on an average time of 82.7 minutes.
Seyda et al improved sequential Hybrid GA called IHGA [10]. Its idea takes from combining genetic algorithm, simulated annealing algorithm, and the greedy algorithm. It enhances the solution by 13.33, 7.94, 2.50, and 0.29 percent better than the greedy algorithm, DA, classical GA, and SA respectively.

Soukaina et al developed a Hybrid Chicken Swarm
Optimization (HCSO) [11]. HCSO applies GPU level parallelism and integrates Chicken Swarm Optimization CSO with Greedy Randomized Adaptive Search Procedure GRASP. GRASP run with a 2-opt Local Search for constructing the initial population. HCSO finds the optimal solution for 85% of 30 QAP instances.
Mohamed et al enhanced Whales Optimization Algorithm by integrating it with Tabu Search (WAITS) [12]. WAITS was applied in a sequential environment, and it enhances the speed of convergence and local search inside the Whales Algorithm (WA). WAITS provides the optimal solutions for 86 instances out of 122 instances.
Previous studies explored many recent heuristics and metaheuristics algorithms in solving QAP either in parallel or in a sequential environment. Parallelism can be designed at the CPU level or GPU level. As we see from reviewed algorithms, parallel algorithms designed by GPU produced better results in computational time and algorithms like GA will provide a high-quality solution in a reasonable time. This will motivate us to design a new GA with a new crossover operator with new technology closer to that found in nature, it depends on arranging genes in a specific way without the need for a crossover point, and also suggested an intelligent mutation operator in the GPU environment. www.ijacsa.thesai.org The proposed method will be implemented and tested in a sequential environment and then in parallel to compare results and to show the degree of parallel improvement using benchmark instances available in QAPLIB [13].
This paper was organized into sections, each section treats a part of our works. The second section shows the methodology of our works, the next section illustrates the overall structure of the proposed algorithm, the fourth section analyzes and explores the results, and finally the conclusion.

A. Population Initialization Method
Population sets will be initialized randomly concerning the problem size. Additionally, make sure this population does not have incomplete or invalid individuals and all nodes are existing and forming a complete solution. Also, be sure the individual does not have redundant nodes or invalid nodes.

B. Selection Method
The proposed GA applied the selection to two places in the algorithm. First, parents' selection is called the stochastic remainder selection method. It works by assigning a probability to every individual to be chosen as a parent. This method takes each individual's fitness then divides it by average fitness, the integer part of the division represents the number of appearances of the individual as a parent, and the remaining fractional part is used to stochastically fill the remaining parents to stochastic places.
The second application of selection was after crossover operation when deciding about if a current parent will stay for the next generation or be replaced by its best offspring. This type of survivor selection is called the steady-state approach.

C. Crossover Operator
In this paper, we propose a new crossover method that produces an individual who inherits from parent's characteristics as much as possible. This method will preserve the order of the inherited nodes from both parents without making a crossover point.
The following example will illustrate the proposed crossover method by using the facility matrix and distance matrix that is used in the "Hud12" benchmark. If we have two parents parent1 with cost = 1956 and parent2 with cost = 1936 each with size 12, as shown in Fig. 1 and Fig. 2, and offspring will be as shown in Fig. 3. There are two indexes (index1= 0) which point to the first index in parent1, (index2=size-1=11) which point to the last index in parent2. Start filling offspring by these two indexes, at the same time, as shown in Fig. 4.
Step2: index1=1, index2=10, before inserting must check if the new node exists in new offspring if not just insert it, if exist go to the next node in the corresponding parent, offspring will be , as shown in Fig. 5. 4 is the second node in parent1, 7 is the second node from the last in parent2, increment index 1, decrement index2, index1=2, index2=9.
Step3: index1=2, index 2= 9, before inserting must check if the new node exists in new offspring if not just insert it, if exist go to the next node in the corresponding parent, offspring will be as shown in Fig. 6: 12 is the third node in parent1, 8 is the third node from the last in parent2, increment index1, decrement index2, index1=3, index2=8.
Step4: index1=3, index 2= 8, before inserting must check if a new node exists in new offspring if not just insert it, if exist go to the next node in the corresponding parent, offspring will be , as shown in Fig. 7. 6 is the fourth node in parent1, 5 is the fourth node from the last in parent2 but 5 exists in offspring, so go to the fifth node from the last in parent2 which is 1 then check if doesn't exist in offspring insert 1, increment index1, decrement index2, index1=4, index2=7.
Step5: index1=4, index 2= 7, before inserting must check if the new node exists in new offspring if not just insert it, if exists go to the next node in the corresponding parent, offspring will be as shown in Fig. 8. Ten (10) is the fifth node in parent1, 10 is the sixth node from the last in parent2 but 10 exists in offspring, so go to the www.ijacsa.thesai.org seventh node from the last in parent2 which is 11 then check if does not exist in offspring insert 11, increment index 1, decrement index2, index1=5, index2=6.
Step 6: index1=5, index 2= 6, before inserting must check if the new node exists in new offspring if not just insert it, if exist go to the next node in the corresponding parent, offspring will appear as shown in Fig. 9. 5 4 12 6 10 9 2 11 1 8 7 3 Fig. 9. Crossover Sixth Step.
Nine (9) is the sixth node in parent1, 4 is the eighth node from the last in parent2 but 14 exists in offspring, so go to the ninth node from the last in parent2 which is 2 then check if doesn't exist in offspring insert 2. The cost for the generated offspring = 1868 which is better than the cost of parents. Crossover must be simple as possible to achieve maximum utilization of GPU benefits. The generated offspring was produced by simple crossover but inherit many features from parents selected by a strong selection method.

D. Mutation Operators
The proposed GA uses a new mutation operator that works as scanning the individual to find the maximum product (flow * distance) located between facility(i) to the facility (i+1). Then swap facility (i+1) with random node from the individual.
This proposed mutation can be illustrated as shown in the following example: The following individual belongs to the "Had12" benchmark with cost = 1902, as shown in Fig. 10. After applying the mutation operator, the cost will be = 1834, and the individual will be as shown in Fig. 11.

IV. STRUCTURE OF THE PROPOSED PARALLEL GENETIC ALGORITHM
The following algorithm shows the general structure of the proposed PGA, followed by a system diagram to represent the PGA structure, as shown in Fig. 12.
PGA exploited graphical processing unit (GPU) for nongraphical parallel computation, the proposed algorithm uses a large single population of individuals which is distributed among several threads in GPU. Each thread performs three GA operators Crossover, Mutation Survivor, and Selection because they are suitable to implement in the parallel environment as shown in Fig. 12. This means, does not need to force threads to communicate between each other or lock other threads, or wait for other threads until unlocking, this parallelism technique maintains data integrity and consistency also threads' waiting time is almost non-existent. The proposed algorithm was shown in Table I.

V. RESULT AND DISCUSSION
This section will show an analysis, discussion, and illustration of the output and results of the proposed method in this paper, results are going to be analyzed in two ways. The www.ijacsa.thesai.org first analysis will present six tables that show elapsed time for GPU and CPU during the execution of proposed crossover, mutation, and selection. The second analysis presents a test of the proposed method in GPU and CPU after embedding it inside several iterations (generations).
The proposed method was tested in CPU of type intel® core™ i7-8565U CPU @ 1.80GHz (8 CPUs) and GPU of kind NVIDIA GeForce MX250 using both CUDA (Compute Unified Device Architecture) and C++ programming languages.
The following four figures show a comparison between CPU and GPU using common QAP benchmarks, while N means the size of population, CPU and GPU time is measured in milliseconds.

A. First Test Illustration
This test shows elapsed time for GPU and CPU during the execution of proposed crossover, mutation, and selection. 1) For the "lipa20a" benchmark: "lipa20a" benchmark with problem size equal to 20. We notice that when population size equal to 100, 200, 300, and 400 CPU show better results than GPU. Here the problem size is small, and we will only see the enhancement in GPU when the size of the problem and population increase. After increasing the population size to 600 we will observe the GPU enhancement and CPU time become approximately twice the time of GPU. Fig. 13 shows a graphical representation of this problem.
2) For the "lipa30a" benchmark: The enhancement on GPU begins at N=300, then the CPU time will take an increasing rate when population size increases. Compared to GPU time, GPU time does not take a significantly increasing rate while the population size increases, it just took a small increasing rate ≈ of 0.56 milliseconds, as shown in Figure IV.1. CPU continues increasing until it reaches more than 2x time of GPU time at N= 600. Fig. 13 Shows a graphical representation of this problem.
3) For the "lipa40a" benchmark: The improvement on GPU starts when N=300, then the CPU time will increase when population size increases until it reaches nearly 2x the time of GPU at N= 600. On the other hands, GPU time takes a small increasing rate ≈ of 0.58 while the population size increases. Fig. 13 shows a graphical representation of this problem.
4) For the "lipa50a" benchmark: shows the same result as "lipa40". The CPU looks better than GPU when N=300, but after that, it becomes worse when N>300. CPU becomes around 2x time of GPU at N= 600. As we noted earlier GPU time is not affected much by population increase, as shown in Fig. 13. 5) For "lipa60a" and "lipa70a" benchmarks: CPU looks worse than GPU when population size > 200, it becomes around 2.3x time of GPU at problem size =60 and N= 600 and it becomes around 2.5x time of GPU at problem size = 70 and population size N= 600, as shown in Fig. 13. 6) Fig. 14: shows the worst CPU time state when population size = 600, it becomes around 2x GPU time and it increases at a significant rate when the problem size increases.

B. Second Test
Table II presents a test set using a group of common QAP benchmarks in size range (12 -70) and population size N =600. After inserting the proposed tested method into the whole Genetic Algorithm program, we will show their elapsed CPU and GPU time after 2000 generations then show the best solution found in both CPU and GPU. Also, measure Average Percent Deviation to measure how much the solution is closed to best known solution (BKS), as shown in Equation. 2.
(2)   The proposed crossover and mutation give the optimal solution with the first ten benchmarks with problem sizes from 12 to 32 in both sequential Genetic Algorithm and Parallel Genetic Algorithm, next ten benchmarks give a solution close to the optimal solution with a small error rate, CPU and GPU time are measured in milliseconds.

VI. CONCLUSION
After this study, we found that the GPU time is not affected much by increasing either population size or problem size compared to CPU time. GPU time was increased by a small rate ≈ of 0.61 when increasing population size and take a small rate ≈ of 4.5 when the size of the problem increases. Also, the CPU time shows its worst when the population size = 600, it becomes around 2x GPU time, and it increases at a significant rate when the problem size increases. This paper concentrates on applying proposed GA to QAP, which in turn, gives a successful result in finding optimal solutions or solutions near to optimal.
Also, this paper applies proposed GA in the parallel environment which shows a good result in execution time enhancement. As mentioned before, the proposed solution uses a large population size; therefore, a lot of synchronous threads will be needed. So, as future works, we can increase the number of threads by increasing the size of the screen card (NVIDIA). Furthermore, the proposed PGA can be generalized to cover other optimization problems such as TSP (traveling salesman problem) or VRP (Vehicle routing problem).
As a future work we try to enhance some drawbacks that make algorithm slower such as sequential parent selection, it needs to convert to be work in parallel environment.