A Parallel Simulated Annealing Algorithm for Weapon-Target Assignment Problem

Weapon-target assignment (WTA) is a combinatorial optimization problem and is known to be NPcomplete. The WTA aims to best assignment of weapons to targets to minimize the total expected value of the surviving targets. Exact methods can solve only small-size problems in a reasonable time. Although many heuristic methods have been studied for the WTA in the literature, a few parallel methods have been proposed. This paper presents parallel simulated algorithm (PSA) to solve the WTA. The PSA runs on GPU using CUDA platform. Multi-start technique is used in PSA to improve quality of solutions. 12 problem instances (up to 200 weapons and 200 targets) generated randomly are used to test the effectiveness of the PSA. Computational experiments show that the PSA outperforms SA on average and runs up to 250x faster than a single-core CPU. Keywords—Weapon-Target Assignment; Multi-start Simulated Annealing; Combinatorial optimization; Parallel algorithms; GPU


INTRODUCTION
The Weapon-Target Assignment (WTA) problem is an NPcomplete combinatorial optimization problem at field of military operation research [1]. The WTA Problem aims to find best assignment of weapons to targets, to minimize the expected damage of the defended area in order to increase chances of survival. Several exact methods are studied in the literature [2][3][4] but these methods can solve only small-size problems. Thus, heuristic methods such as Simulated Annealing [5,6], Genetic Algorithm [6,7], Tabu Search [6], Variable Neighborhood Search [3,6], Ant Colony [7][8][9] and Particle Swarm Optimization [10] are proposed for the WTA.
Simulated Annealing (SA) is an efficient algorithm for solving the WTA problem [5,6]. The SA is a flexible algorithm to implement any problem like the WTA. On the other hand, each iteration of the SA depends on the previous iteration. Therefore, runtime of the SA method is not as good enough as other heuristic methods. Parallelization of the SA can be presented as a solution to overcome this problem.
Nowadays, GPUs are very efficient hardware platform to develop parallel algorithms. Several parallel implementations of the SA on GPU are presented in the literature [11][12]. These methods have achieved good quality results in the applied areas. In this paper, a parallel SA algorithm (PSA) is developed to solve the WTA problem. PSA has been developed on GPU and has used the multi-start technique to obtain better results. This paper is organized as follows. In Section II, mathematical formulation and definition of the WTA problem is introduced. The SA algorithm is described in Section III. Section IV gives details about the PSA. Computational experiments and results are presented in Section V and finally Section VI states some conclusions.

II. THE WTA PROBLEM
In the WTA problem, assets of the defense want to destroy attacks of the targets directed by offense. The defense has a finite number of weapons to defend incoming threats from the offense. There are two models as static and dynamic of the WTA problem. In this paper, the static of the WTA problem has been studied. In the static model, all inputs of the problem are static and the assignments of weapons to targets are performed in a single step. The expected damage for assignments is evaluated after all weapon-target engagements have been completed. Parameters and variables of the problem are defined as follow:  n , the number of the targets (1, 2, …, n),  m , the number of the weapons (1, 2, …, m),  v i , the value of the target i,  p ij , the probability of destroying by assigning the jth weapon to the ith target, The survival probability of target i when attacks weapon j The problem can be formulated as follows: All weapons must be assigned to targets. In this paper, it is assumed that number of target equals to number of weapons and only one weapon can be assigned to one target. www.ijacsa.thesai.org III.
SIMULATED ANNEALING ALGORITHM FOR THE WTA Simulated Annealing (SA) is a heuristic algorithm to obtain optimum or near-optimum of a given function in a large space. Kirkpatrick and Vecchi [13] have developed the SA in 1983 to solve a problem for economic activities. In the SA algorithm, each step generates a random solution using the current solution that is achieved in the previous step. Acceptance of the new solution depends on parameters of the method and the difference between neighbor solutions. Metropolis criterion [14] and Boltzmann distribution are used for the acceptance of the new solution. Also, these methods ensure that the SA does not stick to a local minimum or maximum. The acceptance probability function is shown as follows: where T is temperature at each step and decreasing by cooling factor    for each step. P is acceptation probability of the current solution in annealing process. A new candidate solution is found by randomly selecting two weapons q and r, then swapping assignments to targets between selected weapons. Δf is the difference between two neighbors' solutions in a given function and defined as follows: After swapping operation, the new candidate solution is calculated by formula as given below: When T reaches to a temperature T final that is determined as a parameter by user, the method is terminated. Except for that, time dependent and iteration number dependent termination methods are also used. The pseudocode of the SA is as follows: The function rnd(s) returns a random number sequence using simple local search algorithms like swapping, 2-opt etc. for permutation s and the function rnd(0,1) returns a random number between 0 and 1.
The main steps that compose the SA algorithm for the WTA problem are described below.
Stage 1: Initialization 1) Inputs: Probability of destroying matrix p, value of targets v, array of permutation s of the weapons that is assigned to each target 2) Set the SA parameters: T,T final and a.
3) Solve the WTA using (2) as an initial solution f with the permutation array s that is generated randomly. Stage 2: The SA Execution 1) Generate two different index numbers randomly for s and swap them.
3) Accept or not to accept the s new using   Pf  .

4)
If the s new is accepted then set s = s new and go to step 8, otherwise go to step 9.

5)
Calculate the f new using (6) and set f = f new .
6) Set f best = f new and s best = s new if f < f best .

8) Repeat
Step 4 -Step11 until T is reached to T final .
In the above stages, Stage 2 performs the SA algorithm after initialization of required variables and parameters in Stage 1. Stage 2 searches a new solution by swapping between weapon assignments of two targets ( see Fig.1). PARALELLIZATION ON GPU Implementation of the PSA has been performed using Compute Unified Device Architecture (CUDA) on Graphics Processing Units (GPUs). CUDA is a C/C++ language extension and a parallel computing platform created by NVIDIA Corporation [15]. CUDA platform is also a tool for General Purpose Computing on Graphics Processing Units (GPGPUs). GPGPU can be defined as a parallel processing methodology using GPUs for high-performance computing.
The technique of restarting a heuristic algorithm with different configurations is called multi-start and it is an Weapon 1 Weapon 2 Weapon 3 Weapon 4 Weapon 5 Target 1  Target 2  Target 3  Target 4  Target 5 swapping Weapon 1 Weapon 4 Weapon 3 Weapon 2 Weapon 5 Target 1  Target 2  Target 3  Target 4  Target 5 Before After www.ijacsa.thesai.org effective method to improve quality of solutions for optimization problems. [16]. This technique is also used for the SA and proved its effectiveness [17,18]. Only one heuristic method is run at a time and the method must be restarted to use the multi-start technique for a single-core CPUs. On the other hand, several heuristic methods run at the same time on a multi-core CPUs and many-core GPUs. In this paper, the PSA has been proposed with a multi-start technique.
In the PSA, every thread on GPU starts with a different s. cuRAND is a pseudorandom number generator library defined on CUDA platform and is used to provide multi-start technique. All threads have a different seed and different seeds are guaranteed to produce different sequences. Each thread runs the SA independently that is given the steps at Stage 2 in Section III. After each thread has run the SA, threads in a same block communicate with each other using shared memory. The best fitness value is found for each block in parallel using reduction method. The flowchart of the PSA handled by each thread is shown in Fig. 2.
After the best fitness values have been found for each block of the GPU, they are transferred to CPU. These operations are shown at the process before the Stop process in Fig. 2. Finding the best fitness value has been operated in parallel using reduction method. In the reduction method, half of the threads on a block are active. Transfer of the best fitness value of each block is performed by the first threads on blocks. In other words, only first thread on each block is active for transferring. After that, the best fitness value is found on CPU from the best values of all blocks.
In this paper, 1024 threads per block on GPU have been used. This means all threads on a block are used. When the number of blocks is increased, runtime of the PSA increases, too. The reason of this increment is accessing the global memory at a same time from many threads. Several configurations have been realized to optimize the runtime of the PSA. These are given below.
 Short gives better performance than int therefore short variables are used instead of int variables if suitable.
 --use_fast_math is a compiler parameter. It can be chosen for faster mathematical functions.
 Accessing the global memory is very slow in the CUDA platform. Shared memory is used for accessing v (values of the targets) to read/write much faster. Global memory must be used for the p (matrix array stores probability of destroying targets) because of the limitation of the shared memory.
Each thread performs the swapping operation on its own permutation list. In this context, each thread requires its own a copy of:  Generate two random index (q,r) for the assignment list s

T < T(final)
Yes Calculate the f(q,r)

( f(q,r) < 0 || exp(-f(q,r)/T) > rnd(0,1)
Perform swapping between q r and Calculate the f(π ) by f(π ) f(π) + f COMPUTATIONAL EXPERIMENTS There are no benchmark problem datasets/instances in the literature for the WTA problem. That is why, various scenarios created to test the performance for proposed methods. In this paper, computational tests have been carried out on 12 problem instances in different dimensions (available at: http://web.karabuk.edu.tr/emrullahsonuc/wta). The values of targets are generated as random numbers from the uniform distribution in the range 25-100. The probabilities of destroying targets for weapon-target assignments are generated as random numbers from the uniform distribution in the range 0.60-0.90. Problem instances are generated with different dimensions which are in the range 5-200. Dimensions of the problem instances (WTA1-WTA12) are shown in Table I.  Table II. The best, the worst, the mean, the median and the standard deviation (SD) are listed in Table II for all problem instances. NVIDIA GeForce GTX Titan X (3072 cores, 1.0 GHz) has been used for running the PSA. Results of the PSA for each problem instances are always same. The reason of this result is that the random number sequence generated by each thread is the same at every run. The PSA runs on GPU three times for each problem instance to obtain the average runtime. 24 blocks have been used on GPU to perform the performance comparison with CPU. Table III represents the experimental results of the PSA on problem instances.   Table III shows that the SA has better accuracy in 4 of 12 problems according to best results. If the mean results are considered, the SA has not any better accuracy than the PSA. The PSA has better accuracy in 3 of 12 problems. The SA and the PSA have same accuracy for 5 of 12 problems. The results maybe the optimum fitness value for these 5 problem instances. For each problem instance, runtime results in seconds and speedups are given in Table IV  According to runtime results of the SA for problem instances, runtimes are close to each other on instances and average time is 2985.92 seconds. The average runtime of the PSA is 19.28 seconds. Speedups have been shown also in Fig.3. The average speedup of 12 problem instances is 155x. It can be said that this acceleration is capable of making the SA algorithm more efficient.
The reason why the speedup values do not increase linearly is due to the access on the global memory. Accessing global memory is efficient if there is a coalesced access to it. On the PSA, there is no coalesced accessing in the process of writing best results for each block to global memory. Thus, speedups can be uphill or downhill for various dimensions (see Fig. 3). The best speedup is 250x for the WTA6 problem instance and the worst speedup is 92x for the WTA2 problem instance. www.ijacsa.thesai.org If the PSA is considered, increasing the number of blocks will cause more access to global memory. For this reason, this will also increase the runtime as mentioned before. On the other hand, running of the PSA by more threads means new multi-start configurations. More multi-starts provide the possibility of increasing the quality of the results. The results obtained using 24, 48, 96, 128, 512 and 1024 blocks are shown in Table V. The best results are shown in bold. If the number of blocks is increased, then the quality of the results is improved. Results of first five problem instances (WTA1-WTA5) are same for all runs. When 1024 blocks are used for the PSA, all results have been improved except first five problem instances. Furthermore, when 1024 blocks are used for the PSA, runtime is less than half of the SA method corresponding to the PSA using 24 blocks.

VI. CONCLUSIONS
In this paper, the PSA is proposed to solve the WTA problem. Multi-start technique is used in the PSA to obtain better results on problem instances. The PSA runs on a GPU using CUDA platform. Results are compared both quality and acceleration. The PSA is up to 250x faster than a single-core CPU. In terms of quality solutions, the PSA is capable of delivering good quality results than the SA for problem instances in average. In future, the PSA can be optimized for coalesced accessing to improve runtime. Also, the PSA can be applied dynamic WTA problem which is other model of the WTA.