Hybrid Algorithm for the Optimization of Training Convolutional Neural Network

—The training optimization processes and efficient fast classification are vital elements in the development of a convolution neural network (CNN). Although stochastic gradient descend (SGD) is a Prevalence algorithm used by many researchers for the optimization of training CNNs, it has vast limitations. In this paper, it is endeavor to diminish and tackle drawbacks inherited from SGD by proposing an alternate algorithm for CNN training optimization. A hybrid of genetic algorithm (GA) and particle swarm optimization (PSO) is deployed in this work. In addition to SGD, PSO and genetic algorithm (PSO-GA) are also incorporated as a combined and efficient mechanism in achieving non trivial solutions. The proposed unified method achieves state-of-the-art classification results on the different challenge benchmark datasets such as MNIST, CIFAR-10, and SVHN. Experimental results showed that the results outperform and achieve superior results to most contemporary approaches.


I. INTRODUCTION
The Convolutional Neural Network (CNN) algorithm has been widely applied in many applications, including face recognition [1,2], image classification and recognition [3][4][5][6] and object detection [7].In supervised learning, Back Propagation (BP) algorithm is the prevalence and constituent method used for CNN training and parameters tuning.All researchers used it in CNN training in all their implementations.
However, there are a number of disadvantages of using the back propagation algorithm alone.For example, BP algorithm deterministically occurs in local optima, making it hard to get global optima, especially if a large search space is required for optimal solution.The algorithm is also slow and hardly benefits of using modern machines such as Graphics Processing Unit (GPUs), which runs hundreds to thousands of threads simultaneously.The complex computational equations emerging in the algorithm demand hard and complicated series of steps to find derivative equations for updating weight parameters.Finally, the cardinality of back propagation algorithm recruits intermediate variables to preserve the validity of data.Means, the implication of BP requires keeping forward and backward essential parameters used for updating equations.
To tackle limitations mentioned accompanied with BP algorithm, in this paper an alternate algorithm is proposed for CNN training.In particular, the Particle Swarm Optimization (PSO) algorithm is introduced for training; and it is combined with the Stochastic Gradient Descent (SGD) to achieve better results.The computational algorithm proposed delves to avoid occurring in local optimum, is fully parallel, and induces simple equations for CNN training.It is completely adaptable because it does not require any changes in CNN structure when some network layers are added or eliminated.The PSO equations used for training weights are completely parallelize as described in ( 1) and ( 2) and shown in fig. 2. This suggests that the weights of any layer can be updated without the need for backward phase as in SGD, thus GPUs can be completely utilized using this implementation.The proposed method also improves training by overcoming premature saturation and sluggishness inspired by SGD.
The reset of the paper consist of the following: in section II, related works are introduced.In section III a brief introduction of introducing PSO is presented.Then in section IV, the proposed approach is introduced in details.Then in section V the model architecture of CNN is illustrated.In both second VI and VII, challenge benchmark used for model evaluation and conclusion are depicted respectively.

II. RELATED WORK
Recently there are vast number of research have been proposed for image recognition using different methods and several proposed novel methods are proposed.Generally image recognition can be obtained using different approaches such as Pedro F. Felzenszwalb et al. [8] proposed a method for image recognition using Deformable Part Models (DPM).In addition further works are devoted using different strategies of using DPM as demonstrated in [9,10,11].Varity of other methods are used for image classification such as SVM [12,13,14,15], boosting [16], spatial pyramid matching [17]; however, on the other hand the most dominant recent works achieved using Convolutional Neural Network (CNN).The last is used widely variety of applications such as image recognition [31,18,22,19,20], object detection [20,21,23,24], scene labeling [25], segmentation [26,27], and variety of other tasks [28,29,30].All the mentioned above works use Stochastic Gradient Descent (SGD).However, in this work, this algorithm is replaced by PSO.In addition, hybrid training algorithm of both PSO and SGD is used.www.ijacsa.thesai.org

III. PARTICLE SWARM OPTIMIZATION (PSO)
PSO is an evolutionary stochastic optimization computational algorithm introduced by Eberhart and Kennedy [32,33,34].Particles are randomly initialized, and periodically updated to introduce a new sophisticated population with new fitness.Each particle updates its new position contingent on its history and the best particle history.Thus the particle movement exploits on two values.The first value is the local best, which characterizes the best value so far for the particle itself, and the second value is the global best, which denotes the best value achieved so far by any particle within the swarm.At each time step, particles traverses toward its new best position by altering another parameter termed velocity.The following notions are the formulas used to tune CNN parameters.
(2) where, and denote the velocity and position of the particle at moment , respectively; and are accelerating factors, and are random numbers between [0,1], is the best position for the particle , is the best particle in the whole swarm.It is obvious that PSO notions are unpretentious and have very rare parameters to be adjusted.
The PSO has remarkable convergence in the initial stages, but it quickly traps to local optimum.In addition, PSO has difficultly incapacitating to avoid local optimum if the search space encompasses only optimal solution [35].The PSO predominantly experiences premature convergence and searches in region adjacent to global minimum as training progresses chronologically [36,37] causing PSO permanently trapped in local optimum region.Therefore PSO is amalgamated by Genetic Algorithm (GA), which is an evolutionary algorithm widely used in solving problems in various fields [38][39][40][41].It defines an initial generation that searches in domain space of the problem and generates a new population based mechanisms of reproduction, crossover, and mutation, which is frequently applied to produce new offspring.Usually, new descendants have higher quality and better fitness than ancestors.The GA induces enhancing PSO by merging particles in a bright approach to produce new generations.Combining GA and PSO crucially leverages the proposed hybrid training method by sharing information among particles, increasing the diversity of search space, countenancing the training vital through computation steps, and finally averting PSO to occur in local optimum.To sustain a smooth transition for the hybrid training along computation steps, Genetic Algorithm is applied to PSO whenever there are one of the following factors; ) premature convergence, ) no progress in the fitness function, or ) Error changes remains steady from two to three consecutive steps.

IV. PROPOSED OPTIMIZATION ALGORITHM
Since SGD has slow convergence and it cannot be fully parallel to take advantages of GPUs, in this paper, a robust hybrid training algorithm is proposed for CNN training.The algorithm is combined both PSO and SGD, and it is called PSO-SGD, which is a highly parallel method.In this approach, it is expected that the unified PSO and SGD algorithm can crucially achieves superior results and surpass previous methods because of still preserving gains of using SGD and the PSO is recruited as revival constituent.For instance, instead of running one particle, which characterizes the whole CNN parameters, plurality of particles is used and scattered over the scope of search space.Also all particles collaborate with each other using delicate method elucidated in next sections.The proposed training algorithm is divided into dual phase.In the first phase, the CNN parameters are initialized and trained using PSO.Then, when PSO progress induction decelerates, the SGD algorithm is applied for few iterations.After few iterations, the process is switched to PSO and so on.In addition, PSO is consolidated by Genetic Algorithm (GA), which is exploited to stimulate particles and overcomes SGD lethargy.Moreover, unlike standard PSO, which requires a long time to reach the potent area, hybrid PSO provides fast and enhanced optimization [33,34].In this algorithm, it is endeavored to preserve the training CNN vital for the whole training period.Algorithm 1 shows the CNN training using the proposed hybrid training method.Algorithm 2 describes PSO alone as well.

Algorithm 1. CNN training using the proposed hybrid method
 CNN generally consists of alternatives two main layers called convolution and max-pooling layer and end up with fully connected layer.All these layers are connected to each other with weights.However, there are many different other CNN architectures.In this study, the same structure proposed by Yann LeCun et al. [42] is used.
There are is of ambiguous steps, which need to be clarified such as how can the CNN parameters be encapsulated into particles?How do they cooperate with each other?How can it justify the best particles with the ensembles of swarm?To answer these questions, how the parameters of CNN are distributed.It is obvious that the weights and biases are constituent parameters of CNN.Therefore, in this work, the weights and bias are dismantled and encapsulated into vectors as shown below: * + (6) where is the layer index, is the total number of layers, is the particle , is the total number of particles, is the weight parameters of layer , and is the bias parameters of the layer .Finally the final total parameters of bias and weights are given by * + (7) Fig. 2 shows the first convolution and max-pooling layers of CNN, and there are set of filters and each has dimensions and can be vectored to be .Thus, having filters for layer, then the total weight parameters are .In addition, the total bias parameters for the given layer are .
Since there are particles that will be trained, each one of them could be the best one among the swarm and can give an optimal solution.In order to justify which the best particle among swarm, the following notion is used: ( ) where is the best particle among swarm and described below is the measured error between the reference and the model output.
Where is the number of training samples, is the number the output layers, is the reference, is the output of .For clarification and showing the difference between the updating parameters using BP and PSO, fig. 2 shows the principle of how BP and PSO work.The figure has circles having functions where and is the number of layers.The last layer has a function .
It is noticeable that BP requires both forward and backward phases.In the forward phase, each activation function gives its response with respect to the input.In backward phase, the derivative is required with to respect to network parameters.PSO does not require any backward phase which can save vast expanse of work and time consumption because the forward phase is less problematical than backward phase.
The reason of why the second phase of network is not compulsory because the PSO algorithm depends on positions and velocities of the particles described in fig. 3.For instance, if there are particles and the particle is the best particle which satisfies (8), then the particle can be updated according to (1) and (2).www.ijacsa.thesai.org

B. MNIST dataset
The MNIST [34] is a hand written digits 0-9.The dataset consists of 60000 samples.50000 samples are used for training and the rest used for testing.All samples have the same size, which is 28x28 pixels.The pixels are scaled to be in [0, 1] before the training.There is no preprocessing or data augmentation utilized in this work.The CNN structure is 8C-8S-24C-24S-89C-90F-10F, where C stands for Convolution layer, S is for subsampling layer, and F is for full conned layer.In this dataset, the size of mini-batches is 128 images.The prosed hybrid PSO and SGD is exploited for training.At the beginning, the particles are trained using PSO only and Mean Square Error (MSE) mentioned in (3) is used as fitness assessment for the particles.The lowest MSE particle is the highest fitness is.In these experiments, MSE keeps dropping in few iterations and it saturates after that.To circumvent such margins, SGD and GA are launched when there is no further error dropping seen.SGD-GA is usually applied if error saturates between 5-8 iterations.Test accuracy is 0.9957 % for MNIST dataset.To best of the knowledge, this is the best reported result without preprocessing, augmentation, or dropout.A summary of the best published results on MNIST dataset is shown in Table I.
when a large dataset is used such as MNIT, which has 60000 gray images for training and 10000 for testing or CIFAR-10, which has 50000 color images for training and 10000 for testing with a mini-batch 128 sued, it influences PSO performance because it cannot choose the best particle which depends on only 128 images so local minimum occurs.To tackle this problem, a hybrid training algorithm of PSO and SGD is used.Instead of than using single algorithm, by collaborating two algorithms with each other, a better performance is reached.Table I shows most of the state-ofthe-art results on MNIST.A comparison is performed with only results that do not have preprocessing or they have the same architecture of CNN.It is clear that this work surpasses other works that do not use distortions or any preprocessing.

C. CIFAR-10 Dataset
The CIFAR-10 dataset consists of 10 classes of natural 32x32 RGB images with 50,000 for training and 10,000 for testing [19].The CNN used for this dataset is described as: 12C-12S-48C-48S-89C-90F-10F, which is denoted to convolutional layer with 12 feature maps, subsampling layer, and a convolutional layer with 48 feature maps, subsampling layer, and a convolutional layer with 89 feature maps, and a fully connected output layer with 90 neurons, and a fully connected output layer with 10 outputs.
The subsampling layers have filters over non-overlapping region of size 2x2.The same steps are followed as in MNIST for training CNN.However, in this dataset, occurring in local optimum is faster than previous datasets so the number of times applying SGD is higher.It is determined that PSO-GA needs to be united by SGD as complicated dataset used such as CIFAR-10 because the MNIST dataset is easier for classification than CIFAR-10.Nevertheless, the benefit of using hybrid POS-SGD is still obtainable.The test accuracy gotten on this dataset is 82.41%.
From table II, it is evident that the proposed method surpasses the other state-of-the-art works.It is worth mentioning that only comparison with methods that use the same structure of CNN is considered.Any other techniques that can be very valuable for increasing accuracy such as dropout or drop-connect are not used.In this work, the same general structure proposed by Yann LeCun et al. [34] is used and only the training algorithm is replaced but the same configuration of CNN is kept.Again Maxout Networks is used in very large CNN implementations because it is implemented over Krizhevsky et al. [31] code.However, a conventional CNN is used instead.In addition, a leveraging PSO algorithm is used in this work which is faster than SGD VII.CONCLUSION In this work, a new hybrid training process is proposed and demonstrated called Particle Swam Optimization-Stochastic Gradient Decent (PSO-SGD) algorithm, for training Convolution Neural Network (CNN).It is established that the algorithm is well suited for achieving nontrivial results on different datasets and surprisingly achieving state-of-the-art on these datasets.The proposed algorithm is a proficient method for training because it combines both PSO and SGD in an innovative fashion.Analysis also shows that the proposed method is superior on three different benchmark datasets.The hybrid training method avoids occurring in local optimum and premature saturation inspired by using single algorithm.Additionally, it preserves the training vital for the whole training period and restrains the lethargy inherited by a monocular algorithm.

VIII. FUTURE WORK
In future more influential parameters will be explored.There are more parameters that can influent model accuracy will be investigated in the future work.Deeper analysis and more challenge datasets such as ImageNet also will be as a part of the future work.Also reporting time consumption and how fast execution time for training and testing will be consider endeavoring to reach real time execution.

Fig 4 .
The CNN used in this work consists of alternative convolutional and max pooling layers.Fully connected layer is implemented on the top of the network.The architecture of CNN used for each dataset is dissimilar from each other.The number of particles is 25 and they are randomly initialized with different means and variances.) www.ijacsa.thesai.org

TABLE II
[6]2]ataset consists 604,388 samples (training and extra set) and 26,032 samples as test images.In addition, each the dataset is color images and the size of each sample is 23x32 pixels.Following[1,2], 400 samples per class from the training set and 200 images per class from extra set are selected to implement validation set.The task in this dataset is to classify the digit in the center of each image.Preprocessing local contrast normalization is used following Goodfellow et al.[6].In addition, the same CNN assembly and parameters setting are used as CIFAR-10.The test error obtained is 2.48%.The result is shown in TableIII.

TABLE III .
TEST ERROR RATES ON SVHN DATASET