Improved Accuracy of PSO and DE using Normalization: an Application to Stock Price Prediction

Data Mining is being actively applied to stock market since 1980s. It has been used to predict stock prices, stock indexes, for portfolio management, trend detection and for developing recommender systems. The various algorithms which have been used for the same include ANN, SVM, ARIMA, GARCH etc. Different hybrid models have been developed by combining these algorithms with other algorithms like roughest, fuzzy logic, GA, PSO, DE, ACO etc. to improve the efficiency. This paper proposes DE-SVM model (Differential EvolutionSupport vector Machine) for stock price prediction. DE has been used to select best free parameters combination for SVM to improve results. The paper also compares the results of prediction with the outputs of SVM alone and PSO-SVM model (Particle Swarm Optimization). The effect of normalization of data on the accuracy of prediction has also been studied.


INTRODUCTION
Stock Market prediction is an attractive field for research due to its commercial applications and the attractive benefits it offers. It follows stochastic, non-parametric and nonlinear behavior. An important hypothesis related to stock market which has been debated and researched time and again is EMH (Efficient Market Hypothesis). According to EMH, the stock market immediately reflects all of the information available publicly. But in reality, the stock market is not that efficient, so the prediction of stock market is possible. This paper proposes a hybrid of DE-SVM (Differential Evolution-Support Vector Machines). The performance of SVM is based on the selection of free parameters C (cost penalty), ϵ (insensitive-loss function) and γ (kernel parameter). DE will be used to find the best parameter combination for SVM. DE-SVM has already been used by Zhonghai Chen et al. [6] for air conditioning load prediction, Yong Sun et al. [7] for gas load prediction, Jośe Garćıa-Nieto et al. [8] for feature selection, Shu Jun et al. [9] for rainstorm forecasting and for studying the lithology identification method from well logs by Jiang An-nan et al. [10]. The paper also compares the results of DE-SVM with PSO-SVM and SVM. The effect of normalization on datasets has also been studied.

II. LITERATURE REVIEW
Yohanes et al. [1] showed that ARIMA (Autoregressive Integrated Moving Average) can be outperformed by ANN. ESS (Each sum square) result with ARIMA is 284.95 and with ANN is 170.40 [1]. Qiang Ye et al. [2] proved that stock price prediction results using amnestic NN are better than common ANN. The ratio of right classified stocks is 58.25% when forgetting coefficient is 0.10 as compared to 56.25% for forgetting coefficient of 0.00 (for common ANN) [2]. Ling-Feng Hsieh et al. [3] integrated DOE (Design of Experiment) with BPNN to show that experimental validation of the optimal parameter settings can effectively improve the forecasting rate to 84%. Mustafa E. Abdual-Salam et al. [4] proved that DE converges to global minimum faster and gives better accuracy than PSO when used as training algorithms for ANN. Zhang Da-yong et al. [5] proposed a hybrid model ARMA-SVM (Autoregressive Moving average-SVM) which has MSE of 1.1433 against 1.1494 for BPNN.

A. Support Vector Machines (SVM):
SVM was developed by Vapnik and Cortes in 1995. SVM is a promising method for the classification of both linear and nonlinear data [11]. SVM can be used both for classification and regression. SVMs can be trained with lesser input samples and are less prone to overfitting. The training time of even the fastest SVMs can be extremely slow, but they are highly accurate, owing to their ability to model complex nonlinear decision boundaries [11]. SVM follows supervised learning. For classification purposes, when data is linearly separable a straight line can be drawn to separate the tuples of one class from the other. For nonlinear data, the data is mapped into higher dimensional space where the different classes can be separated using a hyperplane. A number of hyperplanes are possible but SVM searches for the maximum marginal hyperplane (MMH). The vectors in the training set that have minimal distance to the maximum margin hyperplane are called support vectors [12].
SVM selects the minority of observations (support vectors) to represent the majority of the rest of the observations [13]. The soft margins were introduced to penalize but not prohibit classification errors while finding the maximum margin hyperplane [11]. www.ijacsa.thesai.org If the margin can be significantly increased, the better generalization can outweigh the penalty for a classification error on the training set [11]. To maximize the prediction ability of a model, both underfitting and overfitting need to be depressed at the same time in data processing [25]. The error of training is called Empirical Risk denoted by R emp. SVM uses SRM (Structural Risk Minimization) instead of ERM (Empirical Risk Minimization) which aims at minimizing (1) : Here, l is number of samples in training set, 1-η is the probability of the equation ( (1) ≥ R pred , R pred is the total risk of prediction) to be true and h is VC dimension to depress overfitting in data processing [25].

SVM parameters:
The performance of SVM is based on three basic parameters C (cost penalty), ϵ (insensitive loss function parameter) and γ (kernel parameter).
Cost penalty: C determines the trade-off cost between minimizing the training error and minimizing the model's complexity [26]. The parameter C determines the trade-off between model complexity and the tolerance degree of deviations larger than ε [20]. ϵ loss-insensitive function: Parameter ϵ controls the width of the ϵ-insensitive zone, used to fit the training data [27]. Larger ϵ-value result in fewer SVs selected, and result in more 'flat'(less complex) regression estimates [20]. If the value of ϵ is too big, the separating error is high, the number of support vectors is small, and vice versa [26].
Kernel parameter: γ (2σ 2 ) of the kernel function implicitly defines the nonlinear mapping from input space to some highdimensional feature space [28]. The main kernels used are: RBF kernel is mostly used for stock price prediction because only one parameter needs to be confirmed, there are less SVR training parameters constructed by it and it is easy to confirm SVR training parameters [18]. The kernel width parameter σ in RBF is appropriately selected to reflect the input range of the training/test data. For univariate problems, RBF width parameter is set to σ ~[0.1-0.5]* range(x) [20].

B. Differential Evolution (DE):
Differential evolution (DE) was introduced by Kenneth Price and Rainer Storn in 1995 for global continuous optimization problem. It has won the third place at the 1st International Contest on Evolutionary Computation [14]. DE belongs to the family of Evolutionary Algorithms (EA). DE algorithm is similar to genetic algorithms having similar operations of crossover, mutation and selection. DE can find the true global minimum regardless of the initial parameter values. DE provides fast convergence and uses fewer control parameters. DE constructs better solutions than genetic algorithms because GA relies on crossover while DE relies on mutation operation. It is a stochastic population-based search method that employs repeated cycles of recombination and selection to guide the population towards the vicinity of global optimum. DE uses a differential mutation operation based on the distribution of parent solutions in the current population, coupled with recombination with a predetermined parent to generate a trial vector (offspring) followed by a one-to-one greedy selection scheme between the trial vector and the parent [15]. Depending on the way trial vector is generated, there exist many trial vector generation strategies and consequently many DE variants. High convergence characteristics and robustness of DE have made it one of the popular techniques for real-valued parameter optimization. DE uses three parameters conventionally, they are: the population size NP, the scale factor F and the crossover probability CR/ Cr. Some conditions for these variables include: NP>4, F>0 and is a real valued constant and is often set to 0.5, CR Є (0, 1) and is often set to 0.9 [16]. Different stages in DE are: 1. Population structure : The current population, symbolized by P c , is composed of those D-dimensional vectors X g i = {x g i,1 , x g i,2 , …, x g i,D }, the index g indicates the generation to which a vector belongs [17]. In addition, each vector is assigned a population index, i, which varies from 1 to N p , knowing that N p is the population size [17]. Once initialized, DE mutates randomly chosen vectors to produce an intermediary population P v of Np mutant vectors [22]. Each vector in the current population is then recombined with a mutant to produce a trial population P u of N p trial vectors [22].
2. Initialization : This stage consists in forming the initial population. For example, if our objective is the optimization of the membership functions, the initialization step consists in arbitrarily choosing the interval of this function [17].
3. Mutation [17,22]: For each vector (for example, a vector which represents the interval of the membership functions) ={ , ,…., } a mutant vector is produced according to the following formulation [22]: The scale factor F is a positive real number that controls the rate at which the population evolves. While there is no upper limit on F, effective values seldom are greater than 1. [17,22,4]:The relative vector is mixed with the transferred vector to produce a test vector

Crossing
otherwise The crossover probability CR ϵ [0,1] is a user-defined value that controls the fraction of parameter values that are www.ijacsa.thesai.org copied from the mutant. To determine which source contributes, a given uniform crossover parameter compares CR to the output of a uniform random number generator . If the random number is less than or equal to CR, the trial parameter is inherited from the mutant ; otherwise, the parameter is copied from the vector . In addition, the trial parameter, with randomly chosen index j r is taken from the mutant to ensure that the trial vector does not duplicate . Because of this additional demand, CR only approximates the true probability. 5. Selection [17]: All the solutions in the population have the same chance that the parents of being selected, regardless of their fitness function value. The child produced (new vector) after the crossing operations is evaluated. Then, the performances of the child vector and its relative are compared and the best one is selected. If the relative is still better, it is maintained within the population.
Once the new population is installed, the process of mutation, recombination and selection is repeated until the optimum is located, or a prespecified termination criterion is satisfied, e.g., the number of generations reaches a preset maximum, gmax [4].

C) Particle Swarm Optimization (PSO): PSO (Particle Swarm Optimization) was proposed by James Kennedy and
Russell Eberhart in 1995. It is motivated by social behavior of organisms such as bird flocking and fish schooling [29]. It can be used for nonlinear and mixed integer optimization. PSO is different from evolutionary computing, as in it flying potential solutions through hyperspace are accelerating towards "better" solutions, while in evolutionary computation schemes operate directly on potential solutions which are represented as locations in hyperspace [4]. The position of a particle is influenced by the best position visited by itself (i.e. its own experience) and the position of the best particle in its neighborhood (i.e. the experience of neighboring particles) [30]. Particle position, x i , are adjusted using: (4) where the velocity component, vi, represents the step size. For the basic PSO, v i , j (t+1)=wv i,j (t)+c 1 r 1,j (t)(y i,j (t)-x i,j (t))+c 2 r 2,j (t)(ŷ j (t)-x i,j (t)) (5) where w is the inertia weight [31], c1 and c2 are the acceleration coefficients, r 1,j , r 2,j ~ U(0, 1), y i is the personal best position of particle i, and ŷ i is the neighborhood best position of particle i [30]. The neighborhood best position ŷ i , of particle i depends on the neighborhood topology used [32,33].
The main steps involved in PSO are [34]: 1) Initialize a population array of particles with random positions and velocities on D dimensions in the search space.
2) For each particle, evaluate the desired optimization fitness function in D variables.
3) Compare particle's fitness evaluation with its previous best. If current value is better than previous best, then set previous best equal to the current value, and previous best position equal to the current location in D-dimensional space. 4) Identify the particle in the neighborhood with the best success so far. 5) Change the velocity and position of the particle according to (4) and (5) 6) If a criterion is met (usually a sufficiently good fitness or a maximum number of iterations) then optimal result is given out otherwise optimization continues.
DE and PSO have been used to optimize the parameters of SVM during training and then those parameters have been used to create the best possible model for prediction purposes. The data has been normalized as inspired by [18] to: 1) Avoid the data with large range "submerge" those with small range and balance their functions in the training to make data comparable [18]. 2) To enhance training efficiency and to avoid the problem of inner product calculation when calculating kernel function [18]. The formula used for normalization is [18]: Here, x is the original data, x' is the data after normalization, x min is the minimum of original data, x max is the maximum of original data, x low is the lower bound of the data after normalization, x up is the upper bound of the data after normalization. Here, we use x low = −1 and x up = +1 .

2) Performance indicator:
The performance measure used is MSE (Mean Square Error): MSE= ∑ Here, a is the actual value, p is the predicted value, i represents the term index which ranges from 1 to n, where n represents the last term index. MSE helps to avoid NAs and 3) Methodology: The basic methodology for the both normalized and non-normalized approaches is same.
To find the optimal range of all three parameters C,ϵ,γ first two parameters are fixed and the other one is varied to see its effect on Training MSE, Testing MSE and number of support vectors. And then the second parameter is fixed and so on. All these collected values are considered to find the optimal range to be used for the purpose of stock price prediction. The general range of these parameters can vary over a large solution space but the optimal range differs for different applications and is also dataset dependent. Training MSE, testing MSE and number of support vectors of all three parameters are checked for overfitting and underfitting to select the optimal range.
The following points have been considered while selecting values of C and γ: i) Selecting C: A 'good' value for C can be chosen equal to the range of output (response) values of training data [19]. However, such a selection of C is quite sensitive to possible outliers (in the training data) [20] so, C has been fixed using the formula suggested in [20]: C=max(|y'+3σ y |,|y'-3σ y |) Here, y' and σ y are the mean and standard deviation of the y values of the training data. This C value coincides with prescription suggested by Mattera and Haykin (1999) when the data has no outliers, but yields better C-values when the data contains outliers [20]. Based on above formula C is calculated as 69.167.
ii) Selecting γ: RBF kernel has been used for implementation of SVM. This use is inspired from [18]. Radial basis kernel expression is as follows: (9) According to [20] for multivariate d-dimensional problems the RBF width parameter should be such that σ d ~ (0.1-0.5) so γ or 2σ 2 has been selected as 0.0625.
iii) Mattera and Haykin (1999) propose to choose ϵ-value so that the percentage of SVs in the SVM regression model is around 50% of the number of samples [19]. [20] suggests that optimal generalization performances can be achieved with the number of SVs more or less than 50%. The range of values where number of SVs is from 200 to 300 has been chosen for optimization purpose in the implementation.

Dataset for Apple:
Finding range for ϵ: i) Selecting C: The value of C has been fixed at 450.8346 using (8).
ii) Selecting γ: Value of γ is fixed at 0.0625 according to [20] as explained above.

i) Normalized dataset parameters decision making:
Finding range for ϵ: After fixing values of C and γ at 450.8346 and 0.0625 respectively, the values of different aspects for ϵ have been calculated over the range [0.01,0.30]. The results for no. of support vectors, training MSE and testing MSE are shown in Figure 1(a), 1(b) and 1(c) respectively. The favorable range for ϵ has been found as [0.033,0.052] based on required number of support vectors, decrease in training and testing MSEs. Finding range for C: i) ϵ has been selected from above found range of [0.033,0.052]. It has been set as 0.039. ii) γ is set as 0.0625.
The values of C are examined over [0.1,6000] while fixing ϵ and γ. The results of no of support vectors, training and testing errors are shown in Figure 2(a), 2(b) and 2(c). The range of C has been selected as [1,550]. Figure 2(a) shows that number of support vectors never fall below 200. So, C has www.ijacsa.thesai.org been selected such that training MSE decreases and there is no significant increase in testing MSE. Finding range for γ: i) ) ϵ has been selected from above found range of [0.033,0.052]. It has been set as 0.039.
ii) C has been selected from above found range of [1,550]. It has been set as 500.
The values of γ are examined over [0.0,0.4] after fixing ϵ and C. The results for no of support vectors, training and testing MSEs are shown in Figure 3   The results are shown in Figures 4(a), 4(b) and 4(c). The range for ϵ has been selected as [0.033,0.052] after considering appropriate number of support vectors and after examining that training and testing errors don't increase significantly in this range.

iii) Finding range for C:
i) ϵ has been selected from above found range of [0.033,0.052]. It has been set as 0.035.
ii) γ is set as 0.0625.
The results for no of support vectors, training error and testing error over range of C~ [1,3000] are shown in Figures  5(a), 5(b) and 5(c).
The range for C has been selected as [1,300].  Finding range for γ: i) ) ϵ has been selected from above found range of [0.033,0.052]. It has been set as 0.035.
ii) C has been selected from above found range of [1,300]. It has been set as 200.
γ has been examined over [0.0,0.4] and the results are shown in Figure 6(a), 6(b) and 6(c). The range of γ has been selected as [0.01,0.1]. At γ > 0.1 testing MSE decreases but training error increases.
Dataset for Honeywell: The above approach was also used with Honeywell dataset, both normalized and non normalized.
i) Normalized dataset: For ϵ, C and γ were fixed at 69.167 and 0.0625 using (8) and according to [20] which gave range as [0.08, 0.15]. Setting ϵ at 0.1 from the selected range and γ at 0.0625, C has favorable range in [1,440]. Now, ϵ at 0.1 and C at 210 from selected favorable range γ had favorable range in [0.02,0.08].
ii) Non normalized dataset: For ϵ, C and γ were fixed at 69.167 and 0.0625 using (8) and according to [20] which gave range as [0.05,0.07]. Setting ϵ at 0.05 from the selected range and γ at 0.0625, C has favorable range in [1,60]. Now, ϵ at 0.05 and C at 30 from selected favorable range γ had favorable range in [0.01,0.1]. www.ijacsa.thesai.org All implementation has been done in R on a system with AMD Turion-X2 2GHz Dual Core processor having 2GB RAM and Windows 7 Ultimate (32 bit) OS. The model used is shown in Figure 7.

Apple:
Normalized dataset: The range of parameters C,ϵ,γ are [1,550] Table 3. Table 4 shows prediction results for SVM (with default parameters of C=1, ϵ=0.1,γ=0.2), DE-SVM and PSO-SVM together. The large values of MSEs for testing are because of the highly inaccurate predicted values produced because of the wide range of the output values. Table 1   Table 2 For both DE and PSO, normalized and non normalized cases, the population size has been fixed at 30 and iterations at 200. The prediction results of DE-SVM and PSO-SVM are better than SVM alone in both cases.  Honeywell: Normalized dataset: The range of parameters C,ϵ,γ are [1,440],[0.08,0.15] and [0.02,0.08] respectively. Table 6 shows prediction results.  Table  7 shows the results. Predicted values for both normalized and non normalized datasets are shown in Table 8.
For DE CR=0.7 and F=0.9. DE / local-to-best / 1 / bin strategy has been used for DE for all the implementations in this paper. Table 7   Table 8 V. CONCLUSION The performance of SVM can be significantly affected by choice of its free parameters of cost (C), insensitive loss function (ϵ) and kernel parameter (γ). The results show that DE-SVM model's performance is comparable to that of PSO-SVM. Performance of these models can be improved by normalization of datasets. Normalization helps to significantly improve the accuracy of the output when the range of values is vast. Normalization gives equal weightage to all the input variables by converting the values of all the variables within a pre-specified range. This helps to avoid dominance of one variable over others in the created model. So, it helps to improve the efficiency of the created model. SVM alone performs better when data is normalized because in hybrid models optimization techniques help to tune the model according to requirement of datasets. With normalization of data, the range for optimization of C,ϵ,γ improves.  [15]. Hence, it results in slower convergence. To alleviate this problem, a dynamic version of DE called Dynamic Differential Evolution (DDE) has been proposed by Anyong Qing [23]. DEPSO algorithm, which represents more stability by dual evolution, proposed by Ying-Chih Wu [24] can be used for optimization of SVM. The above mentioned methods will help to further improve the efficiency of SVM and hence improve results.