Learning on High Frequency Stock Market Data Using Misclassified Instances in Ensemble

Learning on non-stationary distribution has been shown to be a very challenging problem in machine learning and data mining, because the joint probability distribution between the data and classes changes over time. Many real time problems suffer concept drift as they changes with time. For example, in stock market, the customer’s behavior may change depending on the season of the year and on the inflation. Concept drift can occurs in the stock market for a number of reasons for example, trader’s preference for stocks change over time, increases in a stock’s value may be followed by decreases. The objective of this paper is to develop an ensemble based classification algorithm for non-stationary data stream which would consider misclassified instances during learning process. In addition, we are presenting here an exhaustive comparison of proposed algorithms with state-of-the-art classification approaches using different evaluation measures like recall, f-measure and g-mean. Keywords—Classifiers; Concept drift; Data stream; Ensemble; Non-stationary Environment


INTRODUCTION
Nowadays most of the applications are online applications, where huge amount of data increasingly arrives at every time stamp which is generated from different sources.So it is very important to train classifiers incrementally over the time so that they can learn different concepts of non-stationary data streams.
Conventional data mining algorithms assumes that each dataset is produced from a single, static and hidden function i.e. the function (model/classifier) generating data at training time is the same as that of testing time.Whereas in nonstationary data stream, data is continuously coming and the function which is generating instances at time t need not be the same function at time t+1.This difference in the underlying function is called as concept drift [1].
The concept drift problem is studied in literature with different terminology as "concept shift" , "concept drift", dataset shift ,"change of classification", "changing environments", "non-stationary environment" etc. Concept drift in data stream happens when the relationship between the input and class variables changes over time and this can happen because of change in the following: 1) The class priors, P(c i ), i = 1, 2, 3, . . .k, where k is the number of classes; 2) The distribution of the classes, P(X| c i ), where i = 1, 2, 3, . . .k and X is a vector of labeled instances; and 3) The posterior distribution of the class membership P(c i |X), i = 1, 2, 3, . . .k For providing training to classifiers incrementally over the time so that they can learn different concepts of non-stationary data streams we are using ensemble based approach [2] as shown in fig 1.In data mining, the ensemble is a pool of classifiers whose individual classifications/predictions are combined in some way to classify unseen examples.The strategy in ensemble systems [3] is to create subsets of incoming data stream and for each subset a classifier is trained and tested and then these classifiers collectively would do decision making and predict the label for unseen data Performance of learning algorithms dependent upon the size of the data chunks (block/batch).Bigger blocks [4] can results in accurate classifiers as classifiers are getting more data for training, but can contain too many different concept drifts.Whereas smaller blocks are better for drifted data stream, but usually lead to poorer classifiers as training data is less.In this paper, state of the art Learn ++ .NSE algorithm is evaluated for handling non-stationary data with our proposed approach.This paper is organized as follows: Section 2 offers an overview of related work.Section 3 presents proposed algorithm i.e.ENSDS_P and Section 4 provides detail of proposed algorithm with its pseudo code.Section 5 provides a rigorous evaluation of the proposed algorithm with one of the existing algorithms.Section 6 concludes the paper.

II. RELATED WORK
The first experiment of ensembles in data streams was the one proposed by Street and Kim with their Streaming Ensemble Algorithm [5] (SEA) where a chunk of d instances is read from the data stream and used to build a classifier.As fixed size of ensemble was used, so they compare new generated classifier against a pool of previously trained classifiers (from previous chunk), and if its current classifier improves the quality of ensemble it is included at the cost of the worst classifier.SEA uses a simple majority vote and may not be able to perform in recurring environments.[6] (AWE) of classifiers on each incoming data chunk and use that chunk to evaluate the performance of all existing classifiers in the ensemble.The weight of each classifier is the difference of error rate of a random classifier and the mean square error of the classifier for the current chunk.The mean square errors of old classifiers are high, and thus the weights of old classifiers are small.Brzezinski and Stefanowski proposed the Accuracy Updated Ensemble [7] (AUE) which is derived from AWE.It uses the same principles of chunk-based ensembles, but with incremental base components/classifiers.It not only builds new classifiers, but also conditionally updates existing classifiers on new chunks rather than just adjusting their weights.The updation of existing base classifiers makes AUE better than AWE in case of gradual drift but conditionally updating of base classifiers is less accurate for sudden drift.

Wang et al. proposed Accuracy Weighted Ensemble
Robi Polikar et al. proposed Learn ++ .NSE [8], [9], [10], [11], [12] (Nonstationary Environment) which generates classifiers sequentially using batches of examples/instances (Not true online learner as it converts the online data stream into a series of chunks of a fixed size).At each time step, one new classifier is trained on recent distribution, using an instance weighting distribution.In Learn ++ .NSE each classifier's weight is computed using a weighted average of its prediction error on old and current batch and finally uses weighted majority voting to obtain ensemble's output.
Most recently, Brzezinski and Stefanowski proposed AUE2 [13] introduces a new weighting function, does not require cross-validation on the existing classifiers, does not keep a classifier buffer, prunes its base learners, and always unconditionally updates its components.Classifiers are updated after every chunk, so they can react to gradual drifts.It can react to sudden drifts and gradual drifts but not for reoccurring concepts.Compared to Learn ++ .NSE, AUE2 incrementally trains existing component classifiers, retains only k of all the created components, and uses a different weighting mechanism which ensures that components will have non-zero weights.

III. PROPOSED ALGORITHM Fig. 2 depicts the flow diagram of ensemble for nonstationary data stream with propagation (ENSDS_P).
ENSDS_P is our proposed algorithm, which is an ensemble of classifiers, where the classifiers are generated from data arrived at time t and evaluated on recent data.All generated classifiers are combined by using weighted majority voting to provide the predictions of unseen data.One of the major differences in ENSDS_P as compared to existing approaches is, we are not updating a set of weights for each instance rather we believe all instances are equally important while they are using in training so uniform weight is considered and secondly we are propagating the misclassified instances of a classifier to subsequent classifier for improving the performance.
In this system, data is continuously arriving in nonstationary manner.For learning purpose, we take dataset containing labeled instances.Divide this incoming data into number of batches where each batch contains equal number of instances.First apply any suitable classification scheme to www.ijacsa.thesai.orgcreate a classifier.The performance of classifier is then evaluated with same batch of instances.If the error rate of classifier is more than 50% i.e. half of the predictions are wrong then delete that recently generated classifier and again repeat the classification process till we get a classifier having an error rate less than 50%.
After creation of first classifier, a misclassified instance buffer is used to store hard to classify instances.All hard to classify instances are propagated to next classifier so that next subsequent classifier can learn them with their training chunk and overall system performance can be improved.
When the next batch of data get available, the incorrect classified instances of previous classifier would be combined with labeled instances of current batch and then apply classification scheme.From this step we get next classifier.This process is continued till we get classifier for all batches and all these classifiers are combined using weighted majority voting scheme.When unlabeled data is arrived, it is predicted by created ensemble using weighted majority voting.
Two variations of ENSDS_P are developed and analyzed, first approach named as ENSDS_P_F where we are propagating misclassified instances, but preserving fix batch size for all classifiers while other approach named as ENSDS_P_D where we are propagating misclassified instances and dynamic chunk size is used for training of classifiers.
IV. ALGORITHMIC DESCRIPTION Fig. 3 presents the pseudo code of proposed algorithm.For each t, a new classifier generates on current training chunk , and the performance of all previously generated classifiers would be evaluated on current data chunk by k parameter and misclassified instances would be saved in a buffer .The misclassified instances will be propagated to next subsequent classifier with their training chunk.
In step 1, a uniform weight is assigned to all instances of current data chunk .Step 2 is only different in ENSDS_P_D and ENSDS_P_F rest of algorithm will remain same for both.ENSDS_P_D achieves by eq. 1 where total no. of instances in current data chunk would be union of D t and misclassified instances of previous data chunk.
ENSDS_P_F achieves by eq. 2 and 3 where ND t represent new dataset whose size equal to size of current data chunk minus size of misclassified instance buffer.

Size(ND
After formation of k th classifier as in step 3, the performance of existing classifiers will be evaluated over the current training dataset and we will get k which is error of k th classifier on current .If error generated by current classifier is more than .5 that is half of the predictions are wrong then generate a new classifier for the current distribution.If error generated by one of theprevious classifier is more than .5 then set its k as in step 4.We are not normalizing the k as its value remains between 0 to 0.5 and voting power of a classifier having k will remain low.In step 5, we are creating parameter which represents a buffer to hold misclassified instances.These misclassified instances would be propagated to next classifier before its formation with its training chunk.A nonlinear sigmoid function is used to set weight of a classifier.Because of this, if a classifier will be evaluated more than once then its sigmoid weight will get increased.
The weight to a classifier is assigned based on its performance on previous distributions as well as on recent distribution so weighted average of classifier is computed in step 6.When a classifier is generated it's k , after its evaluation on recent environment its k gets keep updated.If a classifier does not performs well on recent environment, then its weightederror ( k k will gets increased.In step 7 the weight error average is computed to determine the voting

V. COMPARATIVE EVALUATION AND ANALYSIS
In the following subsections; we describe the tested datasets, experimental setup, and comparative analysis of experimental results.

A. Datasets
For doing the comparison of ENSDS_P and existing algorithm (Learn ++ .NSE) we are using different datasets with different batch sizes.The proposed algorithm is tested over real time datasets.

1) IBM_EOD_Direction:
The The purpose of considering stock data is as we know that stock market data is high frequency data which is complex, non-stationary, chaotic and non-linear and suites our research topic .Concept drift can occurs in the stock market for a number of reasons for example, traders preference for stocks change over time, increases in a stock's value may be followed by decreases.Stock market data can possess sudden, gradual and recurring drift at any moment of time.
The analysis over IBM_EOD_Direction dataset would help trader to know the position of stock market index at next moment of time and analysis over IBM_EOD_Trading would help trader to take decision whether its right time to sell or purchase the stock.

B. Experimental Setup
For experiment analysis, the proposed algorithm is implemented in Java using MOA and WEKA framework.The source code of Learn ++ .NSE is obtained from MOA extensions for comparison purpose.The experiments were conducted on a machine equipped with Processor Intel(R) Core(TM) i3-2120 CPU @ 3.30GHz, 2 Core(s), 4 Logical Processor(s) and 4 GB of RAM.Here we have used different batch size for comparison purpose.However, the optimal batch size is different for each stream.For rigorous evaluation, we are considering different evaluation measures [14] like P=precision, R=recall, A=accuracy, F-M=f-measure, and G-M=g-mean.

C. Results
Table 1 depicts the performance of Learn ++ .NSE and both the versions of ENSDS_P respectively to classify the stock index movement(Up, Down) over IBM_EOD dataset where we are considering Naïve Bayes as base classifiers, different batch size and no pruning strategy is used.It is clear from fig. 4 that for each batch size we are retrieving high true positives and low false positives hence precision is higher.The results shows precision of both the versions of ENSDS_P is significantly high and recall is approximately equal.
Generally, there always remains a tradeoff between precision and recall.F-measure is appropriate evaluation measure which gives the balance between precision and recall.As compare to Learn ++ .NSE we are able to maintain a good balance between precision and recall so proposed algorithm can also be used with imbalanced data.The values of evaluation measures proved the validity of proposed work hence evaluation results shows that proposed algorithms effectively provides incremental learning over high frequency stock market data.www.ijacsa.thesai.orgFig. 5 represents that as compared to Learn ++ .NSE we have achieved high precision, accuracy, f-measure and g-mean for all batches on IBM_EOD_Trading dataset.After testing proposed algorithm on different datasets, evaluation measures confirm the validity and excellence of proposed algorithm.The non-stationary data can have class imbalanced problem so result can be biased toward the majority class; thus the classifier tends to misclassify the minority class instances.In imbalanced application area, proposed algorithm can be used and can provide a balance between majority and minority instances.

VI. CONCLUSION
From the implementation and analysis of ENSDS_P we can conclude that the performance of ENSDS_P is better as compared to Learn ++ .NSE on different datasets.Evaluation measures also confirm the validity of proposed algorithm's scores.The selection of optimal batch size varies from dataset to datasets.The non-stationary data can have class imbalanced problem so result can be biased toward the majority class; thus the classifier tends to misclassify the minority class instances.If dataset is highly imbalanced then there is need to add some balancing mechanism in proposed algorithm to achieve high performance.

Fig. 3 .
Fig. 3.The Pseudo code of the algorithm ENSDS_P

8 .
Obtain the final hypothesis www.ijacsa.thesai.orgweight of classifiers.The voting power of each classifier is computed using logarithm of the inverse of its weighted error average.If weighted error average is high a classifier will get less power of voting.*k*O(x*m)+k*t*m) where O(x*m) is the time complexity of Naïve Bayes classifier, x is number of features and m is number of instances in training set, k indicates number of classifiers, t indicates number of data chunks to be predicted.
IBM_EOD_Direction dataset contains stock data of IBM Company, where we are considering open, high, low, close, volume and rate of change in closing price to find out the stock index movement (Up, Down) for classification task.For training purpose data from period 2-Jan-2000 to 13-April-2016 (3999 examples) is fetched and for testing purpose data from period 2-Jan-2001 to 13-April-2016 (3841 examples) is fetched using Google finance.2) IBM_EOD_Trading: The IBM_EOD_ Trading dataset contains stock data of IBM Company, where we are considering open, high, low, close, volume and rate of change in closing price to find out Buy or Sell class for Stock data.For training purpose data from period 2-Jan-2000 to 13-April-2016 (3999 examples) is fetched and for testing purpose data from period 2-Jan-2001 to 13-April-2016 (3841 examples) is fetched using Google finance.

Table 2
depicts the performance of Learn ++ .NSE and versions of ENSDS_P respectively to classify the stock trading (Buy, Sell) over IBM_EOD_Trading dataset where we are considering Naïve Bayes as base classifiers, different batch size and no pruning strategy is used.