A Critical Review on Adverse Effects of Concept Drift over Machine Learning Classification Models

Big Data (BD) is participating in the current computing revolution in a big way. Industries and organizations are utilizing their insights for Business Intelligence using Machine Learning Models (ML-Models). Deep Learning Models (DL-Models) have been proven to be a better selection than Shallow Learning Models (SL-Models). However, the dynamic characteristics of BD introduce many critical issues for DL-Models, Concept Drift (CD) is one of them. CD issue frequently appears in Online Supervised Learning environments in which data trends change over time. The problem may even worsen in the BD environment due to veracity and variability factors. Due to the CD issue, the accuracy of classification results degrades in ML-Models, which may make ML-Models not applicable. Therefore, ML-Models need to adapt quickly to changes to maintain the accuracy level of the results. In current solutions, a substantial improvement in accuracy and adaptability is needed to make ML-Models robust in a non-stationary environment. In the existing literature, the consolidated information on this issue is not available. Therefore, in this study, we have carried out a systematic critical literature review to discuss the Concept Drift taxonomy and identify the adverse effects and existing approaches to mitigate CD.


I. INTRODUCTION
State of the art Big Data (BD) and Machine Learning (ML) is one of the fundamental pillars of the 4th Industrial Revolution (IR 4.0). BD generates from a variety of sources, including scientific research, finance, government, internet search, sensors, documents, image, audio and video, and others. The nature of BD is very complex due to its nonstationary characteristics (volume, velocity, variety, veracity and variability). ML approaches (specifically Deep Learning) are considered the main drivers to utilize BD for intelligence for offline learning scenarios only. However, these approaches failed to maintain performance accuracy during online learning scenarios. One of the online learning scenarios where ML models degrade their performance accuracy due to the nonstationary environment is Concept Drift (CD) [1].

A. Taxonomy of Concept Drift
Dynamic assumptions of data (features of data changes over time) called Concept Drift [2]. The Concept Drift term in Machine Learning (ML) is being recognized as the most critical problem since many decades for traditional data and big data. Many assumptions in ML is by using static data [3]. However, this issue frequently occurs in an Online Machine Learning scenario where these dynamic conditions change frequently. Therefore, due to the addition of new features in data, ML models degrade their performance accuracy or could fail to classify or predict to correct output.
Notably, in Supervised Online ML, the model is learned through the input and output features from data of one-time span and will be likely to predict or classify the output (class category) from another time. The change in features (among both time) is due to various conditions. It could be due to the data format (variety), distribution (variability), or sources (complexity), which change over time. Another term for Concept Drift refers to the classification boundary or clustering centers that continuously change with time elapsing [04]. These conditions will adversely affect the classification performance of the model. In studies the term CD is modeled based on Bayesian decision theory for class output 'c' and input data 'X' as shown in eq (1); Where P(c/X), P(c), P(X/c), and P (X) are posterior, prior, conditional, and feature-based probabilities respectively [3]. The possible conditions of Concept Drift arise P(c/X) undergo changes and causes the shift in the class boundary or conditional probabilities (the number of classes increase), this type of Concept Drift is referred to as Real Drift [5]. www.ijacsa.thesai.org Furthermore, if the P (X) (feature-wise distribution of data changes) due to insufficient or partial feature representation of existing data distribution (new additional feature adds or some feature updates) called as Virtual Drift [5]. Also, a study introduces Hybrid Drift as a condition P(c/X), and P (X) occurred consequently [2], as shown in Fig. 1. However, few studies discuss possible configuration pattern based on the frequency of drift, gradual drift (when the variety of concepts changes gradually), consecutive drift (when previous concepts reoccur) and sudden drift pattern (when a concept changes/substitutes abruptly) [6], [7], as shown in Fig. 2.
ML models train to classify according to input and output features with a predefined number of classes. If a feature or class-wise distribution changes over time, then ML models will face a substantial degradation in their performance (because ML models do not have prior knowledge of these changes). However, if these ML models retrain according to newlyarrived data, then they are unable to keep knowledge of the recurrent context (previous training knowledge). As shown in Fig. 3.

B. Causes and Mitigation of Concept Drift in the
Classification Problem Zilobite I. [8], the classification problem determines through the prior probabilities P(ci) and class conditional probabilities P (x/ci) without considering the Concept Drift scenario. Zilobite I. defines the fixed set of prior probabilities of class and class-conditional as follow; Where "S" represents the data source at a given time.
Concept Drift based on Bayesian decision theory, as shown below; The fundamental causes of the possible change in source data (S) due to P(c/X) are presented by [3] [9], which are However, the mitigation strategies are not identical to each type and frequency pattern of Concept Drift (Class, Virtual, Real, continuous, gradual, and sudden or abrupt). For example, we probably like to reuse the past trained classifier if changes reappear (continuous drift pattern), or we may want to suddenly stop classifier and retrain classifier from newly detected changes (abrupt). Thus, to provide a simple approach to handle various types of CD is critical.
In recent studies, researchers propose the term "Adaptability" to avoid performance degradation due to Concept Drift in ML models. The adaptability refers to the feature of ML models (capability) to dynamically adjust itself as per the data changes. This approach allows the ML models to tune or self-regulate for new concept adjustment. Furthermore, this approach possesses the potential to eliminate performance degradation through its dynamic capabilities. However, due to the recurrent context adjustment, the practical implementation for adaptability arises many critical and fundamental questions for researchers. For example, Machine Learning can be categories as context replacement and recurrent context. Context replacement means, how ML models train new concept and forget the previous one. This is simple and can be easily incorporated through the basic adaptability features. Recurrent context refers to how ML models learn a new concept by keeping the previous one. The recurrent context is very challenging. For example, how good accuracy of the ML model can achieve for a new concept and how good it can retain with minimum re-training old data is one of the challenges. In the literature, the adaptability factor can be categorized as semi-adaptive (fundamental dynamic www.ijacsa.thesai.org changes in a classifier level) and fully-adaptive (self-regulatory and more autonomous approaches to make classifiers selfregulatory), which are defined in detail in section 02. The fact is that there are few research studies, which provide the practical implementation of recurrent context through the adaptability feature. However, these studies are particular to the type or frequency pattern of Concept Drift or type of data stream. Whereas, the studies define the framework of Concept Drift adaptation in machine learning models present the generalized framework for classifiers. This generalized framework encompasses to make the proper future assumption of data sources, detect all the possible change pattern, tune the classifier parameters or select the appropriate strategy (training, testing or feature manipulation) for specific type of Concept Drift, and optimal model selection (more appropriate model towards the target function) with minimum error rate. Through a dynamic mechanism, in the defined framework, a classifier could be able to regularly evolve and maintain its performance after any Concept Drift., as shown in Fig. 4.  [10]. However, the contribution from the research community to mitigate the adverse effects of CD on Big Data Classification models are rarely reported [2]. Furthermore, most existing approaches are based on Shallow Learning models (ELM, SVM, and others), hence not capable of reasonably handle the non-stationary feature of BD in the OSL scenario [1]. Moreover, several studies urged to adopt these dynamic changes (in classifier) through self-regulatory mechanisms [11] [12] [13].
Shallow Learning approaches (for example, Extreme Learning Machine (ELM), Support Vector Machine (SMV), Multi-Layer Perception Neural Network (MLP NN), Hidden Markov Model, and others.) handle classification and regression problems efficiently in structured data [14][15]. These approaches not perform well for complex unstructured data (Big Data) [1]. However, Deep Learning algorithms found a better selection to handle Big Data stream and extract value with more accuracy over conventional approaches [3]. Besides, some studies ascertained through comprehensive experiments that Deep Learning approaches are appropriate to learn from BD and urged the researcher to explore further new means to handle CD issue due to OSL. A study argued that to find new means to handle Concept Drift in the context of Big Data and OSL is an essential task for the future of Machine Learning [10].
In literature, the issue of CD is mostly handled through the different configurations of the Extreme Learning Machine (ELM). These configurations are either based on a single classifier or ensemble classifier [8] [11] [16] [17]. Ensemble classifier considers effective solution than single classifier to improve the classification performance (in terms of accuracy) after CD. Nevertheless, the ensemble approach does not adapt to the numerous drift cases [18] [19], such kind of drift may be handled through the adaptive nature of classifiers.
Few recent studies concentrated towards adaptive learning techniques using ELM based single classifier [2] [15] [19] and ensemble classifier for CD mitigation [20] [21] [22]. However, all these solutions lie in this semi-adaptive category (does not implements the fully autonomous learning behavior). For example, Incremental Data Stream ELM used an incremental approach to train the classifier. In this approach, the number of neuron in hidden layers and selection of the activation layer is dynamic, which enhance the performance of the model. Whereas, this approach handles stream data for gradual drift scenario only [22].
A Dynamic-ELM model uses ELM as a first classifier, whereas the online learning approach is adopted to train the double hidden layer structure of ELM. The improvement in the generalized characteristics of the classifier is incorporated by adding more hidden layers. This approach is capable of mitigating the CD in a short time; however, the performance of this model suffers due to the fast processing speed [21].
Meta-Cognition Online Sequential Extreme Learning Model (MOSELM) proposed for improving class imbalance (binary and multiclass) and Concept Drift for online data classification. This model is first to use Meta-Cognition principles and Online Sequential Extreme Learning Machine (OSELM) but only handle Real Drift [23]. A new adaptive windowing approach is proposed to improve adaptability in Real Drift only [15]. Online Pseudo Inverse Method (OPIUM) is based on Gravel methods, the incremental solutions to computing pseudo-inverse of a matrix. OPIUM tackles the real Concept Drift with the discriminant function boundary shift in streaming data only [19].
A recent study proposed an adaptive ML model (AOSELM) using a single classifier approach based on Online Sequential Extreme Learning Machine (OSELM) [23], and Constructive Sequential Extreme Learning Machine (COSELM) [24] to handle the Concept Drift issue for classification and regression problem. AOSELM is the simple solution used matrix adjustment. Results were satisfactory for handling Real Drift but not satisfactory to handle virtual and Hybrid Drift and did not yield better output on real data. Single classifier results may not exceed the adoptable ensemble or full batch approach due to its shared weight changes [2]. Table I represents the notable contributions (concept drift adaptation models) and highlights its pitfalls.

III. RESULT ANALYSIS AND DEDUCTION
Through the comprehensive literature analysis, we can safely state that the performance degradation in Big Data Classification models (in terms of accuracy) due to Concept Drift is still a critical problem. The existing solutions can be categorized into as follow; 1) Non-adaptive and semi-adaptive (single classifier based) SL approaches.
The existing solutions are either limited to a specific type of CD, or their results are biased towards specific CD conditions or dataset. In addition to that, the classification degradation does not reasonably retain after CD handling for complex datasets (CIFAR 10), as shown in Table II. Table II demonstrates the simulations on the most prominent Big Data classification models under certain CD conditions. The experiments carried out to validate the problem formulation of www.ijacsa.thesai.org the Concept Drift issue. In this experiment, we used MNIST [29], Not-MNIST, and CIFAR 10 [30] dataset. The MNIST dataset is recognized as the benchmark dataset for the classification problem, whereas Not-MNIST is an extension of the MNIST dataset, contains some foolish images for providing some challenging data environment. CIFAR 10 is the dataset for color images. MATLAB model R2018a, using Deep Learning toolbox [31] using NVIDIA GeForce GTX 950, 768 GPU cores with 2 GB RAM. The cross-validation and holdout method is used for evaluation and the testing accuracy is measured after a specific type of CD. Interestingly, through the results, we can determine that ACNNELM is better for handling CD for MNIST and Not-MNIST dataset, whereas we found the promising testing accuracy of CNN in CIFAR10 dataset (color images). Also, in our previous study [32], we have performed several experiments to validate the Concept Drift issue.

IV. CONCLUSION
Concept Drift issue can be handled by improving the ML model accuracy and enhancing the adaptability factor. The adaptability feature talks about how an ML model capable of retaining its previous training data knowledge. The ultimate goal of improvement in the ML model and adaptability is handling CD issues, whereas adaptability in the ML model can reduce the computational processing and training time too. According to literature, adaptability can be classified into two types; semi-adaptive (less adaptability) and self -regulatory (a more general aspect of autonomous learning). Current solutions to handle Concept Drift either handle image data or stream data. However, these data classification only provides non-adaptive or semi adaptive solutions (which restrict to utilizing the complete essence of adaptability factor and to handling Concept Drift in Big Data environment). Some possible research directions to overcome CD are; to investigate and formulate the relationship between Concept Drift (Big Data) and exiting Machine Learning models. Quantification and characterization of Concept Drift for Big Data streams. Propose a framework for fully adaptive models for Big Data streams.
Current solutions improved classification accuracy by working on the fewest parameters. For example. The best ML model ACNNELM for Big Data stream classification worked on six parameters which are; Training data composition.

1) Number of kernels 2) Number of layers
3) Type of activation function 4) Number of iteration 5) Variable learning rate However, to identify the latest critical parameters (for example, hyperspectral features) of advance ML models (Deep Learning) and model the matrix to measure the adaptability factor are potential research directions.