A Scalable Machine Learning-based Ensemble Approach to Enhance the Prediction Accuracy for Identifying Students at-Risk

—Among the educational data mining problems, the early prediction of the students' academic performance is the most important task, so that timely and requisite support may be provided to the needy students. Machine learning techniques may be used as an important tool for predicting low-performers in educational institutions. In the present paper, five single-supervised machine learning techniques have been used, including Decision Tree, Naïve Bayes, k-Nearest-Neighbor, Support Vector Machine, and Logistic Regression. To analyze the effect of an imbalanced dataset, the performance of these algorithms has been checked with and without various resampling methods such as Synthetic Minority Oversampling Technique (SMOTE), Borderline SMOTE, SVM-SMOTE, and Adaptive Synthetic (ADASYN). The Random hold-out method and GridSearchCV were used as model validation techniques and hyper-parameter tuning respectively. The results of the present study indicated that Logistic Regression is the best performing classifier with every balanced dataset generated using all of the four resampling techniques and also achieved the highest accuracy of 94.54% with SMOTE. Furthermore, to improve the prediction results and to make the model scalable, the most suitable classifier was integrated with the help of bagging, and a well-accepted accuracy of 95.45% was achieved.


I. INTRODUCTION
Due to the digitization and use of technology in the educational field, there is a large amount of educational data. Educational Data Mining helps to analyze and extract useful information, such as selecting the factors that affect the students" performance, predicting students" performance, etc., from a large amount of educational data. As students or youths are the future of any nation, predicting the success rate of students in their academic area is a very important and beneficial task. This may be achieved with the help of educational data mining, which utilizes various machine learning techniques.
Although the field of Educational Data Mining (EDM) is old and its definition was given by Fayyad et al. [1] in 1996, EDM emerged as a convincing research area after the establishment of the annual International Conference on Educational Data Mining and the Journal of Educational Data Mining in 2008 [2]. After that, Baker [3] identified the application of data mining in education to discover models for predicting students" performance by using the methods of prediction, clustering, relationship mining, and discovery with models. Among the applications of EDM, detecting student failure at an early stage has been an appealing research topic for researchers due to its social impact. The prediction of the students at risk of being dropouts from an institute or school becomes difficult due to the large number of factors that may influence the academic performance of the students. Thus, it is quite important to predict low-performing students at an early stage with higher accuracy, along with the important factors that may affect their performance.
To achieve this goal, the present study has three important research objectives: (i) to identify the influential features by using a filter-based feature selection technique. (ii) to identify the best performing classifier by comparing various singlesupervised machine learning techniques, viz., decision trees, Naïve Bayes, k-Nearest Neighbor, Logistic Regression, and Support Vector Machine with various resampling techniques such as random oversampling, SMOTE, Borderline SMOTE, SVM-SMOTE, and ADASYN. (iii) to enhance the prediction rate of the students at-risk by using an ensemble model that integrates the most suitable data mining technique.
In rest of the paper, the work related to the present study is given in section II. The methodology used in the present work is explained in Section III. In Section IV, the obtained results are analyzed and discussed. Finally, the conclusion and future work are given in Section V.

II. RELATED WORK
In the past, various review studies have been performed on educational data mining [4,5], and many researchers have worked on identifying the factors that deteriorate the academic performance of students. Ahmed et al. [6] selected nine attributes such as department, attendance, high school degree, mid-term marks, student participation, lab test grades, assignment scores, seminar performance, and homework to predict the final grade and generate the rules set by the Decision Tree. Tomasevic et al. [7] have compared the performance of several data mining algorithms using past student performance, student engagement, and student demographic data. They concluded that students" engagement and past performance data have a significant influence, while www.iijacsa.thesai.org demographic attributes have a slight impact, on students" performance. Further, Verma and Yadav [8] used the crosstabulation method and the chi-square test to analyze the effects of different attributes such as background, academic, social, and psychological characteristics on students' academic performance. In their finding, it was concluded that students" academic and background attributes were the most influential factors that may affect students" grades.
With knowledge of the factors that influence the students" performance, predictions can be made with the help of data mining algorithms to identify students at risk. To analyze students" performance, Asif et al. [9] implemented decision tree and clustering technique on a dataset of 210 students that contained pre-admission marks and all subjects' marks and found that the pre-university marks and subjects" marks in the first and second years had an impact on students" final year marks. Hamoud et al. [10] applied Bayesian classifiers, namely Naïve Bayes and Bayes Net, to the dataset of 161 students and found that Naïve Bayes outperformed for predicting the students" performance. Costa et al. [11] performed a comparison of the effectiveness of different educational data mining techniques to predict students" performance in introductory programming courses and concluded that the support vector machine outperformed. Moreover, Ha et al. [12] implemented rule-based learners, neural-based learners, and statistical-based learners (Naïve Bayes, and Support Vector Machine) on students" datasets, which consist of personal and past academic information, to predict students' performance. In their experiment, neural-based learners and Naïve Bayes achieved the highest accuracy of 86.19%.
A suitable approach towards feature selection and handling imbalanced class problems may enhance the prediction accuracy of machine learning models. Thammasiri et al. [13] compared random oversampling and SMOTE balancing methods along with four popular data mining models: logistic regression, decision trees, neural network, and support vector machine to assess the students' performance. In their results, Support Vector Machine (SVM) achieved the highest accuracy of 90.24% with SMOTE. Mueen et al. [14] applied Naïve Bayes, Neural Network, and Decision Tree to students" data having their general, academic, and forum-related variables along with feature selection and SMOTE oversampling method to solve the imbalanced data problem and found Naïve Bayes to be outperformed with 86% accuracy. Ghorbani and Ghousi [15] used and compared different resampling methods, viz., Borderline SMOTE, Random Over Sampler, SMOTE, SMOTE-ENN, SVM-SMOTE, and SMOTE-Tomek, by evaluating the performance of the various classifiers, and Random Forest obtained the highest accuracy of 81.27% with SVM-SMOTE. Further, Ghavidel et al. [16] solved the problem of imbalanced data by using a combination of the SVM-SMOTE (an over-sampling technique) and Edited-Nearest-Neighbor (an under-sampling technique) while predicting disease mortality. Recently, Desiani et al [17] applied k-Nearest Neighbor (k-NN), Artificial Neural Network (ANN), and C4.5 to students" educational background records along with SMOTE to make the dataset balanced, and that balanced dataset increased the accuracy of prediction, and for k-NN the maximum achieved accuracy was 83.71%.
Another aspect that enhances the prediction accuracy is the appropriate use of ensemble models. Teoh et al. [18] used feature selection and SMOTE oversampling techniques and then applied various ensemble machine learning methods, namely stacking, boosting, and bagging. In their findings, AdaBoost has achieved a maximum accuracy of more than 90%.
Although there are several studies to predict the students" academic performance, the study which considers all categories of variables, i.e., background, academic, social, and psychological, and predicts students at-risk at an early stage with adequate accuracy is lacking. Also, a single classifierbased prediction is not suitable from one perspective to another. Moreover, a classifier giving the highest prediction accuracy for a particular dataset may not be valid for a different dataset. Thus, the aim of the present study is to identify low performers at an early stage with a higher prediction rate by using a scalable approach.

III. METHODOLOGY
The main objective of the present paper is to predict the academic performance of students with higher accuracy. To achieve this goal, the different single supervised machine learning algorithms were applied with and without data balancing, and finally, by comparing the results, a model was constructed to enhance the prediction accuracy. The methodology applied in the present work may be given as follows:  Dataset preparation.
 Data preprocessing including data transformation, feature selection, and data balancing.
 Identification of the best classification technique by comparing the results of classification models when applied to the preprocessed data.
 Make a scalable ensemble model with the help of the best classification technique.
 Result evaluation of the proposed ensemble model.
The workflow of the proposed methodology is given in Fig. 1.

A. Dataset
To make the data versatile, it is collected from the two different engineering colleges situated in different regions (the north and south of India). In the present paper, the sample size comprises 550 engineering students from two different engineering colleges in India, i.e., Bipin Tripathi Kumaon Institute of Technology, Dwarahat, Uttarakhand, and Cochin University of Science & Technology, Trivandrum, Kerala. The dataset includes information regarding background, past academic, social, and psychological factors with 30 different attributes, of which three attributes (roll-number, name, and branch) are used for identification purposes only and do not play any role in the prediction of low-performers. So, only 27 attributes were used for the present work, with first semester GPA as the output variable. For these attributes, data was collected online with the help of a multiplewww.iijacsa.thesai.org choice questionnaire created via outsourced technology, i.e., Google Form. As the aim of the paper is to identify the students having the highest risk of dropping out of college, the information about the output attribute for the dataset is divided only into two categories, i.e., low performers and high performers, based on the first-semester grade point of the students.

B. Data Preprocessing
Before applying any machine learning model to the dataset, data should be preprocessed so that any machine learning model can be performed efficiently. In the present study, the dataset is complete and free from noise, so there is no need to handle missing data and outliers. To preprocess the data, data transformation, feature selection, and data balancing have been performed.

1) Data transformation:
In the present study, all the features were categorical except students" GPA as it was initially in numerical form. So, GPA was generalized into categorical values, i.e., "class A (high performer)" and "class B (low performer)". Finally, these categorical variables were encoded into the suitable format of machine learning models.
2) Feature selection: Feature selection is an important part of the students" performance prediction model for two main reasons:  The main purpose of the prediction of students" academic performance is to provide timely support to the low-performing students in the area where they are lacking. Only after identifying the attributes that have a significant impact on the output variable, i.e., students" academic performance, suitable corrective measures may be taken to provide support to the lowperforming students.
 With the help of feature selection, irrelevant attributes may be removed from the data without losing reliability in classification. Thus, the dimensionality reduction raises the processing speed, and hence the classifier can learn faster.
There are three main feature selection techniques: manual selection based on pedagogical theories or expert experience; filter-based selection; and wrapper feature selection [19]. In the present study, as all the attributes were categorical, a filterbased feature selection technique, namely "chi-square", was used by which p-values were calculated for each attribute [8]. The attributes having a p-value of less than 0.01 show a highly significant correlation with the student's grades.
3) Data balancing: Data balancing is an important part of preprocessing step by which class distribution have to make equal so that classifier do not assign every new sample to the majority class only. In the present study the distribution of "class A" and "class B" is shown in Fig. 2. From the figure, it may be revealed that the dataset contained more samples from "class A" (66%) than the "class B" (34%). Previous study [20] shows that if the percentage of minority class is less than 35% of dataset then it is called imbalanced and hence the dataset of present study is imbalanced to some extent. There are mainly three types of re-sampling techniques i.e., oversampling, under-sampling, and hybrid-sampling [15] that may be used to balance the dataset. Due to the limited size of dataset, in the present study, only over-sampling techniques i.e., Synthetic Minority Oversampling Technique (SMOTE) [21], Borderline SMOTE [22], SVM-SMOTE [23], and ADASYN [24] were used and compared.

C. Machnie Learning Techniques
There are different types of classification machine learning models that may be used to predict the students" academic performance. In the present study, five single supervised machine learning models have been applied, including Decision Tree [25], Naïve Bayes [9,26], k-Nearest-Neighbor [27], Support Vector Machine [28], and Logistic Regression [29]. To achieve the best performance of these machine learning models, the passing parameters for these models were set with the help of an algorithm called "GridSearchCV" which gives the best combination of passing parameters [30]. These combinations of passing parameters are listed in Table I.

D. Model Validation and Result Evaluation
Model validation is used to check the effectiveness of the model across independent datasets. In the present study, the random hold-out method was used for model validation, in which 80% of the data was for training purposes and 20% of the data was reserved for testing purposes.
Furthermore, the performance of all the machine learning techniques was evaluated in terms of accuracy, precision, recall, and f1-score. These performance metrics are given as follows:

E. Construction of Ensemble based Classifier
In most of the previous studies [18,[31][32][33][34][35][36][37][38], it was shown that the ensemble model gives a higher prediction accuracy, so, to enhance the prediction accuracy, an ensemble model was constructed in the present study. For this, the best performing classifier was selected along with its suitable resampling method, after comparing the results of different single machine learning algorithms with balanced dataset. Finally, in order to make an ensemble classifier, the three best-performing classifiers were integrated with the help of bootstrap aggregation.

IV. RESULT AND DISCUSSION
In the present work, the whole experiment was done with the help of different libraries such as Pandas, Seaborn, and Scikit-learn of the Python programming language, which is a very powerful and user-friendly language for data scientists. The first aspect of the present work is to find out the influential attributes and to reduce the dimensionality with the help of a filter-based feature selection technique. For this purpose, the pvalues were calculated for different attributes using the chi2 method of the sklearn.feature_selection library of Python programming and are shown in Table II. From this table, it is depicted that after applying the feature selection technique, the following 11 features are selected as influential features that affect students" academic performance: percentage in 10 th standard, percentage in 12 th standard, confidence, mathematics % in 12 th standard, punctuality, curiosity, medium/language of previous study, category, father"s highest qualification, mother"s highest qualification, and mental stress.
After selecting the most influential attributes, Decision Tree, Naïve Bayes, k-Nearest-Neighbor, Support Vector Machine, and Logistic Regression algorithms have been applied to the dataset, which contains only the 11 selected most influential attributes. The results obtained for accuracy, precision, recall, and f1-score of these algorithms are represented in Table III From Table III, it may be observed that the highest accuracy, i.e., 92.72%, was achieved with Logistic Regression. In terms of recall and precision for classes A and B, no single algorithm can be declared best. This is because precision and recall for classes A and B are not the highest for the same algorithm. For example, in Naïve Bayes recall and precision for class B and class A is highest, respectively, but recall for class A and precision for class B is lowest. In such situations, the f1-score may be taken as an evaluation criterion, as the f1score is the harmonic mean of precision and recall. Logistic Regression has achieved the highest accuracy and highest f1score for both classes "A" and 'B', and hence it may be considered the best performing algorithm with the imbalanced dataset. The dataset of the present study was imbalanced, and hence four resampling techniques (SMOTE, Borderline SMOTE, SVM-SMOTE, and ADASYN) have been used, and the performance of all the classifiers was evaluated with the balanced dataset.
The performances of different models with the different resampling methods are shown in Table IV. From Table IV, it may be noted that the accuracy of the models, except for Logistic Regression, was not significantly improved when applied to the balanced dataset. This may be because of the fact that, in the case of balanced data, all the algorithms considered both the classes "A" and "B" with equal weightage. So, it may be concluded that although in the case of balanced datasets, the accuracy of every classifier is not increasing; the prediction accuracy may now be trustable and sufficient to measure the model"s performance. The performances of various classifiers using the resampling methods SMOTE, Borderline SMOTE, SVM-SMOT, and ADASYN are shown in Fig. 3-6 respectively. From these figures, it may be observed that Logistic Regression outperformed all the classifiers in every balanced dataset generated with all the four resampling techniques, and the highest accuracy of 94.54% and the highest F1-score were achieved when SMOTE was considered as a resampling method. Finally, after evaluating the performance of all classifiers, the best performing classifier, namely Logistic Regression, was chosen to create the ensemble model in order to improve prediction accuracy. In order to make the ensemble model, three Logistic Regression classifiers were integrated with the help of bagging. The result of the proposed integrated model is shown in Table V. The proposed model has achieved the highest accuracy of 95.45%, the highest prediction rate for low performers, and the highest f1-score for both classes while using SMOTE. It is pertinent to mention here that the accuracy of the proposed model increased by 1.82% after using the resampling technique SMOTE, while in the study of Desiani et al., the average accuracy was increased by 20.13%. The possible reason may be that the dataset used in the present study has a small sample size and was not highly imbalanced. In the case of a large sample size, the number of students at risk will be significantly lower, and hence, in such situations of highly imbalanced data, the present model may be quite useful.     The highest prediction accuracy achieved in the present study is 95.45%, which is greater than most of the previous studies [12][13][14][15][16][17][18]. Along with the enhanced prediction accuracy, the main advantage of the present work is that the methodology proposed in the present study is scalable from one context to the other.

V. CONCLUSION AND FUTURE WORK
From the present work, it may be concluded that students" past academic performance (10 th standard %, 12 th standard %, and Math"s % in the 12 th standard), their background (category, parents" qualification, and medium of the previous study), and their psychological features (mental stress, confidence, curiosity, and punctuality) were the relevant attributes. Thus, to increase the academic performance of the students, these factors may be considered as the focus points.
In the present study, all the used classifiers were able to predict students" outcomes with reasonable accuracy of more than 80%. Among all the used classifiers, Logistic Regression was the best performing algorithm with a balanced as well as an imbalanced dataset. Further, the accuracy and prediction rate for identifying low performers as well as for high performers were improved when the Logistic Regression was applied to the balanced dataset. The prediction accuracy was further enhanced with the use of an ensemble classifier in which three Logistic Regression classifiers (because of its highest performance) were integrated with the help of bootstrap aggregation. The proposed integrated model has achieved the highest accuracy of 95.45% and the highest precision and recall for low performers with the balanced dataset formulated with the help of the resampling technique SMOTE.
It should be noted that with different datasets, the different classifiers may give the highest prediction accuracy, and hence there is a need for the methodology to be scalable for every situation. Thus, the main advantage of the present approach is its scalability for different datasets. Further, this study may also be applied to the different domains of data mining and machine learning applications for enhancing prediction accuracy. The limitation of the present study is that the examined dataset has a small sample size and slightly imbalanced data, so in the future, the proposed methodology should be used with large sample sizes and highly imbalanced data for the prediction of students" academic performance.