Hybrid Machine Learning Algorithms for Predicting Academic Performance

The large volume of data and its complexity in educational institutions require the sakes from informative technologies. In order to facilitate this task, many researchers have focused on using machine learning to extract knowledge from the education database to support students and instructors in getting better performance. In prediction models, the challenging task is to choose the effective techniques which could produce satisfying predictive accuracy. Hence, in this work, we introduced a hybrid approach of principal component analysis (PCA) as conjunction with four machines learning (ML) algorithms: random forest (RF), C5.0 of decision tree (DT), and naïve Bayes (NB) of Bayes network and support vector machine (SVM), to improve the performances of classification by solving the misclassification problem. Three datasets were used to confirm the robustness of the proposed models. Through the given datasets, we evaluated the classification accuracy and root mean square error (RSME) as evaluation metrics of the proposed models. In this classification problem, 10-fold cross-validation was proposed to evaluate the predictive performance. The proposed hybrid models produced very prediction results which shown itself as the optimal prediction and classification algorithms. Keywords—Student performance; machine learning algorithms; k-fold cross-validation; principal component analysis


I. INTRODUCTION
The poor performance of students in high school has become a worried-task for educators as it affects the secondary national exam and step to higher education. Mathematics is considered as the basic background for many science subjects, and give very strongly affect the national exam and for further study in higher education [1]. For example, students who are poor in mathematics are much more likely to fail in diploma national exams in Cambodia [2]. They later found themselves harder to choose a major for higher study and hard to survive in the university journey. Early prediction and classification of student performance level offers an early warning and gives a recipe for improving the poor performance of students as well as for other managerial settings. Hence, we aim to deal with the unknown behavior pattern of students which affects student performance. There are various factors affect the performance of students in mathematics; those factors consist of schooling factors, domestics or home factors, and personal or individual factors. These related factors were used as predictive features in predicting the achievement of students in mathematics.
In the age of the information revolution, analysis of the database in education environments such as learning analytics, predictive analytics, educational data mining, and machine learning techniques has become a hot area of research [3][4][5]. The supervised learning was used to predict, classify the students' performance and analyze their learning behaviors to follow up on their progress in classes. However, the challenging task is to find the optimal algorithm which could produce satisfying results. Machine learning algorithms such as naïve Bayes, logistic regression, artificial neural networks, decision tree, random forest, support vector machine, k-nearest neighbor, and more, were popularly used to analyze and predict academic performance [3][4][5][6][7][8][9][10][11][12][13][14]. The performance of each model is varied from dataset to dataset, which relies on the characteristics and quality of data.
In the classification problem, a reason for misclassification that declines the performance of the model is from the quality of data that disturbs the algorithms. Various literature has focused on using dimensional reduction (feature selection and feature extraction methods) to improve the prediction and classification performance. In our work, we applied principal component analysis (PCA) as a feature extraction technique to transform the original dataset into a new dataset of high quality. We also introduced 10-fold cross-validation is to evaluate the predictive performance of the models and to judge how they perform in a new dataset, the testing samples or test data. This paper aims at proposing a novel hybrid approach of machine learning for solving the classification problem. The proposed hybrid approach is the combination of four baseline machine learning algorithms with 10-fold cross-validation and principal component analysis.

II. RELATED WORKS
Supervised learning in machine learning requires an effective prediction model for solving prediction and classification problems. As mentioned in the Introduction, the educational data mining (EDM) field has studied different machine learning techniques to determine these techniques obtaining a high accuracy to predict the future performance of students [3][4][5]. Table I summarized the popular and state-of-the-art classification algorithms, which were used to predict student performance in educational datasets. Several works have been investigated to find the best algorithms to predict future performance.
(ii) J48 was found to be the best prediction model. [10] (i) NB, support vector machine (SVM), C4.5, CART are used to build the learning model.
(ii) SVM is the best model compared to NB, C4.5, and CART. [11] (i) RF, multilayer perceptron (MLP), and ANN were used to classify student performance.
(ii) The RF algorithms generated the highest accuracy. [12] (i) J48, CART, and RF classifiers were proposed with principal component analysis (PCA).
(ii) PCA-RF was found to generate the highest accuracy. [13] (i) MLP, Radial Bias Function (RBF), SMO, J48, and NB are proposed to combine with PCA.
(ii) The C5.0 outperformed the other two boosting models.

III. MACHINE LEARNING ALGORITHMS
We proposed hybrid models by a conjunction of machine learning algorithms with principal component analysis. We first proposed the baseline models. We then improved the performance of our proposed baseline models with k-fold cross-validation. Lastly, we proposed the hybrid machine learning model by combining it with principal component analysis as in Fig. 1.

A. The Baseline Models
There are numerous effective machine learning approaches that have been extensively applied to educational environments. For various purposes in educational settings, we need to take different machine learning techniques such as association rule mining, regression analysis, classification, and clustering [3]. Classification is a common technique in machine learning that was used in order to classify and predict the categories or predefined classes of target variables. In this work, we observed several machine learning classifiers and selected the four state-of-the-art methods which are popularly used in predicting academic performances [3][4][5][6][7][8][9][10][11][12][13][14]. The four proposed algorithms are support vector machine, naïve Bayes C5.0 of the decision tree, and random forest.

1) Support vector machine:
A Support Vector Machine (SVM) is a kind of classification algorithm obtained by the mean of a separating hyperplane [15]. The concept of SVM is to create a line or a hyperplane to separates the samples into classes. SVM is used to observe for the optimal hypersurface to separate each two different data classes. Once the data is more complex, then we create more dimensional space to have a linear separation of data.
is treated for nonlinear function case mapping x into a higher dimensional space. The parameters , wb and i  represent the weight, bias, and slack variable, respectively.
And the optimal hyperplane is possibly to be solved using Lagrangian and then transform it into a quadratic problem of the function () W  as in (2) is the kernel function and, 12 ( , ,..., ) The decision function can be written as: Different kernel functions are used to help SVM to maximize margin hyperplanes to obtain the optimal solution. The most popular used kernels are the polynomial function, sigmoid function, and radial basis function. SVM with radial bias function (RBF) kernel is one of the most commonly used kernels for the multi-classification problem since it requires fewer parameters comparing to the polynomial kernel. Consequently, RFB is an appropriate choice to be used kernel. Hence, this work applied RBF as a kernel function top to get the optimal solution. 2) Naïve Bayes (NB): NB is one among the simple but effective machine learning algorithms that is preferably used in many classification problems. NB is a very attractive method for education research [16]. In the educational domain, an assumption of conditional independence is often ignored and disturbed. Considering that variables are inter-connected, the NB classifier can tolerate strong supervising dependence between independent variables. NB classifier is Bayes theorem-based method that used the idea of computing posterior probability for decision rule. NB classifier has been especially popular for educational data mining. Suppose D is a dataset of n dimensional vector X : 1 2 3 ( , , ,..., ) n x x x x describing attributes of each student and suppose there are k classes: 12 , ,..., k C C C . NB classif-ier predicts X belong to a class The NB classifier is found on conditional Bayes probability as in (4): The probability () PX is normalizing constant and 12 ( , ,..., ) is the set of features variables with a strong assumption of independent predictors, then (4) can be rewritten as: The naïve Bayes classifier holds many advantages such as it is a very simple algorithm, not contain any parameter to optimize, efficient for classification, and easy to interpret.
3) C5.0: Decision tree is a "non-parametric white-box model" which is simple and effective for classification and regression tasks while C5.0 is one of the most famous algorithms of decision tree that construct the structure in the form of tree diagram [14]. This algorithm takes care of various of the decisions automatically using fairly reasonable defaults.
C5.0 is a successor of C4.5; it builds tree structure from training set using the idea of Shannon entropy. The algorithm purifies the subset of samples via the concept of information entropy. Entropy defines the impurity of any subset of an sample set S at a specific node N is written as: The constant c is denoting the number of classes and () i Pc is the proportion of values in the class i . After obtaining the measure of purity, the algorithm needs to decide which feature to split next. The algorithm calculates homogeneity resulting from a split on each possible feature, this procedure of calculation is called information gain (IG) as shown in (7): One complicated matter after splitting is that a split result in more than one partition that is what we need to compute what is called split information in the following equation: Then, using information gain as see formula (7), and splitting information as in (8), we then can compute the information gain ration using the following equation: The C5.0 of a decision tree is one of the most popular machine learning algorithms that has been widely used in various applications.

4) Random Forest (RF):
As in the name indicates its meaning, the random forest is an algorithm builds the forest with a number of trees. A random forest algorithm is a treebased tool that grows many classification trees [12]. It is a kind of ensemble classifier that combines several classification trees to create a new classifier. The concepts of bootstrap aggregation or bagging method is used to grow each tree. To classify a new example, each decision tree gives a classification for the input data which is so-called "voting for a class". The RF algorithm chooses a class with the highest votes. The illustration of the process of random forest algorithms is shown in Fig. 2.

B. The k-fold Cross-Validation
Cross-validation is one of statistical technique that used to test the effectiveness of machine learning algorithms. There are various methods of cross-validation but the k-fold crossvalidation is chosen since it is popular and easy to understand, also generally generates a lower bias comparing to the other cross-validation methods. The process of k-fold crossvalidations is summarized as the following: 1) Shuffle the entire samples randomly 2) Split samples into k sub folds 3) In the split k sub folds:  Take 1 fold as a holdout or test set (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 1, 2020 35 | P a g e www.ijacsa.thesai.org  Take the remaining 1 k  folds as the training set  Retain the evaluation score and discard the model 4) Repeat the iteration until every single fold was treated as a testing set. Finally, compute the average score of the recorded scores.
In our study, we chose the 10-fold cross-validation (will be shortly called 10-CV) to access our proposed algorithms. This process is precisely illustrated in Fig. 3.

C. The Proposed Hybrid Models
The majority task in supervised machine learning is classification. The classification problem is a hot issue in data mining and machine learning. We proposed the four most popular classifiers that hold many merits. However, the major problem for those classifiers is overfitting and noisy data which leads to misclassification and deduce the accuracy of the classification. To overcome this matter, we try to reduce irrelevant feature and non-correlated features which disturb in the classification process. In data analysis, it requires more computational resources and consumes much time when that data consists of a huge volume. Hence, the feature extraction approach to remove noises in data in order to reduce time and resource usage and regain the high quality of data. The dimensional reduction could improve accuracy and boost up the performance by combining it with classification techniques. Using more high-quality data and feature reduction is one of the effective approaches to improve the performance of machine learning models. The four proposed models: support vector machine using radial basis function kernel (SVMRBF), naïve Bayes (NB), decision tree C5.0, and random forest (RF) are the affective algorithms for the classification problem, yet there is no perfect algorithm in machine learning.
SVM is a classifier with the use of support vectors called hyperplanes to separate data into classes. Thus, for a high dimensional dataset, the input space is high and can be unclean which is mostly declining the performance of the SVM algorithm. Thus, it requires an effective feature extraction method that discards noisy, irrelevant and redundant data, and still contains the useful information of data. Removal of such features can increase the search speed and accuracy rate.
NB is a classifier that holds many advantages, yet the greatest weakness of the NB classifier is that it relies on the often-faulty assumption of equally important and independent features. If there are any features that are irrelevant to some class k C then the whole probability goes to zeros for that class because of production in equation (5), which leads to misclassification. In order to solve this problem, feature extraction will be the best tool to reduce irrelevant features and also improve the classification performance.
In the tree-based algorithms C5.0 and RF, the major problem in the splitting process of the decision tree is overfitting. Overfitting caused by noisy data and irrelevant features that produce misclassification results. In return, overfitting lowering the accuracy of tree-based classifiers. To reduce high dimensional data which, contains noisy and irrelevant data, a commonly-used technique is to use feature extraction in order to obtain a lower-input space that contains relevant and informative input features.
In order to improve the performance of the proposed machine learning algorithms, we proposed commonly-used feature extraction approach: principal component analysis (PCA) in this study. PCA is a statistical method that transforms an original data set to a new dataset of a lower dimension. The original dataset consists of possibly correlated variables are converted into a set of linearly uncorrelated variables.
PCA is one of the most popular dimensionality reduction algorithm [17]. In the PCA procedure, the data is first transformed into standardized data with zero mean. The idea behind getting the principle components is the covariance matrix is computed in order to obtain eigenvector and eigenvalues. The eigenvector with the highest eigenvalue is treated as the principal component of new data which shows the most significant relationship of input feature. PCA is less sensitive to different datasets than other holistic methods, so it is the most widely used technique as one of the effective feature reduction methods.
2) Compute variance: In order to investigate and deviation of each feature in the dataset, we compute the variance using equation (11): 3) Compute covariance: Given two variables, denoted X and Y , the covariance and correlation are calculated using equation (12): ( , ) Cov X Y equals to zero means that the two attributes X and Y are independent. Using equation (11) and (12), we can obtain covariance matrix S, which the entry ij s , ij  , is the covariance between the th i and th j variables, and diagonal ii s is the variance of th i variables.

4) Compute Eigenvalues and Eigenvectors:
The features in the new datasets are characterized by mean of eigenvectors and eigenvalues. The obtained eigenvectors will tell the direction of new features space while the eigenvalues are its magnitude. The eigenvalues are possible to obtain by solving the equation: ( -) 0, Det S I   (13) where the covariance matrix S is symmetric,  is the eigenvalue of the symmetric matrix S , and I is an identity matrix. The eigenvector v corresponding to each eigenvalue  can be computed via the equation: We denoted as the Eigen space containing all eigenvectors. The proposed hybrid models by conjunction machine learning models with PCA are introduced for predicting and classifying the academic performance. The best benefits of PCA are summarized as follow: a) Removing the high noises from samples and uncorrelated features from the collected dataset in the preprocessing step.
b) Reducing the high dimensional data to low dimensional one which remains the important characteristics of data that reduce overfitting problems. c) Enhance the equality of features by getting rid of correlated features that effectively improve the performance of classification.
In this proposed research, we proposed the hybrid models by a conjunction of four baseline models (SVMRFB, NB, C5.0, and RF) with 10-fold cross-validation (10-CV) and principal component analysis (PCA).

A. Datasets
In our study, we tried to collect all unseen features affecting student performance in mathematics subjects. Datasets contained 43 features describing the information of the learning behaviors of each student and one target variable describing the performance levels of students based on their score. The predictive features consist of the features observing from three main affected factors. These main factors contain the forty-three variables and their descriptions are shown in Table II. Table III described the predefined classes of the target variable.
To confirm the robustness and effectiveness of our proposed algorithms, we used three datasets. The first two datasets are generated datasets namely GDS1 (2000 samples) and GDS2 (4000 samples) that were constructed based on proposed structures of predictive features to the output variable as stated in [18][19][20]. The third dataset is the actual dataset that was collected from 22 high schools in Cambodia. The data collection was made using questionnaires form. Students were asked to provide their demographical information related to external effects such as domestic factors, individual or student factors, and school factors. The score of mathematics of students in the semester I was obtained from the administrative offices in each school. The dataset was named ADS3 that consists of 1204 samples.

B. Preprocessing Tasks
Data preprocessing is an integral step in data mining that is used to transform the raw dataset into a clean and executable format to be ready for implementation. The preprocessing step is not only used to ensure the readiness of data suitable and ready for modeling but also to improve the performance of the models. The preprocessing tasks in this study contain some operations such as data cleaning or cleansing, data transformation, and data discretization. During data collection, the questionnaire completion was done with missing some questions and inputting invalid value (outliers). In our datasets, the number of missing values is low, so we used the imputing method in order to clean our data. We replaced the missing value in our categorical variables by its modes or high frequency-category values. In the output variable, there is a few missing value and outliers, then we replaced it by the mean value. For simplicity, we transformed some numerical features into ordinal types. In our study, we also discretized the output variables into four performance levels as shown in Table I.

V. EVALUATION METRICS
The performance of each proposed model in analyzing and predicting student performance can be evaluated from the analysis of the graphical confusion matrix. Without loss of generality, our output variable can be categorized into four ordinal categories as mention in Table I. Table IV shows the graphical confusion matrix which represents four classes of student performance level in mathematics subject. Class 1 presents the highest class, Class 2 denotes the second upper class, Class 3 describes the third class lower, and Class 4 denotes the lowest (poor) group of students. The below parameters are calculated.

A. Classification Accuracy
Accuracy is used to quantify the percentage of correctly predicted. Here, we want to evaluate the potential of our prediction model by measuring the percentage of correctly predicted the level of student performance as in (15)

B. Root Mean Square Error (RMSE)
We aim not only to predict the ability of students' performance levels but also to estimate how much our prediction is close to their performance level. We encoded these ordinal performance levels {slow, average, good, excellent} as {1,2,3,4}, respectively. The RMSE can be computed as:   In our experiments, we proceed in three phases. Phase 1 is to implement for the result of the baseline models. Phase 2 is to improve the baseline models by 10-fold cross-validation (10-CV). Phase 3 is to execute a hybrid model which is the combination of the baseline models with 10-CV and PCA.

A. Result of Baseline Models
We proposed four most popular machine learning techniques, random forest (RF), C5.0 of the decision tree, support vector machine using radial basis function kernel (SVMRBF), and naïve Bayes (NB) of the Bayesian network. The two performance metrics, classification accuracy, and RMSE are shown in the tables.
From Table V, VI, and VII, NB was found to be the poorest model, while the RF technique generates the highest performance with respect to both classification accuracy and RMSE, which shown itself as the potential model.

B. Results of Baseline Models with k-fold Cross-Validation
The k-fold cross-validation is a technique that is popularly used in prediction and classification models to split the dataset into 1 k  sub folds for training and 1 fold for testing sets, then rotate the folds. In this experiment, we used 10-fold crossvalidation, since it performs best at this split. 90% of the data was used in the training section, and 10% was used for testing purposes as shown in Fig. 3. Lastly, when all interactions were done, an average of all evaluation metrics is computed.
From Table VIII, the accuracy of SVMRBF was improved by 2%. The performance of the poor NB classifier was then much improved by to 68.03%. The 10-CV technique improved C5.0 and RF with an accuracy increase of 27% and 15%, respectively.
From Table IX, by shuffling the dataset GDS2 with 10-CV, the accuracy of SVMRBF algorithm was improved from 75.52% to 91.15%, which is a very good improvement. NB increased by an accuracy of 9%. The tree-based classifiers C5.0 and RF were improved by the accuracy of 9% and 6%, respectively.   C5.0 and RF are treebased classifiers that could produce a high risk of over-fitting. With a 10-CV, we can not only obtain better performance but also avoid overfitting problems too. By mean of 10-CV, accuracies of C5.0 and RF were improved to 94.82% and 98.22% which improved 18% and 9%, respectively.

C. Results of Proposed Hybrid Models
Our proposed hybrid models were constructed by combing the baseline models with a feature reduction approach, PCA. Feature extraction is one of the powerful methods in classification models that are used for the purpose of removing irrelevant or non-related features. Dimensionality reduction via PCA [13] can definitely serve as regularization in order to prevent overfitting and improve the model accuracies. Often, people end up making a mistake in thinking that PCA selects some features out of the dataset and discards others. The algorithm actually constructs a new dataset of properties based on a combination of the old ones.
In this section, we proposed the hybrid models as the combination of 10-CV in the previous section to PCA in order to avoid overfitting and more improvement in predicting performance. Tables XI, XII, and XIII describe the results of  the proposed models to the three datasets, GDS1, GDS2, and  ADS3, respectively. We visualized the performance of the proposed models to the three datasets GDS1, GDS2 and ADS3 in Fig. 4, 5, and 6, respectively. In Fig. 4, the accuracy based in dataset GDS1, our proposed hybrid models boost the accuracy of SVMRBF from 75.01% to 83.88%, NB from 35.79% to 86.27%, C5.0 from 78.42% to 98.32%, and RF from 80.06% to 98.92%.
In Fig. 5, the hybrid models improved SVMRBF, NB, C5.0, and RF with accuracies of 20%, 23%, 12%, and 9%, respectively. In Fig. 6, the proposed hybrid SVMRBF could improve the classification accuracy from 86.44% to 97.01%. Classification through NB could yields 30% better than baseline NB. The accuracies of C5.0 and RF were improved to 99.25% and 99.72% correctly classified.    Fig. 7, 8, and 9 demonstrated the performance based on the accuracy of each model via each phase. We found the improvement by using 10-CV combined with PCA gives the best result in predicting student performance. The figures show the performance of the RMSE of the models in each step. The proposed hybrid models could generate a very small RMSE. The hybrid RF algorithm produced the smallest value of RMSE which shows itself as the best predictive model in this prediction problem.   From the results, by using 10-CV, we can improve the performance of our baseline models. Additionally, we observed that the proposed novel hybrid models could boost up the classification performance to the superior results. This proposed hybrid models can be regarded as an optimal prediction models for solving prediction and classification problems.

VII. CONCLUSION
This paper introduced the four popular classifiers of machine learnings to predict student performance. The four proposed algorithms are SVMRBF, NB, C5.0, RF. The procedure was made with three phases. Firstly, we observed the performance of those baseline methods. Secondly, we improved the performance with 10-CV. Lastly, we combined the PCA to baseline models, and 10-CV method to improve the classification performance. Based on classification accuracy and RMSE as measurement parameters, it shows that the proposed hybrid models by conjunction of the proposed models with PCA and 10-CV produced very satisfying results. In conclusion, by combining the baseline models with principal component analysis, and evaluated by k-fold cross-validation, the proposed hybrid models produced a high performance which shows itself as a potential algorithm for solving prediction and classification problem.