Deep Learning with Data Transformation and Factor Analysis for Student Performance Prediction

Student performance prediction is one of the most concerning issues in the field of education and training, especially educational data mining. The prediction supports students to select courses and design appropriate study plans for themselves. Moreover, student performance prediction enables lecturers as well as educational managers to indicate what students should be monitored and supported to complete their programs with the best results. These supports can reduce formal warnings and expulsions from universities due to students’ poor performance. This study proposes a method to predict student performance using various deep learning techniques. Also, we analyze and present several techniques for data pre-processing (e.g., Quantile Transforms and MinMax Scaler) before fetching them into well-known deep learning models such as Long Short Term Memory (LSTM) and Convolutional Neural Networks (CNN) to do prediction tasks. Experiments are built on 16 datasets related to numerous different majors with appropriately four million samples collected from the student information system of a Vietnamese multidisciplinary university. Results show that the proposed method provides good prediction results, especially when using data transformation. The results are feasible for applying to practical cases. Keywords—Deep learning; student performance; mark prediction; Long Short Term Memory (LSTM); Convolutional Neural Networks (CNN); data pre-processing; multidisciplinary university


I. INTRODUCTION
Recently, the number of students who have been warned and forced to leave school tends to increase. One of the reasons can be that students are not able to evaluate and predict correctly his or her ability to select appropriate courses. Student performance is an important task of higher educational institutions because it is a criteria for high quality universities that are based on excellent profile of their academic achievements. There are several definitions on student performance. According to [1], student performance can be obtained by measuring the learning assessment and curriculum. However, most of the studies mentioned about graduation being the measure of students' success [2], [3].
In recent years, the situation of students in the institutions have been academically warned tended to accelerate. For example, at Can Tho University 1 , in the first semester of the school year 2018-2019, the number of students who academic warned in one semester were 886 and the two semesters were 125, these number in the first semester of the academic year 1 www.ctu.edu.vn 2019-2020 were 986 and 196 respectively. One of the main reasons for the students' poor performance is that they have not selected appropriate courses to their competencies. These results in extension of learning term and increase of cost for their families, higher educational institutions and society as well. Therefore, predicting students' performance is an important research topic in exploiting educational data, which is of interest to many researchers [4]- [8].
Currently, there are a lot of proposed approaches to predict student performance, in there data mining is one of the most popular approaches to be widely applied in educational area. One of the most popular techniques to predict student performance is classification. There are several algorithms used for classification task such as Decision Tree, Artificial Neural Networks, Naive Bayes, K-Nearest Neighbor and Support Vector Machines [3]. However, the existing researches are primarily based on learning results of previous semesters to predict student performance of next semester or Current Grade Point Average (GPA), but do not analyze additional factors such as English entrance testing grades, activity incentive grades, etc. that affect their performance. Moreover, the researchers have not sufficiently compared among techniques, especially deep learning techniques with other traditionally machine learning techniques.
This study presents an approach of deep learning techniques [9] using the convolutional neural network on 1D data (CN1D) and the Long Short Term Memory (LSTM) to build a student's performance prediction model for predicting student performance in next semesters based on the course's achievement results of the previous semesters.
We analyze and introduce some techniques for data pre-processing (including Quantile Transforms and MinMax Scaler) before fetching them into well-known deep learning algorithms such as LSTM and Convolutional Neural Networks to do prediction tasks. In addition, in order to improve the predictive results, we also consider other additional factors such as entrance English testing grades, activity incentive grades etc. for the proposed model. Moreover, a comparison between deep learning techniques and traditionally machine learning ones is also conducted. Experimental results show that the proposed model provides rather accurate prediction and it can be applied in practical other cases.
In the remainder of this study, we present a literature review on studies performing on student performance prediction. We introduce the considered 17 datasets in Section III. The works According to [8], predicting student performance is an important task in exploiting educational data; student's knowledge can be improved and accumulated over time. From this idea, the authors proposed an approach that uses Tensor Factorization (TF) to predict student performance. With this approach, the authors can personalize the prediction for specific student. Experiential results on two large datasets showed that incorporating prediction matrix factorization techniques is an effective and promising approach.
The authors in [10] investigated the effectiveness of transfer learning from deep neural networks for the task of students' performance prediction in higher education. Experiments were conducted based on data originating from five compulsory courses of two undergraduate programs. The experimental results demonstrate that the prognosis of students at risk of failure can be achieved with satisfactory accuracy in most cases.
The authors in [11] developed a student performance prediction system using the open source recommendation system called MyMediaLite. For the grade databases collected from the academic management system of a university, the authors proposed using Biased Matrix Factorization (BMF) technique to predict the learning results. This results can help students choose more appropriate courses.
The ability to combine the prediction techniques is also used by researchers. [12] developed a model to predict the student learning outcomes based on the combination of Taylor approximation method and Grey models to obtain the most optimal predicted values by multitimes approximate calculation to improve the predicted accuracy of two grey models. Research results can help teachers and educational managers have appropriate solutions to improve the academic results of students who have unstable learning process. In addition, [13] used Collaborative Filtering, Matrix Factorization and Restricted Boltzmann Machines techniques to systematically analyze data collected from a university. The results showed that Restricted Boltzmann Machines technique predicts students' academic results better than the remaining techniques.
In fact, collaborative filtering algorithms are commonly used in recommendation systems due to their simplicity and effectiveness. However, the sparsity of the data limits the effectiveness of these algorithms and it is difficult to further improve the prediction results. Therefore, the models that combine collaborative filtering algorithms with deep learning techniques are more interested. [14] proposed a model based on quadratic polynomial regression model to obtain more accurate latent features by improving the traditional matrix factorization algorithm. Then, the latent features are the input data of the deep neural network model. The experiments on three datasets showed that the proposed model improves the prediction efficiency very well compared to traditional ones. Some other approaches combining collaborative filtering model with deep learning are also proposed by [15]. With this approach, during the prediction period, a feed-forward neural network is used to simulate the interaction between the are user and the item, in which the feature vectors in pre-process used as input to the neural network. The experiments based on two datasets of MovieLens with one million samples and MovieLens ten million features to verify the effectiveness of this method and gave very feasible results. [16] also presented a review on machine learning based approaches for predicting student performance. Other approaches can be found in [17]- [20].
The problem of student performance prediction has been taken into account in numerous previous research using machine learning theory but factor analysis for student performance prediction based on explanation models and data transformation techniques are still the gap for improvements. This research aims to create a new approach that leverages Deep learning. Especially, the Long Short Term Memory (LSTM) can be used with time-based features. This study includes several contributions as the following: • Deep learning models (LSTM, CNN) with a shallow architecture are leveraged to do student performance prediction tasks. As shown from the results, deep learning techniques can produce feasible prediction scores.
• Various optimizer functions are tested to choose an appropriate one for the considered regression problem.
• Data transformation techniques are also considered to enhance the performance of deep learning models.
The feature values which are greater than 1 cause poor performance for deep learning model. By using and testing various data pre-processing techniques, we found that regression tasks with deep learning can converge sooner and also archive a better result.
• We investigate and consider various 17 datasets related to a vast of majors and study fields for the comparison in a multidisciplinary university. Based on the time to divide the training and testing data, we obtain the various ratio between the training set and test set to evaluate the difference in the prediction performance.
• A variety of model explanations are brought to analyze factors which can influence on student performance. From analysis results, educational managers can propose appropriate policies and strategies to support their students. In order to evaluate the proposed model, we have collected real data at a multidisciplinary university, a case of Can Tho University, Vietnam. However, the model can be applied to other universities, schools, colleges as well. The collected data relates to students, courses, marks, and other information from the year 2007 to 2019 with 3,828,879 records, 4,699 courses (subjects), and 83,993 students. Data distributions are described in Table I with information on samples and the ratio for training of educational units/department/institutes at the considered multidisciplinary university.
The set of datasets consists of student performance from 16 academic units (faculties/colleges/schools) that belong to Can Tho University. For each unit, we separate the data into two parts, one of them for the training stage and the remaining for the test stage. Because of data division based on periods (from 2007 to 2016 for training, and from 2017 to 2019 for testing), the size of data for training and testing of each unit is different. Adding to these 16 academic units, we also evaluate our proposed method on the full dataset which includes all The distribution of mark levels of the full dataset for training and testing described in Fig. 2 and Fig. 3, respectively. For these distributions, most of the marks are greater than or equal medium level as 2 (89.7% for full training dataset and 88.6% for full testing dataset). The distribution is similar to most units in the university, for instance, the mark level distribution of Engineering Technology dataset described in Fig. 4 and Fig. 5, respectively.

IV. A PROPOSED APPROACH BASED ON DEEP LEARNING FOR STUDENT PERFORMANCE PREDICTION
First, we collect real datasets at the Student Management System of a university, then data is pre-processed to remove noise, redundant attributes, etc. Next, we divide the data for training and testing in term of "time order" which means that we use the "studied courses" to predict the "to be studied courses". For example, we have used data from 2007 to 2016 for training, and data range from 2017 to 2019 for testing. The purpose of this division is to use courses results in the past (history data) to predict results of course in the future. In order to evaluate the efficiency of the prediction model, "the future" in this context is referred to data of the year from 2017.

A. Data Pre-Processing and Transformation
Since the dataset collected from the Student Management System have a lot of information, we have pre-processed them as described in the following steps: • Step 1: Remove redundant attributes such as Student Name, Course Name, Lecturer Name, etc.
• Step 2: Remove redundant/noise records such as the courses which are registered by the student but have not been taken examination (i.e., the null marks), exemption courses, etc.
• Step 3: Remove the courses which have not enough registration (in some universities, if the courses are registered by less than 15 students, they will be removed).
• Step 4: Transform the text values to numeric values and other formats.
After carefully analyzing the data, we have selected the input attributes for learning model as described in Table II. This selection based on pre-experimental results and previous analysis in predicting student performance [11], [21].
With various data distribution of obtained different attributes, we suggest using Quantile Transformation (QTF) and MinMaxScaler (MMS) [25] for generating and convert all values to the value range where deep learning algorithms can converge.
QTF, a non-linear transformation, is considered a strong preprocessing technique because it reduces the effect of outliers. Values in new/unseen data (for example, a test/validation set) that are lower or higher than the fitted range are set to the bounds of the output distribution. As shown from Fig. 6 with an example from samples of Rural Development, before data is transformed, data range and distributions of each feature have great differences. Data transformed with QTF with all features range from 0 to 1 (Fig. 6b). Fig. 6a exhibits the result of scaler for each feature also enables its distribution to become more normal distribution.
MMS is also used for creating bins for images. This algorithm scales each feature to a given range with formulas 1 and 2: www.ijacsa.thesai.org These algorithms are proven as an efficient method in classification tasks in [22]. In this study, the experiments also reveal promising results comparing to original data with regression tasks. The scaler is learned from the training set and applying to the test set.

B. Proposed Models
Two deep learning and a robust regression (Linear Regression) algorithms are carried out to run prediction tasks. The convolutional neural network (namely, CN1D) receives 1D data with 21 features, then passing through a stack of one convolutional layer with 64 kernels of 3 (stride 1), followed by a ReLU activation function used after each convolution (shown in Fig. 7). The Long Short Term Memory (LSTM) includes 64 Tanh units and one time step (Fig. 8). Both the CN1D and the LSTM produce output by a sigmoid function (Equation 3). The output of the sigmoid function ranges from 0 to 1, so this output then multiplied by 4 to corresponding the grades scale ranging from 0.0 to 4.0 for the mark prediction.

A. Learning Settings and Performance Metrics
All networks with deep learning models deploy either Adam algorithm or Root Mean Square Propagation-RMSprop algorithm [23] as the optimization functions with a learning rate of 0.0001, a batch size of 16000 running to 500 epochs. In order to reduce overfitting, we used early stopping with the epoch patience of 5. If the loss cannot be reduced after consecutive epochs, the learning will be stopped. The scaler algorithm learns from the training set and transforms both training and test sets.
The regression performance is measured using Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) averaged 5 times of run on the test set. The root mean squared error and mean absolute error (MAE) are calculated by equations (4) and (5), respectively.
where, y i is the true value (Grades: 0.0-4.0 scale), andŷ i is the predicted value.
The experimental results are presented as follows. The results with various scalers are illustrated in Section V-B where we show QTF as an appropriate solution for data preprocessing for regression tasks. Then, the selected scaler is run with two optimizer functions including Adam and RMSprop for the comparison in Section V-C. We run the experiments using deep learning and linear regression with the best scaler and optimizer function in Section V-D on all 16 datasets including department and institutes of the considered university and also carried out the prediction on the full dataset which merged from 16 datasets of educational units.

B. Scalers Enhance the Performance of Deep Learning
Various scalers provide the results shown in Table III. It is clear to see that the scalers are able to improve the performance of deep learning algorithms. QTF which can be the best choice among the considered scalers reveals the highest performance on 15 out of 16 datasets for CN1D and all datasets for LSTM. Fig. 9 and Fig. 10 also exhibit a clear view of the improvement. CN1D benefits from the scaler with a significant improvement.

D. Student Performance Prediction with Shallow Deep Learning Architectures
The performance of three learning models with the RM-Sprop optimizer and QTF scaler is shown in Table V. We almost achieve the best performance with deep learning (13 datasets in MAE and 12 datasets in RMSE out of 16 datasets, respectively).
CN1D holds the first place in both MAE and RMSE with achieving the best results on 9 datasets in MAE and 10 datasets in RMSE. The best MEA is obtained on Foreign Languages dataset while the worst in the prediction results are for Engineering Technology dataset. Using the CN1D model, 11 datasets have MAEs which are lower than 0.6 while there are two datasets which get high MAEs being greater than 0.7. On the other hand, the results are rather similar to the metric of RMSE. The best RMSE (0.64607) is achieved when marks come from Foreign Languages. The poor prediction results of Engineering Technology samples can be explained by label distribution as shown in Fig. 4 and Fig. 5 where we can observe that the distribution of training set and test set exist some differences at the mark level of 3.5 while the low results of Rural Development dataset seem to be that the number of samples in the training set is even less than the number of samples in the test set. The models may not obtain enough data in the training set to capture the characteristics in the test set. Predicting marks from Physical Education students can be challenging because of special characteristics from this department where each student can own special talents. However, it can be that many of them usually have to focus and concentrate on various Sports competitions, and hence, he or she cannot spend more time to performs well on other subjects.  Fig. 11 and Fig. 12, respectively. The LSTM holds the best in both RMSE and MAE with values of 0.79522 and 0.63115, respectively. Another deep learning algorithm also obtains better performance than Linear Regression with the results of 0.64107 and 0.79918 in MAE and RMSE, respectively. The performance in the training phase is lower than the validation phase but the difference is trivial. Comparing by MAE, the difference between training and test phases with LSTM is about 0.02846 while this value is 0.03031 for CN1D.
From these experimental results, we can observe that the proposed models could produce acceptable prediction results, thus, the system could support the students to select appropriate courses and to design suitable study plans. Moreover, student performance prediction enables lecturers as well as educational managers to indicate what students should be monitored and supported to complete their programs with the best results. These supports can reduce formal warnings and expulsions from universities due to students' poor performance.

E. Influence Factor Analysis for Student Performance Prediction
In order to know which features are important for the model to learn, we calculate the Pearson product-moment correlation which indicates the covariance of the two variables. This correlation is particularly helpful for regression problem. Next, we compute each variable's standard deviation. The correlation coefficient is indicated by dividing the covariance by the product of the two variables' standard deviations (Equation 6).
• Cov(x, y) is covariance of variable x and y. www.ijacsa.thesai.org • σ x exhibits standard deviation of x.
• σ y denotes standard deviation of y.
The results of factor analysis with Pearson correlation coefficient show that the CGPA and the CourseID have the most correlation to the target attribute (the mark) while the StudentID has negative effect. Other features are presented in Fig. 13.
Taking one bad prediction and one good prediction for the influence factor analysis as shown in Fig. 14, we can see that CGPA (see Table II to get details of features) contributes a positive effect on the mark to produce a good prediction (Fig. 14a) while the result in Fig. 14b considers CGPA as a negative factor so it reveals a prediction with a higher error. An observation from Fig. 14a, we noted that CGPA and the number of semesters (No. Semester) the student studied as well as the course which student was taking also contribute positive effects on the mark.

VI. CONCLUSION
In this study, we proposed deep learning models (Long Short Term Memory and Convolutional Neural Networks) to predict the student performance prediction problem in educational data mining. We analyze and propose using some techniques for data pre-processing (e.g., Quantile Transforms, MinMax Scaler) before fetching them into deep learning models and robust machine learning such as Linear Regression to do prediction tasks. Moreover, we adapt the models by using different optimizer functions including Adam and RMSprop for improving the prediction performance. Experimental results on the dataset collected from a Vietnamese multidisciplinary university's information system show that the proposed methods provide good prediction results and is expected to apply in practical cases.
Using these results, we can help both the educational managers and the students to know early warning results so that the students can have a better plan for studying. Moreover, evaluating various training courses to help the managers to propose appropriate policies.
In the future, we continue to perform experiments on other published datasets and to change the model setting for better performance as well as to compare with other approaches. Moreover, instead of using one model to predict all of the students, future studies can investigate on separated various groups of students depending on different levels of marks to create group of models for enhancing the prediction performance. Further researches should also take into account sophisticated models which can be potential to improve the performance.