Drop-Out Prediction in Higher Education Among B40 Students

Malaysia citizens are categorized into three different income groups which are the Top 20 Percent (T20), Middle 40 Percent (M40), and Bottom 40 Percent (B40). One of the focus areas in the Eleventh Malaysia Plan (11MP) is to elevate the B40 household group towards the middle-income society. In 2018, it was estimated that 4.1 million households belong to this group. The government of Malaysia has widened access to higher education for the B40 group in an effort to reduce the gaps in socioeconomics and to improve their living standards. Statistical data shows that since 2013, a yearly intake of students in bachelor's degree programs in Malaysia's public universities amounts to more than 85,000. Despite this huge number of enrolments, not all were able to graduate, including students from low-income family background. Data mining approach with machine learning techniques has been widely used effectively and accurately to predict students at risk of dropping out in general education. However, machine learning related works on student attrition in Malaysia's higher education is generally lacking. Therefore, in this research, three machine learning models were developed using Decision Tree, Random Forest and Artificial Neural Network algorithm in order to classify attrition among B40 students in bachelor's degree programs in Malaysia's public universities. Comparative performance analysis between the three models indicates that the Random Forest model is the best model in predicting student attrition in this study. Random Forest model outperforms the other two models in terms of accuracy, precision, recall and Fmeasure with the value of 95.93%, 97.10%, 81.26% and 88.50%, respectively. Nevertheless, there is a statistically significant difference in performance between the Random Forest model and Decision Tree model but no statistically significant difference between Random Forest models and Artificial Neural Network model. Keywords—Machine learning; prediction; student attrition; student drop-out; B40; random forest; decision tree; artificial neural network


I. INTRODUCTION
Malaysia's household income is classified into three groups, which are Bottom 40% (B40), Middle 40% (M40) and Top 20% (T20). According to the Department of Statistic, Malaysia (2017), generally in Malaysia, B40 household income is not more than RM 4,630.00. Approximately, there were 2.7 million households belonging to B40 group in 2014. The figures increased in 2018, as the government announced that 4.1 million households will continue to benefit from Bantuan Sara Hidup (Household Living Aid) (BSH) which is specially allocated to B40 group [1].
B40 had been selected as a focus group in Rancangan Malaysia 11 2016-2020 (The Eleventh Malaysia Plan) (RMK-11). Through RMK-11, the government used education as one of the strategies to boost B40 household's income and ultimately narrowing socioeconomic gap [1]- [2]. Higher education institution and skills training institutes were encouraged to allocate more seats and allowing admission by special allocation for B40 students in an effort to ensure their access to higher education is secured. As reported in Higher Education Statistic, from the year 2011 to 2015, the total number of students intake for bachelor's degree programmes in Malaysia Public Universities is more than 85,000 students yearly. The highest number of students intake was recorded in the year 2011 with 99,862 students. Nonetheless, not all were able to graduate on time. In a worse case, some dropped out voluntarily or were expelled from by university.
Student's attrition in university will negatively affect B40 students financially. The family financial burden will increase as student's education loan has to be paid even if they fail to graduate. Furthermore, it will affect a student's chances on securing a high-income job. Students drop out would also lead to a huge loss in human capitals to the nation as fewer professionals and expert skills will be produced by public universities.
Hence, a proactive approach is desperately needed in identifying students who are at risk at dropping out. An effective prediction model using machine learning technique can be implemented for that purpose. Thus, the aim of this paper is to conduct a comparative study for machine learning models in predicting attrition among B40 students, particularly in the bachelor's degree programme in Malaysia Public Universities. Decision Tree (DT) Random Forest (RF), and Artificial Neural Networks (ANN) algorithms were adopted in constructing the models.
The remainder of this paper is organized as follows. Section 2 presents previous research articles related to classification technique in education and student drop-out prediction in higher learning institutions. Section 3 describes the methodology used in predicting student's drop out in this research. Results and discussion will be discussed in Section 4, while the conclusion of this paper and further works is outlined in Section 5. 550 | P a g e www.ijacsa.thesai.org II. LITERATURE REVIEW Classification is a machine learning technique that can be used to predict students drop out rate accurately to help reducing student attrition rate. The task is crucial as the ability to predict students at risk the earliest possible is a great help to keep students from leaving their studies and overcome attrition among B40 students. Classification technique had been developed and applied successfully to a wide range of real-world domains [3] - [8]. Also, the classification is playing an important role in the education domain, especially in predicting student's academic performance, whether in school or higher education institution [9]. The research had review 30 studies carried out in between year 2002 until early 2015 and discovered that Artificial Neural Network (ANN), Decision Tree (DT), Naïve Bayes (NB), k-Nearest Neighbour (k-NN) and Support Vector Machine (SVM) were often used in building prediction models. However, findings showed that ANN and DT models produced higher accuracy results than the others.
Over recent years, there has been a significant growth of research published in predicting student performance, focusing on course drop-out/ retention using the technique of classification (supervised learning). These researches concentrate on predicting final grade or Cumulative Grade Point Average (CGPA) of students by utilizing classifier algorithm [10]- [13], predicting student's performance in Massive Online Open Courses (MOOC) environment [14], and predicting students at risk of not graduating high school on time [15].
In Malaysia, classification techniques had been applied in education domain, but the focus was more on student performance rather than attrition. Reference [16] in their work designed a model to identify key factors that influence the drop-out rates in Computer Science course. They collected student's demographic information and transcript records which focused on the core courses offered as it gives more impact on the drop-out case. Four different classification techniques namely k-NN, DT, NN and Logistic Regression (LR) are utilized to classify the dataset. The results show that LR classifier is the most accurate (91%) as compared to other techniques used in this work. The outcome of this work reveals that there are five important courses that the student must score higher to lower the chance of dropping out.
Bedregal Alpaca et al. [17] proposed classification models based on academic information provided by university to identify a student at risk of drop-out. The student's demographic, academic performance, admission test and course information data are considered for the evaluation. From the result, it is observed that the model is able to determine the most significant variable that affects academic performance, which is the abandoned subjects.
Gil, Delima, and Vilchez [18] adapted DT and NB to identify the underlying factors of student drop-out in a public school in the Philippines. They used Weka tool kit to utilize the classifier algorithm on the selected dataset and produced a comparative result of each algorithm performance in terms of recall, precision and accuracy. Meanwhile, [19] only concentrated on k-NN to perform extensive evaluation and predict student drop-out at an early stage of study. The technique is versatile, simple and can handle different type of data. The results can help teachers to identify a student at risk of drop-out and check on their welfare.
Mardolkar and Kumaran [20] adapted data mining technique to find comprehensive prediction models of student drop-out as early as possible. The model with sufficiently high accuracy will be used in an early warning system as an effort to detect students at high risk of drop-out as soon as possible. They explored the academic variables (both at universities and former school), sociodemography, behaviour and extracurricular activities that may influence student drop-out. However, only a subset of attributes that has a very high predictive contribution on the student drop-out.
Tomasevic, Gvozdenovic and Vranes [21] conducted a research with an objective to provide a comprehensive analysis and comparison of supervised machine learning techniques for discovering students at a high risk of dropping out from the course. For this, they used various classifier such as k-NN, SVM, ANN, DT, NB and LR as the classification tool. The overall highest precision was obtained with ANN by feeding the algorithm with student engagement data in online learning and past performance data.
Viloria and Padilla [22] in their study applied NN, DT and Bayesian Network to predict drop-out among engineering students in India. As a result, it was found that academic results and socioeconomic situation have an influence on students and managing these variables helps reduce the dropout rate.
Sangodiah et al. [23] used SVM to predict academic performance for students under probation in a private higher learning institution. The model gained 89.84% of accuracy. Likewise, [24] also used single classifier to predict postgraduate doctoral degree students that will complete their study on time by using Binary Linear Regression. The outcome revealed that only 6.8% of the students in the year 2014 were able to graduate on time.
Table I described 16 studies conducted in predicting student drop out in higher learning institution from the year 2015 until 2020. The studies indicated that academic and sociodemographic data were important features used in predicting student. Other than that, there was only one research in predicting student drop-out that uses data from server logs containing student's activities for online courses offered from various universities. All of the research reviewed here were targeting students from only one course/major in one faculty or similar institution. However, in this research, the focus will be shifted to predicting drop out among B40 students by using academic or sociodemographic data from various majors and various higher learning institutions (public universities). 551 | P a g e www.ijacsa.thesai.org To generate a classification model and implement them on academic information provided by the university.

Demographic, academic, admission test, course information ANN, DT
The generated model is able to determine the most significant variable that affects academic performance, which is the abandoned subjects.
Gil et al. (2020) [18] To identify the underlying factors of dropout students and apply the different approach of data mining algorithms.
Academic, student attendance, sociodemographic, DT, NB DT model produces the best result. The model identified key factors that affect students drop-outs.
Mardolkar & Kumaran (2020) [19] Evaluate and propose k-NN method to predict students' drop-out To develop web-based system with the ability to predict students who are at risk to drop-out in Information Technology major.
Academic (First and second year students) DT, RF RF accuracy higher than DT Chen, Johri & Rangwala (2018) [26] Performance comparison between survival analysis framework and machine learning approach in predicting student attirition in Science, Technology, Engineering and Mathematic (STEM) major. SVM outperformed other models based on F-measure score, but the differences were not significant.
Academic , sosiodemografic and phone conversation ANN Able to predict student drop-out with 76% accuracy.
552 | P a g e www.ijacsa.thesai.org Comparative studies of two or more prediction models had been the core for 13 studies while the remaining used single classifier. In comparing prediction models performance, DT was the main choice among the researchers which was used in 11 studies followed by NB/ k-NN (six studies), RF (five studies) and ANN/ SVM (four studies). Based on the review, it can be concluded that DT is the most popular choice among the researchers in predicting student drop out as it is easy to comprehend and produce high-performance prediction results. Other than that, over the recent years, classification model using ensemble learning, especially RF had been increasingly popular among researchers because the performance outcome is very high as compared to a single classifier. Thus, a comparative study between classifier, particularly DT, ANN and RF is very much needed to discover the best prediction models for student drop-out among B40 students.

III. RESEARCH METHODOLOGY
In general, this research was conducted in three phases which were Phase I -Feasibility Study, Phase II -Data Preparation ad Phase III -Modeling and Evaluation. Fig. 1 shows detailed activities for each phase in research methodology. Fig. 1 illustrated phases and details of activities for each phase for research methodology in this study. There were three softwares used in this study, which were RapidMiner for prediction models construction and performance evaluation, MariaDB database to store and pre-process data, along with SPSS for attribute selection and statistical test.

A. Data Preparation 1) Data acquisition:
The dataset was provided by Bahagian Pembangunan dan Perancangan Dasar (BPPD), Kementerian Pendidikan Malaysia (Pendidikan Tinggi) which consists of 44,406 records with 23 attributes. The dataset holds student's records from 20 public universities for bachelor degree programmes, who have dropped out or graduated from the year 2014 to 2017 intake.
2) Data Pre-processing: Pre-processing of data is a method of transforming a dataset in order to better expose the information quality to the mining tool. Real world data is often incomplete, incoherent and can contain noise such as errors and outliers. Pre-processing data is therefore required to ensure that data is formatted for a given miner tool and must be adequate for a given method. Data cleaning was performed using dimension reduction process. Attributes with more than 20,000 data unavailable, redundant or obsolete were deleted from the dataset. Incomplete records or outliers were also discarded (Table II). Data cleaning also ensured that the dataset included only student records with B40 household income (not more than RM4,387.00).  Next, the data were transformed into a structure or understandable format befitting data mining. There were two new attributes constructed from existing attributes which were age from date_of_birth and class from student_status. Attribute class was the class label in this study. In attribute student_status, records with data' Berhenti' or 'Diberhentikan' were translated into 'C' in attribute class which represented students who dropped out while records with data' Tamat' were translated into 'G' which represented students who manage to graduate. Furthermore, attributes with varieties of data were aggregated or generalized by using the hierarchical concept.
Only relevant attributes were selected and used in building the prediction models. For this reason, the Chi-Square test was used to assess the relationship between attributes and the class label. Attributes with test result p < 0.05 were considered having significant association with the class label. Afterwards, further tests were performed using Phi ( ) or Cramer's V (V) to each associated attribute in order to measure the strength of association. The value of or V is between 0 to 1, with 1 being the strongest and 0 being the weakest. The interpretation of the association between attributes and class label is, as shown in Table III. Refer to Table IV, Chi-Square test results showed that all attributes had a significant association with the class label as the p-value for each attribute that was less than 0.05. However, based on Cramer's V/Phi test result, place_of_birth and family_income were discarded from the dataset as their association level with the class label can be ignored. The final dataset for model construction consisted of 28,844 records with 9 regular attributes and 1 special attribute (class) ( Table V). Table VI and Table VII report the statistical analysis for each attribute after pre-processing data activity. Students who dropped out (class 'C') only represented 19.22% data in this study as compared to 80.78% of students who managed to graduate (class 'G'). More than half of the students (59.47 %) come from UiTM while UKM had the lowest number of students (0.21%).   Majority of B40 students managed to obtain a CGPA higher than 2.00 in the first year of their study. Nearly 40% of the students obtained their first year CGPA between 3.00 -3.49. Even though students with first CGPA lower than 2.00 percentage is the lowest (13%), this group is most likely to drop-out as almost all of the students (98.84 %) did not continue their study. Business and Administration programme group contributed the largest number of students (10,809 students followed by Engineering (3,976 students) and the lowest was Transport Service with only one student. Further analysis also found that six out of ten programme groups with most numbers of students that dropped out and obtained CGPA lower than 2.00 were from Science, Technology, Engineering and Mathematics (STEM) major (Engineering, Computing, Manifacuturing and Processing, Mathematics and Statistics, Physical Science and Architecture).

B. Descriptive Analysis
Almost two-third of B40 students (63.61%) self-funded their study while the balance (36.39%) used education loan or received financial aid from government agencies, a private institution, foundation or other sources. Self-funded students were also presumed to drop-out as 60% of the students quit their study. Being a part-time student also can be a disadvantage, as 99.47% of them failed to graduate. Students who were single got a higher chance of finishing their degree as their percentage of dropping out was very low as compared to married or divorced/widowed students. When it comes to gender, over 70% of students in this study were female, but their drop out rate is 15.61% lower than male students. Finally, students in the age group o 20 years old and below when enrolled in a bachelor degree programme are most likely to drop-out than students in the age group 21 years and above.

C. Modelling and Evaluation
1) Model construction and testing: Each prediction model (DT, RF and ANN) was tested beforehand to determine the validation method and algorithms parameters that can be used to produce high performance prediction model. All nine attributes were used in validation method testing and parameter tuning. Prediction model validation was tested using holdout (70 %-30% and 60% -40%) and 10-folds cross-validation methods, and the latter was chosen as it gave the highest accuracy results for the majority of the models. Next, each prediction model also was constructed repeatedly by using a different parameter to achieve highest accuracy result. Parameter tuning results are as shown in Table VIII, and these parameters were used in building the final prediction models. 556 | P a g e www.ijacsa.thesai.org Different numbers of attributes were used in building final prediction models. At first, the models were build with attributes that had moderate to strong relationship with class label. Subsequently, attributes with weak relationship were added one-by-one based on attribute ranking. The importance of weak attributes can't be neglected as they might be useful in producing high performance prediction models. Table IX shows attribute representation for final models construction.
2) Model evaluation: Prediction model's performance was evaluated by comparing the value of accuracy, precision, recall and F measure. Those values were calculated based on the confusion matrix technique, as shown in Table X. Prediction results and actual class were put in a matrix for comparison depending on a positive and negative value. Class 'C' was marked as positive value while class 'G' was negative.
In addition to performance comparison, the statistical test was performed to decide the best prediction model. This study used the McNemar test to determine if there was a significant difference statistically to the proportion of error between two prediction models with a significance level of 0.05 (α = 0.05). The significant difference between the proportion of error of two prediction models is also interpreted as a significant difference in performance between two prediction models (Dietterich, 1998).

IV. RESULTS AND DISCUSSION
The results indicated that RF model gives the highest accuracy in predicting student drop-out with 95.93%, followed by ANN with 95.86% and DT with 95.84%. The highest accuracy for RF model was produced with seven attributes while the others by using six attributes. However, the accuracy for RF model with six attributes was higher than ANN and DT models with the same number of attributes (refer Fig. 2). Consistently, RF also yields a higher accuracy rate than the other two models, even by applying different numbers of attributes. This showed that prediction performance could be improved with the use of ensemble learning. This result is also inline with research outcome by [20], which predicts students' drop-out in higher learning institution, revealing that the accuracy of the prediction model using RF l was higher than DT.   Performance between prediction models was evaluated by comparing the value of accuracy, recall, precision and F measure (refer Table XI). Aside from accuracy, the results also showed that RF leads ANN and DT with regards to recall value with 81.26%, 81.03% and 80.99%. This means that RF model succeeded in predicting more students who will dropout (class 'C') correctly from the total number of student who were actually dropped out in this study. Likewise, the highest value for precision was also recorded by RF with 97.10% which means that the model was able to predict more class 'C' precisely from the total number of students who were predicted to drop-out. ANN took second place in precision with 96.89% while DT the last place with 96.83%. When comparing F measure, RF also the highest with 0.885, followed by ANN with 0.883 and DT with 0.882.
Generally, RF is the best model in predicting drop-out among B40 students in this study as it outperformed the other two models with reference to accuracy, recall, precision and F measure, subsequently ANN and DT models. Nevertheless, the difference in accuracy and F measure value between the three models were very narrow, with 0.07% to 0.09% and 0.002% to 0.003%, respectively. Hence, the statistical test (McNemar) results were referred to in determining a significant difference in performance between the prediction models.
Based on Table XII, McNemar test results proved that statistically: 1) DT and RF models had a significant difference in proportion of error; 2) DT and ANN models had no significant difference in proportion of error; and 3) ANN and RF models had no significant difference in proportion of error.
This implied that even though RF is the best model in predicting drop-out among B40 students in this study, there is a significant difference in performance only between RF and DT, but contrarily, no significant difference in performance between RF and ANN.

V. CONCLUSION
Drop-out prediction among B40 students in bachelor's degree programmes can be implemented by using classification technique. Prediction model using RF was selected as the best model in this study as it outperformed ANN and DT in accuracy, recall, precision and F measure. However, statistically, the difference in performance was only significant between RF and DT, not between RF and ANN.
Results of this research are expected to benefit B40 students, public universities and the government. Early prevention steps can be deployed by public universities to avoid drop-out to produce more graduates. B40 students who are at risk to drop-out will be able to graduate with the help of their university and getting better job opportunities that will improve their socioeconomic status. These students also will become assets to the government as professionals and skilful worker that can be contributed to the nation's future development.
In future, this study can be furthered by applying regression technique to predict when attrition will happen with the additional data of students who are still studying and the exact date of drop-out. Besides, the association rule technique can be applied to discover hidden patterns that can be used to identify students at risk, and the results can be verified by the experts from the ministry or universities.