Development of Mobile-Interfaced Machine Learning-Based Predictive Models for Improving Students’ Performance in Programming Courses

—Student performance modelling (SPM) is a critical step to assessing and improving students’ performances in their learning discourse. However, most existing SPM are based on statistical approaches, which on one hand are based on probability, depicting that results are based on estimation; and on the other hand, actual influences of hidden factors that are peculiar to students, lecturers, learning environment and the family, together with their overall effect on student performance have not been exhaustively investigated. In this paper, Student Performance Models (SPM) for improving students’ performance in programming courses were developed using M5P Decision Tree (MDT) and Linear Regression Classifier (LRC). The data used was gathered using a structured questionnaire from 295 students in 200 and 300 levels of study who offered Web programming, C or JAVA at Federal University, Oye-Ekiti, Nigeria between 2012 and 2016. Hidden factors that are significant to students’ performance in programming were identified. The relevant data gathered, normalized, coded and prepared as variable and factor datasets, and fed into the MDT algorithm and LRC to develop the predictive models. The developed models were obtained, validated and afterwards implemented in an Android 1.0.1 Studio environment. Extended Markup Language (XML) and Java were used for the design of the Graphical User Interface (GUI) and the logical implementation of the developed models as a mobile calculator, respectively. However, Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Relative Absolute Error (RAE) and the Root Relative Squared Error (RRSE) were the metrics used to evaluate the robustness of MDT and LRC models. The evaluation results obtained indicate that the variable-based LRC produced the best model in terms of MAE, RMSE, RAE and the RRSE having yielded the least values in all the evaluations conducted. Further results obtained established the strong significance of attitude of students and lecturers, fearful perception of students, erratic power supply, university facilities, student health and students’ attendance to the performance of students in programming courses. The variable-based LRC model presented in this paper could provide baseline information about students’ performance thereby offering better decision making towards improving teaching/learning outcomes in programming courses.


INTRODUCTION
Computer programming courses are a fundamental part of many Universities' curricula and among the most important subjects for computer science and information technology students.This requires the knowledge of programming tools and languages, problem-solving skills and effective strategies for program design and implementation [1].Furthermore, students are being exposed to various programming specifications and techniques which normally entails an overview of algorithms, concept of programming, basic data structure, problem analysis and illustrations describing the application of various techniques to problems which are quite difficult to understand [2].Furthermore, the high level of abstraction and very complex language syntax and semantic structures induced in programming makes it a much dreaded task in which most students fail [2].This is evidenced by the notion that the same set of students who failed programming courses performed better in other non-programming courses [3].As a matter of fact, the failure rate in programming courses at the University level suggests that learning to program is a difficult task [3].The perception of the complexity ascribed to programming courses can be described as one of the main reasons that may have attributed to the decline in number of undergraduates who offer or intend to offer computer science in various institutions [4].www.ijacsa.thesai.orgChermahini [5] noted that students are different based on their ability to learn, how they respond to instructional practices, their motivational differences from one individual to another and that the more students understand the differences in their abilities, the better are the chances they have to meet their different learning needs in order to achieve good scores in examinations.Students' performance is majorly affected by several social, economic, institutional, environmental, psychological and personal factors which vary across individuals and regions [6]- [8].Unfortunately, poor performances have ravaged the academic institutions due to indices of those factors which influence students' performance including poor funding, lack of frequent curricular review, overpopulation, students' unrest, staff strikes, poor facilities, coarse relations between the university and government, inadequate teaching and research facilities needed to enhance students' learning and performance.More specifically, Ogbogu [6] and Irfan and Shabana [9] emphasized that challenges such as poorly equipped departmental and central libraries, overcrowded lecture rooms, method of collating and accessing semester results, interruption of electricity supply, poor access to internet facilities, incessant strike and closure of school and poor accommodation facilities which are pertinent to developing countries affect student performance.
Students' performance assessment has become a pressing issue that requires fair attention from all regardless of differences in interest and intentions [9], [10].However, different methods have been used to evaluate students' performance, and more than ever before, information generated by evaluation can be helpful for students and tutors to take timely, meaningful and effective decisions.Most existing student performance models have adopted statistical techniques for prediction which are probability-induced, depicting that results may not be scientifically correct but rather are based on estimation.To this end, several authors have adopted data mining and soft computing techniques in educational domain and/or to evaluate students' performance [11]- [17].
Ashish, Saeed, Maizatul, and Hamidreza [14] focused on consolidating the different types of clustering algorithms been applied within the context of Educational Data Mining (EDM) to harnessing the power of the massive didactic data recently being generated in institutions.EDM was employed to analyze data generated in an educational setup by the various intraconnected systems in a bid to develop a model for improving learning and institutional effectiveness.Among the slightly numerous clustering algorithm consolidated by the authors are Expectation Maximization, Hierarchical Clustering, Simple k-Means and x-Means, Apriori Algorithm (as applied to academic records of students in a guise to obtain the best association rules which helps in student profiling), C-Means clustering, Ward's clustering, Markov Clustering (MCL) algorithm, Unique Clustering with Affinity Measure (UCAM), Fuzzy sets, Transitive Closure and a hierarchical cluster analysis which was performed on the questionnaire data.As concluded by these authors, data mining methods in the educational sector sets to uncover the previously hidden data to meaningful information that can be used for strategic and learning gains.
Kolo, Adepoju and Alhassan [18] aimed at predicting the performance of students with the decision tree approach.Gurmeet and Williamjit [13] employed data-mining approach for an effective prediction of student performance based on personal, social, psychological and environmental variables.This was to ensure a high accuracy in the prediction of student performance, thereby assisting to identify students with low academic achievements.The parameters employed in the study include gender, hometown, family income, previous semester grade, attendance, communication language (medium), seminar performance and participation in sports.Analysis of these parameters was conducted by implementing the algorithms in WEKA tool.Naïve Bayes and J48 algorithms were used for classification and the result showed that the Naive Bayes algorithm provided an accuracy of 63.59% while the J48 algorithm provided an accuracy of 61.53%.
Generally, the educational sector in developing countries is being faced by a series of multi-factored challenges that contribute to the rapid decline in the performance of students located within such contemporary environments.Teachers and students alike have for so long been unable to estimate the impact that certain factors have on academic performances but rather anticipate good performances in the long run.This way, it becomes impossible for student to quickly re-adjust and retune performance demeaning challenges surrounding them or probably their responses to such surrounding factors.More often than not, the actual influences of hidden factors that are peculiar to students, lecturers, learning environment and the family, together with their overall effect on student performance have not been exhaustively investigated in existing studies.
In this paper, M5P decision tree and linear regression classifier, which are among the most widely adopted machine learning techniques, are employed to develop the student performance predictive models.Metrics used to evaluate the performance of the machine learning techniques employed include mean absolute error, root mean squared error, relative absolute error and the root relative squared error, correlation coefficient, time taken to build the model and the time taken to test the model.
The major contributions of this paper are as follows: a) Exhaustively investigated, examined, identified and established new hidden factors and associated variables on which students' performance in programming courses is dependent and that are particularly peculiar to a prototype University in a developing economy.These are significant and technical extensions beyond most student performance models that currently exist; b) Beyond the spheres of statistical approaches commonly used for student performance modeling which are based on probability and estimation in most existing works, this study applied machine learning techniques (M5P Decision Tree and Linear Regression Classifier) to predicting www.ijacsa.thesai.orgstudent performance in programming courses to guarantee precision and accuracy of the resultant predictive models; c) Towards facilitating the accessibility, availability and ubiquity of the developed predictive models, a mobile application, that visually interfaces the stakeholders and all student performance indices with the models, was developed.This is to realize real-time use in predicting students' performance and for promoting effective and efficient decision making on education planning by all stakeholders.
The rest of this paper is organized as follows: Section 2 discusses the materials and method including the M5P decision tree and linear regression classifier, data acquisition, the development and validation of the machine learning-based predictive models and the performance evaluation metrics for the machine-learning based approaches.In Section 3, the design and implementation of the mobile-frontend application for the developed predictive models are presented and discussed.The results of performance evaluation of the machine learning approaches are presented and discussed in Section 4 while the conclusion and future works are presented in Section 5.

II. MATERIALS AND METHOD
In this research, models for predicting students' performance in programming courses were developed based on M5P and linear regression classification algorithms in three basic steps.These include data acquisition, development of the predictive models and finally model validation.Furthermore, the performance evaluation of the machine learning approaches employed and the mobile implementation of the predictive models developed were conducted.
A. The Classification Algorithms 1) M5P Decision Tree: This is a decision tree model that learns regression tasks.The M5P learns efficiently and can cope with highly-dimensional data with up to several hundreds of distinct attributes.According to Quinlan [19], M5P decision tree is the most accurate among the family of regression tree learners with much smaller model trees than regression trees.It uses mean squared error as the impurity function.A M5P tree is constructed by recursive partitioning of a data into a collection of set T which can either be associated with a leaf or a split function that segregates T into some subsets based on some split function criteria [20].The subsets that emerge are further partitioned following the same process repeatedly.However, the quality of split (goodness of fit) is evaluated using a function where is the split candidate in node such that the split candidate that maximizes the value of quality of fit is selected as the next node of tree [21].That is, (1) where is the impunity function at node for classes in a dataset defined as: and are the probabilities that an instance is going to the left branch and right branch of according to split | is the estimated posterior probability of class given a point in node , is the difference between the impunity measure of node and two child nodes , according to split The information gain in M5P is determined by the difference in the values of standard deviation obtained before and after the split function test.Simply put, given data , where denotes the subsets of corresponding to the outcome of a split function test, then the expected error reduction value is determined by Hieu [22]: The split function test criterion that maximizes this expected error reduction is then selected.To avoid overfitting, subtrees that do not improve the performance of the tree are pruned via an error-based estimation procedure, from the leaves to the root node [23].This is determined by the difference in the estimated error of a node and estimated error of the subtree below at each internal node.
2) Linear Regression Classifier: The linear regression classifier is a mathematical measure depicting the mean relationship among two or more variables based on the original units of the data [24].This often involves the estimation and prediction of an unknown value of one variable from the known value of another variable [25].This implies that there exists a linear regression between the variables should the regression curve be a straight line.With linear regression, the values of the dependent variable increase by a constant absolute amount for a unit change in the value of the independent variable.However, the general form of linear regression measure is given as [26]: (4) where if is assumed.

Algorithm: Linear Regression Classification [27]
Inputs: Class models , and a test input student performance factors' vector .Output: iii.Distance between original and predicted response variables is determined by || ̂ || iv.Decision is made with regard to the class that has the minimum distance

B. Data Acquisition
Hidden factors that are significant to student performance were identified via a thorough literature review, interview and field observations.Questionnaire was developed for the University under study with respect to information on programming courses and associated scores as presented at the Appendix section.In Table I, the contextual definition of the variables is presented.Copies of the questionnaires were disseminated to students that had offered programming courses and their respective lecturers in the University.www.ijacsa.thesai.orgRelevant data were gathered, normalized and coded.The coded data was utilized by the machine learning techniques to develop the student performance models and were further validated for prediction purpose.However, twenty-one (21) factors were investigated via this study with a total of 81 variables.Each factor was coded based on the cumulative of the variables designated to investigate it as conducted by Fagbola et al. [11]: a) Student Study Habit (SSH): This is the amount of the student's effective study in programming courses offered relative to the frequency of revision and practice and hours spent on revising the lecture notes.It was investigated by three variables .b) Student Fear and Perception (SF): This is the students' fearful perception of programming courses where a positive perception implies a reduction in fear factor of the student.This was investigated by the variables .c) Student Attendance (SATD): This is the level of effort, seriousness and devotion of students towards learning to program, investigated by the variables .d) Student Attitude (SAT): This is the level of responsiveness of a student relative to their interest, behavior and seriousness to programming courses, and characterized by student's participation in class activities, assignment, willingness to learn, and motivation from friends, colleagues and lecturer(s).This was represented by the variables .e) Tutorials and Extra Classes (ST): These are the extra effort put in place by students in other to have a clear www.ijacsa.thesai.orgunderstanding of the subject matter(s) discussed programming classes.This includes extra-classes attended, assistance from friends and use of online forums and materials.This factor was investigated by the variables .f) Lecturer Attitude (LAT): This is defined as the lecturers' assertiveness, interest to explicitly expatiate on the subject matter, ability to motivate the student and relate with the student in a means to improve their interest in the course.This was investigated by variables g) Teaching Style (LTS): This is defined as the pattern of teaching of the lecturer in charge (probably dishes out voluminous handouts or excessive assignments).Whether he carries the class along and helps the student conceptualize the concept of that particular programming course.This was investigated by variables .h) Communication Skills (LCS): This is the ability of the lecturer to deliver the course content in a less ambiguous manner and to the understanding of the students.This entails the clarity and explicitness of the lecturer.This was investigated by variables .i) Lecturer Availability (LA): This is the presence and accessibility of the lecturers' when they are needed by the student(s).This factor was investigated by the variables .j) Lecturer Dedication (LD): This is the devotion of the lectures to the programming courses they tutor.This includes the assertiveness of the lecturers to their duty and extra effort put in place to ensure an excellent student performance.This factor was coded as presented in Table III and was investigated by the variables .k) Health (OH): This is the influence of medical condition on students' performance in programming courses.This factor was coded and was investigated by the variables .l) Electricity (OE): This is defined as the erraticism of power supply as it affects the students' practice using computers and also other laboratory works.This factor was coded and was investigated by the variables .m) Background knowledge (OB): This is the academic strength of the student in other courses that are elementarily related to computer programming (mathematics and physics).This factor was investigated by the variables .n) Facilities (UF): This is the availability of appropriate programming learning facilities (computer laboratory) within the university environment.This factor was investigated by the variables .o) Class population (UCP): This is the student to tutor population ratio during the programming course class.This factor was investigated by the variables .p) Lecture time (ULT): This is the conduciveness of the lecture schedule.This factor was investigated by the variables .q) Teaching aids (UTA): This is the availability of teaching aids (audio visuals) for the demonstration of the concept of programming courses.This factor was investigated by the variables .r) Family income (FI): This is the robustness of the family income of the student.As it influence the ability of the student to afford textbook materials, print handout or even own a personal computer for effective study.This factor was investigated by the variables .s) Family stress (FS): This is the degree of disturbance from home.An unsettled home creates a paranoid atmosphere which seemly affects student performance.This factor was investigated by the variables .t) Parent education (FPE): This is the degree of education of the students' parent.A poor motivation from home might destabilize the student cognitive sense, hence influencing the students' performance in programming.This factor was investigated by the variables .u) Proper guidance (FPG): This is the student's family guidance and support level for programming courses.A student from a family of computer scientist is prone to having huge support and guidance from home.This factor was investigated by the variables .After final normalization and cleaning process were completed, the entire data acquired was divided into variable and factor datasets and each data split was used to train the machine learning classifiers.

C. Development of the Machine learning-based Student
Performance Predictive Models M5P decision tree and the linear regression classifier, having industrially-packaged working implementations in WEKA environment, were trained using the variable and factor datasets and further applied to generate predictive models which are of exclusive significance to the determination of students' performance.The variable-based student performance model generated by the linear regression classifier is presented in (5). (5) The learned models developed are further used to generate predictions on new instances.The factor-based Student www.ijacsa.thesai.orgPerformance Model obtained using linear regression classifier is expressed in ( 6). ( 6) The M5 pruned model tree for the variable dataset is presented in Fig. 1.However, the variable-based M5P decision tree classifier generated smoothed Linear Models (LM) through 22 refinement processes.The first and the last generated models are presented in ( 7) and ( 8), respectively although the latest refinement was used to predict student performance.The M5 Pruned model tree for the factor dataset is presented in Fig. 2.However, the factor-based M5P classifier generated smoothed Linear Models (LM) through 22 refinement processes.The first and the last models generated are presented in ( 9) and (10), respectively.(9) (10)

D. Validation of the Developed Machine Learning-based Student Performance Predictive Models
The variable and factor datasets were employed in the development of the students' performance predictive models, which were then validated using the test dataset.Some instances of the validation results of the predictive models generated by the machine learning classifiers are presented in Table II.It is important to note that with limited data used for validation, the results of validation test cannot be exclusively used to justify the correctness of the developed models but rather by some standard evaluation measures.Based on some validation results obtained, the best performing model is the factor dataset-based SPM generated by the linear regression classifier.This is followed by variable dataset-based SPM generate by M5P decision tree classifier, factor dataset-based M5P decision tree and the variable dataset-based SPM based on linear regression classifier in decreasing order of performance.Note that the best prediction values are marked in "bold".

E. Performance Evaluation Metrics for the Machine Learning-based Approaches Used
The mean absolute error, root mean square error, relative absolute error, root relative squared error, time taken to build and test the models are the standard metrics used to evaluate the performance of the learning techniques.a) Root Relative Squared Error (RRSE) is determined using the relation: where P (ij) represents the predicted value by each individual program i for any sample case j which is a subset of n sample cases, T j is the target value for sample case j; and ̅ is given by [28]: b) The Relative Absolute Error, RAE, accepts the total absolute error and divides it with the actual absolute error of the model predictor.Relative Absolute Error is determined using the relation [24]: c) Mean Absolute Error, MAE, is determined by adding the absolute values of the error, and then dividing the total error by [24): d) Root Mean Square Error: This is a measure of the differences between the sample values predicted by a model and those which are actually observed from the system that is being modelled [28].That is, the change between the model performance of a predictive model and another.Analytically, √ where ∑ ̂ such that ̂ is the modelpredicted response for input e) Time taken to build the model: This is the total time required to learn the discriminating features and to develop a model f) Time taken to test the model: This is the time taken to validate and ascertain the correctness of the developed model.

III. THE DESIGN AND IMPLEMENTATION OF A MOBILE FRONT-END APPLICATION FOR THE DEVELOPED PREDICTIVE MODELS
The developed student performance models were implemented within an Android 1.0.1 Studio environment, using XML for the design of the Graphical User Interface (GUI) and Java for the logic that unifies the GUI and the implementation of the developed models.The flowchart representation for the implementation of the developed student performance models is presented in Fig. 3.The code and design interface is presented in Fig. 4. In the same vein, the mobile home interface of the SPM implementation as presented in Fig. 5 defines the model(s) to be applied and www.ijacsa.thesai.orgserves as a link to the questioning aspects of the application.Students and stakeholders can predict the performance of a student by selecting any of the options presented on the home activity of the application.Each of these options implement an underlying model which is used for the prediction of student performance relative to their responses to questions presented.
The interface presented in Fig. 6 displays various questions which are relevant to the selected prediction perspectives.Responses to these questions are then interlinked with the underlying models.In Fig. 7, the predicted performance of the student is displayed in an alert messagebox after the responses from prospective students and educational stakeholders have been substituted into the chosen model(s).This happens upon clicking the finish button which appears after the entire questions required for the prediction of student performance under the selected perspective has been duly responded to.In this section, the performance and comparative evaluation results of the machine-learning predictive approaches and the developed student performance models are presented and discussed.

A. Results of Performance Evaluation of the Machine Learning Methods
The results regarding the mean absolute error, root mean square error, relative absolute error, root relative squared error, time taken to build and test the models for both linear regression and M5P decision tree classifiers are presented in Table III.The variable-based Linear Regression Classifier produced the best model in terms of mean absolute error, root mean squared error, relative absolute error and the root relative squared error having yielded the least values in all these metrics.This is followed by the variable-based M5P decision tree, factor-based M5P Decision Tree and the factorbased linear regression classifiers in decreasing order of performance.In terms of the time to build the model, the results obtained indicate that the factor-based M5P Decision Tree is the most computationally-efficient classifier followed by variable-based Linear Regression classifier, variable-based M5P decision tree and factor-based linear regression classifier.Using the model produced by the best performing classifier (variable-based LRC), three (3) out of the 70 variables investigated are found to be insignificant to student performance as presented in Table IV.However, there are 32 variables with positive significance and 35 variables with negative significance to student performance in programming courses as presented in Tables V and VI, respectively.

B. Comparative Evaluation of the Developed Student Performance Models
The expressions of variable-based LRC model with positive significance agree with some already established variables such as students' lack of understanding, absence from class, negative attitudes towards programming, students' performance in Mathematics [29], study habit [30], review study materials, self-evaluate, rehears explaining materials, and studying in a conducive environment [31], students' class attendance (Pudaruth, Nagowah, Sungkur, Moloo and Chinia [32], Teaching Styles and Strategies [33], availability of University facilities [6] and mathematics background [34].However, this study established the negative significance of variables such as group discussions, good background in physics and English among others on student performance in programming as against the reports of Mohd and Abdullah [29] and Darwin et al. [30] for example.In general, the variable-based LRC model is an explicit extension of most existing counterparts by salient factors such as Lecturers' Teaching Style (LTS), Health (OH), Electricity (OE), Parental Education (FPE), Student Fear and Perception (SF), Tutorials and Extra Classes (ST) among others which have not been duly considered by other previous works.

V. CONCLUSION AND FUTURE WORKS
This study was conducted to explore the factors affecting the academic performance of undergraduates in programming courses and develop models with which the performance of students can be predicted.The research was conducted on a sample of students who have at one time or the other offered Web programming, C or JAVA within the Federal University, Oye-Ekiti, Ekiti State, Nigeria between 2012 and 2016.This was based on students' performance records which cut across the second and third (200-300) levels of study within the institution.Machine learning approaches were gainfully employed for the analysis of the retrieved data from a defined number of respondents.Results obtained indicate that the attitude of students and lecturers, fearful perception of students, erratic power supply, university facilities, student health, students' attendance are significant to the performance of students in programming courses.It is recommended that future research adopts improved statistical machine learning approaches to comparatively model the learning behaviour in private and public Universities of Nigeria and identify the salient factors significant to performance of students in both systems for robust evaluation of quality of training and to aid effective decision making by the government, students and University education stakeholders.Furthermore, a consideration of all programming courses being offered in the institution and a relatively larger population might graciously improve the findings reported in this study.The existing statistical machine learning approaches can also be extended while some other ones can be introduced for more accurate results.

Fig. 3 .
Fig. 3. Flow control of the implementation of student performance models.

Fig. 4 .
Fig. 4. Code and design interface of the student performance models.

Fig. 5 .
Fig. 5. Home interface of the mobile student performance evaluator.

TABLE IV
I had to travel to settle quarrels within my family My mother is familiar with computers My parents are well educated www.ijacsa.thesai.org

TABLE VI .
VARIABLE-BASED LRC' SPM VARIABLES WITH NEGATIVE EXPRESSIONS Quarrel between family members is normal Quarrel between my family members escalates a times My father is familiar with computers My parent would want me to offer programming courses I received educational advices from family members often My family believed that a proper study will help me in programming courses