Educational Data Mining for Monitoring and Improving Academic Performance at University Levels

This study applied Educational Data Mining on 712 sample of logs extracted from Moodle Learning Management System (LMS) at an African University in order to measure students and staff patterns of use of the LMS resources and hence determine if the quantity of participation measured in the amount of time spent on the use of LMS resources improved academic performance of students. Data collected from Moodle LMS was preprocessed and analyzed using machine learning algorithms of clustering, classification and visualization from WEKA system tools. The dataset consisted of Course tools (Quiz, Assignment, Chat, Forum, URL, Folder and Files), Lecturer and Student usage of the tools. Furthermore, SPSS was used to obtain a matrix for coefficients of correlations for course tools, tests and final grade. Correlation analysis was done to verify if students use of course tools had impact on student’s academic performance. Findings indicated the pattern of usage for course1 as Quiz (38358), System (17910), Forum (8663), File (8566), Assignment (1235), Folder (514, File Submission (172), and Chat (37); Course2 as System (11920), Quiz (8208), Forum (4476), File (4394), Assignment (257), Chat (247), URL (125), and File Submission (38); Course3 as System (2622),File (1022), Folder (570), Forum (258), and URL (2). Overall, evaluating the correlation between the use of LMS resources and students’ performance, findings indicated there is significant relationship between the use of LMS resources and students’ academic performance at 0.01 level of significant. The findings are useful for strategic academic planning purpose with LMS data at the university. Keywords—Educational data mining; learning management systems; Weka system tools; improved academic performance


I. INTRODUCTION
Data Mining (DM) and Machine Learning (ML) as subdisciplines of computer science provide powerful tools for knowledge discovery from massive data sets [1,2]. As a process of discovering patterns in data a DM process must be automatic or semiautomatic. The patterns discovered must be meaningful in that they lead to some advantage, usually an economic advantage [3]. The concept of Educational Data mining (EDM) is defined in this paper as the application of data mining to derive meaningful patterns from educational system repositories which in turn could be used to improve teaching and learning experiences. One such educational teaching and learning system tool is the learning management system (LMS). LMS tools are computerized teaching and learning platforms for creating contents, delivering courses to learners and managing courses in teaching and learning environments [4,5].

A. Statement of the Problem
Many African universities have invested in learning management systems. Over the years, massive data have also been accumulated through the LMS, but which has not been appropriately mined in order to provide information useful for strategic decisions at university levels. The aim of this paper is to demonstrate with empirical evidence the usefulness of EDM in discovering hidden but useful patterns from teaching and learning data accumulated through the LMS. Several studies for example [4,5,6,7,8,9] have also shown the central place of LMSs in teaching and learning, although, only a few addressed the need for EDM from logs of LMS's data. The specific objectives of this study were: The specific objectives of this study were: 1) To measure students and staff use of LMS resources in teaching and learning at university level.
2) To evaluate the correlation between potential use of LMS resources and students' performances.

B. Research Questions
The following research questions were investigated in the study: 1) What were the recognizable patterns in teaching and learning from Moodle LMS logs for the sampled data set?
2) Is there any correlation between potential use of LMS and prevailing students' performance at the university?
The rest of this paper counts of three sections, and a list of references. Section two presents a review of literature relevant to the study. Section three presents the empirical study and the methodology. Section four presents the results and discussion of the study.

II. LITERATURE REVIEW
The application of Learning management systems and their effectiveness in higher education have been widely discussed in the literature [6,7,8,10,11,12,13,14,15,16,17]. A common approach to EDM modeling is a combination of sequential processes which includes data collection from LMSs, data preprocessing, data mining and analysis, preprocessing, result generation, and application (see Fig. 1). 570 | P a g e www.ijacsa.thesai.org A successful educational data mining (EDM) which produced satisfactory results at the University of Cordoba, Spain was reported in [9]. The EDM approach enabled the discovery of new rules of association which were used to improve the design of online courses at the university.
Similarly, strategies to mine data from activity logs found on Moodle LMS was investigated in [6]. Applying data mining and using simple statistics to analyze the logs the author recommended the use of Access watch analogue and Web start applications to infer student's attitudes to learning and for predicting examination scores through multiple regressions. Fig. 2 shows a data mining model using LMS.
Following a data cycle approach, Fig. 3 demonstrates the usefulness of data mining in the context of this paper. Every decision-making process is based on a data transformed into information culminating in a decision being made. The cycle begins with identifying the problem, collecting and storing data using appropriate tools, preprocessing data, mining data, and discovering new knowledge from data which provides necessary feedback from the system for future activities. Fig. 3 portrays the Data Cycle in knowledge discovery using data mining.
A brief explanation of each stage of the DCKD include: 1) Problem definition: An initial definition of the problem, or the mission, or the purpose, for which data is required.
2) Identifying data sources: Understanding what data are pertinent, and where they can be located.
3) Data collection and storing: Retrieval of data from various sources and storing them in an accessible location. 4) Data mining: Selection of relevant data out of the Big Data using appropriate DM and ML tools. 5) Knowledge discovery: New knowledge discovered and presented to decision makers (Classified, Clustered and Visualized).
6) Learning and decision-making: The final stage that is the purpose of the data cycle. The results are displayed to the decision makers and decisions are taken. 7) Feedback for further cycles: This stage is not always necessary. However, very often, the need to make a certain decision is repetitive, so the customer (the decision maker) can affect the usefulness and the effectiveness of the cycle by forwarding comments and changes.

A. Methodology
A sample of 712 Moodle data logs accumulated using Moodle Learning Management System were used in the study. The sample comprised three courses selected randomly from three faculties. The logs comprised of students, academic staff, and course data in the format: user full name, description, time, components, affected user, event context, event name, origin and IP address. The following criteria were observed in the sample selection: 1) Permission was obtained to conduct the study from which the data was obtained.
2) The experience of a lecturer offering a course on Moodle LMS was considered.
3) A selected course was offered in the first semester. 4) A selected course was taught by a lecturer who showed enthusiasm in online activities.

B. Data Collection Instrument
The logs were downloaded as .csv files, prepared and preprocessed using a Waikato Environment for Knowledge Analysis (WEKA) tool.

C. Data Preparation and Preprocessing
The preparation and preprocessing consisted of data extraction, cleaning, aggregation/ integration, filtering and transformation. Data was mined using machine learning schemes selected from the WEKA tool. The selected schemes included tools for pre-processing, classification, clustering and visualization. The algorithms were applied directly to each dataset for each stage of preprocessing, classification, clustering, and visualization invoked from the schemes menu as follows: 1) A data file is selected from the file menu.
2) Important attributes of the data are selected.
3) Aggregates of existing attributes were created using the spreadsheet.
4) A machine learning scheme was selected from the Schemes menu.

5)
Results were viewed as trees, text or three-dimensional plots.

7)
The scheme was re-run on the revised data.
Furthermore, in order to maintain format independence, the data was converted to an intermediate representation -Attribute Relation File Format (ARFF). Data logs of semester1 of the academic year 2017/18 were downloaded, extracted and used for the study. The extraction process is shown in Fig. 4.
Data filtering was done through the WEKA system filtering tool targeting the attributes required for use in the selected machine learning algorithms ZeroR and J48.

D. Data Mining and Analysis using Machine Learning
Algorithms ZERO R and J48 J48 and ZeroR machine learning algorithms were applied on the data sets. The justification for the choice of J48 and ZeroR algorithms in this study was due to their perceived usefulness in classification, clustering and visualization [18,19] J48 algorithm uses unsupervised learning to form clusters of a data set, while ZeroR algorithm uses supervised learning. Both classifiers differentiate each case according to some set criteria.

A. Classification with ZeroR and J48 Algorithms
Fig . 5 shows the results of using ZeroR to predict Course1. The prediction class value was Quiz which took 0.09 seconds to build the model. The number of correctly classified instances was 38530 (50.553%) while the number of incorrectly classified instances was 37687 (49.447%).
The mean absolute error and the root mean squared error for these predictions were 0.147 and 0.2711 respectively. There were five attributes (User Full Name, Event context, Components, Event Name and Origin) and the focus was on the Component attribute which explains how the course activities and resources were used during the semester. The accessibility of the logs of the course by a lecturer and the students was 76217 which is shown as the total number of instances. The number of correctly classified instances was 38530 (50.556%) and incorrectly classified instances was 37687 (49.4%). It should be noted that when an algorithm is based on probability, there is the risk of type 1errors (false positives) and type 2 error (false negatives). This accounts for the noticeable mean absolute error, and the root mean squared error above. The detailed accuracy by class is shown below in the confusion matrix (Table I) which shows the activity tools with their accessibility values. Table II, Table III, and Fig. 6 show the results of predicting activities in course1 using J48 algorithm.
From Table II the statistical summary of the results shows 75463 (99%) as number of correctly classified instances and 754 (0.98%) as incorrectly classified instances. The mean absolute error is 0.0028, root mean squared error is 0.039, and relative absolute error is 1.91%. The errors are minimal. The detailed accuracy by class is shown in the confusion matrix (Table III).
As a machine learning algorithm, J48 uses the decision tree technique. When applied to the LMS logs on course activities and resources, the pattern of activities classification shows that Quiz (column c) has the highest activity at 38358, followed by System (column b) at 17910, Forum (column d) at 8663, File (column f) at 8566, and Assignment (Column e) at 1238, Folder (column g) at 514, File submission (column h) at172 and chat (column i) at 37. Table IV, and Table V show the result of prediction of course 2 and its confusion matrix in classification using J48 algorithm.
From Table V the number of correctly classified instances is 29765 (99%), while the number of incorrectly classified instances is 276 (0.91%). The mean absolute error was 0.0026, root mean squared error was 0.0387, and relative absolute error was 1.84%. The errors are minimal. The detailed accuracy by class is shown below in Table VI. Table V above shows the confusion matrix of all classes. The pattern shows that System column b) has the highest activity at 11920, followed by Quiz (column c) at 8308, Forum (column d) at 4476, File (column g) at 4398, Assignment (column f) at 257, Chat (column e) at 247, URL (column h) at 125 and File submission (column i) at 38. Similarly, for ZeroR the prediction of course2 in classification is shown in Fig. 6, Table VI, and Table VII. The same explanation offered in J48 upholds except that the prediction identified the system as predictor as shown in Fig. 6 and Table VI. The results of ZeroR predicted class value were Quiz (38530), System (18108), Forum (8802), File (8566), Assignment (1395), Folder (514), File submission (172), and Chat (122). ZeroR algorithm predicted Quiz tool to be the highest. This explains the pattern of how the lecturer taught the class. The access of the course (System:18108) led learners to use the quiz tool (38530) more often because they had to discuss (Forum 8802) the concept of the given topic and read notes (File:8566). The learners also had to work on the given assignments (1395) for the lecturer to check whether they had understood the concepts of the topic and submit (172) back their assignments for marking. Detailed accuracy by class is shown in Table VII, while the confusion matrix is shown in Table VIII.
The report on Confusion Matrix above (Table VIII) shows the activity tools with their accessibility values. The results of ZeroR predicted class value as System (12109), Quiz (8308), Forum (4485), File (4394), Assignment (309), URL (125), Chat (125) and File submission (38). ZeroR algorithm predicted System tool to be the highest followed by Quiz, Forum, File, Assignment, URL, and Chat. This explains the pattern of how the lecturer taught the class. The learners accessed the course (System:12109) more often which led them to use the quiz tool . The Table XI above and Table XII shows the results of using ZeroR to predict Course3. The figure has five attributes (User Full Name, Event context, Components, Event Name and Origin) and the focus was on the Component attribute which explains how the course activities and resources were used during the semester. The prediction class value was System which took 0 seconds to build the model. The number of correctly classified instances was 2622 (58.5007%) while the number of incorrectly classified instances was 1860 (41.4993%). The accessibility or the logs of the course by the lecturer and the students was 4482 which is shown as the total number of instances. The ZeroR algorithm is based on probability, hence the likely presence of errors. The Mean Absolute and Root Mean Squared Errors are minimal in the range of 0.1676 and 0.2894 respectively.             Similarly, Fig. 7 shows the J48 pruned tree and the event context is course3. The summary of classification (Table X) shows correctly classified instances as 4482 (100%). The root mean absolute error is 0, the root mean squared error is 0, and the relative absolute error is also 0. Table XI shows the confusion matrix where the leading diagonal displays numbers which represent interaction with the system and the total of these numbers represent the logging behavior of students and the lecturer(s). Therefore, the algorithm J48 predicts System as the major classifier. Fig. 7 shows prediction of course3 activities in classification using J48 with 5 attributes. Table X shows the summary of the results reported correctly classified instances as 4482 (100%), and 0 (0%) as incorrectly classified instances. The mean absolute error was 0, root mean squared error was 0, and relative absolute error was 0%. The errors are minimal.

B. Pattern of Usage Discuss 1) Research question 1:
With reference to research question1, the data analysis revealed a mixed picture. There is a substantial use of the quiz tool (about 70%) for assessment purposes such as gauging the level of achievement of instructional objectives. It is worth noting that another facility which was also used substantially was the Resources tools (File, Folder and URL) which were mainly used for posting notes and communication between lecturer and students. They can be considered as an entry point for lecturers to make teaching and learning digital. In all, it appears that the pattern of usage identified above are complemented by blended learning where both traditional and technology-based approaches are mixed depending on the instructional goals.
2) Research question 2: With reference to research question 2 patterns were already revealed in the results presented. It was clearly observed that student's login into their Moodle portal to check for new course content (system use). Usually, students would have been alerted by the lecturer on the existence of new contents. Students could download the course content into their personal devices to read and to discuss amongst themselves. However, they could also do limited discussions using forums created by the lecturer.. Students might spend minimal time online when the course contents were not engaging. Furthermore, students might have exhibited, "the student syndrome", where there was a rush to do assignments just before the due date. These activities were discernible from the students' activity logs and their dates of submissions.
On the side of lecturers, the typical pattern of usage suggested, posting of notes using the resources tool and creating forums. In addition, it was observed that lecturers made good use of quizzes, assignments, chats and URLs.
In terms of correlation between potential use of LMS and prevailing students' performance (Research question 2).  Table VII, Table XIII and Table XIV show student's interaction with LMS and performance at final exam.

C. Analysis of Correlations
In Course1 (Table XII)  In course 2 (Table XIII) (0.564); Therefore, the various components influence final grade. Since Pearson Coefficient value is greater than 0.01, we accept null hypothesis and conclude that there is significant relationship between use of LMS and prevailing students' performance. The results show that the Files, Folder, Quiz, Assignment, Forum, Final Exam, Test1 and Test2 all had an impact on academic performance significantly. In this case the lecturer used the tools (research question2).
In course 3 (Table XIV) 731). Therefore, the various components influence final grade. Hence there is significant relationship between use of LMS tools and student's performance. Since Pearson Coefficient value is greater than 0.01, we accept null hypothesis and conclude that there is significant relationship between use of LMS and prevailing students' performance.

VI. CONCLUSION
In conclusion, using our approach of educational data mining and LMS, it was possible to monitor students and staff use of the LMS resources at the university. In addition it was 580 | P a g e www.ijacsa.thesai.org obvious from correlation analysis that students use of the resources could affect academic performance. In the future, it is suggested that this study be conducted with several courses from different disciplines in order to determine if academic disciplines affect students' performance based on the same tools.