Towards Finding the Impact of Deep Learning in Educational Time Series Datasets – A Systematic Literature Review

—Besides teaching in the education system, instructors do a bunch of background processes such as preparing study material, question paper setting, managing attendance, log book entry, student assessment, and the result analysis of the class. Moreover, Learning Management System(LMS) is mandatory if the course is online. The Massive Open Online Course (MOOC) is an example of the worldwide online education system. Nowadays, educators are using Google to efficiently formulate study material, question papers, and especially for self-preparation. Also, student assessment and result analysis tools are available to get instant results by feeding student data. Artificial Intelligence (AI) is driving behind these applications to deliver the most precise outcome. To accomplish that, AI requires historical data to train the model, and this sequential (year-wise, month-wise, etc) information is called time series data. This Systematic Literature Review (SLR) is conducted to find the contribution of time series algorithms in Education. There are enormous changes in algorithm architecture analogized to the traditional neural network to endure all kinds of data. Though it significantly raises the performance, it expands the complexity, resources, and execution time as well. Due to this, comprehending the algorithm architecture and the method of the execution process is a challenging phase before creating the model. But it is essential to have enough knowledge to select the suitable technique for the right solution. The first part reviews the time series problems in educational datasets using Deep Learning(DL). The second part describes the architecture of the time series model, such as the Recurrent Neural Network (RNN) and its variants called Long-Short Term Memory (LSTM) and Gated Recurrent Unit (GRU), the differences between each other, and the classification of performance metrics. Finally, the factors affecting the time series model accuracy and the significance of this work are summarized to incite the people who desire to initiate the research in educational time series problems.

Human education assists in automating massive work with less intervention of human resources. That education system itself automated with the help of AI. Machine Learning is a subset of AI. Likewise, Deep Learning (DL) is a division of broader machine learning based on the Neural Network (NN) designed to mimic the human brain. DL is becoming an imperative buzzword in data handling technology due to the potential of prediction using extensive data. However, the prediction system was enlightened only after the innovation of RNN. The RNN is chosen for this study due to its architecture to operate on sequential nature data. For example, Predicting learner dropout rate in MOOC using LMS interaction data (user click events, weekly assignments, etc). All educational institutions switched online to continue the classes during the corona lockdown period. Many online courses started and then boomed. The MOOC is one of the popular platforms for online education. But, the course completion rate is significantly lower than the number of registration due to being free of cost. RNN helps to predict the success and dropout rate of MOOC learners. RNN applications are unrestricted in all the fields, such as finance [1], [2], medicine [3], [4], [13], and nature-related forecasts, such as weather, rainfall, temperature, and wind speed [5]- [8]. Also, enough surveys are available to enrich the existing work on those domains. But in education, reviews still need to be conducted *Corresponding Author, jayashrr@srmist.edu.in www.ijacsa.thesai.org to find the related work in sequential data collectively. Hernández et al., [21] did the same job, but that did not focus on time series. This work fills this research gap with the time series model architecture, the difference from the conventional neural network, and the parameters influencing the model performance. The following are the research questions identified for this work: RQ1: Finding the impact of Deep Learning in educational time series problem.
RQ2: Identify the architecture of time series model and how it differs from the traditional approach.
RQ3:Discover the significant factors affecting the time series model accuracy.
The remaining paper encloses five sections. Section II defines the methodology of this work, and Section III describes the review results including previous work using the deep learning model,working methodology of RNN, LSTM, and GRU, and metrics used for the model. Section IV outlines the contribution of this paper through discussion. Finally, Section V explains the conclusion and future work of this article.

II. RESEARCH METHODOLOGY
This section elucidates the research methodology followed in carrying out this review process and the filtration of the downloaded papers. The following research repositories are accessed: Google Scholar and IEEE Xplore. The keyword used for this work is the following: "Deep Learning", "RNN", "Time Series", "Student", and "Education". The google search result showed many research papers, and all are evaluated manually to select the suited one for this work. The selection process considers the journal articles using time series data in Education and valid conference papers. The inclusion and exclusion criteria of this study is mentioned in Table I. The article selection process followed the PRISMA method to carry out this study. PRISMA is an abbreviation of the "Preferred Reporting Items for Systematic Review and Meta-Analyses". Fig. 1 explains the step-by-step article inclusion and elimination details through PRISMA 2020 flowchart.

1) Identification:
The initial search retrieved two hundred and ninety-one (n= 291) documents from Google scholar and the IEEE database. The filter is applied for the last five years (2018-2022) to restrict the search before 2018 and after 2022. Then removed, four duplicate records from various databases.
2) Screening: There are two screening steps to check the paper's eligibility. i) Preliminary check ii) Full-text analysis.
Step 1 investigates the title and abstract to verify the document's relevance. It removed one hundred-six (n=106) reports and included eighty-one (n=81) articles for full-text retrieval. Then sixteen (n=16) documents are eliminated due to paid version. Step 2 inquiry prevents invalid articles, conferences, and other documents irrelevant to this context.

3) Included:
The previous stage gives twenty-two reports (n=22), and the selected articles are used as a source for this Systematic Literature Review(SLR) or meta-analysis.

III. RESULTS
This section describe the review results obtained through previous section. Fig. 2. depicts the publisher's contribution to this topic. It shows that most well-known publishers are involved, but springer published more articles than others.

A. Contribution of Deep Learning in Educational Time Series Data
Wang et al., [24] proposed two novel methods to predict student learning status. The first one is to retrieve compelling features and performance using Conv-GRU. The second one, xNN (Explainable Neural Network) explains the relevance of student positive/negative results to improve the weak area. This approach helps to identify the hidden pattern of student behavior and early notification to improve the particular section.
Waheed et al., used the same dataset (OULAD) in both of their papers [25,29], but followed different methodologies to predict the student category. The first work gives the highest accuracy (93%) than the second using DNN(84%), with a notable difference. The deep neural network(DNN) proves its power by providing the highest accuracy. Mubarak et al., [26] predict learner's weekly performance using video click stream for timely intervention. This model is created in such a way that it can adjust a variable window length routinely, which helps it to fit the RNN layer dimensions with different sizes of input data.
In studies [27,28,39,42], all the authors used the same dataset (KDD cup 2015) to predict student dropout, and the result shows above 85% performance in all. It contains 39 courses and seven kinds of student behavioral information such as Access, video, wiki, discussion, navigate, page_close and problem. These multiple parameters allow applying a multi-variate time series approach. Though the dataset is the same, imbalanced data is handled only in [28,39].
Zhang et al., [30] introduced a predictive model to pick the micro-level pattern from student learning behavior. To avoid data sparsity, the author divides the data into five clusters based on the nature of the student's learning behavior. Because, the author believes that every student's online learning behavior will change depending on their free time. An auto-encoder is used to encode time-series data. The significant difference between recall and accuracy values shows that classification errors need to fix in this model. He and Gao [31] proposed a student performance predictive model by collecting student learning behavior information through terminal data acquisition tools to find the student concentration level in the classroom and explore the influencing factors of learning concentration. Aljaloud [32] suggested a model to predict student learning outcomes by selecting the number of essential features and evaluating the result by reducing the number of features. There are seven features(f1,f2,f3,f4,f5,f6,f7) and seven courses used in this LMS, and the final result shows the best accuracy in the more number of attribute combination.
Chen et al., [33] created an intelligent framework to handle imbalanced datasets and spatiotemporal information. This LMS contains eight learning features (F1-F8): assignment, file, forum, homepage, label, page, quiz, and URL. Course length is 16 weeks, but this model helps to warn the at-risk students much earlier than other models with higher accuracy.
Karim et al., [34] conducted ablation tests on time series data using LSTM. In this experiment the LSTM block is substituted by other techniques such as GRU, RNN, and Dense block. But LSTM-FCN performance was higher than others.
Chen et al., [35] provided a comparative study between deep learning and conventional machine learning using the data retrieved from the Learning Management System (LMS). This data tells the temporal behavior of the student activity in the form of time series. The author used classification and clustering techniques to predict early identification of at-risk students, and then compared the results using AUC.
Li et al., [36] also did a comparative study using higher education data such as student grades and levels to predict the performance. Prabowo et al., [37] tried dual input, the combination of categorical and numerical time series data. The proposed dual-input hybrid model combines MLP and LSTM networks and then compares the perrmance with the individual model.
Asish et al., [38] offered a comparative study using CNN, LSTM, and CNN-LSTM to classify the student distraction level using eye gaze data. The author collected the data through a Virtual Reality Environment. Wuet al., [39] proposed a hybrid model called CNN-Net to predict student dropout in MOOCs. Moreover, the author handled the class imbalance due to the massive dropout ratio of students. www.ijacsa.thesai.org Shin et al., [40] created a model to predict student performance using time series data by clustering the students using the k-shape technique. Each cluster helps to identify the student category to give a warning from the instructors. Bousnguar et al., [41] proposed a model for enrolment prediction using LSTM and statistical machine learning. The statistical model gives the highest accuracy than deep learning due to the insufficient data for training.
Qiu et al., [42] developed a model for dropout prediction using CNN. The author compares the results with baseline models, including LR, Naïve Bayes, Decision Tree, Random Forest, Gradient Tree Boosting, and SVM. Among those, CNN with windows size 10 showed better results than the others. Chen et al., [43] created a model for predicting course performance using an imbalanced dataset. The SMOTE sampling technique is applied to balance the minority data.The author used the KNN algorithm to fill in the arbitrary missing values.
Aljohani et al., [44] proposed a model to find the at-risk student in the early stage based on weekly performance sequence data. But it achieved the highest accuracy only after 38 weeks. He et al., [45] suggested a model for student performance prediction. The author used two fully connected neural networks for demographic information; RNN for handling student assessment and click stream time series data. The proposed method provided better performance than the baseline models.
Tables II provides the vital points of this review and Fig. 4 represent the classification of time series use case and workflow found in this review. Static and sequential informationswere combined for performance prediction. GRU gives better performance than LSTM due to minimal length data. The accuracy of the proposed model is above 80% in the last week.
The joint neural network is proposed to fit both static and sequential data, where the data completion mechanism is also adapted to fill the missing stream data.

B. The Architecture of Time Series Model and the Difference between Traditional Approach
This section provides the history and technical background of Recurrent Neural Networks. Even though a few studies used other models (CNN and hybrid models) for time series problems, those are excluded and not specific to handle temporal data. Initially, statistical methods are beneficial in predicting time series problems, but they are ineffective in handling nonlinear data. Therefore deep learning came into existence to overcome the liabilities of conventional time series algorithms such as ARIMA and Exponential smoothing techniques [9], [10], [20]. Similarly, few classical machine learning algorithms (XGBoost) apply to time series problems.  Here "x" denotes any time unit such as minutes, hours, months, years, etc. And the "y" represents the numerical value such as weight, height, price, quantity, etc. Fig. 6 explains how the "y" value is changed based on time. The appropriate algorithm has to prefer built on the type of the dataset. RNN www.ijacsa.thesai.org has introduced around the 1980s. However, it got renowned after the invention of LSTM in 1990 to overcome the weaknesses of RNN. The most common use case for RNN is time series problems [11] and natural language processing [12]. Fig. 7 depicts the workflow difference between traditional Feed Forward Neural Networks (FFNN) and RNN.

1) Recurrent Neural Network (RNN):
RNN is a type of Artificial Neural Network (ANN) specially designed to capture sequential information with the aid of memory cells. This memory cell retains the previous report for further processing, and the decision is based on the prior and current state. RNN shares the same weight parameters within each layer, whereas the traditional neural network shares different weights. There are three crucial components in RNN Input, hidden neuron, and activation function, as described in Fig. 8  and 9.
Eq. (1) calculates the hidden state where is a hidden neuron at time t, is the input at time t, U is the weight of the hidden layer and W is the transition weight of the hidden layer. The input and previous state informations are combined to go through the tanhactivation function to produce a new hidden state. RNN suffers from the vanishing gradient problem while handling long sequence data. But it is rectified by Long Short-Term Memory [19], another variant of RNN.

2) Long Short Term Memory (LSTM):
LSTM is capable of processing long-term dependency data. It manages the previous context more effectively than RNN using three gates. They are the input gate, forget gate, and output gate, as depicted in Fig. 10. The input gate updates the memory cell, forget gate decides whether the information has to be kept or not. The output gate is responsible for determining the next hidden state.The loop structure of RNN and LSTM helps to choose the better weight parameter. The formula for each variable in LSTM is defined below: where , , refers to the input gate, forget gate, and out gate respectively. W, U, and V are the weight matrices, b is the bias vectors, is the input vector to the memory cell at time t, is the value of the memory cell at time t, and , are the candidate state and state of the memory cell at time t, respectively. Here sigmoid (σ) and tanh are the activation functions.

3) Gated Recurrent Unit (GRU):
GRU is a simple version of RNN in terms of architecture. It is uncomplicated to implement and has a quick performance than LSTM, but the functionalities of both architectures are identical. GRU uses fewer parameters, so it requires less hardware and training time. Therefore, GRU attracts the user to involve in many applications. The three gates are reduced into two gates update and reset gate, defined in Fig. 11. www.ijacsa.thesai.org The update gate is a combination of the input, and a forget gate in LSTM. It decides whether the particular information has to be kept or discarded. The reset gate will determine the amount of data that should forget. The following formula defines each variable in GRU: where and are the two gates for reset and update respectively.
is memory content, is the final memory of the current time step and the σ and tanh are the activation functions. The two gates have values between 0 and 1 through the sigmoid function (σ). While doing, the memory content ( ), using the reset gate store the significant information from the previous value between the range −1 to 1 over tanh.

4) Metrics used for time-series data:
Choosing the right metric is essential to evaluating the model's performance. All the decision, such as tuning the hyper-parameter and selecting the suitable model, is made on the result only. Here the notable thing is before deciding the metrics, need to check the following entities: the nature of the dataset, the values going to handle, and whether there is any need to compare other datasets. If so, are they all on the same scale or different ones? Table III and Table IV illustrate the various metrics available for the time series problem [14]- [18]. Fig. 12 shows the percentage of performance metrics reported in this study.

AUC/ROC
The AUC measures the entire two-dimensional area under the curve at all possible classification thresholds.ROC is a plot to explain the true and false positive rates.
Using graph representation to show the trade-off between the TPR and FPR.
Not suitable for the highly imbalanced dataset and concentrates only on TPR and FPR.

Confusion Matrix
Identify the model correctness all the way. The Four elements of this table are TP, TN, FP, and FN, which helps to derive the following metrics.
Find the issue where the model failed to understand.
Interpreting the result is complex.

Accuracy
The degree of model correctness.

Accuracy=(TP+TN)/(TP+FN+TN+FP) Easy to interpret
Misleading the result where the sample of minority class is very less.
Precision Ability of the model to identify only the relevant data points.

P=TP/(TP+FP)
Identify the proportion of correct positive identifications It doesn't consider the type II classification error.
Recall (Sensitivity) Ability of the model to find all the relevant data points.

R=TP/(TP+FN)
Identify the proportion of correct actual positives.
It doesn't consider the type I classification error.

F1-Score
A single score that balances both the concerns of precision and recall in one number.

F1-Score = 2 *(P*R)/(P+R)
The harmonic mean of precision and recall value It is a combined result of precision and recall, so a bit harder to interpret. Hidden Layer The layer between input and output and it determines the depth of the neural network (Usually 1 or 2 layers).
Batch-Size Number of sample that the network used to update the weights Momentum It speeds up the learning process by preventing the oscillation in the convergence of the method.

Weight initialization
It defines the starting point of the optimization.

Regularization
It prevents over-fitting by stopping the weights that are too high(L1,L2) [26] Units It determines the level of knowledge that is extracted by each layer.

C. Factors Affecting the Time Series Model Accuracy
There are several factors affecting the model performance which are the techniques used for pre-processing,train-test ratio, and the selection of model hyper-parameters. Table V represents the hyper-parameters that are affecting the model accuracy [23].

1) Pre-processing:
Removing unwanted data and filling in the missing values are the initial step inpre-processing. Several methods are available for imputation, such as mean, median,mode, interpolation, weighted average [24], and knearest neighbor [43]. Mean and Weighted Average is the widelyused techniques. The first one returns the average value of the feature column, and the second substitute the average of the most frequent information.
The next step is to encode all the categorical information into numerical value for model understanding using any technique such as label encoding or one hot encoding [39].
Each method has merits and demerits of its own. After encoding, re-scaling the data (feature) is very important since it makes the model less sensitive to the scale of features and allows converging with better weights. There are two significant types of scaling: Standardization(z-score) and Normalization (min-max scalar). Standardization assumes that the values are in Gaussian distribution and centered on the zero mean with unit standard deviation. It is less sensitive to outliers, so Karim et al., [34] used z-score normalization to handle the outlier values, and Wang et al., used batch normalization for scaling in [24]. It is specific to each layer and batch of input in the neural network.
2) Train-test split: Normally, 80:20 is the suggested ratio for a train-test split if the samples are distributed evenly across the dataset. Wu et al., [27,39] split the dataset into 80:20 for training and testing. Shin et al., carry the exact ratio in [40], but 10 % for validation from test data. In researches [25,38,31,32], the percentage used for training/testing is 70:30; the remaining studies [43,26,28,29] use slightly different ratios. www.ijacsa.thesai.org 3) Imbalanced dataset: Most of the real-time dataset is imbalanced and should be handled appropriately to avoid classification errors. In studies [35,39,43,33], the authors used synthetic samples (SMOTE) to balance the target class count. But Mubarak et al., [28] introduced a cost-sensitive technique in the loss function to avoid type 2 classification error [28]. Dimension reduction is also another issue where the feature count is vast. Waheed et al., [25] using Singular Value Decomposition (SVD) method to find the top 30 efficient features.

4) Hyper-parameters:
The number of hidden layers is significant in deep learning because it shows the complexity of the problem. Bousnguar et al., [41] used three LSTM layers and 50 cells for each layer. Qiu et al., [42] involved two convolutional and two fully connected layers for binary classification with the sigmoid activation function. Aljohani et al., [44] applied three LSTM layers, and each layer is assigned 100 to 300 units of neurons. Deep ANN is appliedin [25,45] and uses a minimum of three layers and up to seven hidden layers. Next to hidden layers, select the suitable optimizer to update the weight for every iteration. The Adam optimizer is majorly used [39,43,45,26,27,29,31,32,33] among others, such as gradient descent, stochastic gradient descent, and RMSProp.
Then the learning rate (0.0-0.1) assigns the speed of the network parameter update. Frequently used values are 0.0025 [27], 0.001 [29,32,33], and 0.1 [31]. The activation function is another hyper-parameter that helps to predict complex nonlinear data. This parameter differentiates neural networks compared with machine learning models. Relu, Leaky Relu, tanh, and sigmoid are the activation functions equally used in all the papers. After fitting these parameters training the model by mentioning the number of epochs is mandatory. Sometimes less training, such as 15, 20, and 25, gives better performance than massive iteration [27,31]. The Dropout is the last layer of the neural network to avoid overfitting, so it is majorly used in all the experiments. The frequent values are 0.1, 0.2, 0.3, and a maximum of 0.5 by Mubarak et al., in [28].
Concerning batch size, the authors adjusted the value to improve the accuracy by doing several experiments. Waheed et al., found the batch size from the value of 64 increased the model performance for all the weeks, but when the batch size was increased additionally from 1364, the model performance degraded with AUC decreasing by a value of 0.04. Regularization is rarely used [26] in the experiment. The model setup does not explain the other parameters, such as Weight initialization and Momentum.

IV. DISCUSSION
This section discuss the contribution of this paper referred to in the introduction.

A. Finding the Impact of Deep Learning in Educational Time
Series Problem To answer RQ1, this SLR proved the success of deep learning models in educational time series data retrieved from various sources. Hernández et al., [21] also confirmed in their review the number of publications recently enriched after raising the application of the DL model. But the first publication commenced in 2015. Section III describes the previous work, and all the information is summarized in Table  II to explain the types of models used, the paper's findings, individuality, and the dataset details. The CNN-LSTM is the majorly used hybrid technique, and the LSTM is the widely used single model in many research works. In Education, student performance prediction is the typical use case executed in multiple investigations. Moreover, clustering the time sequence data was also applied to categorize the students based on their performance. Due to its sequential nature, most of the work was done on MOOC online data than the offline mode to predict student dropout.

B. Identify the Architecture of Time Series Model and How it Differs from the Traditional Approach
To answer RQ2, Section IV describes the internal structure of RNN, LSTM, and GRU using the required diagrams and formulas. It represents the improvement and differences between each other. Numerous investigations involve LSTM rather than RNN and GRU though the architecture is intricate. Also, LSTM merged with CNN to retrieve spatiotemporal features effectively. Self-connected neurons helps to maintain the previous information, and this is different from feedforward neural network.

C. Discover the Significant Factors Affecting the Time Series
Model Accuracy To answer RQ3, Table V provides information on the factors influencing the model accuracy. Tuning the Neural Network is necessary because it improves the model's performance. The number of hidden layers, epochs, batch size, dropout layer, and optimizers are the commonly used hyperparameters due to their high impact on the outcome. Most authors use manual selection to pick the best hyperparameters instead of any optimization technique, such as grid search and Bayesian method.

V. CONCLUSION
Understanding the DL methodology and the previous work done in a particular domain is fundamental before implementing the research idea. This study is the first work that gives a background for young researchers who want to involve Deep Learning in the Education time series problem. Accessed Google Scholar and IEEE Xplore scientific websites to collect relevant research papers. Then the collected documents(n=291) are analyzed manually and selected twenty-two (n=22) papers for this SLR by following PRISMA methodology. The essence of this survey is deep learning applies widely, but the hybrid model gave the highest accuracy than the individual model. Student classification, clustering, forcasting the student enrolment/grade, and dropout prediction using online course log data are the normally used problem statement. Large sequential data are rarely used compared with other domains which helps to avoid complex models. Finally discussed the RNN architecture, types of metrics, and the factors influencing the model accuracy. www.ijacsa.thesai.org VI.
FUTURE PERSPECTIVE This deliberation clearly explains the previous work done in the educational domain using time series data and will involve all this learning in the implementation work to fill the research gap identified.