Convolutional Neural Network Model based Students’ Engagement Detection in Imbalanced DAiSEE Dataset

—The COVID-19 pandemic has significantly changed learning processes. Learning, which had generally been carried out face-to-face, has now turned online. This learning strategy has both advantages and challenges. On the bright side, online learning is unbound by space and time, allowing it to take place anywhere and anytime. On the other side, it faces a common challenge in the lack of direct interaction between educators and students, making it difficult to assess students’ engagement during an online learning process. Therefore, it is necessary to conduct research with the aim of automatically detecting students’ engagement during online learning. The data used in this research were derived from the DAiSEE dataset (Dataset for Affective States in E-Environments), which comprises ten-second video recordings of students. This dataset classifies engagement levels into four categories: low, very low, high, and very high. However, the issue of imbalanced data found in the DAiSEE dataset has yet to be addressed in previous research. This data imbalance can cause errors in the classification model, resulting in overfitting and underfitting of the model. In this study, Convolutional Neural Network, a deep learning model, was utilized for feature extraction on the DAiSEE dataset. The OpenFace library was used to perform facial landmark detection, head pose estimation, facial expression unit recognition, and eye gaze estimation. The pre-processing stages included data selection, dimensional reduction, and normalization. The PCA and SVD techniques were used for dimensional reduction. The data were later oversampled using the SMOTE algorithm. The training and testing data were distributed at an 80:20 ratio. The results obtained from this experiment exceeded the benchmark evaluation values on the DAiSEE dataset, achieving the best accuracy of 77.97% using the SVD dimensional reduction technique.


I. INTRODUCTION
During the COVID-19 pandemic, the education sector has been compelled to adopt online learning. The conventional classroom learning has transformed into online learning or "school from home." E-learning has become a standard solution for learning, and virtual conference technologies, such as Zoom, Google Meet, and others, have given online learning flexibility and accessibility from anywhere and at any time, suitable with the current digital era. However, despite the numerous advantages of online learning, one significant obstacle that needs to be addressed is the lack of direct interaction between teachers and students. During virtual conferences, some students may not turn on their cameras, making it challenging to determine their presence and participation in the online class. Consequently, it becomes difficult for teachers to observe the level of student engagement during online learning, especially during screen sharing to explain teaching materials. This situation presents a common obstacle in online learning. To address this obstacle, it is necessary to conduct research to develop methods of automatic students' engagement detection during the online learning process.
Students' engagement detection is an essential factor in improving the learning process. It is a qualitative indicator in the learning process [1]. It entails three structured learning dimensions: behavioral engagement, emotional engagement, and cognitive engagement [2]. While all the three dimensions of engagement are crucial for measuring students' level of involvement in the learning process, emotional engagement is the most widely studied. Detecting students' emotional engagement is particularly important in education because it has a significant impact on their learning rate and overall academic performance. Whitehill et al. [3] showed that both human and automatic engagement judgments are correlated with task performance. The study found that post-test student performance could be predicted based on engagement labels with similar accuracy to pre-test results.
The problem of automatically detecting students' engagement in online learning based on video data can be solved using a machine learning approach. Zang et al. [3] investigated engagement detection in online learning through a data-driven approach based on facial expressions and mouse usage behavior. Their study demonstrated that utilizing multiple features for detection could significantly improve the accuracy of engagement detection. In contrast to previous studies that solely relied on students' facial expressions, they also took into account students' mouse usage behavior in their approach. Bhardwaj et al. [4] proposed a deep learning model named Convolutional Neural Network (CNN) for students' engagement detection, while Selim, et al. [5] conducted students' engagement detection in online learning using Hybrid EfficientNetB7 together with TCN, LSTM, and Bi-LSTM. Khenkar et al. [6] also proposed an engagement detection method based on micro-body gestures using 3D Convolutional Neural Network (CNN). www.ijacsa.thesai.org Ashwin et al. [7] also conducted engagement detection using CCTV video recordings in a computer laboratory, in which case CCTV video recordings were successfully used to analyze students' engagement. Convolutional Neural Networks (CNN) were successfully implemented with a good level of accuracy in identifying students' engagement levels. This study's results revealed a positive correlation between students' scores (student learning) and students' predicted engagement levels. Meanwhile, Sharma et al. [8] detected students' engagement using video recordings of students' learning through emotional analysis and tracking of eye gaze and head movements based on two machine learning algorithms, namely the Haar Cascade algorithm (for face and eye detection) and the Convolutional Neural Network algorithm (CNN) (for emotion classification). Based on these studies, CNN is a powerful deep learning model that has been successfully used in various studies to detect students' engagement levels in online learning. By analyzing emotional features, tracking eye gaze directions, and estimating head movements, CNN could predict students' engagement levels, which is essential for improving the effectiveness of online learning.
One of the widely used datasets for video-based students' engagement detection is the DAiSEE dataset (Dataset for Affective States in E-Environments). The DAiSEE dataset was first introduced in the study of Gupta et al. in 2016 [9]. The benchmark accuracy value of the DAiSEE dataset for the affective level of engagement was 51.07%. Based on the benchmark evaluation result, there are still many opportunities for improving the classification performance of the DAiSEE dataset. The data distribution for each label of the affective level of engagement is unequal, with 1% for very low engagement, 5% for low engagement, 50% for high engagement, and 45% for very high engagement. This data imbalance can result in errors in the classification model, leading to overfitting or underfitting. One solution to address this issue is to balance the data using undersampling or oversampling techniques [10], [11]. Ali et al. [12] presented a data-level approach and an algorithm-level approach for handling class imbalance problems. Bach et al. [13] examined some undersampling and oversampling methods for highly imbalanced data. The conclusion of their research was that the Synthetic Minority Oversampling Technique (SMOTE) boosted by the Edited Nearest Neighbours (ENN) method allowed for an improvement in classification precision. Fernandez et al. [14] also revealed through their research that the SMOTE algorithm improved performance in supervised learning problems. Therefore, imbalances in the DAiSEE dataset must be addressed. The current research's objective was to perform data balancing and feature selection to improve the benchmarking performance of the video-based students' engagement detection model on the DAiSEE dataset.
This article is organized as follows. Section II explains related works from previous studies. Section III describes the proposed model and methodology for students' engagement detection. Section IV presents the results of the methodology implementation. Section V provides a discussion of the results, and Section VI presents the conclusions of this study.

II. RELATED WORKS
Many studies related to the detection of students' engagement have been carried out. Bhardwaj et al. [4] used two datasets. The first one is the FER-2013 dataset, which is an image dataset used to train the CNN model, and the second one is the MES dataset, which is a tabular dataset used to do weight and subsequent calculations of the MES (Mean Engagement Score). The engagement level of students is classified into two classes: "engaged" and "not engaged." The proposed model achieved an accuracy level of 93.6%, a precision level of 98.48%, and a recall level of 87%. The proposed automated approach will certainly help educational institutions achieve an improved and innovative online learning method.
Selim et al. [5] also used the DAiSEE dataset to detect students' engagement and compared the performance of the proposed method with the VRESEE dataset. They proposed a Hybrid EfficientNetB7 model combined with TCN, LSTM, and Bi-LSTM. EfficientNet was pre-trained on the ImageNet dataset, which includes eight models ranging from EfficientNet B0 to EfficientNetB7. The study also compared the proposed and previous models on the DAiSEE dataset. The results of the three proposed models were as follows: EfficientNetB7+TCN, EfficientNetB7+Bi_LSTM, and EfficientNetB7+LSTM were at the levels of 64.67%, 67.39%, and 67.48%, respectively, outperforming the state-of-the-art ResNet+TCN model that was at 63.59%. When evaluating the proposed models on the VRESEE dataset, the highest accuracy achieved was 94.47% (from the use of EfficientNetB7+Bi_LSTM).
Paidja et al. [15] used the DAiSEE dataset for engagement emotion classification. They proposed a Convolutional Neural Network (CNN) model and performed feature extraction using five facial landmarks and the Euclidean distance between points and center points from the facial image. They also compared CNN with other machine learning algorithms, such as Support Vector Machine (SVM) and Deep Neural Network (DNN). The accuracy results obtained indicated that CNN successfully recognized engagement emotions better than the other methods. However, the limitation of their research was that it did not use the entire DAiSEE dataset as only 77 out of 9068 videos were used.
Abedi et al. [16] described the improvement of the state-ofthe-art technology for detecting students' engagement using a ResNet and TCN Hybrid Network. This research also used the DAiSEE dataset and evaluated the performance of the ResNet+TCN method, comparing it to several previous studies on the DAiSEE dataset. The experimental results showed that the proposed ResNet+TCN model could improve the classification accuracy performance by 63.9%. It is very challenging to detect the minority engagement level with a very small sample in a supervised classification problem.
Zhang et al. [17] proposed an Inflated 3D Convolutional Network (I3D) for automatic students' engagement. The research also used the DAiSEE dataset for students' engagement detection, coupled with the use of OpenFace and AlphaPose for feature extraction. The proposed method achieved an accuracy of 52.35%. www.ijacsa.thesai.org Bajaj [18] et al. proposed SOTA hybrid ResNet+TCN for the detection of students' affective states. They also used the DAiSEE dataset with ResNet for feature extraction and TCN (Temporal Convolutional Network) for classification. The accuracy level reached by the study was 53.6%. The biggest challenge posed by this dataset is high class data imbalance.
Liao et al. [19] used the DAiSEE dataset and presented the Deep Facial Spatiotemporal Network (DFSTN) model for engagement prediction. To extract facial spatial features, they utilized pre-trained SE-ResNet-50 (SENet). The experiment obtained an accuracy of 58.84%.
Hasnine et al. [20] examined the extraction and visualization of students' emotions for engagement detection in online learning. The proposed model for emotion extraction and engagement detection consists of several steps. First, the OpenCV Face Recognition is implemented to detect emotions and eyes. This step results in emotion weight and eye gaze weight. These weights are used to calculate the Concentration Index (CI), which is then used to determine a student's engagement level based on specific rules. If the CI is greater than 65%, the student is detected to be highly engaged. If the CI is between 25% and 65%, the student is considered to be engaged. Otherwise, if the CI is less than 25%, the student is detected to be disengaged.
Brenner et al. [21] presented a social robot system that could detect a person's engagement by utilizing proxemics, body posture, and attention features. The proposed model achieved precision, recall, and F1 score results of 0.81, 0.82, and 0.81, respectively. The intended use of the proposed system is to design robots whose behaviors indicate awareness of a person's engagement.
The limitations of previously discussed DAiSEE dataset studies are related to model performance. The performance of detection models such as one with the DAiSEE dataset improved in accuracy from an average benchmark accuracy of 57.9% in 2016 [9] for baseline benchmarking, to 63.9% in 2020 [26], to 67.48% in 2022 [5]. The levels of accuracy are still relatively low, however, so there remain many challenges to overcoming this problem. In addition, the selection of the features to be extracted to increase the model accuracy still needs improvement.

III. METHODOLOGY
The research methodology used is illustrated in Fig. 1. It is important to note that the facial images appearing in this discussion were taken from the DAiSEE open dataset developed by [9]. The explanation for each stage in the research methodology is as follows:

A. Dataset
The dataset used in this study was composed of secondary data from an open dataset called DAiSEE (Dataset for Affective States in E-Environments). The dataset was downloaded from https://people.iith.ac.in/vineethnb/resources/ daisee/index.html. DAiSEE is a multi-label video classification dataset comprising 9,068 recorded video clips from 112 students, aimed at identifying students' affective states, including boredom, confusion, engagement, and frustration. Each affective state is labeled into four levels: very low, low, high, and very high. The videos were annotated by psychology experts and a crowd. This study focused solely on engagement levels, which were denoted by numbers: 0 for very low, 1 for low, 2 for high, and 3 for very high.

B. Feature Extraction
The subsequent step, feature extraction, was carried out using the OpenFace library. This open-source library is widely used for face recognition purposes, with the capabilities of facial landmark detection, head pose estimation, facial expressions (facial action units) recognition, and eye gaze estimation.

C. Data Pre-Processing
The next stage, data pre-processing, aimed to prepare data for the modeling stage. This stage involved three steps: data selection, feature dimensional reduction, and data normalization. The outcomes of this stage were feature matrices that could be utilized in the subsequent stage.

D. Imbalanced Dataset Handling
The oversampling or undersampling techniques could be employed to address data imbalances, which could lead to prediction errors in the model. Undersampling aimed to balance the data by reducing the number of instances in the majority class to match the number in the minority class. On the other hand, oversampling balanced the data by increasing the instances in the minority class to match those in the majority class. www.ijacsa.thesai.org

E. Data Splitting
The feature matrix with balanced data was used in the training and testing processes of the classification model. The training data were used to form the classification model, while the testing data were used to evaluate the performance of the model formulated. Fig. 2 is an illustration of the classification model formulation process. The video recording data collected with feature extraction were used as input data in the CNN classification model. The output of the CNN classification model was the prediction class or engagement level of the input video data.

G. Classification Model Evaluation
After obtaining the classification model through the training process, the model was tested using the testing data. The test results were then evaluated using metrics such as accuracy, precision, recall, and F1-score. These values were used to determine the performance of the proposed method. To further evaluate the classification model, the confusion matrix was referred to.
The confusion matrix in Fig. 3 is a matrix visualization of the prediction number and the actual data number on the classification model used. True positive (TP) is the number of correctly predicted data in the positive class. False positive (FP) is the number of incorrectly predicted data in the positive class. True negative (TN) is the number of correctly predicted data in the negative class. False negative (FN) is the number of incorrectly predicted data in the negative class [28]. The accuracy evaluation value compares the number of correctly predicted data with the entire data being tested. It can be calculated using Equation 1 based on the confusion matrix in Fig. 3.

Accuracy = (TP + TN) / (TP + TN + FP + FN)
(1) The precision evaluation value compares the number of correctly predicted data in the positive class with the overall positive predicted results. It can be calculated using Equation 2.

Precision (Pc) = (TP) / (TP + FP) 
The recall or sensitivity evaluation value compares the number of correctly predicted data in the positive class with all the actual data in the positive class. It can be calculated using Equation 3.

Recall (Rc) = (TP) / (TP + FN)
(3) Meanwhile, the F1-score evaluation value compares the average weighted precision and recall. It is better in measuring a classification model's performance than the precision or recall value. It can be calculated using Equation 4.
IV. RESULTS

A. Dataset
The current study used an open dataset named DAiSEE, which is a video dataset that recognizes students' affective levels, including engagement. Each video has a clip ID and engagement level label: very low, low, high, or very high. Fig. 4 shows some examples of the downloaded DAiSEE dataset. The data distribution for each engagement level can be seen in Table I. According to the table, the data for each level of engagement were highly imbalanced. Low and very low engagement levels were minority classes with data presentation of 0.7% and 5.1% of the total available data, respectively. If data of this sort are processed, it will cause errors in the classification model due to overfitting. This can be addressed by balancing the data with undersampling or oversampling techniques.

B. Feature Extraction
The OpenFace library extracted facial features from each video frame. It can be downloaded on the following GitHub page: https://github.com/TadasBaltrusaitis/OpenFace. Fig. 5 is an example of video output generated from the OpenFace library. In addition to the video output above, each video generated a CSV file. The file would display the columns frame, face ID, timestamp, confidence, success, and 709 facial feature values covering facial landmark detection, head pose estimation, eye gaze estimation, and estimation of facial expressions in the forms of facial action unit (AUs) features. The CSV file was also modified to store data of file name and the level of engagement for each frame. The DAiSEE dataset comprises 10-second videos with a frame rate of 30 fps, producing 300 frames for each video.

C. Data Pre-Processing
As shown in Fig. 1, there were three pre-processing stages: data selection, dimensional reduction, and data normalization.

1) Data selection:
The first data selection stage involved selection of videos to facilitate the computational process. The video selection process was carried out in the following substages:  Find id_people from the video name (taken from the first five digits) The results of feature extraction using the OpenFace library had a confidence column. This column refers to the confidence level of the model as to whether the detected face_id was a face or not. The confidence value ranged from 0 to 1. The closer it was to 1, the more confident the model was that the detected object was a face. On the other hand, the closer it was to 0, the less confident the model was that the detected object was a face. Fig. 6. An example of a frame that detected two face objects.
As shown in Fig. 6, in frame 65, two objects, face_id 0 and face_id 1, were detected at the timestamp of 2.133 seconds. face_id 0 was detected at a confidence level of 0.03, and face_id 1 was at 0.98. If more than one object was found in a frame, data selection would be performed, where the object with the highest confidence value was to be selected. The data distribution for each engagement level before and after data selection in stages 1 and 2 can be seen in Table II.

2) Dimensional reduction:
The length of the feature vector generated for each frame was very large, i.e., 1x709. Therefore, it became necessary to reduce the dimensions of the features to obtain unique features that could be used as differentiators for each level of engagement. The algorithms used at this stage were PCA (Principal Component Analysis) and SVD (Singular Value Decomposition). The explained variance value refers to the percentage value of the variance from the initial data. The number of components extracted covered a minimum of 80% of the explained variance in the data. In other words, at least 80% of the variance of the data was successfully captured. The greater the value of the explained variance, the better the original data were represented. Based on Table III, component = 300 was chosen because it had the highest explained variance value for both PCA and SVD. In addition, it was chosen so that each video produced would form a feature matrix with a square size of 300 x 300. Thus, using PCA and SVD, the number of features was reduced from 709 to 300. www.ijacsa.thesai.org

3) Normalization:
The feature matrices produced in the dimensional reduction stage had different ranges of values. It became necessary to normalize the data to prevent them from turning into noise in the model training process. The data normalization method used in this independent study was the min-max normalization method. This normalization method produced new feature values that had the same range from 0 to 1.

D. Imbalanced Dataset Handling
Based on Table II, the data for each level of engagement needed to be more balanced. It was necessary to balance the data to avoid overfitting prediction results. SMOTE (Synthetic Minority Over-sampling Technique), which synthesizes new data by re-sampling the minority class data to balance the data to the majority class, was used as an oversampling technique. A comparison was made between the number of data before and after applying SMOTE (see in Table IV).

E. Data Splitting
Before entering the training stage of the classification model, the pre-processed data were divided into training and testing datasets, with 80% of the data being used for training and the remaining 20% for testing. The training data were used to form a supervised learning classification model, while the testing data were used to evaluate the classification model. The distribution of training and testing data for each engagement level can be found in Table V.

F. Classification Model
The CNN classification model consisted of two stages, namely feature extraction and classification. The former consisted of four convolutional and pooling layer combinations as can be seen in Fig. 7. The feature maps on convolutional layer 1, layer 2, layer 3, and layer 4 were 32, 64, 128, and 256, respectively. The kernel size was 5 with the activation function using "ReLU". The latter used max-pooling with a pooling size of 2 x 2. The learning parameters used during the model-building process were batch size, epoch, learning rate, and optimizer. The trial-error approach was used to make the parameter selection.  In learning with artificial neural networks, the best model is often not found in the most recent epoch. Therefore, checkpoints and early stopping are used in the training process. A checkpoint is a CNN model that records each time the loss value decreases by a specified difference. In this way, if the loss value tends to increase or stagnate, the CNN model that manages to achieve the lowest loss value will be stored. Early stopping is a technique to stop the CNN learning process when the loss value has not shown a significant decrease in the number of certain epochs or when the model is said to have converged. This method is used because it can optimize the maximum number of epochs but saves more training time by stopping CNN training when it shows no improvement in learning. In early stopping, there is the patience parameter (p), which is used to determine the conditions for stopping training when it is found that the number of epochs remains the same as there is no decrease in the loss value. The patience value used in this independent study was half the number of epochs.

V. DISCUSSION
Based on the previous discussion, dimensional reduction was carried out with two approaches: PCA and SVD. Therefore, in evaluating this classification model, a comparison was made between the classification models from PCAreduced data and SVD-reduced data. Table VII shows the best evaluation value for each experiment. It can be seen that PCA-CNN had the highest accuracy of 72.88% in model 19, with an average accuracy value of 69.66% and parameter values as follows: optimizer = Adam, epoch = 1600, learning rate = 10-4, and batch size = 4. In comparison, model 8 had a higher maximum accuracy of 74.58% but with a standard deviation value greater than that of model 19 (3.54 > 3.34). The smaller standard deviation value was chosen because it means that the accuracy value in the experiment was closer to the average value. For experiments using SVD-CNN, the highest accuracy value was found in model 38, with a maximum accuracy value of 77.97% and an average accuracy value of 71.02%. The best parameter values obtained in this model were as follows: optimizer = Adam, epoch = 1600, learning rate = 10-4, and batch size = 8. Model 38 was found to have the smallest standard deviation value. This model was quite stable in providing accuracy evaluation values from the ten iterations performed for each model.
Regarding the learning rate parameter, the SVD-CNN and PCA-CNN experiments both had a learning rate of 10-4, producing the best accuracy model. Compared to the learning rate of 10-5, the learning rate of 10-4 provided faster computation time because the lower the learning rate, the higher the accuracy of the network, which means that the training process takes longer. For epoch size, the SVD-CNN and PCA-CNN experiments had the same number of epochs, 1600, which produced the best accuracy model. As can be seen in Table VII, when the epoch number of 800 was applied, the optimal accuracy value had yet to be reached. However, in terms of computational time, the larger the epoch number, the greater the time required. If we look at the batch size parameter, the SVD-CNN and PCA-CNN experiments had different parameter values generated by their best models. SVD-CNN had the best batch size of 8, while PCA-CNN had the best batch size of 4. The smaller the batch size, the more batches will be generated, requiring greater computation time. Table VII shows that SVD-CNN was better than PCA-CNN in terms of accuracy, batch size, number of epochs, and learning rate and required shorter computation time. www.ijacsa.thesai.org Based on the accuracy, precision, recall, and average F1score values, SVD-CNN performed better than PCA-CNN. Not only were they used for dimensional reduction, SVD and CNN were also used to select important features from the overall features. If analyzed from the variance value generated at the data pre-processing stage, PCA-CNN and SVD-CNN had the same variance value at component value = 300. The higher the variance value, the better the data representation to obtain unique information from the data. Meanwhile, if analyzed from the correlation value between features and engagement level, SVD-CNN had a higher correlation value than PCA-CNN. The features obtained from the SVD results had the best correlation value with the engagement level data. Therefore, in this study, SVD-CNN was superior to PCA-CNN.
The comparison of the values achieved by previous models and those by the model proposed in the DAiSEE dataset can be seen in Table IX. The PCA-CNN and SVD-CNN models with data balancing produced the highest accuracy performance at 72.88 and 77.97, respectively, with fewer features than the previous models.

VI. CONCLUSION
In this study, we have successfully improved the benchmark performance of the DAiSEE dataset. The DAiSEE dataset experienced improvements from an average benchmark accuracy of 57.9% in 2016 [9] for baseline benchmarking, to 63.9% in 2020 [26], to 67.48% in 2022 [5]. We applied data balancing using oversampling and undersampling in the Convolutional Neural Network (CNN) classification model. The DAiSEE dataset also went through the pre-processing stages of data selection, dimensional reduction, and normalization. The features used in this study were taken from the OpenFace library, including 709 facial feature values from facial landmark detection, head pose estimation, eye gaze estimation, and facial expressions (facial action units (AUs)) estimations. Dimensional reduction was performed on the OpenFace features obtained using PCA and SVD techniques. A component number of 300 was applied in the PCA and SVD dimensional reduction, which means that the number of unique features was reduced from 709 to 300.
The PCA-CNN model had the highest accuracy rate of 72.88%, and the SVD-CNN model did 77.97%. The best CNN model parameter values were as follows: learning rate = 10-4, optimizer = Adam, epoch = 1600, and batch size = 4 (PCA-CNN) and 8 (SVD-CNN). The two PCA-CNN and SVD-CNN www.ijacsa.thesai.org best models had precision values higher than the recall values (0.75 > 0.73 for PCA-CNN and 0.79 > 0.78 for SVD-CNN). This indicates that there were false negative events such as prediction errors in the actual engagement level data. Meanwhile, the highest F1-score values were 0.73 (PCA-CNN) and 0.78 (SVD-CNN), which shows that the classification models had fairly good precision and recall values.
From all the experiments that have been carried out, it can be concluded that SVD-CNN had better performance than PCA-CNN in evaluating average, maximum, precision, recall, and F1-score values accuracy. If analyzed from the variance value generated at the data pre-processing stage, PCA-CNN and SVD-CNN had the same variance value at component value = 300. The higher the variance value, the better the data representation to obtain unique information from the data. Meanwhile, if analyzed from the correlation value between features and engagement level, SVD-CNN had a higher correlation value than PCA-CNN. Moreover, it can also be interpreted that the features obtained from the SVD results had the best correlation value with the level of engagement contained in the data. Therefore, in the current study, SVD-CNN was superior to PCA-CNN.
Although this study has provided better evaluation results than previous studies on the DAiSEE dataset, there remains a room for improvement for further research. It is necessary to explore alternative approaches to determining the optimal component values to produce features that have a more significant impact on the classification model. Additionally, conducting a more in-depth analysis of features beyond facial expressions can increase the accuracy of students' engagement detection.