A Comparison of Sampling Methods for Dealing with Imbalanced Wearable Sensor Data in Human Activity Recognition using Deep Learning

—Human Activity Recognition (HAR) holds significant implications across diverse domains, including healthcare, sports analytics, and human-computer interaction. Deep learning models demonstrate great potential in HAR, but performance is often hindered by imbalanced datasets. This study investigates the impact of class imbalance on deep learning models in HAR and conducts a comprehensive comparative analysis of various sampling techniques to mitigate this issue. The experimentation involves the PAMAP2 dataset, encompassing data collected from wearable sensors. The research includes four primary experiments. Initially, a performance baseline is established by training four deep-learning models on the imbalanced dataset. Subsequently, Synthetic Minority Over-sampling Technique (SMOTE), random under-sampling, and a hybrid sampling approach are employed to rebalance the dataset. In each experiment, Bayesian optimization is employed for hyperparameter tuning, optimizing model performance. The findings underscore the paramount importance of dataset balance, resulting in substantial improvements across critical performance metrics such as accuracy, F1 score, precision, and recall. Notably, the hybrid sampling technique, combining SMOTE and Random Undersampling, emerges as the most effective method, surpassing other approaches. This research contributes significantly to advancing the field of HAR, highlighting the necessity of addressing class imbalance in deep learning models. Furthermore, the results offer practical insights for the development of HAR systems, enhancing accuracy and reliability in real-world applications. Future works will explore alternative public datasets, more complex deep learning models, and diverse sampling techniques to further elevate the capabilities of HAR systems.


I. INTRODUCTION
Human Activity Recognition (HAR) is a multidisciplinary field focused on the automated identification and categorization of human activities, primarily relying on data collected from diverse sensors.Its applications extend into critical domains, particularly Sports or healthcare [1].The automatic detection and classification of human activities can significantly improve the quality of life for elderly individuals and dependents, enhancing their safety, well-being, and independence [2].HAR systems play a vital role in smart home environments by providing context-aware services to residents, monitoring their activities, and alerting caregivers in case of any abnormal situations [3].
Deep learning models have revolutionized HAR due to their capacity to process and analyze sensor data effectively.Long Short-Term Memory (LSTM) networks and Convolutional Neural Networks (CNN) are two prominent deep learning architectures that excel at learning complex patterns and temporal dependencies from sensor data.These models have demonstrated remarkable performance in HAR, making them the focal point of this study [4] [5].While deep learning models have shown promise in HAR, one significant challenge is posed by imbalanced datasets [6].In many realworld scenarios, certain activities or classes are more frequent than others in the data, creating an imbalance.This imbalance can adversely affect the performance and accuracy of HAR models, as they may become biased towards the majority class, leading to poor recognition of minority activities [7].
In summary, Human Activity Recognition (HAR) holds vital implications for various domains.Deep learning models, such as LSTM and CNN, enhance HAR by effectively processing sensor data.The use of sampling techniques to address class imbalance significantly boosts model performance.This research underscores the importance of balanced datasets in HAR and provides practical insights for real-world applications.
The primary contributions of this study are as follows:  The profound impact of class imbalance on the performance of deep learning models in HAR is investigated, and a range of sampling techniques designed to alleviate this issue is introduced and rigorously evaluated, offering valuable insights into enhancing model performance within imbalanced datasets.
 Three distinct sampling techniques are evaluated: SMOTE Random Undersampling, and a hybrid approach combining both methods.
 A detailed comparative analysis of the efficacy of these sampling methods in enhancing learning from imbalanced human activity data through deep machine learning algorithms is provided.Specifically, a test is conducted on Vanilla LSTM, 2 Stacked LSTM, 3 Stacked LSTM, and the Hybrid CNN-LSTM model.www.ijacsa.thesai.org The findings consistently demonstrate that the hybrid sampling techniques consistently outperform state-of-the-art models across critical performance metrics, including accuracy, precision, recall, and F1 score.
The paper is organized into distinct sections to effectively present the research findings.Section II presents the Related Works, reviewing prior research in the field related to the problem.Section III elaborates on the Materials and Methods employed in the experimental approach.Section IV presents the outcomes of the experiments and engages in a thorough discussion of the findings.Finally, in Section V, the Conclusion presents the key findings, and future directions of sensor -based HAR using deep learning models.

II. RELATED WORK
In the field of Human Activity Recognition (HAR), addressing imbalanced data presents a significant challenge, a common issue observed in various public datasets, including Opportunity [8], WISDM V1.1 [9], SPHERE [10] and PAMAP2 [11].Imbalanced data can profoundly affect the performance of deep learning models utilized in HAR tasks.To tackle this challenge, several studies have explored the integration of deep learning models with sampling methods specifically designed for Human Activity Recognition based on sensor data.Jeong et al. (2022) conducted a comprehensive study focusing on the influence of undersampling and oversampling techniques for classifying physical activities using an imbalanced accelerometer dataset.Their findings proposed that ensemble learning, coupled with well-defined feature sets and undersampling, exhibits robustness in the classification of physical activities within imbalanced datasets.This approach proves particularly effective in real-world scenarios, where imbalanced class distributions are commonplace.Furthermore, the study underscored the superiority of ensemble learning over other machine learning and deep learning models in handling small datasets with subject variability [12].Hamad et al. (2020) evaluated the efficacy of imbalanced data handling methods in the context of deep learning applied to smart home environments.Leveraging a CNN LSTM model and a dataset comprising daily living activities collected from two real intelligent homes, their research demonstrated a significant performance improvement by applying the SMOTE oversampling method.This enhancement resulted in a notable increase in accuracy (from 0.60-0.62 to 0.71-0.73)when compared to training on the original imbalanced data [13].Alani et al. (2020) delved into the classification of imbalanced multi-modal sensor data for HAR within smart home environments, using deep learning techniques in conjunction with oversampling (specifically, SMOTE) and undersampling methods.The results unequivocally favored the SMOTE method over undersampling in effectively addressing imbalanced data challenges within HAR tasks using the SPHERE dataset [14].Alharbi et al. (2022) made significant contributions by investigating the effectiveness of oversampling methods, such as SMOTE and its hybrid variations, in improving the classification of minority classes in diverse datasets.For instance, on the PAMAP2 dataset, the MLP achieved an F1 score of 0.7185 using the SMOTE sampling method, compared to its baseline score of 0.7473.[7].
In addition to the aforementioned studies, recent research has showcased the potential of deep learning models for HAR:  [18].
Table I summarizes the performance of various deep learning models on the PAMAP2 dataset using different sampling methods for sensor-based HAR.
These studies collectively emphasize the positive impact of oversampling techniques, particularly SMOTE, in enhancing model performance when compared to training on imbalanced datasets.These insights lay the foundation for this research, which aims to build upon this foundation and further investigate the efficacy of sampling methods in improving the performance of deep learning models for HAR on the PAMAP2 dataset.
In the field of HAR, a significant gap in existing research has been found.There hasn't been enough focus on how different sampling techniques affect the performance of deep learning models in HAR.While some studies have tackled imbalanced data in HAR, they often overlook the critical role that sampling methods play.This gap highlights the need for a more thorough investigation into how sampling techniques and deep learning intersect in HAR.That's where the research steps in.The commitment is to address this gap by thoroughly studying how various sampling methods impact the performance of deep learning models in real-world HAR scenarios.The goal is to provide a clearer picture of how sampling methods and deep learning models work together, ultimately improving the accuracy and reliability of activity recognition in sensor-based applications.www.ijacsa.thesai.org

III. MATERIAL AND METHODS
In this research, the impact of class imbalance on HAR using wearable sensor data and deep learning models was investigated.To address this issue, three sampling methods were thoroughly examined: SMOTE, Random Undersampling, and a hybrid combination of the aforementioned techniques.The study involved the training of four deep learning models, including Vanilla LSTM, 2 Stacked LSTM, 3 Stacked LSTM, and Hybrid CNN-LSTM, on the PAMAP2 dataset.Through rigorous experimentation and evaluation, the aim was to identify the most effective sampling approach to improve model performance and generalization in HAR.The findings are expected to contribute valuable insights towards enhancing the accuracy and reliability of HAR systems deployed in real-world scenarios.

A. PAMAP2 Dataset
The PAMAP2 dataset [11], which stands for "Physical Activity Monitoring using a Multipurpose Sensor" holds a prominent role in the realm of Human Activity Recognition (HAR) research.Its comprehensive data collection approach, diverse participant demographic, and meticulous data organization make it a valuable resource for the research community.
Here are some key characteristics of the PAMAP2 dataset, as extracted from the dataset documentation [11] :  Table III provides an overview of the data distribution within the PAMAP2 dataset, highlighting a significant class imbalance among different activity labels in both the training and testing datasets.This issue is further underscored in Fig. 1, where it becomes evident that the "Rope jumping" activity exhibits notably fewer instances compared to other activities.This skewed data distribution can exert a substantial impact on the performance of deep learning models.Consequently, it becomes imperative to implement suitable sampling strategies to guarantee the reliability and accuracy of results.www.ijacsa.thesai.org

B. Deep Learning Models
1) Long short-term memory (LSTM): LSTM networks belong to the category of recurrent neural networks (RNNs) and hold significance in time series applications, particularly HAR, that involve the classification of activities based on sensor data, such as accelerometers and gyroscope readings from smartphones.The strength of LSTM networks in HAR lies in their capability to capture and model long-term dependencies present within the sensor data [19]. 2

) Hybrid deep learning model (CNN-LSTM):
This research harnesses the power of hybrid models, specifically the integration of Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks.This combination, as evidenced by several studies [14][4], holds promise for achieving high performance in HAR tasks.The rationale behind selecting this hybrid model is compelling: CNN excels at capturing spatial relationships within data, while LSTM is adept at modeling temporal dependencies.This combination allows us to leverage the strengths of both architectures [20].One notable advantage of the hybrid model is that CNN accelerates the feature extraction process, enhancing training efficiency.This synergy between CNN and LSTM contributes to the model's overall effectiveness in recognizing human activities based on sensor data.
3) Deep learning models configurations: In the following deep learning configuration for multiclass HAR classification, various layers play distinct roles.These include the LSTM layer, dropout layer, dense layer with Softmax activation for probability estimation, convolutional (Conv1D) layer, and max pooling layers.Each layer plays a specific role in a deep learning architecture for multiclass classification.The LSTM layer captures sequential dependencies in the data, making it suitable for time series or sequential data.The dropout layer helps prevent overfitting by randomly deactivating a fraction of neurons during training, enhancing the model's generalization.The dense layer, often found in the final stage, produces class scores.The softmax activation function applied to these logits converts them into class probabilities.The convolutional (Conv1D) layer extracts spatial features from the input data.Max pooling layers reduce the spatial dimensions while retaining essential information, aiding in feature selection and computational efficiency.Combined, these layers enable the deep learning model to process, understand, and classify data efficiently and accurately.
In this study, several configurations of deep learning models for HAR are explored.These configurations include: a) Vanilla LSTM: This straightforward LSTM setup consists of a single hidden layer of LSTM units and an output layer for prediction.It has proven its effectiveness in various small sequence prediction tasks [21].

C. Sampling Techniques
To tackle the challenge of imbalanced data, three different sampling techniques were applied:

1) SMOTE (Synthetic minority over-sampling technique):
This Sampling technique serves as an effective tool for addressing imbalanced datasets in the realm of machine learning.Its function involves creating synthetic data points for the underrepresented class by bridging the gap between existing samples.In the context of sensor-based Human Activity Recognition (HAR) using deep learning, SMOTE plays a vital role in enhancing the classification accuracy of models like Multi-Layer Perceptrons (MLPs) [7].
Deep learning models require a high amount of data and are very sensitive to the imbalanced class problem.This is where SMOTE steps in, generating artificial samples for the minority class, thereby balancing the dataset and significantly improving the classification accuracy of these deep learning models [14].
2) Random undersampling: This Sampling method addresses imbalanced datasets by randomly removing samples from the majority class to achieve balance.In sensor-based Human Activity Recognition (HAR) with deep learning, is employed to boost deep learning model classification accuracy [14].Deep learning models require a high amount of data and are sensitive to class imbalances.Thus, Random Undersampling eliminates samples from the majority class, balancing the dataset and improving classification accuracy [14].However, this method can lead to the loss of critical information from the majority class, potentially impacting the model's classification accuracy [7].Hence, it is crucial to carefully select the samples for removal to prevent the loss of vital information.
3) Hybrid sampling: Hybrid sampling is a technique used to deal with imbalanced datasets in machine learning.It involves combining oversampling and undersampling methods to balance the dataset.This method generates synthetic samples for the minority class using SMOTE and randomly removes samples from the majority class using Random Undersampling.The combination of these two methods helps to balance the dataset and improve the classification accuracy of deep learning models [7] [14].Hybrid sampling is particularly effective in sensor-based Human Activity Recognition (HAR) when combined with deep learning models.This technique successfully addresses the challenge of imbalanced classes while simultaneously mitigating the risk of losing valuable information from the majority class that can occur with random undersampling alone.By generating synthetic samples for the minority class through SMOTE, hybrid sampling ensures a well-represented minority class in the dataset.This balanced dataset significantly enhances the classification accuracy of deep learning models, while also promoting data diversity [14].

D. Hyperparameter Tuning with Bayesian Optimization
The Model Hyperparameters are crucial in deep learning, shaping training algorithms and model performance.Bayesian optimization offers an effective means to optimize these parameters, particularly in complex, function-based problems lacking simple analytical solutions.To apply Bayesian optimization to time series and sensor-based Human Activity Recognition (HAR) using LSTM models, the following steps can be followed: Step 1: Define the hyperparameter search space.
Step 2: Specify the objective function to evaluate model performance.
Step 3: Initialize the Bayesian optimization algorithm with hyperparameter values.
Step 4: Iteratively use the algorithm to suggest hyperparameters for evaluation.
Step 5: Continue until predefined convergence criteria are met, like a set number of iterations or desired performance levels.

E. Evaluation Metrics
In the experiment, various evaluation metrics were used to assess the HAR model's performance.These metrics included accuracy, F1 score, precision, recall, and the confusion matrix.These evaluation metrics determine the performance of a model on a dataset.The most common metric is the confusion matrix which is a two-dimension table of class labels; one represents the current class and the other represents the predicted one.Accuracy is the most used one to evaluate model classification.It defines a ratio of correct predictions and overall predictions.The accuracy can be a good measure when the dataset class is balanced.Otherwise, this metric is not appropriate for evaluation.In the case of imbalanced datasets, other metrics are used such as precision, recall, fmeasure, and specificity.Table IV presents the definition of all these metrics [22].
Understanding these performance metrics requires knowledge of four fundamental terms used in their measurement: true positive (TP), true negative (TN), false positive (FP), and false negative (FN).

𝑡𝑛 𝑡𝑛 + 𝑓𝑝
The ratio of actual class 0 to the correctly predicted 0 F1 score / Fmeasure

A. Experimental Design
This research aims to investigate the impact of data balancing techniques on the performance of deep learning models for Human Activity Recognition (HAR) by addressing the following research questions: 1) How does class imbalance affect the performance of deep learning models in Human Activity Recognition (HAR) when applied to wearable sensor data?
2) What are the comparative effects of different sampling techniques, such as SMOTE, Random Undersampling, and Hybrid Sampling, in addressing the class imbalance in wearable sensor data for HAR?
3) What role does hyperparameter tuning play in improving the accuracy and performance of deep learning models for HAR, particularly in the context of imbalanced datasets?
4) Which combination of sampling technique and hyperparameter tuning strategy yields the most significant performance improvements in HAR using deep learning models for imbalanced wearable sensor data?
The hypothesis guiding this study is that balancing the dataset will result in enhanced classification accuracy in HAR using deep learning models.The experiments were carried out using the PAMAP2 dataset collected from wearable sensors, encompassing wrist, chest, and ankle devices.Four deep learning models were employed: Vanilla LSTM, 2-Stacked LSTM, 3-Stacked LSTM, and CNN-LSTM.

B. Experimental Setup
The conducted experiments are performed on an NVIDIA GPU V100 using the Google Collaboratory Pro+ platform.The four models' hyperparameters were optimized through Bayesian Hyperparameter Optimization, utilizing the Keras Tuner library [23].The experiment setup is detailed in Table V.

C. Experiment Pipeline
To evaluate the models' performance on the PAMAP2 dataset, a comprehensive experiment pipeline was executed.
This pipeline is composed of multiple stages, each playing a vital role in the experiments (see Fig. 6): 1) Data collection: Initially, the raw sensor data from wearable devices were collected.
2) Data preprocessing: The dataset goes through a preprocessing phase, involving actions like data cleaning, noise reduction, and normalization.
During this stage, the raw sensor data from wearable devices is readied for the proposed model.The subjectspecific files containing activity records are consolidated into one data frame.To adhere to PAMAP2 guidelines, invalid orientation columns are removed, and transient activity rows are dropped.Non-numeric data is transformed into numeric form, and missing values are interpolated to ensure data integrity.Scaling is applied to normalize input features, ensuring data uniformity.Labels are encoded and converted into categorical variables, a critical step for activity classification during model training.
The data is then split into training and testing sets, with 70% allocated for training and 30% for testing.Data is segmented into overlapping windows, with a window size of 1 second and a 50% overlap.This segmentation process creates segments and associated labels for both training and testing.The segments and labels are reshaped to align with the LSTM model's input format.The experiment validates the shape of training and testing segments before moving on to model training and evaluation phases.The four models were trained using Bayesian Optimization to fine-tune hyperparameters for optimal model performance.The Keras Tuner library is utilized to search for the best hyperparameters.
The models are fine-tuned by adjusting several critical hyperparameters: the LSTM units, which determine the number of LSTM units in each LSTM layer, are explored within the range of 64 to 256, with a step size of 32.Similarly, the dense units, specifying the number of units in the dense layer, are considered within the range of 32 to 128, with a step size of 32.The batch size hyperparameter, significant for model training, is chosen from the options of 32, 64, or 128.The learning rate, influencing the optimizer's learning rate, is selected from values like 1e-3, 1e-4, or 1e-5.Furthermore, the dropout rate, responsible for controlling the dropout applied after each LSTM layer and the dense layer, varies from 0.1 to 0.5, with a step size of 0.1.The optimizer hyperparameter allows the choice of ADAM or RMSprop as the optimizer used to compile the model.The number of epochs in this experiments ranges from 50 to 100.This extensive exploration and fine-tuning of the models ultimately result in enhanced accuracy and robust performance for HAR tasks.All these hyperparameters are summarized in Table VI.

6) Model evaluation:
To evaluate the models performance on the PAMPA2 dataset.The evaluation metrics were used including accuracy, precision, recall, F1-score and confusion matrix.These metrics were compared against those reported in previous literature studies conducted on the same dataset, enabling a comprehensive assessment of the proposed model's effectiveness and advancements in HAR.Subsequently, the four deep learning models were trained on the preprocessed imbalanced dataset.The optimization of hyperparameters for these models was carried out using Keras Tuner Bayesian optimization.The best hyperparameters of each model are summarized in Table VII.

Table VIII presents the results of Experiment 1,
showcasing the performance metrics of the models, which include accuracy, precision, recall, and F1-score.These metrics were measured to establish a baseline for comparison.2) Experiment 2: Balancing data with SMOTE: In this experiment, the evaluation of model performance was conducted when trained on a dataset balanced using the Synthetic Minority Over-sampling Technique (SMOTE).The four models underwent training on the SMOTE-balanced dataset, and the search for the best hyperparameters for each model was facilitated by Keras Tuner, as shown in Table VIII .
Performance metrics achieved in this experiment were observed and reported in Table IX, with a comparison to those from Experiment 1. Fig. 11 to Fig. 14 depicts the confusion matrices for the Vanilla LSTM model, 2-Stacked LSTM, 3-Stacked LSTM, and CNN-LSTM, respectively, on the balanced data with SMOTE (see Table X).3) Experiment 3: Random undersampling: Experiment 3 entailed the assessment of the models' performance when trained on a dataset balanced through Random Undersampling.The four models underwent training on the randomly undersampled Training set.The search for the best hyperparameters for each model was conducted using Keras Tuner, as indicated in Table XI          In summary, this study conclusively demonstrates the efficacy of hybrid sampling techniques in effectively addressing class imbalance challenges in HAR.The proposed models consistently achieve good results, especially the 3 Stacked LSTM, surpassing other models in terms of accuracy, precision, recall, and F1 scores.This underscores the crucial importance of balancing data for better-performing deep models.The comparative plots in Fig. 23 to Fig. 26 provide a visual representation of these findings.Comparison with Previous Studies: Previous research has extensively explored diverse deeplearning models for Human Activity Recognition (HAR) using the PAMPA2 dataset.As demonstrated in Table XV, these prior studies have yielded impressive outcomes.In 2022, an exemplary convLSTM Autoencoder (AE) model exhibited remarkable accuracy, recording a value of 0.9433, along with an F1 score of 0.9446 [4].Similarly, in 2023, a Bi-LSTM model demonstrated commendable performance, achieving a high accuracy of 0.9341 and an F1 score of 0.9341, complemented by notable precision and recall values [17].www.ijacsa.thesai.orgFinally, our study demonstrates the effectiveness of Hybrid Sampling techniques in addressing class imbalance in HAR, leading to higher accuracy, precision, recall, and F1 scores.These models consistently outperformed the best-performing models from previous research, underscoring their potential to significantly enhance the accuracy and reliability of HAR systems and demonstrating the importance of tackling the imbalanced data problem.

V. DISCUSSION
Prior studies such as [6], [24] have highlighted the lack of works that address and investigate the impact of the class imbalance problem in human activity recognition.This present study fills this gap by comparing three sampling approaches, SMOTE, Random Undersampling, and Hybrid sampling to reduce the class imbalance and substantially improve human activity recognition (HAR) performance.
In this section, a comprehensive discussion of the experimental findings and their implications for the field of HAR using deep learning models is presented.The consideration encompasses the following key aspects: the impact of class imbalance, the effectiveness of sampling techniques, and the significance of hyperparameter tuning.

1) Hyperparameter tuning enhances model adaptability and performance:
In all the experiments, hyperparameter tuning was applied in each scenario, proving to be a highly beneficial approach.The optimization of hyperparameters for each experiment ensured that the deep learning models were tailored to perform optimally under specific conditions.This adaptability is crucial in real-world applications where data characteristics and sampling techniques may vary.Moreover, hyperparameter tuning significantly contributed to the fairness of this comparative analysis.It prevented any model from having an unfair advantage due to suboptimal hyperparameters, ensuring a more equitable evaluation of different sampling techniques.
Overall, the inclusion of hyperparameter tuning in this experimental design serves as a robust foundation for meaningful comparisons and insights into HAR.

2) Addressing class imbalance with sampling techniques:
The experiments aimed to investigate the impact of different sampling techniques on the performance of deep learning models in HAR.To address this, four experiments were conducted, each involving variations in data preprocessing and www.ijacsa.thesai.orgsampling, and each of them incorporated hyperparameter tuning.
The results clearly demonstrate the notable impact of sampling techniques on model performance, further enhanced by hyperparameter tuning.
In Experiment 2, following the application of SMOTE and Hyperparameter Tuning, substantial improvements in accuracy, F1 score, precision, and recall were observed across all models.This underscores the effectiveness of SMOTE in addressing the class imbalance issue, especially when combined with optimal hyperparameters.The balanced dataset led to enhanced recognition efficiency, with significant gains in accuracy and F1 score.
In Experiment 3, involving Random under-sampling and Hyperparameter Tuning, the models exhibited decreased performance compared to the baseline.
In Experiment 4, employing hybrid sampling and hyperparameter tuning, remarkable results were achieved.By combining the strengths of SMOTE and Random Undersampling with fine-tuned hyperparameters, high accuracy and F1 scores were achieved, surpassing the baseline.This confirms the potential of hybrid sampling as a powerful technique for enhancing model performance, especially when hyperparameters are tuned effectively.
Hybrid sampling demonstrates its effectiveness in balancing data by leveraging the strengths of both oversampling (SMOTE) and undersampling (Random Undersampling) techniques.It begins by oversampling the minority class, increasing its representation, and then follows with undersampling the majority class to reduce redundancy.This approach enhances model performance, mitigates overfitting, and ensures that deep learning models are exposed to a more representative and diverse distribution of data.Consequently, these factors contribute to improved generalization, enabling models to make more accurate predictions.It is the combination of these advantages that positions hybrid sampling as an outperforming technique compared to other sampling methods.
3) Model performance and generalization: The findings suggest that deep learning models trained on balanced datasets exhibit improved performance compared to those trained on imbalanced data.This result highlights the significance of addressing class imbalance in HAR applications.Furthermore, these models demonstrated robust generalization capabilities, indicating their potential for real-world deployment.
4) Practical implications: The practical implications of this research extend to various applications, including healthcare, fitness tracking, and human-computer interaction.By improving the accuracy and reliability of HAR systems through both sampling techniques and hyperparameter tuning, this work contributes to enhancing user experiences and promoting healthier lifestyles.
5) Limitations and future work: It's important to acknowledge the limitations of this study.The choice of datasets, model architectures, and hyperparameters may impact the generalizability of the findings.Future research could explore additional datasets, and more complex model architectures, and further investigate hyperparameter tuning techniques.Additionally, the real-world deployment of HAR systems should consider challenges related to sensor placement, data privacy, and user variability.
In conclusion, this study emphasizes the critical role of both sampling techniques and hyperparameter tuning in improving the performance of deep learning models for HAR.SMOTE and hybrid sampling methods, when coupled with effective hyperparameter tuning, demonstrate their effectiveness in addressing class imbalance.The achievement of enhanced accuracy and F1 scores through these combined techniques paves the way for more reliable and efficient HAR systems with broader applications.

VI. CONCLUSION
In this extensive study on Human Activity Recognition (HAR) using deep learning models and wearable sensor data, the goal was to enhance the accuracy and reliability of HAR systems, which are crucial in healthcare and sports analytics.The challenge of imbalanced datasets in HAR was addressed by exploring different sampling techniques: Synthetic Minority Over-sampling Technique (SMOTE), random undersampling, and hybrid sampling (a combination of SMOTE and random undersampling).These techniques were tested with various deep learning models, including Vanilla LSTM, 2 Stacked LSTM, 3 Stacked LSTM, and Hybrid CNN-LSTM.The findings showed significant improvements in model performance when using sampling techniques to balance the data.SMOTE and hybrid sampling were particularly effective in countering class imbalance, leading to notable enhancements in model accuracy, precision, recall, and the F1 score.The importance of hyperparameter tuning, involving adjustments to specific model settings, was also highlighted.By fine-tuning these parameters, even better model performance was achieved, emphasizing the critical connection between data preprocessing and parameter configuration.As wearable sensors become more prevalent, this research contributes to the creation of systems that can better understand and interpret human actions in various realworld scenarios.Future work will involve experiments with more diverse public datasets, the exploration of more complex deep learning models, and the investigation of additional sampling techniques to further advance the field of Human Activity Recognition APPENDIX A

Fig. 2
Fig. 2 illustrates the architecture of the Vanilla LSTM model.It provides a visual representation of the model's structure, showcasing the flow of data through its layers, including the LSTM layer, dropout layers, and dense layers, ultimately leading to the output layer for activity classification.

Fig. 3 .
Fig. 3. Structure of 2-stacked LSTM model.c) 3-Stacked LSTM: Similar to the 2-Stacked LSTM but with an additional layer of LSTM units, this configuration aims to further enhance the model's capacity for temporal feature extraction [21].Fig. 4 provides an overview of the 3-Stacked LSTM model's architecture, designed for Human Activity Recognition (HAR).This model excels at capturing intricate temporal patterns within sensor data.It comprises three LSTM layers, each with 32 units, to model complex temporal relationships.Dropout layers are integrated to prevent overfitting during training.The model also includes two dense layers with 64 and 12 units, respectively, for feature extraction and final activity classification.In summary, the 3-Stacked LSTM model is engineered to achieve robust and accurate activity recognition in HAR scenarios by effectively handling temporal data dependencies and ensuring generalization through dropout mechanisms.
d) Hybrid Model (CNN-LSTM): The CNN-LSTM model, a hybrid architecture that seamlessly combines Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) layers.Fig. 5 outlines the Hybrid CNN-LSTM model for HAR.The architecture starts with a CNN layer followed by dropout and max pooling for feature extraction.Subsequently, an LSTM layer captures temporal patterns with dropout for regularization.The final dense layer performs activity classification.This design effectively handles spatial and temporal aspects of sensor data, ensuring robust activity recognition in HAR.

3 )
Data class balancing: In the data balancing stage, three different sampling techniques were applied to tackle class imbalance in the experiments: SMOTE, Random Undersampling, and a hybrid approach.SMOTE was used to generate synthetic instances for minority classes, Random Undersampling involved reducing instances in the majority class, and the hybrid approach combined both methods.The objective was to create balanced datasets to enhance model training.These sampling techniques were exclusively applied to the training set to ensure class balance for improved model performance.4) Data segmentation: Data were segmented into overlapping windows, following a window size of one second with a 50% overlap.This facilitated the data's suitability for deep learning models.5) Training and hyperparameter optimization: Deep learning models performed feature extraction automatically to identify relevant patterns in the segmented data.

Four experiments were conducted:  Experiment 1 :
Train and test the four models on an imbalanced dataset. Experiment 2: Train and test the four models on the balanced dataset with SMOTE. Experiment 3: Train and test the four models on the balanced dataset with Random Undersampling. Experiment 4 : Train and test the four models on the balanced dataset with hybrid Sampling(SMOTE & Random Undersampling).D. Experiments Results1) Experiment 1: Baseline: In Experiment 1, the baseline was established to compare the effects of various data balancing techniques.
Fig. 15 to Fig. 18 illustrate the confusion matrices for the Vanilla LSTM model, 2-Stacked LSTM, 3-Stacked LSTM, and CNN-LSTM, respectively, on the balanced data achieved through Random Undersampling.4) Experiment 4: Hybrid sampling: In Experiment 4, the examination of the models' performance was carried out when trained on a dataset balanced using hybrid sampling, combining SMOTE and random undersampling.The four models underwent training on the hybrid-sampled dataset.The search for the best hyperparameters for each model was conducted using Keras Tuner, as indicated in Table XIII.Performance metrics from this experiment were documented in Table XIV and compared with the results from Experiment 1. Fig. 19 to Fig. 22 illustrate the confusion matrices for the Vanilla LSTM model, 2-Stacked LSTM, 3-Stacked LSTM, and CNN-LSTM, respectively, on the balanced data achieved through hybrid Sampling.

Fig. 22 .
Fig. 22. Confusion matrix of CNN-LSTM on balanced data with hybrid sampling.E. Comparative Results Analysis In this research paper, a comparative study was conducted employing four distinct deep learning models: Vanilla LSTM, 2 Stacked LSTM, 3 Stacked LSTM, and hybrid CNN-LSTM.The study aimed to address the challenge of class imbalance in Human Activity Recognition (HAR) through the utilization of three sampling techniques: SMOTE, Random Undersampling, and a novel Hybrid Sampling approach.The performance of these models was evaluated based on key metrics, including accuracy, F1 score, precision, and recall.Accuracy Comparison of Deep Learning Models on Different Sampling Techniques (illustrated in Fig. 23):  For models trained on imbalanced data, the 2 Stacked LSTM model exhibited the highest accuracy, achieving 0.9531.It was closely followed by the Vanilla LSTM model with an accuracy of 0.9257. When using SMOTE to balance the data, the Vanilla LSTM model performed remarkably well, with an accuracy of 0.9499.The 2 Stacked LSTM also showed strong performance with an accuracy of 0.9438. For Hybrid Sampling, the models reached even higher accuracy.The 2 Stacked LSTM achieved an accuracy of 0.9755, and the Vanilla LSTM excelled further with an impressive accuracy of 0.9821.The 3 Stacked LSTM model with Hybrid Sampling exhibited the most remarkable performance, achieving an accuracy of 0.9828.F1-score Comparison of Deep Learning Models on Different Sampling Techniques (see Fig. 24):  In terms of F1 score, similar trends were observed.The 2 Stacked LSTM model performed exceptionally well

Fig. 23 .
Fig. 23.Accuracy comparison of deep learning models on different sampling techniques.

TABLE I .
PREVIOUS STUDIES PERFORMANCE ON PAMAP2 DATASET USING DEEP LEARNING MODELS AND SAMPLING METHODS FOR SENSOR BASED HAR

Table A1 in
Appendix A provides a comprehensive overview of the dataset instances before and after the application of various sampling methods.The table allows for a clear visualization of how each sampling technique impacts the dataset composition.

TABLE IV .
PERFORMANCE METRICS

TABLE VII .
THE SUMMARIZED HYPERPARAMETERS OF THE FOUR MODELS FOUND BY KERAS TUNER ON IMBALANCED DATA

TABLE VIII .
RESULTS OF EXPERIMENT 1 ON IMBALANCED DATA

TABLE IX .
THE SUMMARIZED HYPERPARAMETERS OF THE FOUR MODELS FOUND BY KERAS TUNER ON BALANCED DATA WITH SMOTE Fig. 11.Confusion matrix of vanilla LSTM on balanced data with SMOTE.
and Table XII shows the Experiment 3on balanced data with random undersampling.Performance metrics achieved in this experiment were observed and reported in Table XIII, with a comparison to those from Experiment 1.

TABLE X .
RESULTS OF EXPERIMENT 2 ON BALANCED DATA WITH SMOTE

TABLE XI .
THE SUMMARIZED HYPERPARAMETERS OF THE FOUR MODELS FOUND BY KERAS TUNER ON BALANCED DATA WITH RANDOM UNDERSAMPLING

TABLE XII .
RESULTS OF EXPERIMENT 3 ON BALANCED DATA WITH RANDOM UNDERSAMPLING

Imbalanced Data Balanced Data With Random Undersampling
Fig. 15.Confusion matrix of Vanilla LSTM on balanced data with random undersampling.

TABLE XIII .
THE SUMMARIZED HYPERPARAMETERS OF THE FOUR MODELS FOUND BY KERAS TUNER ON BALANCED DATA WITH HYBRID SAMPLING www.ijacsa.thesai.org

TABLE XIV .
RESULTS OF EXPERIMENT 2 ON BALANCED DATA WITH HYBRID SAMPLING Fig. 19.Confusion matrix of Vanilla LSTM on balanced data with hybrid sampling.

TABLE XV .
COMPARISON WITH PREVIOUS WORKS

TABLE A1 :
DATASET INSTANCES BEFORE AND AFTER APPLYING SAMPLING METHODS