A Hybrid Deep Neural Network for Human Activity Recognition based on IoT Sensors

Internet of things (IOT) sensors, has received a lot of interest in recent years due to the rise of application demands in domains like ubiquitous and context-aware computing, activity surveillance, ambient assistive living and more specifically in Human activity recognition. The recent development in deep learning allows to extract high-level features automatically, and eliminates the reliance on traditional machine learning techniques, which depended heavily on hand crafted features. In this paper, we introduce a network that can identify a variety of everyday human actions that can be carried out in a smart home environment, by using raw signals generated from Internet of Thing’s motion sensors. We design our architecture basing on a combination of convolutional neural network (CNN) and Gated recurrent unit (GRU) layers. The CNN is first deployed to extract local and scale-invariance features, then the GRU layers are used to extract sequential temporal dependencies. We tested our model called (CNGRU) on three public datasets. It achieves an accuracy better or comparable to existing state of the art models. Keywords—IoT; deep learning; CNN; GRU; CNGRU; human activity recognition


INTRODUCTION I.
The Internet of Things (IoT) is a technology that has a lot of potential, it presents a platform where sensors and devices can communicate seamlessly within a smart environment. Each year, the number of IoT supporting devices increases; sectors such as transport, healthcare, security, smart cities, education, agriculture, and many others have already benefited from its development. This will result in a generation of applications capable of completing complex sensing and recognition tasks to support a new world of human-things interactions. The recognition of human activities is a field that presents an interaction between computers and humans which has been promoted recently by the expansion of artificial intelligence. This progress has reached a stage that has allowed it to integrate several fields, to the point that we find its applications in everyday life. In the field of security by making surveillance more intelligent [1]. In smart homes by improving the security and monitoring the health condition of the residents [2], and increasing the degree of independence and quality of life, especially for the elderly [3]. HAR is present as well in the field of healthcare, by the deploy of a combination of one or more techniques of recognition that notifies the medical staff once an intervention is necessary [4].
This widespread availability is owing to significant efforts to reduce the size of the electronic components and create sensors that can be included in smartphones, smart watches, and other wearable internet of things devices.
Depending on the type of sensors used, we categorize activity recognition into vision-based or sensor-based recognition. The first category deploys cameras to obtain images and videos and use it to detect and classify activities, however it faces challenges as image variation, object deformation, mobility constraints imposed by visual sensors, besides other problems related to power consumption and privacy. On the other hand, sensor based recognition which is based on acceleration sensors, gyroscope sensors, geomagnetic sensors and others, are simple to use and generate relatively accurate and reliable data. The classic approaches require a lot of data pre-processing and domain knowledge for feature engineering, which will be necessary at every change of dataset, and limit the generalization of the model.
Recently, Deep learning has achieved good performances and it has accumulated successes in image, speech, and natural language processing, and today it is introduced in human activity recognition, to profit from its capacity to learn complex movements, by abstracting features automatically from raw data without being handcrafted. Deep learning's layer-by-layer structure enables it to progressively learn features from simple to complex, which is effective in the analyse of multimodal sensory data. The various architectures of deep learning are capable of encoding these features from diverse perspectives. For example, CNNs can capture local multimodal sensory connections, where RNNs can extract each temporal dependency and learn information incrementally across multiple time intervals.
We achieve sensor-based HAR through four major steps, the first is data collection, followed by data segmentation, then feature selection or extracting features, and last the classification of the activity. Most of the previous works in HAR are based in their approaches on a manual feature engineering, which already requires an expert knowledge, the method proposed in this article does not require any design or creation of features, it exploits directly the data generated by the accelerometer and gyroscope. This is the key contributions of our work: We propose CNGRU, an end to end Network for HAR capable of automatically extracting and learning features from raw data without pre-processing.
We deploy a combination of two types of neural networks: convolutional and gated recurrent units. *Corresponding Author.
The network permits to recognize various activities and gestures, recorded using different types and combinations of sensors. The experience on three most widely used open datasets, proves that we reach comparable, or better results than previous methods, which demonstrates the generalization capability of the model. We organize our paper as follows: Section II reviews related works of human activity recognition. In Section III, we propose our model for HAR. Section IV presents and examines the experimental results. And last in Section V, we draw out our conclusion.
Prior studies on human activity recognition have been conducted utilizing open-access datasets available on the internet. Mainly the UCI HAR dataset was exploited alone or with other datasets like Opportunity [5], WISDM V1.1 [6], PAMAP2 [7]. Consequently, this availability of data facilitated the design and evaluation of the activity recognition approaches based on motion sensors. Whereas some works are based on the investigation of feature selection in order to achieve higher accuracies, others attempted to avoid this design and engineering task by utilizing the capacity of deep learning models. Convolution neural network is the most common model in the approaches proposed in the literature, researchers exploit its ability to capture local connections, as well as the recurrent neural network and its variants capable of capturing temporal dependencies between signal readings. And in other works those two networks are fused or cascaded to learn the most important features.
The authors in [8] have proposed a hybrid architecture, which combines LSTM and CNN. After preprocessing data, they fed it to two LSTM layers for temporal feature extraction, while the spatial features were extracted by two other convolution layers.
Deep et al [8] used the UCI HAR dataset to test their model composed of CNN followed by an LSTM network. They have achieved better recognition scores compared to simple LSTM architecture. On the same dataset, Hernández et al [9] presented the idea of using bidirectional LSTM networks, to recognize the six activities of this dataset. They attain a high recognition performance, except for static activities: laying and standing. Ahmad et al [10] introduced a new approach based on an architecture called multi-head CNN to recognize human activities, The fundamental idea is to employ three CNNs, each supplied by three streams: overall acceleration, body acceleration, and body gyroscope. The results of these parallel CNNs are then integrated and transmitted to another LSTM layer, resulting in a high recognition accuracy. Sikder et al [11] used frequency's and power's features of raw activity signals, and they feed each stream of them to a CNN channel, the result is concatenated for classification, finally an accuracy of 95.25% is obtained on UCI HAR.
Other works have explored the effect of deepness on recognition, the authors in [12] proposed an HDL: Hierarchical Deep Learning Model capable of recognizing activities with an accuracy of 97.95 % on the UCI HAR dataset, their model is composed of several BLSTM layers, which are used to capture information from the original data, CNN layers came afterwards to learn features from the output of the last BLSTM layer, and classification is obtained in the end using a Softmax layer. Xu et al [13] have proposed InnoHAR, a network which, takes advantage of Inception-like modules to make feature extraction, combined with GRU for sequential temporal dependencies extraction, Gao et al [14] proposed a method called DanHAR designed for challenging scenarios where there are multi-modal sensors. Their model uses a hybrid approach that fuses information using a dualattention mechanism with CNN, which improved the ability to capture temporal and spatial patterns, resulting in a better performance while keeping the number of parameters small. Teng et al [15] proposed a network based on convolutional neurons with a local loss after each CNN module, they compared a baseline model containing three CNN layers and one Fully Connected layer, with the same model having the first time similarity matching loss, a second time crossentropy loss and the third time a combination between the two previous losses. Sena et al [16] divided the data into several inputs according to the type of sensor, then for each of them they built a deep CNN to extract temporal scales and features. Their method employs a DCNN, which is made up of two convolutional layers followed by a Maxpooling layer. In the end all the DCNN ensemble are merged using late fusion method. A different approach used by Bokhari et al [17], who exploited Channel State Information (CSI) to estimate and classify activities performed in an indoor environment using a deep Gated Recurrent network (DGRU).

MATERIALS AND METHODS III.
Even if the conventional HAR methods have reached good scores, their reliance on handcrafted and their need to heavy data preprocessing methods limits their scalability to other datasets. Convolutional Neural Networks, Recurrent Neural Networks, and their combinations enabled for the creation of shallow and deep models in an end-to-end technique, resulting in high recognition scores in complicated task solving.

A. Convolutional Neural Network
This architecture is based on the convolutional layer, which performs the convolution operation on the input by multiplying it by the weights of a filter and then summing it to find the value corresponding to that position. The output of this linear operation is injected into a nonlinear activation function g and can be expressed as: Where, + , + is the activation of the higher neurons linked to the neuron (i, j), , is a matrix with a size of L.K and containing the weights of the convolution filter, and b is the bias [18].
the convolutional network is a type of neural network which is mainly constituted of convolutional layer, but other layers like Maxpooling and Fully connected layers can also be present and stacked one after another to add depth and build an hierarchical network [19]. For feature extraction the convolutional layer and the Maxpooling layer can be deployed 251 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 11, 2021 together as a single part, whereas the second part which has the role of classifying the resulting feature vectors is dedicated to the Fully connected layer, and it typically contains a number of nodes equal to the number of classes [20].

B. Gated Recurrent Unit
Conventional Recurrent Neural Network suffers from the issue of vanishing gradient when the network cannot transmit convenient gradient information back to the input layers, making the optimization difficult and prohibiting them from learning long term dependencies [21]. Short-term memory units [22] (LSTMs) and recently gated recurrent units (GRUs) [23] are two modifications of RNN designed to solve this problem. Where LSTM have the state of the art performance, it needs more inference time and processing. In our work we studied using GRUs, which are simpler than LSTM, have fewer parameters, and give a good trade-off between speed and performance [24]. The recurrent transition of GRU are obtained by: Where { , , } designate the recurrent weights. ℎ , ℎ � are hidden states. denotes sigmoid function. And ⊙ component-wise or Hadamard multiplication. is the update gate and is the reset gate.
The update gate determines the degree of similarity between the hidden state ℎ and the new hidden state ℎ � and if the update is performed. The reset gate is used to regulate how much of the prior state we wish to retain. if , is equal to 1 it means that we keep information from the previous state, otherwise, this latter state is neglected.

C. Overview
Activity recognition is considered a classification problem, the signals extracted from motion sensors are time series data, in our approach Convolutional neural networks are used on these raw signals to avoid the requirement for feature engineering and to take advantage of local dependency and correlation between signal measurements [25]. The extraction of temporal features is the next stage. Because Simple RNN has a vanishing gradient problem, we opted to run signals through three consecutive GRU layers. We chose GRU because of its ability to deal with extended sequences and its time efficiency [26].

D. Proposed Architecture
Our architecture is inspired by LeNet 5 [27], it benefits from its simplicity and straightforwardness, the original network uses a pair of convolutional and average pooling layers, followed by a flattening layer, two fully-connected layers and last a Softmax classifier. It was initially designed for handwriting and printed characters' recognition. We made the following change: we divided the layers into two groups: convolution layers and dense layers. We reduced the number of units in the last layer, replaced two-dimensional convolution and two-dimensional average pooling with onedimensional convolution and one-dimensional average pooling, and finally injected what we called a GRU block in between.
Different GRU block configurations were tested and evaluated in order to select the one with the highest accuracy. TABLE I contains the configuration of each injected block.
The first GRU block contains only one layer with 100 units, then a dropout layer of 20%, this architecture has the advantage of being simple, and light, its training was fast, but unfortunately it cannot recognize well all the activities. To solve this problem, we added another layer to the first one, and we kept the number of nodes for each of them at 100 nodes, then we preserved the 20% dropout after each layer, the results showed an increase in accuracy of more than 2%. In the third architecture, we wanted to test the effect of deepness on the initial network, in fact in GRU block 3 we increased the number of nodes in the first two layers to 128 nodes, then we added a third one with 64 nodes, while using Batch Normalization instead of the dropout after each layer, the experimental results for each network (CNN + GRU bloc) is presented in TABLE II. We find that the third network has the best accuracy, it means that adding three GRU layers, gives the model the capability to better extract the sequential temporal dependencies, while batch normalization layers served better in reducing Overfitting than dropout. This improvement in accuracy is also accompanied by a reduction in the number of parameters from 455,566 to 427,950. Fig. 1 illustrates the final architecture, Fig. 2 presents the diagram of the proposed solution in this paper.  Several recent studies have demonstrated that a onedimensional convolutional neural network is well suited for the analysis and extraction of discriminative features from data time series generated by sensors such as accelerometers and gyroscopes, and that it has the ability to learn an internal representation of data sequences [28]. Average pooling is often used instead of Maxpooling since it can extract features more smoothly. As mentioned earlier the 128-128-64 combination of GRU layers nodes, proved to outperform the 100-100 and 100 node combinations used in the other two architectures. We used the Adam optimizer with a learning rate fixed at 0.001, tested batch sizes of 32, 64, and 128, and finally chose 64 since it produced the best results. We trained the model for 1000 epochs and we used early stopping. TABLE III contains a definition of each layer and the parameters used in this our network.

A. Evaluation Methodology
We ran tests on three publicly available datasets. Here is a short description of each one: UCI HAR [29]: This dataset was gathered by 30 users aged 19-48 who wore smartphones around their waists while performing a series of activities. The information gathered is classified into five activity classes, three of which are static activities (standing, sitting, and lying) and the others are dynamic (walking, going upstairs, and going downstairs). The accelerometer and gyroscope embedded in the phone (Samsung Galaxy SII) enabled the measurement of three-axial linear acceleration as well as three-axial angular velocity.
WISDM V1.1 [6]: is a dataset collected by using only one IMU (accelerometer), the chosen activities were selected carefully, depending on their performance regularity in daily life. Those activities are Walking, Jogging, Upstairs, Downstairs, Sitting, Standing. This dataset has approximately the same activities as UCI, Fig. 3 contains a description of its activities.
SKODA [30]: this dataset has been recorded using only one type of IMU, in a manufacturing scenario and covers the problem of recognizing the activities of assembly-line workers in a car production environment. A worker carried a number of sensors while performing manual quality checks for the correct assembly of parts in newly built cars. 10 resulting hand movements are considered.    Vol. 12, No. 11, 2021 of validation protocol should also be taken into consideration when dividing data into training/test/validation since it impacts the recognition results and comparison. The parameters we used to compare the model's performance are defined as follows: (Where, T: True, P: Positives, F: False, N: Negatives). We use also Confusion Matrix, to have a summarized view about the performance of the classification, and to see the errors being made its type, and where the confusion occurs.

C. Results
We ran several tests on two other datasets to evaluate the performance and validate the efficiency of the proposed method. We used WISDM V1.1 and SKODA, the first one contains activities similar to UCI, while the second one contains a different type of gesture. We present the detailed results for UCI which was exploited in the design and tuning of our model, then we compare the results obtained with WISDM V1.1 at the level of each activity, and last we evaluate our approach on SKODA. UCI HAR's signals were pre-processed by filtering noise then sampling in a fixed-width sliding windows of 2.56 sec and 50% overlap, again we chose to take 21 subjects for training and 9 for testing. We fed our network with data in a specific shape. Accuracy and loss over each epoch are used for evaluation. We trained the model through 1000 epochs, then we used early stopping technique to end training when the validation accuracy stops increasing. All the datasets were uploaded to Google drive, and we used for the experiment Google Colaboratory. Our model achieved an accuracy of 96.77 %. As shown in TABLE V, this value is comparable to the state of the art, and other works that use handcrafted features, classical machine learning algorithms, unsupervised machine learning algorithms or models composed of a combination of previous methods.
To show the correspondence between the predicted labels and the true ones, we used the confusion matrix illustrated in Fig. 4. It shows that we achieve good recognition for all activities. We see that the static action LAYING is easily identified, with an accuracy of 100% and it's unconfused with any other activity. The dynamic activities WALKING_UP and WALKING are also well recognized, but for STANDING and SITTING their accuracies are relatively smaller and consequently the total score of the model is reduced, furthermore we remark that they are often confused with each other's, this could be explained by the similarity of the signals of those two classes.
The second experiment was on WISDM V1.1 using raw data again, this time we evaluated our results, using K-fold cross-validation, to allow for a reasonable comparison with preceding works. The model can predict all activities with great accuracy. The overall accuracy is (98.21%), this result is close to previous works on the same dataset done by Alsheikh et al [39] with a hybrid model using deep learning and hidden Markov models DL-HMM (98.23%). It improves accuracy over ensemble learning method [40], and slightly above the model proposed by Ravi et al [41] on the basis of shallow CNN architecture. Ravi et al [41] Alsheikh et al [39] Our model Ravi et al [42] 254 | P a g e www.ijacsa.thesai.org  The confusion Matrix of WISDM V1.1 dataset is presented in Fig. 6 we can see that Walking and Sitting achieved a recognition close to 100%. We also note that the relative lack of sample for the two Sitting and Standing classes did not affect their recognition, which means that the change in orientation of the sensor on the thigh is easily detectable and learned, helping in result to better identify each class. Jogging is an activity that requires the movement of the whole body from point A to point B, is well identified. Where Walking Upstairs and Downstairs are often confused with each other, this indicates that the model has difficulty distinguishing between these types of movements.
In this part we will compare the ability of our model to detect each activity belonging to UCI HAR and WISDM V1.1, and compare it to other models. We chose these activities because they are the most regularly performed in daily life, and they are recorded differently in both datasets. This comparison should help us to understand the relevance of our approach.
UCI HAR and WISDM V1.1, datasets both contain 6 activities, 5 are the same, and two are different (jogging and laying). Dividing activities into two categories: static and dynamic, can lead to understand the behavior of the model. We will compare and evaluate each activity according to its F1 score, since we have an imbalance between classes.
We observe that the static activities sitting standing and laying, are differentiable by the model among the others even if we change the dataset, this indicates its aptitude to detect those movements despite using only an accelerometer instead of its combination with a gyroscope. We deduct also that the location of sensors does not affect the detection of those activities. the other remaining activities "walking downstairs", "Jogging", "walking Upstairs", and" Walking" are dynamic and they present the vast majority of the data in WISDM V1.1, and almost half of UCI HAR dataset. Jogging and Walking are well identified 99% of the time in WISDM V1.1, and 96% in UCI HAR (Walking). The lowest score achieved is 93% in WISDM V1.1, it indicates that the model does not manage to detect with ease the Downstairs class. Considering the number of sensors, we remark that the use of a single accelerometer alone did not provide the necessary information to identify the dynamic actions which are related to climbing or descending, specifically moving downstairs or upstairs because they obtain the lowest score among classes and even for the other works presented in TABLE VI. On the other hand, we note that the recording in UCI HAR realized with both a gyroscope and an accelerometer allowed a good detection despite the small number of samples, as indicated in TABLE IV.
In WISDM V1.1 dataset the most recognized classes are jogging and walking, followed by walking upstairs in UCI HAR dataset, and the lowest score is for walking downstairs which reaches 93%.
Comparing our results with other approaches, we see that our network can classify activities in a similar way or better than other works using feature engineering, like the spectrogram domain of the time series signal, or hierarchical continuous hidden Markov model or using complex end to end deep learning networks.
In this part we want to test our model on a dataset that does not contain the same characteristics of the two previous ones. As previously mentioned Skoda contains gestures made with the hand in an assembly environment. Performed by a single subject and one type of sensors, it contains 10 gesture classes, to evaluate our work and compare it with others we used the 10-fold cross-validation process. The accuracy of our network is 96%. Fig. 7 shows that it outperforms other works previously done on the same dataset. The classification results are shown in Fig. 8 as a form of a confusion matrix. In this matrix we visualize that the model recognizes all the activities with a high score, except for the activity "close both left Front door" which is confused with "opening left front door" and "closing left front door". We see also that the NULL class causes the largest confusion.   In this paper we aimed to integrate Internet of Things (IoT) technology and deep learning to recognize human activities. We presented CNGRU, a new structure that combines convolution layers with GRU. This architecture is able to learn features automatically from raw data, unlike previous works based on handcrafted features. The effectiveness of this architecture is proved by experimenting on three datasets containing a variety of activity classes and recorded using different sensors. We achieved 96.77% on UCI-HAR, 98.21% on WISDM V1.1, and 96.70% on SKODA. This final result is superior than or close to existing state-of-the-art approaches that use shallow or deep designs or classical methods. Future works will investigate a resource efficient implementation of this network for IoT devices, and explore other datasets that contains more complex activities.