Multi Modal RGB D Action Recognition with CNN LSTM Ensemble Deep Network

Human action recognition has transformed from a video processing problem into multi modal machine learning problem. The objective of this work is to perform multi modal human action recognition on an ensemble hybrid network of CNN and LSTM layers. The proposed CNN LSTM ensemble network is a 2 stream framework with one ensemble stream learning RGB sequences and the other depth. This proposed framework can learn both temporal and spatial dynamics in both RGB and depth modal action data. The hybrid network is found to be receptive towards both spatial and temporal fields because of the hierarchical structure of CNNs and LSTMs. Finally, to test our proposed model, we used our own BVCAction3D and three RGB D benchmark action datasets. The experiments were conducted on all the datasets using the proposed framework and was found to be effective when compared to similar deep learning architectures. Keywords—Human action recogniiton; RGB D video data; convolutional neural networks; long short-term memory


I. INTRODUCTION
Human action recognition is basically considered as a computer vision problem where a set of video processing algorithms were proposed to extract features that became input to a classification algorithm. However, these video processing algorithms depended heavily on the orientations of pixels in the video frames which affected the performance of the classifier as whole. Despite their instabilities in generalizing the classifiers performance the human action recognition are applied in surveillance networks, industrial automation, medical and sports analysis to name a few. In contrast to RGB video sensors, we now have low cost multi modal sensors such as Microsoft Kinect, that can enhance RGB sequences with depth and skeletal information. On the other hand the progress of deep learning algorithms like Convolutional Neural Networks (CNNs) and Recurrent models (RNNs) has been instrumental in enhancing the performance of multi modal action recognition systems.
In the recent years deep learning architectures have been shown to learn and complement the unique features in RGB, depth and skeletal data for performance improvements in action recognition tasks [1], [2]. Specifically, the work in [3] shows the effectiveness of using auxiliary datasets in the form of skeleton and depth has enhanced the accuracy of action recognition system using RGB videos. Multiple Kernel based learning framework was applied effectively on RGB D action data for extracting multi modal features and further fusing them, which improved their accuracy positively [4]. Further, a few of these models explored the sparse modelling of dense RGB and depth features that were translated into weighted bag of words (BOW) [5] representation for classification. Most of the works experimented with full action sequences ignoring the temporal information accompanying the action.
Initially, the idea was to extract motion information from RGB and depth sequences using optical flow, Kalman tracking and sometimes packing motion into a single image called as motion history images (MHI) [6]. Even though these methods offered an improvement in performance of the classifiers, they showed difficulty in learning spatio temporal features for generalizing an action. The fundamental difficulty in multi modal sequences is the formation of a multi-dimensional tensor indexing modalities, their spatial and temporal knowledge in one field. Subsequently, learning and temporal pooling operations on this multi-dimensional tensor is a challenging task. Moreover, time varying modalities will always induce constraints due to variable length effects. Despite the above gaps in data acquisition and processing, the time varying multi modal features can enhance the performance of the learning algorithms. However, the question posed at this instance is how to teach a classifier the spatio temporal modalities for RGB D action recognition tasks.
Previously, we approached the above problem by dividing multiple modalities to fixed length action sequences which are then arranged as a multi layered multi modal tensor. These multi-dimensional tensors are processed through deep convolutional neural networks (CNN) for learning spatial representations thereby completely ignoring the temporal structures [7].
In this paper, we propose to develop a hybrid recurrent CNN based deep learning framework for multi modal action recognition from RGB and depth data. Our proposed CNN LSTM network has been an inception of recurrent CNNs for action recognition in [8]. However, it is different from models proposed previously on multi modal action sequences [9], [10], [11], [12] in two aspects. One, the multi modelled data used in our work is RGB and depth sequences and two, our proposed CNN-LSTM Action Network (CLANet) is an ensemble of streams of layers.
The CLANet extracts the spatial features from RGB and depth sequences using CNN and infuses the extracted features into the LSTM network which is bidirectional in structure. The LSTM streams learn the temporal patterns in both the RGB and depth sequences during training. The last layer of the CLANet is a dense layer with SoftMax activation the outputs of which are score fused to decide on the input class. We opted for RGB and depth sequences for this experimentation and no skeletal action inputs due to the data dimensionality representation between them. Skeletal data has a higher dimensionality over the RGB and depth which share a common representation.
In order to validate our proposed framework, we have our own BVC3DA RGB D action dataset with 40 actions from 10 different actors with 10 repetitions per action. However, we evaluated the proposed framework on benchmark RGB D datasets, NTU RGB D, MSRAction and UTKINECT to test the learning strategies of the proposed CNN LSTM network.
The rest of the paper is organized as follows. The second section presents the previous works on RGB D multi modal action recognition with an insight into gaps and the achieved breakthroughs. Section three discusses the methods applied in this study to achieve higher performances across datasets for multimodal action recognition. Results are presented in section four and conclusions drawn on the obtained results in section five.

II. BACKGROUND
Multi modal or RGB D based action recognition models has been studied extensively which led to the development of the proposed CLANet. The previous methods have shown to have used data with RGB frames, depth and skeletal information for recognition of human actions across multiple identification platforms such as machine learning [1] to deep learning [13]. The machine learning models apply segmentation and feature extraction algorithms on RGB or depth or both frames for extracting meaningful representations of actions [14]. On the other hand, deep learning models extract features and segments based on the training algorithms on the RGB D data sequences [9]. The most formidable of these deep learning models are grouped into spatial and temporal domains. In spatial domain the models extract features with respect to the pixel location in image space using models such as Convolutional Neural Networks (CNNs) [2], [7]. For temporal or time series modelling of the RGB D data, Recurrent neural networks (RNNs) and their upgrades such as Long Short-Term Memory (LSTM) nets [15], [16].
However, the spatial and temporal models have their share of advantages and disadvantages. The spatial models cannot effectively learn the time series information which is necessary to represent action sequences that dependent on continuous data variations. Contrastingly, using exclusive time series modelling on video frames will not capture the spatial representations of action movements in image spaces. Hence, a hybrid combination involving both spatial and temporal models in found to be necessary to represent actions in video sequences for recognition [17], [18]. The early models applied optical flow to extract the temporal features on RGB video frames which are further fused with the spatial features during the training of CNNs. A few state-of-the-art models used multiple streams of independent CNNs with inputs from RGB and optical flow based RGB giving satisfactory results [19], [20]. One stream of CNN used RGB spatial features and the other uses motion information during training the networks simultaneously. All these networks are accompanied with feature fusion layer before or after the dense layers for decision making on the inputted action sequence. However, these models require additional computation time in the form of motion vectors which makes them computationally inefficient due to data alignment problems. Moreover, few also tried 4 streams by adding motion information from depth sequences producing better recognition accuracies than the previous 2 stream model [7]. Similar to the above models, properties of the RGB and depth modalities have produced efficient action recognition algorithms such as depth rank pooling with CNNs [21], scene flow based RGB D channels on CNN [22] and sequence based methods with RNNs [23]. However, the most successful are models that combine the advantages of both spatial and temporal networks. These models are named as spatio temporal recurrent convolutional neural networks (rCNNs) [24]. These models operate in twofold: one, the primary network extracts the spatial features using CNNs and the secondary network encodes that spatial features into temporal data using recurrent models. The most frequently applied recurrent model was Long short-term memory (LSTM) for representing temporal information in the action video sequences due to their ability handle long term dependencies by avoiding gradient vanishing problems [25]. Consequently, it was found that the operating the feature pooling model with LSTM can influence the temporal learning capabilities of the hybrid CNN LSTM architectures. Through feature sharing mechanism between the two networks, they were able to produce higher level representations of actions in a video sequence [26]. Moreover, bidirectional LSTM based methods have shown to handle multiple length video sequences when compared to RNNs. Therefore, the hybrid combination of CNN and LSTMs is the most widely applied model for human action recognition because of their abilities to decode spatial and temporal information simultaneously [27].
Literature is filled with CNN LSTM models for action recognition using skeletal actions as inputs [28], [29], [30]. These models use 3D skeletal joints as time series data along with RGB video frames for training and testing. However, depth-based models were rarely used along with these hybrid models [31]. In this work, we try to learn through a hybrid model which uses both RGB and depth data to draw inferences on the input action sequences. Both CNNs and LSTMs allow end to end trainable models that eliminates the need for tracking variations through time series data. The advantages of using RGB inputs along with depth instead of skeletal data are threefold. First, the depth features are more profound in assisting the spatial information in RGB data when compared to skeletal data. Second, the depth data is analogous to RGB data, which allows for complex processing mechanisms in transforming the skeletal data to image data. Finally, the skeletal data at times is found to be noisy with missing joints or overlapping joints making it difficult to process.
Eventually, in this work we describe a hybrid framework by combining LSTMs with CNNs for action recognition called as CLANet to construct an end-to-end trainable architecture that has capabilities in handling visual action recognition and sequence prediction tasks.

III. METHODOLOGY
This section provides a detailed description of the proposed CLANet hybrid CNN -LSTM architecture for action recognition. First, we design a deep CNN model to extract RGB and depth features of multiple frames to generate spatial features in the considered RGB and depth modals respectively. Then we will build an end-to-end pipeline architecture by combining multi modal CNNs with bi-directional LSTMs, followed by a (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 12, 2020 multiply score fusion to estimate the actions. The proposed architecture is shown in Fig. 1.

A. The Spatial CNN Network
This subsection describes in detail the architecture for extracting spatial information from RGB and depth video frames. To accomplish this, we employed convolutional neural networks in multiple streams that take input as RGB and depth video frames. Based on the GPU memory, we found that the maximum number of streams that can be applied in a batch is 16. Hence, the first hyperparameter selected was batch size which is set to 16. Hence each ensemble of CNNs will feed into 16 frames of RGB and depth frames. Lets name the two ensembles are CRGBe and CDe. The CRGBe and CDe are multi stream ensembles of CNNs for RGB inputs and depth inputs, respectively. Fig. 2 shows the CNN architecture developed for extracting spatial features from RGB video frames. Consequently, we have a similar network, CDe for processing depth frames.
Given an RGB action video frame V rgb (v r , v g , v b ) with a pixel position of (x, y), the output of 2D convolutional kernels are feature maps. Eventually, the j th feature map from the i th convolutional layer is extracted using the expression Where, N × M is the size of the video frame V and f is the activation function. W nm ijp is the weight vector at position (n, m) associated with p th feature map in the (i − 1) th layer of the CNN network. The parameter b ij is the bias associated with each of the neurons. Eq'n(1) depicts the convolutional operation between the video frames and the weight matrix, which is updated sequentially during training of the network. There are 16 streams in CRGBe ensemble network to extract spatial features of 16 consecutive frames per action video. To maintain uniformity, we divided each action class video into 128 frames. That is there will 8 batches of RGB video frames per class for training on the CRGBe network. Subsequently, the depth network, CDe will also have the same configuration as CRGBe. The CDe extracts the depth spatial features from the depth sequences of actions.
As shown in Fig. 2, the architecture for RGB spatial feature extraction module with RGB input video frames. The CRGBe is an ensemble of 16 streams with a depth of 10 layers each. The 10 layers depth across each stream consists of 6 convolutional plus ReLu layers, 3 max pooling layers and a flatten layer. The filter kernels are selected as 7 × 7, 5 × 5 and 3×3 framework. This kernel selection framework has ensured a hierarchical feature extraction model that has ensured maximal spatial preservation of pixels towards the end of the network. Similar functionality is achieved on depth frames using CDe network. The spatial maps from CRGBe and CDe are now used for modelling temporal information in the features by passing them through LSTM module. The LSTM module is presented in the following section.

B. The LSTM Temporal Coding
The extracted spatial features from the two ensemble nets, CRGBe and CDe, are then temporally coded for recognition www.ijacsa.thesai.org of actions at the highest level. LSTM blocks provide temporal dynamics for the extracted spatial features across both the input modalities. Fig. 3 illustrates the single LSTM block used in this work. The following expressions are implemented during the operation of an LSTM block. It consists of an input unit I t , forget gate F t , output gate O t , momentum factor G and the LSTM cell outputs (C t , h t ).
Where, W and b are weights and bias. x t are the feature inputs extracted using the spatial CNN network.C t andh t are LSTM's cell state at time step t. The sigmoid σ acts as control gates for transfer of inputs to the outputs. The forget gate initiates the progress of inputs to the next LSTM block. Based on the state of forget gate, the LSTM cell either forgets or memorizes the features in a sequence. However, the flow is unidirectional in a single stream LSTM model. In general, a sequence labelling problem such as video-based action recognition we need access to the past and future inputs at a single time step during the training sequence. This is found to be achievable in the past using bidirectional LSTM networks as shown in Fig. 1. This is performed by two LSTM streams with one moving past data forward and the other moving the future data backwards for a specific time step. This biLSTM network is also trained using the same backpropagation through time algorithm. In our work, we performed the backward and forward passes for each action sequence. Subsequently, the hidden states of LSTMs were reset after each action class. Our work uses a bidirectional LSTM architecture from [25]. The following subsection describes the complete multi modal action recognition framework with bidirectional LSTM network on top of CNN networks.

C. The Hybrid CLANet Training
The hybrid CLANet is designed by stacking bidirectional LSTM cells on top of spatial CNNs to create an end-to-end trainable model. The CNNs are capable of extracting global and highly discriminating spatial features from the RGB and depth video frames. On the other, LSTM capture the local and time representations in the extracted features. Finally, the outputs of LSTM network is passed through a dense layers and a SoftMax layer to compute the probabilistic distribution of the class labels as In the proposed bidirectional LSTM, the hidden states from forward pass and backward pass are combined in the output dense layer. We used 2 dense layers of sizes 1024 each along with a SoftMax to compute the recognition scores. The validation losses are calculated after the first dense layer to update the weights and biases through backpropagation. The validation data is 15% of the total training data and cross entropy loss is used for error calculation. The hyper parameters such as weights and biases are selected randomly with zero mean random gaussian generator. Stochastic gradient descent algorithm is used calculating the losses during training with an initial learning rate of 0.001 across all datasets. However, the learning rate is readjusted, whenever the loss became constant during training. The entire CLANet is end-to-end trainable.

D. The CLANet Testing
The action datasets are divided into 65% training, 15% validation and 20% testing. The outputs of the network give a probability distribution across classes for a particular test input sample. We used multiple machines for training and testing at different frame rates to understand the characteristics of the CLANet in processing multi model spatio temporal data.
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 12, 2020 Finally, we perform multiple test mechanisms on our own BVRCAction3D dataset and other benchmark datasets such as MSRDailyActivity3D, UTKinect and NTU RGB D.

IV. RESULTS AND ANALYSIS
This section presents results of experimentation with analysis of various components that were instrumental in generating the results on various datasets. We start by describing the datasets for training, validation and testing. Next, we initiate the training and testing of the proposed CLANet across different actions in our dataset. Subsequently, we apply benchmark datasets on CLANet for inspecting its rationale against our dataset. Finally, we compare our CLANet with other state-ofthe-art multi stream CNN LSTM models for cross data action recognition.

A. Datasets and Performance Measures
The NTU RGB D [32] is the largest dataset with 60 action classes in 80 views recorded with 40 subjects with a total sample size of 56880 videos of skeleton, depth and RGB. We selected 60 action classes with 40 subjects for training and testing the proposed CLANet. The NTU RGB D dataset used in our work has 2400 video samples with 40 subjects in 60 action classes. MSRDailyActivity3D [33] is another standard benchmark dataset using Microsoft Kinect with 16 activity types. It consists of 320 video samples in both RGB and depth modes with actions performed in both sitting and standing positions. The other most widely used RGB D action dataset for benchmarking is UTKinect [34] which has 10 actions from 10 subjects each performing the action twice. It has 10 classes with 10 × 10 × 2 = 200 videos of both RGB and depth data. Inspired from the above benchmark datasets, we collected our own BVRCAction3D action dataset with 40 single human and 10 two human actions using 5 subjects. The complete list of actions is available at [7]. Fig. 4 presents some action sequences in RGB and depth from our BVRCAction3D dataset. Our BVRCAction3D dataset consists of 50 × 5 × 2 = 500 video sequences with 50 classes from 5 subjects each per-forming the action twice. We used Kinect 1.0 for capturing the actions. Each action was recorded for 60 seconds at 30fps. Consequently, each action video has 1800 frames with a resolution of 640 × 480 for RGB and 320 × 240 for depth. In order to maintain uniformity across datasets, we resized the frame sizes to 256×256 in both RGB and depth modal videos. Moreover, we found the number of frames in each video clip to have a high degree of similarity among themselves. To increase the redundancy in the action videos, we selected 120 Key frames per action by applying correlation based key frame extractor [35].
The performance of the proposed deep network is measured using two standard parameters: mean Recognition Accuracy (nRA) and mean f1 score (mf1). Apart from the two, we also obtained confusion matrices and region of convergence (ROC) plots across all datasets. In the following subsection, we apply various datasets to our proposed CLANet and evaluate its performance.

B. CLANet Performance
The proposed multi modal CLANet is trained with RGB D action sequences from our BVRCAction3D and other benchmark datasets. The training parameters were kept constant across dataset to understand the implications of data on the network. The hyperparameters of the network were selected as discussed in the previous section. Fig. 5 shows the confusion matrices on the datasets used in this work. The performance of CLANet on our dataset is high when compared to other datasets due to less noisy backgrounds in BVRCAction3D as shown in the Fig. 4. The scores from CLANet are found to be better than our previous work in [7], where we used multi stream CNN with motion information. The reason for higher accuracies is because of the LSTM network which models the time series information in a more accurately. The testing in this case in preformed with 10 test samples only. Eventually, we tested the trained CLANet with the entire testing dataset from each dataset and projected the results www.ijacsa.thesai.org in Table I. The results in Table I indicate two performance parameters mRA and mF1 for the proposed CLANet across multiple datasets used in this work. The testing is conducted in cross subject mode, means that the network is shown samples with subjects that are previously unseen by the network during training. The average recognition rate achieved is around 93.32% on our BVRCAction3D dataset, which is found to be better than our previous work in [7].
The above comparison with our work in [7] is important in the context of understanding the need for time series modelling against motion modelling using optical flow. Contrastingly, optical flow-based motion estimation and processing it with regular spatial network has limitations in characterizing the changes across multiple frames. Additionally, the flow-based models fail to capture the long-term dependencies in the action video sequences. Interestingly, hybrid CNN LSTM networks have performed exceptionally well by modelling spatio temporal contents in the action video sequences. Meanwhile, the depth data has come in to assist this process by increasing the performance of the network. However, it is not possible to generate depth data in real time and hence, we conducted a RGB only test on our proposed CLANet to understand its usability as a real time application. We supplied zero matrices in place of depth data during testing for the depth stream. This test resulted in a mean accuracy of 84.76% and a mean f1 score of 0.862 for BVRCAction3D dataset. The second right half of table I shows the results on all datasets. In spite of depth data absence during testing, the proposed CLANet has performed better on our BVRCAction3D dataset when compared to other benchmark datasets. Consequently, the performance of the network has to be gauged by comparing its performance against state-of-theart networks as presented in the next subsection.

C. Comparison with Recurrent Hybrid Networks
This subsection gives the comparison of hybrid CNN LSTM networks with RGB and depth inputs as training data. Surprisingly, there are very few works which used both RGB and depth data with hybrids networks for action recognition applications. However, there are a large contingent of networks for skeleton based action recognition using CNN LSTM architectures. Table II presents the comparison of our proposed CLANet with the previously proposed methods for action recognition using the benchmark datasets. We implemented all these networks on the datasets and the mean average recognition is calculated across the training data. The results show that the proposed network outperforms the existing models. All the hyper parameters of the networks were incepted from the proposed CLANet. This is because of the spatio temporal characteristics that are learned effectively by the network in two modalities simultaneously. However, it would be interesting to check the network performance against different action recognition models. Hence, in the next subsection, we compare our method with other state-of-the-art RGB D based action recognition models.

D. Comparison with RGB D Action Recognition Deep Models
The parameters used for training and testing were as described in Section III. In all experiments, the video resolutions were fixed at 256×256×3 for both training and testing for both RGB and depth data across all subjects. The aim of this section is to investigate the suitability of RGB and depth information for action classification through deep learning networks. Given that, we compare the mRA from multiple architectures on three multi modal action data. Table III presents the results of our investigation. The networks were borrowed from previous methods and were trained from scratch on the datasets used in this work. All the networks are trained and tested only once. From Table III, we were able to generate two insights regarding the performance of the action recognition models. One is based on the use of input data and the other is on the deep networks. We see that RGB based methods performed poorly when compared to the other two modalities, depth and skeletal. This is because of the background noise that exists in the RGB video frame that are difficult to learn during training of spatial networks. Contrastingly, this background noise is relatively less in-depth frames, and it is completely absent in skeletal data. Hence, skeletal action recognition is the popular choice for producing higher accuracies with deep networks. Despite their success the skeletal action data becomes noisy when there is a joint overlap during the action sequence producing ambiguous results.
Simultaneously, skeletal action data is represented as time series data which is perfectly characterized and discriminated using RNNs and LSTMs together. These networks have produced the highest recognition accuracies across all datasets. However, modelling RGB and Depth as time series data by extracting features and inputting those features to recurrent networks has shown to improve performance. However, the most obvious choice of combination is the skeletal data with either depth or RGB. The fusion with skeletal data has improved the discriminating confidence of the networks. The most suitable network architecture is the hybrid CNN LSTM which can extract spatial and temporal dynamics of the action data. Contrasting to the regular phenomenon, we applied RGB and depth modalities to CNN LSTM architecture to generate a highly discriminating feature vector for action recognition. Table III shows that our proposed method is on par with the existing state of the art models and in fact better than some of the existing models. All the models are tested with cross subject data. Finally, the last subsection evaluates the networks for cross data validation.

E. Cross Data Validation
This section shows the experimental evaluation of CLANet across datasets. We found some of the common actions across datasets and evaluated the performance of CLANet with separate training and testing data from multiple datasets. Incidentally, we trained the CLANet with our BVRCAction3D dataset and tested with same actions from another dataset. We used seven common actions across datasets. The results of this experiment were presented as mean recognition accuracy across these seven actions used for training and testing in Table  IV. Here, the network has to fine tuned multiple times and the recognition rates obtained are averaged across multiple runs of the algorithm. Table IV shows that the proposed network has capabilities in evaluating cross data action recognition. Interestingly, we found that training with less noisy data could result in good recognition accuracies when compared to a noisy data training. The average recognition was around 65% across datasets with the proposed CLANet with RGB and depth input data.
Despite better performance by the hybrid CNN LSTM architecture across RGB D action datasets for recognition tasks, there are many challenges such as view invariance, cross data and occlusions that need attention. We found that it is difficult to achieve high degree of robustness for some complex actions from the existing deep learning frameworks. Moreover, deep networks are data intensive models and require a wide variety to provide actionable intelligence across action recognition platforms. Finally, more hybrid models with multiple levels of abstraction are required for designing deployable action recognition models.

V. CONCLUSION
In this paper, we have proposed a novel approach for recognizing RGB D action data. Specifically, our method involves training of a hybrid CNN LSTM multi stream network on multi modal data, RGB and depth videos. The CNN network is designed to extract spatial features from both RGB and depth action frames. Subsequently, bidirectional LSTM network is used to model the sequential information in the extracted multi modal features at the output of the CNN. The hybrid CLANet is trained and tested using our generated BVRCAction3D dataset and other benchmark datasets for recognition. The results conclude that the proposed network is capable of achieving higher average recognition rates of around 93.32% on our dataset and an average of 90.24% across all benchmark datasets.