SSEC: Semantic Segmentation and Ensemble Classification Framework for Static Hand Gesture Recognition using RGB-D Data

—Hand Gesture Recognition (HGR) refers to identifying various hand postures used in Sign Language Recognition (SLR) and Human Computer Interaction (HCI) applications. Complex background in uncontrolled environmental condition is the major challenging issue which impacts the recognition accuracy of HGR system. This can be effectively addressed by discarding the background using suitable semantic segmentation method, where it predicts the hand region pixels into foreground and rest of the pixels into background. In this paper, we have analyzed and evaluated well known semantic segmentation architectures for hand region segmentation using both RGB and depth data. Further, ensemble of segmented RGB and depth stream is used for hand gesture classification through probability score fusion. Experimental results shows that the proposed novel framework of Semantic Segmentation and Ensemble Classification (SSEC) is suitable for static hand gesture recognition and achieved F1-score of 88.91% on OUHANDS test dataset.


I. INTRODUCTION
Hand gestures plays significant role in many real time applications like robotics control, gaming, 3D modeling, virtual environment etc,. Various methods are used to detect and recognize the hand gestures depending on the data acquisition and processing system [1] [2]. Hand gesture recognition systems can be broadly classified into sensor based and vision based systems.
In sensor based systems, data is captured using the sensor modules connected to the hand which converts the hand movements into varying time series signal. These devices captures the hand movement data very precisely but not ease for usage as it impose the constraint to wear the device and not supporting the contact less operation [3] [4].
In vision based systems, RGB cameras are widely used to capture the hand pose data as color images. Vision based systems involve hand detection or segmentation as one of the important pre-processing step involved in gesture recognition pipeline to localize the hand region and discard the background in the image [5]. Feature extraction methods are used on the segmented hand data to obtain its characteristic representation which are used in classification stage to effectively recognize the various hand gestures. Current state-of-the-art recognition systems based on the color image data face many challenges in hand segmentation and recognition due to the complex background and varying illumination conditions. Gestures performed by various subjects differing in their hand size and color is difficult to identify due to large intra-class and interclass variations. Addressing these issues is difficult by using the color modality data. Hence recent systems are developed on dual modality using RGB and Depth data known as RGB-D data. In these systems, data is simultaneously captured using both color and depth sensor to obtain the RGB-D aligned data pair registered on the same view of camera coordinate system [6].
Depth map can be obtained using various techniques like stereo, time-of-flight of IR etc., where it provides the distance information between depth sensor and the object scene [7]. Kinect device sensor uses time-of-flight between the emitted IR light and the reflected light on projector to provide raw depth map in which each pixel location represents the distance in millimeter. Depth modality can be effectively used to discard the far away background based on depth distance range. It helps in effective hand segmentation to group hand region pixels into foreground and rest of the image pixels into background [8]. Also in case of low light scenarios RGB sensors fails to capture the data, this issue can be resolved using Kinect depth sensor as it captures the data using IR light.
Various image processing and computer vision algorithms are discussed in literature for hand region segmentation. According to recent studies, CNN architectures are widely used to address various real time problems and segmentation is one of the majorly studied area. CNN based semantic segmentation networks performs pixel level localization of region of interest where it classifies each pixel into its corresponding segmentation class. It provides the fine boundary of each distinct region mapped to the unique segmentation class.
In this paper, we analyzed various segmentation methods on RGB and depth data and provided comparative analysis to identify the best suitable method for hand segmentation.
The organization of the paper is as follows. In Section II, a brief review of different methods that exist in vision based hand gesture recognition is presented. In Section III, problem statement and the proposed method is discussed. In Section IV, detailed experimental results is presented. Experimental outcomes are briefly discussed in Section V. Finally, conclusion

II. LITERATURE REVIEW
In this section, we discuss on the state-of-the-art methods for hand region segmentation and gesture classification along with their advantages and limitations.
Earlier approaches of hand segmentation in color image were based on color intensity thresholding in RGB, HSV, YCbCr and other color spaces [9]. In these methods suitable color range was identified based on the experimentation to segment the skin region. Limitation of this approach is difficulty in selecting the threshold range to segment all variation of skin color and the lighting variations which significantly affect the segmentation accuracy.
Hand region segmentation based on human skin tones was proposed in [10] using an MLP network to learn the skin color tones and classify the pixels of image which belongs to the skin color sets. User independent recognition system using low-cost Microsoft Kinect depth sensor was proposed in [11] to overcome illumination and background variations issue in color-based sign language recognition. Here hand region was segmented by using a pre-processing algorithm on depth image. Features are extracted from hand segmented data using CNN based unsupervised Principal Component Analysis Network (PCANet) and classified using Support Vector Machine (SVM) classifier.
Real-time hand gesture recognition method was put forth in [12] using light-weight semantic segmentation method (FASSD-Net) to produce hand segmentation masks which are combined with RGB frames in gesture classification using Temporal Segment Networks (TSN) and Temporal Shift Modules (TSM) tested on IPN Hand dataset.
Various interactive methods like Graph cut, Random walker, geodesic star convexity etc., were analyzed in [13] for hand region segmentation. Five distinct types of hand motions in various backdrops were tested using the Expectation Maximum technique to learn the parameters of the Gaussian Mixture Model and the Gibbs random field to image segmentation by minimising the Gibbs Energy using the Mincut theorem. According to experimental findings, utilising manually segmented photos improves recognition accuracy when compared to unsegmented images.
Bin et. al [14] proposed a fine-tuned Inception V3 RGB-D static gesture recognition method. This framework eliminates the gesture segmentation and feature extraction steps in traditional algorithms. The proposed framework consists of a CNN architecture in which feature concatenate layer concatenates the features of RGB and depth images. Compared with general CNN, the Inception V3 based gesture recognition resulted in improved accuracy. D Kumar et. al [15] proposed a two stage approach for static hand gesture recognition using RGB-D data. In first stage k-means clustering algorithm is applied on the depth image to cluster the foreground and background depth pixels based on the distance. Depth threshold is computed as the mean of cluster centers and using this dynamic threshold background is discarded. In classification stage, segmented RGB-D data is stacked to form the input to data layer of custom CNN network.
Coarse to fine segmentation approach using depth map was proposed in [8] where pre-trained YOLO-v3 model was used to detect and localize the hand region at coarse level. The hand detected bounding region was used to initialize the foreground in graph cut segmentation algorithm which refines the hand region boundary and discards the background. Hand segmented RGB-D data was further used in classification stage to recognize the hand gestures.
The hand region in the depth map was segmented using the depth thresholding approach in [16]. Additionally, a two stream network with AlexNet and VGG16 was employed using scorelevel fusion technique to recognise the static hand gestures from the datasets from Massey University (MU) and HUST American Sign Language (HUST-ASL) with accuracy of 98.14 % and 64.55 % respectively.
Hand Gesture Recognition Approach called HGRA on RGB data using two stream was proposed by [17], in first branch U-Net combined with Multi-Scale Attention module is used to segment the hand region and extracting shape features. In second branch, Multi-Scale Fusion (MSF) and Light-Weight Multi-Scale (LWMS) modules are used to extract multi-scale appearance and color features. This method was evaluated on OUHANDS and HGR1 datasets and achieved the accuracy of 90.9% and 83.8% respectively.
Three stage spatial attention-based neural network was proposed in [18]. First two stages include generation of feature vector and attention map with the feature extraction architecture and self-attention technique. Final feature is generated after multiplying the features and attention map and feed to classification module in third stage to predict the label of hand gesture. This model achieved 99.75%, 99.46% and 99.67% accuracy in Kinematic, NTU and senz3D datasets respectively.
Dual-stream dense residual fusion network(DeReFNet) was proposed in [19] which utilizes the strength of global features and spatial information from the residual stream and other stream. Both the streams are fused using the feature concatenation module. Subject-independent cross-validation technique is used to validate DeReFNet four publicly available benchmark datasets.
Kinect sensor device is used to capture hand gesture depth images. Serial binary image extraction is used in [20] to eliminate the undesired shadow region in depth image and improve the recognition accuracy using VGG-type CNN. Emergence of industry 4.0 with need of natural human-robot interaction in manufacturing using vision-based and wearablebased approaches for gesture-based interaction is discussed in [21]. Position data from Microsoft Kinect RGB-D cameras and acceleration data from inertial measurement units (IMUs) is compared to evaluate the recognition accuracy.
Based on the brief literature review it can be observed that most of the recent research in hand gesture recognition use RGB-D data. Early methods of hand region segmentation used skin color based segmentation in different color space, later CNN based semantic segmentation methods gained much attention due to its efficiency and robustness even in complex www.ijacsa.thesai.org scene. Depth modality can be used both in segmentation and classification, hence active research is being carried out in state-of-the art methods to evaluate various ensembling and fusion techniques of RGB and Depth modalities [22]. In further section, we discuss about the details of proposed method and experimental analysis.

III. PROPOSED METHOD
Based on the literature review, it is evident that hand gesture recognition is still an active area of research trying to solve the challenges of gesture recognition is real scenarios with complex background scene and varying lighting conditions. Current research methods have also showed that multi-modal RGB and depth stream data is effective than uni-modal RGB data for hand gesture recognition.
In this paper, we analyze various semantic segmentation methods to effectively segment the hand region using RGB and Depth stream data. Hand segmented RGB and Depth data are further used to train custom CNN model for gesture classification. Suitable approach for fusing the probability scores from both the models are analyzed and proposed ensemble classification framework for static hand gesture recognition.
The main contributions of this paper are: 1) Analysis of semantic segmentation model accuracy using RGB data, depth data and combined RGB-D data. 2) Proposed the ensemble approach of score fusion for static HGR classification on RGB and Depth data.

A. Semantic Segmentation
Semantic segmentation is a pixel-based classification in which each pixel of an image is classified to its corresponding class. Here the class labels of all the pixels of image are predicted, hence segmentation is also termed as dense prediction. Hand region segmentation is a binary case of segmentation which has two output class and provides the pixel level mapping into required foreground and background regions as in Fig. 1. In this work, we evaluate various CNN architectures like UNet, ResUNet and DeeplabV3-Plus for semantic segmentation of hand region on both RGB and Depth data.

B. UNet
U-Net architecture [23] adopts auto-encoder framework which consists of two components known as encoder and decoder as in Fig. 1. Encoder generates the compressed feature representation of the image using down sampling and strided convolution, these features contribute in classifying the pixels into its corresponding segmentation class. The encoder and decoder layers are symmetrical to each other. Decoder includes up-sampling and transpose convolution which generates the output segmentation map which has the same resolution as the input image to segmentation model. The least squares reconstruction error is back propagated from the decoder to encoder using which the weights are updated to obtain optimal feature representation.
Encoder generally have the following sub layers, Convolution layer, Relu activation layer and pooling layer. The input image is fed in to the data layer followed by convolution layer which consists of filter of size 3x3 followed by Relu activation to add non-linearity. In the subsequent layers the number of kernels is doubled with constant kernel size and the max pooling layer is used to reduce or down sample the feature map and to maintain the local dominant features in the image patch. Here we have modified original architecture by removing last block of convolution layers and used only three blocks which consists of two convolution layers in each block for model convergence.
Decoder is used to reconstruct the input image using the reduced representation from encoder layer. Encoded input images are decoded by a series of up sampling and de-convolution block. The up-sampling operation of the decoder layers use the max-pooling indices of the corresponding encoder layers. The decoder architecture follows certain pattern based on its encoder design, where the decoder is mirror replica of encoder. The decoded image is evaluated against the input image while self learning the feature representation.

C. ResUNet
Residual U-Net [24] is a semantic segmentation network in which the residual blocks are used in encoder and decoder block of U-Net architecture. This residual learning helps to improve the U-Net results and only with fewer parameters. Fig. 3 shows the basic unit blocks of U-Net (a) and ResUnet (b). Each residual unit can be mathematically shown as in Eq.1.
where X l and X l + 1 are the input and output of the l th residual unit, F is the residual function, f (y l ) is activation function and h(X l ) is a identity mapping function, where h(X l ) = X l .  ResUnet comprises of three parts built with residual units, encoding, bridge and decoding. The first part encodes the input image into latent feature representation. Second part forms a bridge connecting the encoding and decoding paths. Third part of decoding provides the semantic labels to each pixel by pixel-wise classification. Each residual units consists identity mapping and two 3 × 3 convolution blocks with BN layer and ReLU activation layer. The identity mapping connects the inputs and outputs of the unit. Decoding path consists of three residual units and concatenated with the feature maps from the corresponding encoding path. After the last level of decoding path, a 1 × 1 convolution and a sigmoid activation to obtain desired segmentation.

D. DeepLab-V3+
DeepLabv3+ [25] is the extended version of DeepLabv3 segmentation architecture. It follows encoder-decoder structure with Atrous Spatial Pyramid Pooling (ASPP) module in the encoder block. Hence encoder module processes multi-scale contextual information at multiple rates and multiple effective fields-of-view by applying dilated convolution at multiple scales. The decoder module with depthwise separable convolution refines the segmentation results along object boundaries by gradually recovering the spatial information. Fig. 2 shows the block diagram of segmentation architecture analysis. UNet and ResUnet models are trained on RGB, Depth and stacked RGB-D data, DeepLabv3+ is trained on RGB and Depth data. All these models are trained separately and results are analyzed using mean IoU (Intersection over Union), average F1-score metrics. DeepLabv3+ trained on depth data gave better accuracy with comparatively less parameters, hence this model is selected as the best model for hand region segmentation. Predicted binary segmentation mask from this model is combined with RGB and depth data to obtain segmented RGB and segmented Depth data, this data is further used as input to the classification model.

E. Ensemble Classification
Classification block diagram is depicted in Fig. 4, where two classification models using segmented RGB and segmented depth data stream are trained independently using the custom CNN network as shown in Table I. Further, the classification probability of segmented RGB and depth model are analyzed and fused using max and average operator to select the best classification model. Table I consists of four groups with two layers of convolution CONV2D and RELU activation, Batch normalization and Max pooling. The number of kernels is increased as [16,32,64,128] in subsequent groups. Global average pooling layer is used to get final feature map, two fully connected layers are used followed by Softmax activation to get the probability output of each class.

Custom CNN-Net architecture in
Model is trained using Adam optimizer with learning rate of 0.001 and categorical cross-entropy is used as loss function. Batch normalization and drop out layers are used avoid the model from over-fitting. RGB and depth data from OUHANDS dataset is resized to 320 x 320 and segmented using the binary mask obtained from Deeplabv3+ depth segmentation model. Two classification models are trained using the segmented RGB and depth data, these models are evaluated on the test data and the miss-classified images are analyzed. It is observed that some of the images that were wrongly classified in the RGB stream are detected properly in the depth stream and  vice versa, hence this forms the basis to build a score fusion ensemble model with both RGB and depth stream which gives better accuracy as compared to the uni-modal results.
max(p r2 , p d2 )....max(p r(n−1) , p d(n−1) )) (2) P E (avg) = 0.5 * ((p r0 + p d0 ), (p r1 + p d1 ), (p r2 + p d2 )....(p r(n−1) + p d(n−1) )) Probability score fusion of max and average methods is mathematically shown in Eq. 2 and Eq. 3. In max fusion, maximum value of RGB and depth probability is considered for each class, where in average fusion mean probability is taken. Further, max of these fused probability is taken to decide the class label of ensemble classification model. From the experiments, it is found the average fusion gives better results as compared to max fusion.

F. Evaluation Metrics
Most commonly used principal measures to evaluate semantic segmentation and classification performance are briefly explained below.
Intersection over Union (IoU) -It is computed as intersection of the pixels from a given class in the predicted results with the ground truth divided by their union. IoU is computed class wise in case of multi-class segmentation. In our work, it is binary case of foreground hand region and the background hence only the class of pixels belonging to foreground is considered.
where, c jj = T p is the number of pixels which are labeled as class j in ground truth and also predicted as class j, c ij = F p is the number of pixels which are labeled as class i, but classified as class j that is False Positives class for class j. Similarly, c ji = F n , the total number of pixels labeled as class j, but classified as class i are the False Negatives (misses) for class j.
Mean Intersection over Union (mIoU): mIoU is the classaveraged IoU across all the images, where k is the number of class.
Precision -It is the ratio of hits over summation of hits and false alarms. It indicates total positive cases predicted correctly, over all the predicted positive cases.
Recall -It is the ratio of hits over summation of hits and misses. It indicates total positive cases predicted correctly, over all the actual positive cases.
F1score -This measure also known as the dice coefficient, computed as harmonic mean of the precision and recall.
In this work precision, recall and F1-score are computed for both segmentation and classification models. In segmentation predicted pixel class is considered whereas in classification predicted image class label. Additionally IoU metrics are used to evaluate segmentation models.

IV. EXPERIMENTAL RESULTS
The appraise the proposed semantic segmentation and ensemble classification framework, experiments are conducted on widely used benchmark OUHANDS [26] dataset for static hand gesture recognition. The experiments are performed using TensorFlow2.0 Keras deep learning library in Google Colab environment with NVIDIA GPU.
OUHANDS dataset [26] -contains RGB, raw depth data and segmentation ground truth images. It consists of 10 unique gestures captured in complex backgrounds and lighting changes from 23 subjects with different hand gesture sizes and shapes. Training dataset consists of 2000 images split into 1600 for training, 400 for validation. Test dataset contains 1000 images of the unseen individuals in the training set. All images in the training and test datasets are resized to 320 x 320 image resolution and used in segmentation and classification.
As depicted in Fig. 2, We have chosen three well known segmentation networks namely: UNet, ResUNet and DeepLabV3+. RGB and Depth images are used as input into these networks, which outcomes the hand region segmented mask. We trained these networks with various type of input data like RGB, Depth and the stacked RGBD data. Out of these 3 combination Depth data based model gave better accuracy with fine segmented hand region mask.
As shown in Table II, it can be observed that DeepLabV3+ segmentation results are better than UNet and ResUNet models. Also it can be observed that, the Depth data based based models provide better F1-score as compared to RGB data and stacked RGBD data. The DeepLabV3+ depth segmentation model resulted in the highest F1 score of 0.9235 compared to other networks.  We have analyzed the segmentation accuracy for each subject performing various gestures under different lighting conditions as shown in Fig. 5. As it can be observed, Deeplabv3+ depth segmentation model gave better accuracy for all the subjects.
Segmentation results were analyzed over each gesture class as shown in Fig. 6. From the plot it is evident that Deeplabv3+ model gives better segmentation accuracy over UNet and ResUNet models. Hence we can conclude that DeepLabV3+ segmentation model trained with Depth data is suitable for efficiently segmenting the hand region in complex scenarios.
Based on experimental results DeepLabV3+ is selected as best segmentation model, predicted binary segmentation mask is used for masking to discard the background by bitwise AND operation on the input RGB and depth image. This results in fine segmented foreground hand region in RGB and Depth image constituting to segmented RGB and Segmented Depth images.  Table I. Accuracy of these two models are analyzed and found that few miss-classified images are detected complementary, hence score fusion is performed using max and average operations to get ensembled gesture classification results. It can be observed that average ensemble gives the best result of 88.91%. It is also evident from Fig. 7, average ensemble provides best accuracy over all models for all the gesture class.
Proposed framework of semantic segmentation and ensemble classification (SSEC), is compared with the state of the art methods as shown in Table IV. Deeplabv3+ depth segmentation model followed by classification using average score fusion gives the best F1 score accuracy on OUHANDS test dataset.

V. DISCUSSION
Comprehensive experiments using RGB and Depth data stream are conducted in both segmentation and classification stage. As in Fig 2, hand region segmentation is performed using three segmentation networks UNet, ResUNet and Deeplab V3+ with RGB, Depth and stacked RGBD data. Corresponding experimental result is shown in Table II which indicates Deeplab V3+ with Depth data gives the better segmentation accuracy. This is also evident from the plots in Fig. 5 and Fig.  6, where segmentation accuracy is analyzed for each subject and each class respectively (both in aqua color plot).
Segmented RGB and Depth data is used to train the classification model as depicted in Fig. 4. Corresponding experimental results in Table III and plot in Fig. 7. shows that average ensemble (blue color plot) gives the better classification accuracy.
Proposed SSEC framework of Deeplab V3+ based semantic segmentation and average score ensemble classification is evaluated on OUHANDS benchmark dataset and compared the accuracy with existing methods as in Table IV. Experimental results shows that proposed methods gives the highest F1-score of 0.8891 and proved to be better than state-of-the-art methods.

VI. CONCLUSION
In this paper, we introduce SSEC, a novel semantic segmentation and ensemble classification framework on RGB-D Data for Static Hand Gesture Recognition. Specifically, we have analyzed three segmentation networks UNet, ResUNet and Deeplab V3+ with RGB, Depth and stacked RGBD data. Deeplab V3+ with Depth data gave higher F1 Score, the prediction outcome of this model is used as segmentation mask to discard background in RGB and depth data. In classification stage, custom CNN network was trained with segmented depth and segmented RGB data individually. Experimental results shows that, average score ensemble of these models can give better accuracy as compared to individual models. Hence it is inferred that RGB-D is with score fusion model ensembling is suitable for hand gesture classification as compare to the stat-of-the art methods.
Limitation of current approach is the usage of same network in both the RGB-D stream which may not give the diverse features, this can be further improved by using different CNN architectures in both streams to extract complimentary features.
In future work, we intend to develop a framework for dynamic hand gesture recognition considering the temporal sequence of RGB-D data for word and sentence level classification.