3D Hand Gesture Representation and Recognition through Deep Joint Distance Measurements

Hand gestures with finger relationships are among the toughest features to extract for machine recognition. In this paper, this particular research challenge is addressed with 3D hand joint features extracted from distance measurements which are then colour mapped as spatio temporal features. Further patterns are learned using an 8-layer convolutional neural network (CNN) to estimate the hand gesture. The results showed a higher degree of recognition accuracy when compared to similar 3D hand gesture methods. The recognition accuracy for our dataset KL 3DHG with 220 classes was around 94.32%. Robustness of the proposed method was validated with only available benchmark 3D skeletal hand gesture dataset DGH 14/28. Keywords—Gesture recognition; 3D motion capture; deep learning; joint relational distance maps


I. INTRODUCTION
Hand gestures were considered to be one of the most powerful form of communication known to humans. It has evolved with the progression of generations which has now been regarded as the formidable communication between humans and machines. Hand gestures have now become a part of natural language processing in the current scenario. Hence, hand gestures have become an increasingly important part of human computer interaction (HCI) [1].
There are only three sensors that are exclusively available for capturing 3D hand and fingers. They are Kinect [2], leap motion [3] and Time of Flight (ToF) [4] sensors. Kinect 2 has the capabilities to capture fingers abstractly though noticeably imperfect at times. Leap motion is a good choice for hand capture but the factors for quality depends on the precision movements on the sensor, which at times attracts failures. The ToF sensor reconstructs 3D images from time series data captured by the sensors which however are quite complex to effectively predict hand gestures. Apart from the above, the most popular currently are based on 3D depth sensing technologies [5]. The depth-based hand gestures used 3D modelling for finger relationships for recognition [6]. Moreover, to 3D hand gesture recognition has been the most sought after for its challenging nature. Recent studies point towards static, trajectory and continuous 3D hand gesture recognition for many applications such as human robot interaction, daily assistance, gaming and sign language recognition [7].
In contrast to the above sensors for 3D hand capture, we propose a 3D motion capture technology-based hand gesture recognition. In this work, we used an 8-camera motion capture technology to extract hand gestures for representation of Indian sign language. Here 3D hand gestures are Modelled as a time series 3D joints on the hands. Two hands are used in cohesion as against the existing separation techniques.
The 3D hand joint across frames is Modelled as a time series position vectors that change over frames. This data from all 3D joints is converted into a spatio temporal image representing the varying hand gestures. Hence, the 3D hand gesture recognition problem translates into a spatio temporal RGB image recognition problem. This RGB image recognition problem is handled efficiently using deep networks. An 8layer CNN is built for this purpose which is based on VGG-16 architecture. However, these networks showed resistance to inter hand variations which resulted in non-discriminatory features at the end of the network. In this work, we propose a multi layered CNN network that preserves the long-term spatial relationships among actions thus generating discriminatory features that facilitate better performance.
To test the proposed multi layered CNN architecture, we intend to use our own 3D hand gesture dataset (KL 3DHG) in skeletal form along with only available skeletal DGH 14/18 [8]. The rest of the paper is organized as follows. Section 2 describes the literature review related to the proposed framework. Section 3 gives the methodology of the proposed framework that has been followed for 3D hand gesture recognition. It is then followed by results and discussion in Section 4. Finally, Section 5 concludes the work.

II. LITERATURE REVIEW
Hand gestures are an important part of human communication. It's classified as a natural language processing tool when comes to interactions between humans and machines. Numerous studies have been successfully conducted in the last few decades to develop a framework for hand recognition using multiple sensors for data capturing with subsequent experimentation to improve recognition performances. This section describes the methods and their findings with gaps towards development of a 3D hand gesture recognition system.
Hand gesture recognition has been attempted visually through video data captured using 2D sensors. However, the operations on this 2D video data has been a series of steps such as pre-processing, segmentation, feature extraction and finally classification [7], [9]. Consequently, the methods used have generated interest mildly, but could not create an impact on the applications related to 3D hand gestures. The underlying (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 4, 2020 reason for poor performance lies in the input sensors ability to capture real time hand gestures effectively [10].
Consequently, sensors such as Kinect and ToF were instrumental in capturing 3D human hand gesture recognition to a new dimension involving depth and skeletal data [11]. The 3D hand gestures recognition problem has been approached in two ways: 1) Static hand poses and 2) Dynamic poses. The static 3D hand shapes are represented as original 3D depth data or using some transform domain data. The 3D hand features are projected as a pixel wise depth features in different hand positions accounting for a large feature space with computational complexities in [12]. In contrast, ensemble of shape function has been proposed to represent 3D shapes as a point cloud which greatly reduced feature space [13]. Apart from spatial domain, the transform domain used Haar [14], Gabor [15], invariant moments [16] as features to model intensity and orientation of 3D hand shapes.
More efficient methods were proposed for representing 3D hand gestures using histogram of 3D Facets as features that modelled surfaces on 3D point clouds [17]. However, the most successful features were SIFT [18], SURF [19] and BOW [20] which achieved highest classification accuracies on a large contingent of classifiers. Moreover, the hybrid features such as bag of words (BoW) has improved the performance of the 3D hand gesture recognition methods effectively. Apart from BoW, other hybrid methods that have shown promising improvement in the recognition accuracies are feature fusion [21] and sensor fusion methods [22]. After feature extraction, an efficient classifier is necessary for producing highly accurate 3D hand gesture recognition. The most widely employed classifiers for 3D static hand gesture recognition are, support vector machines (SVM), artificial neural networks (ANN), random forests (RF) and template matching (TM) [5].
However, dynamic hand gestures were a set of time varying hand representations which need trajectories and orientations for efficient recognition. Two most exclusively used methods for dynamic hand recognition are hidden Markova models (HMM) [23] and dynamic time warping [24]. Besides the above models for continuous 3D hand gesture recognition, condition random fields (CRF) [25] and windowed DTW [26] has proved to achieve higher accuracies.
In the last couple of years, the hand gesture recognition has shifted gears to accommodate real time application capabilities using deep learning models. The most widely employed deep learning model being convolutional neural network (CNN) [27] for 3D human action recognition. Deep learning has been popular on 2D hand gesture video data with 3D CNNs at the learning core to estimate gestures [28]. These are two stream models that are quite popular than the single stream methods. Depth and skeletal data were being exploited simultaneously for recognition with multi stream CNNs [8]. The SoftMax scores from skeletal and depth stream are fused together to generate a class score. However, the most challenging dataset for 3D hand recognition has been the skeletal data. This is due to joint occlusions and overlapping that are hard to analyse on the CNN [29], [30]. Moreover, these methods directly operate on the raw positional vectors as inputs to the CNNs. The results point to a poor recognition accuracy due to inconsistences in the data during the signing process with joint many possible joint interactions.
Apart from CNNs, other deep learning methods used for 3D skeletal hand recognition are memory based deep learning architectures called recurring neural networks (RNNs) and its derived models such as Long Short-Term Memory (LSTMs). The most accurate are a mixture of both spatial and temporal feature learning models that used CNNs for spatial features and RNNs or LSTMs for temporal features. The Recurrent CNN (R-CNN) [31] used 3D convolutional neural networks to extract spatial features which are learned in time by RNNs to generate a complete spatio temporal learning. However, RNNs are slow and could not handle long sequence of data streams making them sluggish for real time operation. These shortcomings were handled efficiently by using long short term memory networks (LSTMs) and there are a multitude of CNN -LSTM [32], [33], [34] combinations with different network architectures that have shown their might in learning spatio temporal features in 3D hand gesture recognition. The sad part is that these hybrid recurrent CNNs are not end -to -end trainable, which limits their capacity for real time modelling. The solution is to develop a complete spatio temporal features which represent spatial and time series variations in 3D hand gestures. This is however is managed effectively by extracting features on the raw time series positional data as motion maps [35]. The problems in raw 3D joint data has been effectively regulated by transforming the joint time series positional data into spatio temporal feature maps such as joint distance maps (JDMs) [36], joint angular displacement maps (JADMs) [37], joint velocity maps (JVM) [38], joint quad maps (JQM) [39] and joint trajectory maps (JTM) [40]. There are joint surface maps and joint acceleration maps [36] proposed on skeletal data. All the coded maps represent spatio temporal information in the joints with a colour coded image maps which can be effectively learned by a deep convolutional neural network.
The key objectives of this work are 1) To generate a 3D hand skeletal dataset with 36 joints on both hands using 3D motion capture technology, which is first of its kind dataset with highest number of joint representations. 2) To extract features from the 3D skeletal hand gesture data for characterizing then using a maximally discriminant spatio temporal colour coded feature maps. 3) To design and train an end -to -end deep learning model to learn the 3D gesture characterizations from spatio temporal maps to accurately recognize gestures of Indian sign language.
The proposed work is different from the existing 3D hand recognition models in three aspects: 1) Most joints on the hands till now for modelling accurately the real time 3D hand motions. 2) A colour coded feature map to characterize the spatio temporal variations in the 3D hand skeletal data, which have not been explore fully for hand gesture recognition. 3) A fast training CNN architecture which can estimate gestures accurately on the proposed features.
The following section describes in detail the methodology for 3D hand gesture recognition framework with datasets, maps creation and CNN operation.

III. PROPOSED METHODOLOGY
The section presents a detailed description of the methods used in 3D hand gesture recognition with deep CNNs. The 3D data describes the hand to hand communication in Indian sign language. The data is captured using 3D motion capture system with 8 cameras. The captured 3D data is a time series representations of hand joints as shown in Fig. 1. Consequently, joint distance features of hands are computed which are then transformed into spatio temporal RGB images. Finally, a deep CNN is inputted with these images to estimate a class label pertaining to the sign. This section contains information regarding 3D hand gesture datasets, joint distance measurements, colour coding joint distance to features, CNN training and testing procedures. The 3D hand gesture skeletal data for sign language is the most complex dataset and hence a challenging task to learn features for recognition. Since Indian sign language is a twohand system, both hands are used in this work to generate data. Each hand is marked with 18 joints, taking the total number of joints in both hands to 36. This is currently the highest number of joint representations for 3D hand gesture recognition in sign language application. The recorded 3D data gives positional information of each of the finger joints individually with in a video frame. For a particular sign these 3D hand joints are variable across frames in a video sequence.
The time series 3D positional values of hand joints represent a spatio temporal information of a particular class of signs. To construct an entire dataset for training and testing the proposed CNN, we capture 220 sign classes with 10 subjects in 4 views. Fig. 2 shows 3D hand gesture of Indian sign language. Each 3D video frame is recorded for 280 frames, which is considered as a hyper parameter for optimal capture of all signs. A total of 220 × 280 × 10 × 4 × 3 = 73, 92, 000 2D tensors or 24,64,000 3D video frames are available for processing by the proposed CNN.
Apart from our 3D hand gesture datasets (KL 3DHG), we test the proposed network on benchmark skeletal dataset captured using Intel's real sense technology is DGH 14/28  [8]. This is the only skeletal dataset that is available for hand gesture recognition. It consists of 22 joints in a single hand pose to record 3D skeletal data. The system has a resolution of 640 × 480 and captures hand poses at 30fps. Each 3D skeletal video in the dataset has 20 to 50 frames per gesture. There are around 2800 samples with 14 or 28 class labels in the DGH 14/28 hand gesture dataset. Comparatively, our KL 3DHG is quite advanced than the DGH dataset with highest sign gestures with full HD resolution with a recording frame rate of 120fps. Our dataset has a greater number of frames per class than the DGH 14/28 dataset. Next section presents the feature calculation and colour coded map generation.

B. JRDM Feature Calculations
Inspired from the methods in [27], [35], [36], [37], [38], [39], [40], we propose to calculate joint relational distance maps (JRDM) between the two hands separately and combine them into a single mapping entity. Here, we calculate joint distances of each hand separately in each frame and further calculate the distance between the two hands from the distances of individual distances of corresponding joint pairs. This JRDM is calculated between joint distance of paired joints on individual hands.
The location p i of the joint J can be represented in 3D space using 3D coordinates as p i (x i , y i , z i ) ∀i = 1 to J ∈ R 3×J . We then have the combined position vector for the full set of J joints on a N-frame hand sign can therefore be expressed as S h = {p 1 , p 2 , ...., p N } ∀R J×3×N , where h is the hand pointer which takes two variables such as l for left and r for right hand. The intra frame hand distances between i th and j th joint is For left hand pair (i, j), the distance becomes d n ij l and it is d n ij r for right hand in the n th video frame, respectively. The two-hand joint relative distance (JRD) that gives the relationship between hands is formulated as Where D n ij characterizes the hand relationships between joint pairs in an entire video sequence. However, if only one hand is present during a signing process, only intra hand distances are used as feature vector. The final feature matrix for an entire 3D hand sign sequence of N frames is given as www.ijacsa.thesai.org The JRD matrix captures three types of motion details, namely the intra hand joint distances, inter hand relational joint distances and the time. Finally, the JRD matrix is transformed into a JRDM mapped entity that represents 3D hand movements in Indian sign language.
In contrast to previous studies [27], we simply encode the JRD matrix into an image, using a standard mapping procedure [36] with the "Jet" colour map. Combining the three RGB colour planes into one produces a JRD image which consists of intensity values only. Previous methods have encoded distance maps into colour images [27], but these are affected by the subject's dimensions, leading to an increased number of misclassifications. We used the inter hand relationships between hand joints in the present study to account for the differences in their unrelated features, thus making our approach resistant to subject to subject dimensionality differences. Fig. 3 shows how the JRDM is encoded for a 3D hand sign video.

C. Proposed 3DH CNN
The proposed 3DH CNN is inspired by the modified signet VGG architecture developed in [36], a moderately deep CNN model that demonstrated state-of-the-art classification and precision for 3D sign language recognition. The architecture of 3DH CNN is shown in Fig. 4. It has 8 convolutional layers followed by a max pooling and ReLu layers. Drop out of 0.5 was introduced at the end of 8 th layer for inducing nonlinearity into the feature vectors. Two dense layers and a SoftMax were present at the end of the network to assign class probabilities during training and testing. The filter sizes in each layer are kept constant at with an increasing filter numbers every two layers. The dual constant filter layers are 16, 32, 64 and 128.
The image resolution of 256 × 256 is considered for both training and testing to match the filter resolutions and their number which avoided vanishing gradients.

D. Training 3DH CNN
Python 3.7, with a Keras frontend and a TensorFlow backend is used for implementing 3DH CNN on our KL 3DHG dataset with 220 class labels. We used the same hyperparameters for all datasets, except for the learning rate, which was reassigned during training for benchmark dataset DGH 14/18. Specifically, we decreased the learning rate exponentially from 0.001 until the error became constant. At the start of the training phase for each dataset, we set the network's weights and bias parameters randomly using a zero-mean Gaussian distribution function with variance 0.01. The 3DH CNN learned by updating its weights and bias parameters using the back propagation gradient descent algorithm. We applied ReLu and SoftMax hyperparameter activations in the convolutional and dense layers, respectively. Finally, we used a fixed batch size of 64 for training, based on the image resolution and amount of GPU memory available. During training, we used k-fold cross validation, setting the k value at 20% of the training set. After training on each dataset, the trained model was saved, and then its hyperparameters were tuned based on feedback acquired through layer visualizations. Later, we compared our model's performance against those of several state-of-the-art DNNs used for 3D hand gesture recognition in [27], [28], [8], [29], [30], [35]. The training accuracy and loss functional plots are shown in Fig. 5 from the proposed 3DH CNN on KL 3DHG.

E. Testing and Performance Evaluation
After training on each dataset, the CCNN and the other DNNs were tested on the test sets described in Table I. Table  I shows the recognition accuracies obtained for each of the two skeletal datasets for hand gesture recognition which are averaged over the entire set. Further, video-based 3D hand gesture recognition based on CNNs with datasets in [41] and [42] were also tested with our network. These results show that our 3DH CNN recognition accuracies were higher than those of the state-of-the-art DNNs. The proposed 3DH CNN showed no signs of disappearing gradients, and weight decay was relatively smooth in the dense layers. The promising results for the 2D video hand gesture datasets inspired us to look into the more difficult question of identification of 3D human action skeletal dataset such as NTU RGB D, HDM05 and CMU [27].

IV. EXPERIMENTAL EVALUATIONS AND DISCUSSIONS
Firstly, the proposed method is being evaluated for 3D hand gesture skeletal data characterizations using JRDMs with  3DH CNN. Second, various colour coded maps will be tested with 3DH CNN on KL 3DHG and DGH 14/28 hand skeletal datasets. Thirdly, different DNNs gauge the performance of the proposed 3DH CNN on the two hand gesture datasets. Finally, we test the performance of the proposed JRDMs on 3D skeletal action datasets with 3DH CNN and other popular models.

A. Evaluation on KL 3DHG with 3DH CNN
The implementation is derived from Keras and TensorFlow toolboxes available in python 3.6 with considerable adjustments during training and testing. The training is accomplished on an 8GB GPU from NVIDIA with model number GTX1080.
The proposed 3DH CNN is tested on KL 3DHG mocap data on the above GPU system. Performance of each network with the proposed JRDA encoding format is evaluated with respect to mean average recognition (mAR) on the entire training set. The 3DH CNN is shown examples from 8 subjects in 2 views during training and the remaining 2 subjects with 2 views are applied during testing. Table II shows the mAR for both same and cross subject test results. It also shows results of same and cross view testing.
In this part, we plot confusion matrices of the proposed JRDM's on our 3DH CNN architecture resulted from cross subject and cross view testing of the trained network. Fig.  6 and 7 shows the confusion matrix for 30 hand gestures in Indian sign language. The confusion matrices clearly show the influence of putting relational information between hands into distance maps together with the help of Eq. (4). The overall recognition accuracies achieved are around 94.32% for cross subject and 91.28% for cross view testing respectively.  [36], joint angular displacement maps (JADMs) [37], joint velocity maps (JVM) [38], joint quad maps (JQM) [39] and joint trajectory maps (JTM) [40]. We present the average recognition accuracies for JRDM and other encoded images for front view, cross view and cross subject evaluation on the two datasets in Table III on both the datasets.
The superior performance registered by JRDM's over other maps on 3D hand gesture datasets can be attributed to joint relational information that provides relationships between joints on both hands. All the values are averaged over the number of test subjects used for testing the proposed 3DH CNN. The image encoding model is further evaluated on popular state of the art single stream CNN architectures, to prove that the encoding mode is universal across architectures. Training for all architectures is given from scratch by keeping the network attributes such as learning rate, learning momentum and stopping criterion as common.
From Table III, the recognition accuracies for cross subject and cross view show that JRDM type colour texture encoding is better than all other encoding on our KL 3DHG data. All 220 class labels are tested with an encoded image size of as input for each deep net architecture. Table IV gives the mRA of the networks trained with JRDM's and other maps on benchmarked deep learning models. The cross-view scores are a little less than cross subject scores in all the cases due to inter finger occlusions in joints during the signing process.

D. Performance on 3D Skeletal Action Recognition
This section evaluates the advantages of using our JRDMs across different 3D skeletal action datasets with multiple DNN classifiers. Table V lists the recognition accuracies on HDM05 [46], CMU [47] and NTU RGB-D [48] 3D action datasets. The JRDMs were generated on the positional vectors using the process described in Section 3. All the maps were normalised and resized to 256 × 256, irrespective of number of joints in the skeletons. Since, the performance of other maps has already been reported in earlier works [36], [37], [38], we recommend the reader to refer them for drawing conclusions with the present relative geometric maps.

V. CONCLUSION
This work proposes Joint Relational Distance maps (JRDM's) for representing spatio temporal information in 3D mocap hand gesture recognition data. Unlike, Joint other previously proposed maps for action recognition, the proposed JRDM maps to rich colour coded images with local information is computed using paired joint distances of left-and right-hand joint distances. Further, a 3DH CNN architecture is proposed for classifying the encoded images. The CNN's are trained from scratch with KL 3DHG and DGH 14/28 hand gesture datasets. The results show the JRDM encoded images generate unique representations of 3D mocap hand gesture data which are recognized with deep CNN frameworks.