Deep Multi View Spatio Temporal Spectral Feature Embedding on Skeletal Sign Language Videos for Recognition

—To build a competitive global view from multiple views which will represent all the views within a class label is the primary objective of this work. The first phase involves the extraction of spatio temporal features from videos of skeletal sign language using a 3D convolutional neural network. In phase two, the extracted spatio temporal features are ensembled into a latent low dimensional subspace for embedding in the global view. This is achieved by learning the weights of the linear combination of Laplacian eigenmaps of multiple views. Subsequently, the constructed global view is applied as training data for sign language recognition.


I. INTRODUCTION
Sign Language Recognition (SLR) is extremely coordinated movements of hands captured through sensors as 1/2/3D data and translated into text or voice by a machine learning interface [1]. Sign language is a communication medium for hearing impaired people which consists of hand movements and finger shapes that operate independently or collaboratively with respect to upper body parts. SLR is considered an extension of human action recognition (HAR) [2]. Automated HAR or SLR is accomplished through machine learning approaches on multi modal datasets such as RGB, Depth and skeletal information in image, video and data formats. The RGB and depth formats provide appearance information whereas the skeletal joint data exclusively models pose details. Although SL knowledge representation is largely modelled in RGB video formats, it is bottlenecked by motion blurring and spatial resolution of fingers with respect to the frame size. Therefore, the skeletal data has obtained wide acceptance for human action or sign language recognition problems. The 3D skeletal data has been used as vectorized, image and RGB video formats for recognition.
However, the pattern identification process on skeletal 3D video data for building a real time application is a supremely challenging task. Traditional models employed vectorized 3D data for recognition with deep neural networks(DNN) [3]. Above all the DNN models on 3D skeletal action data, long short-term memory (LSTM) [4] networks have shown greater reliability and robustness for HAR tasks. Similarly, 3D skeletal SLR on vectorized data was successfully designed and experimented with color coded Spatio-Temporal features [5]. Singularly, most of these methods presented results related to cross view testing with poor performance as these models received only single view training. As a result, the above methods failed to generalize on building a real time engine for HAR or SLR.
Meanwhile, the above problem is finding solutions in the form of multi view training on Deep Learning Models. Though multi view processing of video data is having 2 decades of research history, it has gained extensive attention in the last few years due to the progress in deep learning approaches. Earlier DNN proposed were constructed with multiple streams feeding into individual views independently whose Softmax scores are fused for getting a final recognition score. Later, learning approaches have trained multiple CNNs for each view and then learned the concatenated features in the dense layers. This approach has allowed for multiple views to share features across classes. Specifically, this process does not restrict the features that were not significant in the decision making. Additionally, the view specific features that play a major role in articulating the desired outcome are ignored.
To overcome the above challenges, we propose to learn a global synthesized target view by linearly combining the independent multiple views as suggested in [6]. However, these intra class independent views have shown to exhibit unequal similarities with other views which biases the result towards the false positives. Hence, to overcome this uniformity across views that influence the target class, we propose higher order Laplacian eigenmaps from [7]. This enables the target feature reconstruction to have a complete non uniform distribution across the multiple independent views. Consequently, we learn a nonuniform linear combination of weights on independent views which can be generalized for any target view. Finally, the synthesized target view features of all classes are classified using standard deep learning architectures. The proposed methodology called multi view spatio temporal feature embedding (MVSTFE) is illustrated in the following Fig. 1.
The proposed MVSTFE is investigated on our 3D skeletal video datasets of sign language (KLEF3DSL 2Dskeletal) [8] and four other multiview action datasets NTU RGB-D [9], SBU Kinect Interaction [10], KLYoga3D [11] and KL3D MVaction [12]. The performance of the proposed deep networks was tested for the proposed method against the stateof-the-art on datasets. The remaining paper is clustered into four sections. The second section highlights the key historical aspects associated with multi view learning, sign language recognition and deep networks. The methodology is packaged in the third section and the obtained results for experimentation with analysis are presented in section four. Finally, conclusions were drawn from the analytical insights gained on the overall performance of the proposed models.

II. LITERATURE REVIEW
This section of the paper dwells on the advantages and disadvantages of the previous methods of sign language and action recognition in multiple views. Additionally, it also discusses the current models in deep metric learning.
With the advent of deep learning frameworks, the 2D video based SLR has become powerful with the option of feature learning rather than feature extraction. A large contingent of them is available for perusal [13]. The accuracies reported by these methods are not reproducible or they simply fail to generalize on the video quality or the signer. This has motivated researchers towards higher dimensional data such as RGB D or 3D skeletal representations. Multi modal video sequences that are fed into multiple streams of a CNN are predominantly researched which have shown evidence of exceptional performances in real time for sign (action) recognition applications [14]. The recognition accuracies were better than the single modal datasets. However, the training requires higher computing powers, and the datasets are captured with special devices making it an unfeasible deployable solution.
Eventually, to develop a real time SLR or HAR system, it is intuitive to learn multi views across datasets. This has initiated action recognition research to move in the direction of developing view-based learning algorithms [15], [16]. Multi view HAR has evolved through research using dictionary learning [17], neural networks with adaptable views [18], convolutional neural networks [19] and deep attention models [20], to name a few. However, the most widely researched and acknowledged models are from deep learning networks. Moreover, visual attention models with deep CNNs have established themselves as a formidable solution to multi view learning [21]. Despite their success, attention models are specific to a particular view and the view specific features are to be fused accordingly for classification by the dense layers. The fusion mechanisms ensemble the view specific features into a multi view feature vector that has failed to capture the variations in multi view data [22].
Primarily multiview approaches were classified as multiview learning and view invariant models. In multiview learning, the video input is considered as a time series of data frames in different views which are learned independently by the classifier [23], [24]. Most of the methods used low level observable features for generating discriminative features [17]. Subsequently, multiple training methods were employed for each of the views to find a set of consistent features between a pair of views [25], [26]. The algorithms are used for finding relationships between views canonical correlation analysis(CCA) [27] and projection matrices [28]. Extending to the above methods are matrix factorization [29] and low rank constrained matrix factorization [30] for capturing view similarities. All these models have shown good performance on instances where the number of views were limited and require extensive computational power for deployment.
Alternatively, view invariant models developed linear descriptors to transfer information between views. Accordingly, these models consider target views as a linear combination of views within a class label [6], [7]. Subsequently, the weight vectors are computed by applying optimization in Laplacian space. Moreover, these works assume that all views contribute equally to the target view features. However, in sign language recognition with video data from multiple source views it is difficult to impose the above assumption in real time. To overcome the disadvantage of equal contribution by all views to the target view, we propose to learn these contributions in the Laplacian space using deep learning.
The following points make the proposed method unique from the existing ones: 1) To design an unequal linear view combiner to extract target view features. 2) To construct highly discriminative Spatio-Temporal features in the Laplacian space.

3) To reconstruct learned target vectors into a Spatio-
Temporal feature representation with 3D CNNs.
In order to find an appropriate solution for multiview problems, the following objectives are being formulated: 1) To design an unequally contributing linear view combiner to identify the linear combinations. 2) To learn the mapping function for generating a singularly trainable view invariant Spatio-Temporal feature. 3) To initiate anyone view testing model. We call our proposed model multi view spatio temporal feature embedding (MVSTFE).

III. MULTI VIEW SPATIO TEMPORAL FEATURE EMBEDDING (MVSTFE)
This section describes the proposed multi view spatio temporal feature embedding model for multi view sign language www.ijacsa.thesai.org recognition on skeletal video datasets. First, a cluster of 3D CNNs is trained independently on individual views for all classes in the dataset. Secondly, a target view is selected randomly which is referenced on the pre trained 3D CNNs for feature extraction. The extracted features from independent view streams are learned by compiling Laplacian eigenmaps to construct a combined target view. This combined target view features will represent a linear combination of Laplacian eigen maps from multiple views generating a highly discriminative feature for all views of the target view class. Finally, these learned target view features will be used for training any deep classifier for sign or action recognition.

A. Independent View 3D CNN Model
The primary step in the process of multi view sign language recognition is to design and train a 3D convolutional neural network (3D CNN). The 3D CNN takes input as the skeletal video sequences as input for supervised training. The number of 3D CNN streams are equal to the number of source views available for training. The 3D CNN architecture used in this work is shown in Fig. 2. The model has 4 pairs of 3D convolutional layers with one set of batch normalization and maximum pooling layers after each pair respectively. The input of the network is a 2D skeletal video sequence of size 256 × 256 × 3 with 100 frames. The features at the end of the convolutional layers are flattened and inputted to two fully connected layers with the last layers being Softmax.
. The 3D CNN model will extract the features f v from x v with view specific labels y v using the trainable parameters θ 3D by optimizing a loss function L on the overall multi view dataset as For classification tasks, we need a global loss function to www.ijacsa.thesai.org discriminate the classes with the help of SoftMax layers. The class label prediction is computed on the embedding space using the cross-entropy loss functional defined as The l CrossEnt is the loss function for training the network. The ( y i ) is the predicted label and y i is the actual. The C defines the total number of classes in the dataset. Each stream in the network is view independent with the specifications as shown in Fig. 2. Consequently, weight and biases are initialized using unit variance zero mean Gaussian random variable. The filter sizes in all 3D CNN layers is fixed at 3 × 3 × 3. Moreover, the learning rate is dynamically controlled with 10% decrease rate from the previous valued whenever the loss became constant across 10 epochs. The initial learning rate was selected as 0.0001. Stochastic gradient descent optimizer is applied to update the wights and biases in the network. This trained network will be used to extract spatio temporal features from a target view which are further used to construct a combined view features. These constructed view features have the ability to represent all the views within a class label.

B. Combined View Feature Generation
Given a sign class in a specific target view x vt as input the trained model θ 3D , the output features f v at the end of dense layers are represented as The features extraction network is shown in Fig. 3. The network consists of four pairs of convolutional layers with rectified linear activations followed by a 2 × 2 window maximum pooling layer. The strides of the kernels in convolution layers is one and that of maximum pooling is two. After maximum pooling a batch normalization layer is added to standardize the inputs to the deeper layers. Finally, two fully connected layers are added to learn on the feature extracted in the convolutional layers. Subsequently, the spatial features at the output of dense layers are concatenated along the frames to generate a complete spatio temporal feature matrix representing the 2D skeletal video sequence. Altogether, V streams operate independently in the network generating view specific class features F cv = {f ic } ∀i = 1 to V ∈ R g×N , Where g is the dimensionality of the features and N is the number of frames. The model is trained with categorical cross entropy loss with stochastic gradient descent optimizer on the entire dataset. The trained model θ 3D is applied on all the input video frames to extract the feature samples as The spatio temporal feature matrix F cv consists of the target view features inferenced from independently trained views across all classes. The objective is to generate a feature matrix that will represent all views in a class as a linear combination of the extracted features. Traditionally, this is achieved by considering the all the mixing coefficients are equally distributed across all views. However, equally distribution of information across all views has produced ambiguous recognition accuracies. To overcome this, non-uniform distribution is proposed [7] with Laplacian eigenmaps. In this work, we incorporate the process of spectral embedding using Laplacian eigenmaps to calculate the mixing coefficients of the linear combination.  Given a set of spatio temporal target view features F cv ∈ R d×V from a particular class label y i∈C with V views, these views can be linearly combined with coefficients as Where, V is the total number of source views and d is the feature matrix dimensionality. F cv Comb is the combined feature representation of the target feature. The mixing coefficients λ i ∀ i = 1 to V is the weighted combination. The constraint on the mixing coefficient is The intent in the above representation is to generate a global view that is compatible with all the views in the class. Mostly, the coefficient λ i is considered as the average 1/V across all the views. However, in reality, the views that are in close proximity with the target view contribute more than 1 / V . Consequently, the obtained linearly combined global view features are least compatible for representing all the views in a class. This problem is solved by evaluating the mixing coefficients of individual views with the help of cost function derived using Laplacian eigenmaps [7].
First, the target features are arranged a V data matrices Fig. 3. The objective is to calculate a set of mixing coefficients λ = λ i V i=1 . We start by initializing λ = 1 V , ..., 1 V . Subsequently, set the g × N feature points obtained from trained network in the t th target view.
To compute the combined target view embedding features, we subsequently compute the weighted adjacency matrix A t on the target features and the Laplacian matrix L i of the individual views with i ∈ (1, 2, ..., V ). Consequently, the global Laplacian L G of the entire target view class is computed as a linear combination of initial weights. The spectral encoding, Y G can be computed from eigen value decomposition of L G as a Laplacian eigen map. Accordingly, select the smallest eigen values other than the zeroth one, reconstruct the spectral encoding Y G * . Using the reconstructed spectral encoding Y G * and Y G , update the mixing coefficients of the linear combination λ i . Optimize till the distance between the reconstructed and the original spectral encoding are less than a set experimental threshold.

D. Construction of Laplacian Eigenmaps and Spectral Embedding
Given the feature data points in the t th target view F cv ∈ R d V v=1 with g × N data points, we first compute the adjacency matrix A t as Where, A t is a symmetric matrix of size gN × gN . The value of σ is selected as 2. The adjacency matrix establishes a link between the target features extracted from trained CNN in Fig.  2 in all views. If the distance between the features is small, the value in the (i, j) th position tends towards 1 and vice versa. Consequently, A t establishes a relationship between the features points formed by a set of d data points in multiple views.
Subsequently, to compute a single view feature combination from multiple target view features Laplacian eigenmaps were used from [30]. Laplacian eigenmaps reduces the data by projecting data on a different spectral view without compromising on the relationships between the feature points. Accordingly, the spectral encoding Y G * can be computed by minimizing the cost function defined as The above representation gives the difference between two embedding features in multiple views modulated by their association values in adjacency matrix. If the feature points in multiple views are in close proximity, the adjacency matrix value is large, thus contributing more to cost function. As a result of this, similar data points are preserved in the spectral embedding from different views. Eventually, the solution to the optimization is transformed into a minimization problem as described in [30] as The global laplacian matrix L G is computed as where D gives the degree of connectivity in the data as (9) is equivalent to fining eigen vectors of Y G * as L G Y G * = αDY G * . The spectral embedding Y G * can also be calculated by simply computing the eigen values of L G . Finally, the laplacian eigen maps L G and spectral embedding Y G * are used to compute the cost function to find the mixing coefficients as Overall, the convergence of (10) can be decided based on the l2 norm between iterations as Here, λ i k is the value of mixing coefficients at k th iteration and λ i k−1 is the value at (k − 1) th iteration. The constant δ is a user defined parameter less than 1. Eventually, the value of λ i will be different from 1 V where multiple views are contributing differently to the target view. Finally, by multiplying the obtained mixing coefficients with target features from different views, we obtain a global view feature that closely relates to the target view features. Furthermore, the resulting single view target view feature is highly discriminative across classes and has found have close proximity with all the views from within a class label. The following section describes the datasets and experiments conducted to ascertain the performance of the proposed method.

IV. EXPERIMENTATION
The proposed view invariant method, Deep Metric Encoder Decoder (DMED) was trained and tested on multi view skeletal sign (action) video datasets in multiple ratios. We present a one -to -one, one -to -many, many -to -one and many -to -many cross view training and testing approaches on DMED. Further, we compare the results of our approach with other state -of -the -art multi view methods. Finally, multiple CNNs architecture's for classification were tested to check the robustness of the proposed feature extraction process in generating view invariant features.

A. Skeletal Video Datasets and Evaluation Metrics
The multi view sign language dataset KLEF3DSL 2Dskeletal with V = 15 views, 200 classes is generated at KL Biomechanics and Vision Computing Research Centre using 3D motion capture technology [8].
Further, the proposed model is evaluated on multi view benchmark skeletal action datasets such as NTU RGB-D [9], SBU Kinect Interaction [10], KLYoga3D [11] and KL3D MVaction [12]. A small subset of data sample from KLEF3DSL 2Dskeletal is presented in Fig. 4 for a sign basketball. In this work we are limiting our views to 15 due to computational constraints. The training testing ratios are kept constant across all networks and datasets. The selected train test ratios are one -to -one and one -to -many. The remaining views were also evaluated but are not presented here as they have not produced any noticeable performance changes when compared to the selected ones. Since there are no multi view sign language datasets, we evaluated our model on multi view benchmark action datasets. Despite the availability of huge classes in action datasets, we selected only 40 action classes for training with 15 views from each class for maintaining uniformity during comparison. In some cases, unavailability of views has prompted us to generate random views by altering the viewing angles of joints. Here, the evaluation is performed independent of the type of view in which the action is recorded. Fig. 5(a), (b) and (c) shows samples from NTU RGB-D, KL3D MVaction and KLYoga3D dataset respectively. We used mean recognition accuracy (mRA) for performance evaluations. The first 3D CNN network in Fig. 2 Fig. 2 is trained on all the available views with similar hyper parameters except for the learning rate and number of epochs. The learning rate for KLEF3DSL 2Dskeletal sign language video dataset is 0.001 and it was 0.005 for all other action datasets. However, the KLYoga3D was trained on a learning rate of 0.0001 for 200 epochs due to large number of skeletal joints. The remaining datasets were trained for 150 epochs. The maximum recognition accuracy achieved during training was around 0.973 for KLEF3DSL 2Dskeletal sign language, 0.942 for NTU RGB-D, 0.845 for SBU Kinect Interaction, 0.902 for KLYoga3D and 0.985 for KL3D MVaction datasets respectively. Consequently, these individual view trained 3D CNN streams will be inferenced for all dataset samples to generate global view features which represent all views within a class label.

extracts the features from skeletal sign (action) video datasets. The network in
To accomplish the proposed objectives of MVSTFE, we select a target view from each class for inferencing on the trained 3D CNN in Fig. 2 as shown in Fig. 3. The output of Fig. 3 are the features extracted from each of the individual views for the inputted target view. These target view features are combined using the non -uniform linear combiner by computing the value of linear combination value λ i using spectral embedding of Laplacian eigenmaps. The hyperparameter (δ)for MVSTFE on KLEF3DSL 2Dskeletal(δ = 0.54), NTU RGB-D(δ = 0.71), SBU Kinect Interaction (δ = 0.94), KLYoga3D (δ = 0.83) and KL3D MVaction (δ = 0.57) is selected iteratively. Finally, the generated combined view target features are used for classification. Specifically, to test the robustness of the features in the classification process, we standardized it by training and inferencing on benchmark CNN architectures. However, these architectures are miniaturized in layers and depth to source the feature inputs of size 100×100. Moreover, the regular 2D Convolutional layers in these models were replaced with 3D layers. This has been done to directly extract spatio temporal features from the network. To demonstrate the actual usefulness of these view invariant features, which resulted in the formulation of multiple performance evaluation procedures on the classifier as presented in the following sections.

B. One -to -One Classifier Performance Evaluation
The one -to -one cross view recognition experiment is conducted by training the classifier in Fig. 6. with one view global target feature representing all views and inferencing on a different views. Specifically, the key aspect of this experiment is to test the robustness of the generated view invariant features in estimating a class label based on its constituent views on which it is formulated. To demonstrate this, we designed a CNN network inspired from VGG-16 with 6 convolutional layers, 3 maximum pooling, one flatten and 2 dense layers. The network is trained with the generated view invariant features in each class and tested with view specific features. Consequently, we selected the learning rate of 0.01 for this network with categorical cross entropy loss and Adam optimizer. Subsequently, the above procedure is repeated for all datasets with the same hyper parameters. Furthermore, three benchmark architectures such as Inception -V4, GoogleNet and ResNet -50 were trained and tested. However, vanishing gradients and overfitting problem were eliminated by re-designing the architectures with only half the layers than the original models.
On the other hand, the structure of the original models were preserved to achieve highest performance. Eventually, mRA is computed during inferencing and the 10-fold maximum value is presented in Table I for all the datasets.
After examining the mRA in Table I, it is evident that all the models perform well on test views that have more visual information when compared to views with overlapping joints. The outcomes from Table I   classes. Moreover, the proposed work also highlights the used of any single view for testing as against the previous models, where all views are required as input. Consequently, it will be interesting to test the many -to -one cross view performance, where the models are trained with view specific features and tested with only one target view invariant feature.

C. Many -to -One Classifier Performance Evaluation
Here, we train the classifiers with all the views and test it only one target view feature. Table II shows mRA values for multiple sets of training views. The results in Table II show that the performance of the MVSTFE model has increased when trained with multiple view features. On the other hand, Inception -V4 has shown to outperform all other classifiers used for experimentation due to the fact that it contains multiple attention layers for selecting maximally contributing vectors.

D. Comparisons against other View Invariant Generation Techniques
The previous models applied spectral clustering with matrix factorization [28], auto -weighted spectral clustering [7] and multi view temporal ensemble [6] are designed to generate complimentary views and correspondingly reconstructing a global view. Additionally, the number of views used in these models is comparatively lower than our proposed work. Increasing the number of views in the above models will increase the computational complexity, which was reduced in MVSTFE. Table III   for generation and classification of view video data. Since the data used in these methods were different, we recreated these models from scratch as given in their respective manuscripts.
All the experiments were conducted on the benchmark skeletal datasets used in this work with one -to -one train -test pattern. We presented our best result obtained from inception V4 classifier in this comparison. However, the hyper parameters for the comparison networks was adopted from our Inception V4. The proposed MVSTFE has outperformed the existing models as can be seen in Table IV.

V. CONCLUSION
This work proposed a deep learning based spectral embedding method for generating a single global view from a set of multi view features. We trained a 3D CNN on each of the available views and inferencing on a target view video data to extract features. Eventually, these target features are combined linearly by calculating the mixing coefficients for making a global feature representation for all possible views. Consequently, the mixing coefficients are computed using spectral embedding in Laplacian eigen space which preserves proximity between views within the class label. Experimentation has shown that the proposed MVSTEF on 2D video based skeletal sign language dataset and the benchmark action www.ijacsa.thesai.org