Action Recognition using Key-Frame Features of Depth Sequence and ELM

Recently, the rapid development of inexpensive RGB-D sensor, like Microsoft Kinect, provides adequate information for human action recognition. In this paper, a recognition algorithm is presented in which feature representation is generated by concatenating spatial features from human contour of key frames and temporal features from time difference information of a sequence. Then, an improved multi-hidden layers extreme learning machine is introduced as classifier. At last, we test our scheme on the public UTD-MHAD dataset from recognition accuracy and time consumption. Keywords—Action recognition; features; key frame; temporal; extreme learning machine


I. INTRODUCTION
Action recognition has been a hot research topic due to its wild range of applications in many areas, such as intelligent video surveillance, smart living and human-computer interaction [1]- [3].Although quite a lot of achievements have been reported in the latest several years, human action recognition still has great difficulties [4], [5].Challenge mainly includes the high intra-class variability, e.g. one same action performed by different subjects and inter-class similarity of actions, e.g.different actions captured from one same person.These difficult issues constraint the further progress of vide technology based on RGB sequences [6], [7].However, the release of Kinect sensor presents a new idea and more information to resolve these problems [8]- [10].The Kinect sensor can provide high-resolution RGB images, depth maps and skeleton at same time.Compared with traditional color sequence, depth sequence is invariant and stable to the illumination and body appearance.Besides, it also provides body structure and shape information for action classification.Based on these advantages, many methods were proposed these years.In [11], Chen et al. projected depth images onto three orthogonal planes and a depth motion maps (DMMs) was produced by stacking these projected maps.Histogram of oriented gradients (HOG) [12] was then used as feature descriptor.Xia et al [13] detected STIPs from depth maps directly (called DSTIP) and then used a correction function to remove interest points resulting from noise.Further, for every DSTIP, they extract a depth cuboid similarity feature (DCSF).This feature is applied to describe the local 3D depth cuboid, which size is setting adaptable.Unlike using information only from depth sequences, there are some methods combing multiple information to do action recognition, such as color data, skeleton data and depth maps.Ni et al. [14] proposed two multimodality fusion methods, which is simply based on the concatenation of color and depth sequences.Moreover, two feature representation methods are introduced for action classification.Zhang et al. [15] extracted coarse depth-skeleton (DS) feature by utilizing gradient information from depth sequence and distance information from skeletal joints.To refine the coarse DS feature, they combine the sparse coding approach and max pooling method.Then, the Random Decision Forests (RDF) was used to classify different actions.Hsu et al. [16] introduced a new scheme by producing Spatio-Temporal Matrix Intensity (STMI) from raw RGB and Spatio-Temporal Matrix Depth (STMD) images from depth images respectively.This method was demonstrated to be viewinvariant.HoG and HoF features were generated by constructing BoW-Pyramids, which made the classification of reversed actions become possible, such as from sit to stand and from stand to sit.Finally, the presented representation was applied to train a support-vector-machine (SVM) for recognizing different actions.Theoretically, the combination of different attributive data can effectively improve the recognition rate.However, the difficulties and disadvantages are negligible, such as features selection, different dimensional features fusion, training and testing times consumption, which has great relationship to judge the algorithm whether can be used on-line or not.
Inspired by the effectiveness of depth-based action recognition, in this paper we propose a novel algorithm for recognition using depth maps.To reduce calculating burden, key frames are produced from skeleton sequence by using joints as spatial-temporal interest points (STIPs) and mapped into depth sequence to represent an action sequence.Human contour is extracted from each key frame.Then feature representation is introduced including features obtained from human contour and temporal difference.Finally, an improved multi-hidden layers extreme learning machine is utilized as classifier for action recognition.The rest of the paper is organized as follows.In Section 2, we introduce key-frame extraction technique.Section 3 describes the proposed feature representation method.In Section 4, an improved method of multi-hidden layers extreme learning machine is presented for performing action recognition.In Section 5, the experimental results demonstrate the effectiveness of our framework from recognition accuracy and time consumption.Finally, we conclude our work in Section 6. www.ijacsa.thesai.orgII.KEY FRAMES EXTRACTION Key frames are usually used as the most informative frames because they can capture the major elements of a sequence.Key frame extraction approaches can be roughly divided into two categories: one is based on the interframe difference and the other is based on clustering [17], [18].In the approaches of interframe difference, a new key-frame will be extracted if the interframe difference exceeds a setting threshold.Clusteringbased approaches try to look for similar low-level features from frames and group them.Then a frame is selected as the key-frame, which locates closely to the cluster center [19], [20].In this paper, considering skeleton provides detail body joint positions, so key frame extraction method based on distance difference accumulation is proposed.Define a joint position as P i, j = {x i, j , y i, j , z i, j } , i is frame index and j is joint index.The accumulated difference of the ith frame can be calculated as follows: Where, × and n denote the Euclidean distance and the number of skeletal joints, respectively.
Usually, the key frames are defined as the motion with maximum or minimum D i within a sliding window.However, in most cases, D i has low value in the first or last several frames or shows extremes in intervening time.As a result, the extracted key frames will be more centralized, and the sequence cannot be accurately and comprehensively expressed.Here we propose the following steps to address these issues: 1) For an action video with N frames, accumulate the total differences from the second frame to Nth frame and express as: 2) Set key frames number as K and calculate the average differences increment: 3) From the second frame to Lth frame, we calculate the difference: k Î K .We gain a set {W L } and the minimum value of this set on sth frame.So, the sth frame is the key frame.
The improved algorithm can effectively extract key frames to express the whole sequence.Key frame numbers are mapped to depth sequence and then human contour is extracted.A complete overview of the involved stages can be seen in Fig. 1.We select an action of 'draw circle (clockwise)' from UTD Multimodal Human Action Dataset (UTD-MHAD) as an example [21].The first row of Fig. 1 shows the extracted six key frames from the skeleton sequence.The second row shows the corresponding depth images.In the third row, we list human contours based on the second row images.To facilitate the next step's feature representation, a treatment method with data smoothing and curve fitting is applied in the processing of contour extraction.Take the center we can divide the contour into Q radial bins of the same angle.Then, based on the work in [22], [23] we extract features from contour.The point wise Euclidean distances between each contour point and the center of mass are calculated and recorded as cp i , "i Î {1, , m} .
Considering contour points should be in the same order, the corresponding bin q i of each contour point CP i is assigned as follows: So, the feature vector of each bin can be described as: å and u i is the average distance of the contour points of bin q i .Next, we concatenate Q bins features to form a feature V k d for the kth key frame image based on its contour, d is feature dimension.
However, by analyzing we find that some distinct activities may be very similar to each other on key frames.For example, two different activities of 'sit to stand' and 'stand to sit', the high similarity of key frames will lead to serious possibility of failure classification.Actually, they contain almost identical frames but different in time.So, we have to calculate time difference of key frames as temporal features, which can effectively help to distinguish different actions.

Assume
kth key frame's original number in depth sequence is k ' .The feature of kth key frame in previous work is V k d .The temporal difference of feature vector V k td can be defined as: The final features V of a key frame are concatenation of the spatial feature V k d and the temporal feature

IV. ACTION RECOGNITION
Extreme learning machine (ELM) was proposed by Huang et al. [24], [25] as a novel learning algorithm, which is based on the single hidden layer feed forward neural networks (SLFNs).In ELM, the input weights and first hidden layer biases can be assigned randomly instead of learning.This advantage guaranties the learning and classification extremely fast and particularly suitable for online applications.
. The standard SLFNs have Z hidden neurons.Then the activation function can be formulated as follows: It can be expressed as the following matrix calculation: Where, There In ELM the input weight and bias are initialized and valued randomly, the output weight can be generated by solving the least square of b .In the condition where the number of hidden nodes is same with the number of input samples, the resulting H matrix will be square and invertible.But in actual applications, the number of hidden nodes is always not equal to the input samples, which makes H non-invertible.As a result, b can be formulated as finding a least squares solution b .
To enhance the stability of the numerical solution of SLFNs, a regularization coefficient j is given by considering the application of ridge regression method and Tikhonov regularization.The least-squares solution of ( 8 So, P k s can be formulated as follows: (12)   c q Î {0,1}, q =1, , n is a binary variable.It is used to express an input reserved or not.
Multi-hidden layers ELM is a multilayer neural network based on extreme learning machine.It not only can approximate complicated function but also does not need iteration during the training process.It has much better generalization performance and processing rate.
For single hidden layer feed forward neural network, the activation function usually defined as a sigmoidal function.But for multi-hidden layers ELM, we define the activation functions as follows: Once P k s is calculated, the next formulation is used to discriminate the final classification of an input:

V. EXPERIMENTAL RESULTS AND ANALYSIS
In this section, we test our proposed action recognition scheme on the public UTD-MHAD [21] dataset that consists of depth sequences and skeleton data.Our method is then compared with some existing methods.

A. UTD-MHAD Dataset and Tests Setting
The dataset records 27 different actions performed by 8 persons (4 females and 4 males).Each subject repeated each action 4 times.The subjects were required to face a Kinect during the performance.The same experimental settings as reported in [26] are followed in our tests.20 actions are divided into three subsets as illustrated in Table 1.In test one, half of the action samples are utilized for training and the rest for testing; in test two, 3/4 action samples are applied as training samples; and in the cross subject test, half of the subjects including 1,3,5,7 are applied as training samples and the rest subjects are used for testing.

B. Comparison with Other Methods
In order to evaluate the effectiveness of approach proposed in this paper, our method is compared with the existing methods and the obtained classification accuracies are recorded.Three algorithms are selected: first, algorithm from literature [13].In this method spatio-temporal information and depth cuboid similarity feature (DCSF) are used.Then bag-ofwords is presented for classification.Second, algorithm reported in [27].A depth motion maps (DMM)-based human action recognition method using l2-regularized collaborative representation classifier is introduced.Third, method in [28] skeleton joint position information with temporal difference is produced as final feature, and extreme learning machine is used for action recognition.The comparison results are listed in Table 2.The best recognition results are highlighted in bold.By comparison, it can be seen that our scheme outperforms the approaches published in [13] in all three test cases.For the challenging cross subject test, algorithm in [28] produces better results on AS2 and AS3.The most probable reason for this may be that actions in the two subsets are more complicated and the proposed accurate joint position information can effectively solve the problems of high intra-class variability and inter-class similarity.In test one and test two, only on action set 1 our results are slightly lower than C Chen's method [27] from 0.3% to 0.6%, while our method shows highest recognition rate in the overall results.The real-time efficiency of the proposed scheme is further discussed and reported.There are three major processing components including key frames computation, features extraction and fusion, and classification.In Table 3 [29].From the report in Table 3 we can find that the proposed scheme can be applied on a real-time depth video processing which requires the processing rate to be not less than 30 frames per second.

VI. CONCLUSION
In this work, we present an action recognition scheme for Kinect captured data.We extract features from human contour of key frame from depth sequence and calculating temporal difference as constraint.We use an improved multi-hidden layers extreme learning machines as the classifier for its high classification accuracy and low time consumption.Experimental results indicate that the proposed features not only can be easily obtained but also provide distinctive information for classification.To further expand our work, we plan to conduct some experiments involved human-human interactions by using method proposed in this paper.

Fig. 1 .
Fig. 1.Human contour based on key-frame extraction.III.FEATURE REPRESENTATION To a contour with m points CP = {cp 1 , cp 2 , , cp m } , the used as weight vector, which connects the ith hidden neuron and the input.
b i = [b i1 , b i2 , , b im ]T is defined as the weight vector connecting the ith hidden node to the output nodes.
b i is applied as the bias term of the ith hidden neuron.If SLFNs can approximate the N samples with zero error, which means that ) can be expressed as follows: b = H T (HH T +jI) -1 T (10) Therefore, the output function of ELM can be expressed as o(x) = g(x) b .=1, , Z is variable quantity of activation function of hidden layer nodes.Then, we construct a matrix N´m www.ijacsa.thesai.org

X Lu et al. [13] C Chen et al. [27] X Chen et al. [28] Our method Test One
, we list the average time needs of each component for the UTD-MHAD dataset.All the experiments are carried out using MATLAB on a PC equipped with Intel Xeon 3.4 GHz CPU with 16 GB

TABLE III .
PROCESSING TIMES ASSOCIATED WITH THE COMPONENTS OF OUR METHOD