Regression-Based Feature Selection on Large Scale Human Activity Recognition

In this paper, we present an approach for regression-based feature selection in human activity recognition. Due to high dimensional features in human activity recognition, the model may have over-fitting and can’t learn parameters well. Moreover, the features are redundant or irrelevant. The goal is to select important discriminating features to recognize the human activities in videos. R-Squared regression criterion can identify the best features based on the ability of a feature to explain the variations in the target class. The features are significantly reduced, nearly by 99.33%, resulting in better classification accuracy. Support Vector Machine with a linear kernel is used to classify the activities. The experiments are tested on UCF50 dataset. The results show that the proposed model significantly outperforms state-of-the-art methods. Keywords—Action Bank; Template Matching; SpatioTemporal Orientation Energy; Correlation; R-Squared; Support Vector Machine; Logistic Regression; Linear Regression; Human Activity Recognition


I. INTRODUCTION
Human activity recognition is an active research area in artificial intelligence, human-computer interaction and computer vision.Applications of human activities include patient monitoring systems, surveillance systems, interfaces, virtual reality, motion analysis, robot navigation, robot recognition, video indexing, browsing, choreography,..etc.Human activities are conceptually partitioned based on their complexity into four different categories: gestures, actions or activities, group activities and interactions.Nowadays, digital cameras can record the most daily activities of people and this makes the video sources to be rich on the internet, and also brings the problem of video categorization and how a new input video is classified based on their activities classes.Generally speaking, the process of classification of input videos movies in the real world is impossible, also, the manual task is time-consuming.Many researchers engage a lot of attention to these problems.They tried to create a machine recognition model which the feature descriptors originated from the training videos are trained to automatically recognize the activities of the new videos [1], [2], [3].
Feature selection is a significant step in human activity recognition to identify the minimum number of features that improve the accuracy of the model.Moreover, the models with the smallest number of features can be simpler and faster in building and understanding.In general, the main types of feature selection are filters, wrappers, and embedded machine learning.The last type selects the features based on integration with machine learning.
Filters methods depend on the properties of the data to evaluate the features and are independent regarding learning methods, but they use statistical methods like information gain, correlation to calculate splitting criterion for decision tree.These statistical methods evaluate how well each feature partitions dataset.Wrapper methods measure the features based on the estimates or results of machine learning algorithms which integrate predictive estimates as feedback.One of the common methods is regularization, which uses in the optimization process of learning in predictive modeling as penalization.This approach penalizes the irrelevant features(coefficients) and selects the most important features to reduce the complexity (over-fitting) like LASSO, Ridge regressions.Feature selection in embedded methods performs in the training process of machine learning.It is efficient because no need for splitting data into training and validation sets.Also the approach is fast due to the re-training of a feature is not necessary.Wrapper methods provide better results than filters, but the computational cost is increased.Embedded methods have good results between performance and cost [4], [5].
The organization of this paper is structured as follows.In Section II, we discuss related work.Section III presents the Model framework.Section IV presents the feature detection based on spatiotemporal orientation energy and the detected features are described based on maximum pooling of template matching.Section V presents the feature selection process which mainly based on the R-squared regression model.Support vector machine is introduced in VI.Section VII shows the simulation results and the conclusion of the paper is summarized in section VIII.

II. RELATED WORKS
At the present time, local spatiotemporal features are the most public techniques of video representation.The techniques of local spatiotemporal features depend on detectors and descriptors.The detectors capture spatiotemporal interest point locations, like, Cuboids [6] and Harris3D [7].The descriptors are extracted by HOG3D [8] or HOG/HOF [9].Then prelearned codebooks are defined to quantify the extracted features.Bag of Visual Words (BoVW) [10] can model videos.The local descriptors are local and repeatable features which are suitable advantages in video representation.They describe www.ijacsa.thesai.orgappearance and motion information of a local cuboid nearly interest point.Due to simplicity and repeatability, the local descriptors are robust to deformation and intra-class variability.The drawback of local descriptors that They only display low level information, not high level motion, which makes the features lack discriminative power.Many recent researchers try to fix the issues by developing high level models like Silhouette [11], Space-time Shape [12], Motion Energy and History Image [13].The recent approach is Actionbank [14].A large combination of activity detectors are applied on input videos and the responses are used as rich representation for videos.The detectors are composed of global templates of activities which are discriminating and global.However, the global features are sensitive to deformation and intra-class variations.

III. THE PROPOSED FRAMEWORK
The proposed model of human action recognition is composed of four steps: feature detection, feature description, feature selection and classification (See Fig. 1).For each step, the algorithms are described in details in the following sections.

Spatiotemporal
Orientation Energy

Template Matching
Feature Detection Feature Description

Feature Selection
Fig. 1: The Proposed Framework of Human Action Recognition

IV. FEATURE DETECTION AND DESCRIPTION
The videos are showed via high level features.The Action Bank [14] is the representation of videos.It is similarly close to object bank [15].It represents the video as composed action detectors that each produces a correlation volume.The base element of feature is the template-based action detector.It is invariant/robust to variations in appearance, scale, viewpoint,and tempo.

A. Spatiotemporal Orientation Energy
Motion energies can represent an activity or video in various spationtemporal orientation.A composition of energies along various space-time orientations can capture the motion at a point during decomposition of video.These energies are the basis for low level activity representation.A decomposition of spatiotemporal orientation energies is performed using third derivatives of 3D Gaussian steerable filter which represents the strength of motion and used as local filter.Let G 3 θ (x) denotes 3D Gaussian third derivatives, where x = (x, y, t) indicates for location of spatiotemporal space and θ denotes for unit vector of 3D directions.The spatiotemporal orientation energy is computed at every pixel as follows: where Ω(x) denotes for a local region around x, V ≡ V (x) denotes for input video, and (*) indicates for convolution.Gaussian filters are separable filter that has some properties like estimation spatiotemporal orientation energy without executing convolution for all directions.The result of convolution is summed and squared over neighborhood space time Ω to get the energy measurement.
Marginalization for energy is a process to eliminate spatial orientation influence.Formally, the computation of energy with normal n at frequency domain plane E θi (n) by a simple sum where N denotes for is Gaussian derivatives order, θi is one of N + 1 = 4 directions calculated from Eqn. 2.
Officially θi is provided by, where θa (n) = n × êx / n × êx , θb (n) = n × θa (n),ê is the unit vector along the spatial x axis in the Fourier domain and 0 ≤ i ≤ 3. The implementation for detectors of action bank, G 3θ (x), with the unit vector θ capturing the 3D direction of the filter symmetry axis and x denoting space-time position.
The responses of the image data to this filter are pointwise squared and summed over a space-time neighbourhood Ω to give a pointwise energy measurement A basis-set of four third-order filters is then computed according to conventional steerable filters [9]: where θa (n) = n × êx / n × êx , θb (n) = n × θa (n), ê is the unit vector along the spatial x axis in the Fourier domain and 0 ≤ i ≤ 3.And this basis-set makes it plausible to compute the energy along any frequency domain planespatiotemporal orientation-with normal n by a simple sum Ẽn (x) = 3 i=0 E θi (x) with θ(i) as one of the four directions calculated according to (2).
For our action bank detector, we define seven raw spatiotemporal energies (via different n): static E s , leftward E l , rightward E r , upward E u , downward E d , flicker E f , and lack of structure E o (which is computed as a function of the other six and peaks when none of the other six has strong energy).Finally, we have experimentally found that these seven energies do not always sufficiently discriminate action from common background.So, we observe that lack of structure E o and static E s are disassociated with any action and use their signal to separate the salient energy from each of the other five energies, yielding a five-dimensional pure orientation energy representation: r, u, d}.Finally, the five pure energies are normalized such that the energy at each voxel over the five channels sums to one.
Template matching.Following [6], we use a standard Bhattacharya coefficient m(•) when correlating the template T with a query video V : where u ranges over the spatiotemporal support of the template volume and M (•) is the output correlation volume; the correlation is implemented in the frequency domain for ef- and applying a bank of action detectors in the manner we do.There is neurophysiological evidence that mammalian brains have an action bank-like representation for human motion.Perrett et al. [27] discovered that neurons in the superior temporal sulcus of the macaque monkey brain were selective to certain types of mammalian motion, such as head rotation.Early research in human motion perception has also suggested that humans recognize complex activities as the composition of simpler canonical motion categories, such as that of a swinging pendulum [14].Finally and most significantly, other neurophysiological research, e.g., [10], suggests that view-specific representations are constructed in the visual pathway.For instance, recognition of certain point-light motions degrades with the angle of rotation away from the learned viewpoint.These viewspecific exemplars (templates) of action are exactly what comprise our action bank (see, for example, Figure 2).Fig. 2: A spatiotemporal orientation energy representation [14] seven raw spatiotemporal energies are defined with different velocities:static E s , leftward E l , rightward E r , upward E u , downward E d , flicker E f , and lack of structure E o .The lack of structure energy is calculated as function of six other energies and has peaks when no strong response from other six energies.The goal of this energy is to eliminate the instabilities of small energy points and gets a saliency.The pure energies are extracted from energies with subtraction of background and noise and are normalized to avoid influence of illumination adjustment and contrast as follows:

B. Template Matching
Detection an activity of small video called "template video" in a large video called "search video" is performed by scanning a 3D template video over all positions in spacetime.The similarity is determined by calculating each location among www.ijacsa.thesai.orghistogram of oriented energy of the template and search video.The "action spotting" algorithm is the recent detector which is applied due to appropriate features of in-variance to activity localization, appearance variation, natural explanation like the decompose oriented energies and efficiency [16], [14].The correlation between template video T and search video or query video is calculated by Bhattacharya coefficient m(.) as follows: where M() denotes for the results of correlation and u denotes for ranges of template video.The correlation is efficiently performed in frequency domain and the output value is between 1 denoting full match or complete match and 0 denoting a complete mismatch which interprets volumetric max-pooling method.
Let N a denotes for number of detectors for a given action bank and N s denotes for scales of activity (run times), the output of correlation volumes are N a × N s .The max-pooling technique in [17] is adapted as in Fig. 3 to be three levels in the octree which is 1 3 + 2 3 + 4 3 or a 73 dimension vector [14].For each activity, the total length of feature vector equals to N a × N s × 73.
[30] fuses ord algeow.Der  in object bank [22], we have not found it to outperform the standard hinge loss with L2 regularization (Section 3.3).Being a template-based method, there is actually no training of the individual bank detectors.Presently, we manually select which detector templates are in the bank.In the future, we foresee an automatic process of building the action bank by selecting best-case templates from among those possible.Nevertheless, we find that only a small subset of the actions in the bank have a nonzero weight in the SVM classifier.We liken this fact to a feature selection process in which the bank detectors serve as a large feature pool and the training process selects a subset of them, hence mitigating the manual-selection of the individual bank templates being a limiting factor.At present the manual approach has led to a powerful action bank that can perform significantly better than current methods on activity recognition benchmarks.

Action Templates as Bank Detectors
Action bank allows a great deal of flexibility in choosing what kind of action detectors are used; indeed different types of action detectors can be used concurrently.In our implementation, we use the recent "action spotting" detector [6] due to its desirable of invariance to appearance variation, evident capability in localizing actions Fig. 3: Volumetric max-pooling technique [14] V. FEATURE SELECTION Feature selection is an important area in predictive modeling and statistics.Theory and practice of feature selection have shown that feature selection is an effective way in improving learning, enhancing recognition accuracy and decreasing complexity of human activity recognition.The objective of feature selection in supervised learning produces higher classification accuracy [18], [19], [20].
One of the most crucial issues in high-dimensional data is determining which features should be included in a model of human activity recognition.From a practical point of view, a model with less features may be more interpretative and less complexity.Statistically speaking, the model with less features is often more attractive.Also, some models are negatively affected by irrelevant features [21], [20].R-squared is a statistical measure of how close the data are to the fitted regression line.It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.It uses a forward stepwise least squares regression that maximizes the model R-squared value.In this model, the features assessment are fast provided as a preparatory step and the predictive models are rapidly simplified in development with huge data.Linear models can quickly identify input useful features for classifying the target classes.The R-Squared feature selection criterion has applied two steps processes as follow:

A. Squared Correlations
The squared correlation coefficient is the ratio of single input feature explains the variation in target class with elimination of other features in calculations.Also, It is called Coefficient of Determination (CoD) in statistics.The value ranges of squared correlation coefficient are between 0 ( no relationship between the target class and input feature) and 1 (the variation of target class is totally explained with input feature).In human activity recognition, all input features are interval, so the squared correlation coefficient is calculated by a simple linear regression as follow: where Y denotes response variable or target, X denotes for input feature, β 0 denotes for intercept parameter, β 1 denotes for slope parameter and ε indicates the error deviation of Y about β 0 + β 1 X (See Fig. 4a).
The feature has a significant influence if it explains the target, so the simple linear regression model is compared to the baseline model (Fig. 4b).The baseline regression has a horizontal fitted regression line over any value in input feature with slope equals to 0 and the intercept equals to the mean of response target Ȳ .
Explained variability is the distinction between the regression line and baseline line.The regression sum of squares (SSR) is the amount of variability explained by your model.The comparison between the explained variability to unexplained variability determines the amount of variability explained by regression line rather than baseline line.The Fig. 4c shows a seemingly contradictory relationship between explained, unexplained and total variability.The regression sum of squares (SSR) is equal to Unexplained variability is the distinction between the between the actual values and the regression line.The error sum of squares (SSE) is the amount of variability unexplained by regression model.The error sum of squares is equal to Total variability is the distinction between the actual values and baseline regression line.The corrected total sum of squares (SST) is the sum of the explained and unexplained variability.
The corrected total sum of squares is equal to R-Squared the proportion of variability observed in the data explained by the regression line.The R-Squared is equal to  In a baseline model, there is no association between the response variable and the predictor variable.Therefore, knowing the value of the predictor variable does not improve predictions of the response over simply using the mean of the response variable for everyone.
(b) Baseline Regression The plot shows a seemingly contradictory relationship between explained, u and total variability.Contribution to total variability for the data point is sm to explained and unexplained variability.Remember that the relationship of total=unexplained + explained holds for sums of squares over all observatio necessarily for any individual observation.

B. Forward Stepwise Regression & Logistic Regression
This algorithm is applied after calculating the squared correlation coefficient for all input features in human activity recognition, the other important features are measured using a forward stepwise R-Squared regression.The sequential forward regression chooses The feature that has the highest squared correlation coefficient which explains the largest amount of variation in class target.At each iteration, the additional input feature is selected that gives the largest incremental increase in model of R-Squared.The stepwise algorithm ends when no other input feature can meet the Stop R-Squared criterion.The final logistic regression analysis is performed using the predicted values that are output from the forward stepwise selection as the independent input.

VI. SUPPORT VECTOR MACHINE
The concluding stage of the recognition process is the classification of the extracted features into a predefined set of classes.The field of machine learning has many powerful classification models.Our goal in this stage is to contribute to this field by introducing a reliable, accurate and interactioncentric classifier.
The human activities recognition are formulated by multicalss classification problem.Each activity is represented by each class.The goal is assigning and classifying a video sequence to classes of activities.Many supervised learning methods are learned to activity recognizer.Support Vector Machine (SVM) is one of the superior machine learning in human activity recognition and high dimensional data because the prime generalization strength and highly accurate results.SVM can avoid over-fitting in neural networks based on risk minimization theory.Also, SVM can handle a high dimensional space by creating a maximal hyperplane to separate nonoverlapping classes.Two parallel hyperplanes are proceeded in SVM and the goal of SVM seeks to find the maximal distance between the parallel byperplanes (Fig. 5).The better the classification, the larger the distance between byperplanes and vice vera.shows a series of plots of 2D dynamic affine invariants with different action classes computed on the average images of action sequences.

Action Classification Using SVM.
In this section, we formulate the action recognition task as a multiclass learning problem, where there is one class for each action, and the goal is to assign an action to an individual in each video sequence [1,29].There are various supervised learning algorithms by which action recognizer can be trained.Support Vector Machines (SVMs) are used in this work due to their outstanding generalization capability and reputation of a highly accurate paradigm [30].SVMs that provide a best solution to data overfitting in neural networks are based on the structural risk minimization principle from computational theory.Originally, SVMs were designed to handle dichotomic classes in a higher dimensional space where a maximal separating hyperplane is created.On each side of this hyperplane, two parallel hyperplanes are conducted.Then, SVM attempts to find the separating hyperplane that maximizes the distance between the two parallel hyperplanes (see Figure 4).Intuitively, a good separation is achieved by the hyperplane having the largest distance.Hence, the larger the margin, the lower the generalization error of the classifier.Formally, let D = {(x  ,   ) | x  ∈ R  ,   ∈ {−1, +1}} be a training dataset; Vapnik [30] shows that the problem is best addressed by allowing some examples to violate the margin constraints.These potential violations are formulated with some positive slack variables   and a penalty parameter  ≥ 0 that penalize the margin violations.Thus, the generalized optimal separating hyperplane is determined by solving the following quadratic programming problem: subject to (  (⟨x  , ⟩ +  0 ) ≥ 1 −   ∀) ∧ (  ≥ 0 ∀).from the hyperplane, 0 ≤ ξ i ≤ 1 denotes the observation between margins and when ξ i ≥ 1 then the observation is wrongly classified and appears on the wrong side.This approach achieves best performance for SVM.The quadratic programming can determine the optimal generalized separating hyperplane as follow: Subject to y i (w The parameter C is a constant called "regularization constant" to control the misclassification cost which governs the trade-off among maximal margins and minimal loss.The term n i=1 (ξ i ) k denotes for loss.The constant k controls the loss which becomes hinge loss when k is 1 and quadratic loss when k is 2. Dual formulation is recommended to solve SVM due to computational purposes.This solution uses Lagrangian method and is optimized with Lagrange multiplier α.The weight vector for predicting decision is β = i α i x i y i ; 0 ≤ α i ≤ C. The instances x i with α i > 0 are called support vectors, as they uniquely define the maximum margin hyperplane.

VII. SIMULATION RESULTS
The experiments are conducted using UCF50 action dataset [22].UCF50 is an activity recognition data set with 50 www.ijacsa.thesai.orgactivities classes, composing of real Youtube videos.The large variations in cluttered background, camera motion, object scale, object appearance and pose, illumination conditions and viewpoint make the dataset to be very challenging.The total videos in UCF50 are 6680.The videos in UCF50 are grouped into 25 groups.For each group, the video clips have similar features, such as the same person, similar viewpoint, similar background, and so on.The classes or activities are visually shown in Fig. 6.The experiments are implemented on Fig. 6: UCF50 Dataset computer with CPU i7, 2.6 GHz, 16 RAM, Matlab 2013b and R-Studio.Initially speaking, The features in UCF50 dataset are extracted using the spatiotemporal orientation energy, then the extracted values are described in vectors using template matching as action bank.The length of feature vector is 14746 and the number of observations is 6680.The R-Squared model is implemented to select the features that describe the variations in target.The features that explain the target class are selected and the other features are redundant or irrelevant.The minimum R-squared in our implementation is 0.005.It specifies the lower bound for the individual R-square value of a feature in order to be eligible for the model selection process.The number of selected features for each action is described in Fig. 7.The average number of features using R-Squared is 99 which is 0.67% from the original data.About 99.33% of features can't improve the performance of the model, but these features degrade negatively the recognition due to the large number of features which are redundant or irrelevant.The irrelevant features can make an over-fitting in the model.
The UCF50 features data are evaluated using 5-fold groupwise cross-validation, 5-fold video-wise cross-validation and 1 3 (34%) testing data.In our model, One-vs-rest SVM is applied to classify the actions using Linear kernel.The penalty is 1 and the maximum iterations is 25.For each action, positive video clips are labeled as 1 and negative videos are as labeled -1.For each action, R-Squared and SVM are applied.The accuracies are sorted for each action using 5-fold group-wise   [22] Leave One Group Out Cross validation (25 cross-validations) 76.9% Sadanand and Corso [14] video-wise cross validation 76.4% Sadanand and Corso [14] group-wise cross validation 57.90% Todorovic [23] 2/3 training and 1/3 testing for each class 81.03% Solmaz et al. [24] Leave One Group Out Cross validation(25 cross-validations) 73.70% Kliper-Gross et al. [25] Leave One Group Out Cross validation (25 cross-validations) 72.60%

VIII. CONCLUSIONS
Human activity recognition based on spatiotemporal orientation energy and activity template is simple and advanced discrimination techniques in detection and extraction features based on multiple activity detectors.The features in human activity recognition often more than the number of observations, so the feature selection is a major step before classification to avoid irrelevant or redundant features and over-fitting problems.R-Squared model is applied to get the best important discriminative features that explain the target.Also, R-Squared can handle a huge data in rapidly simplified manner.The model can significantly improve the performance/accuracy of human activities and reduce the features.
In the future, We will plan to apply the regression-based feature selection in human activity recognition based on different feature extraction methods that have large amount of features.
Figure 4.A schematic of the spatiotemporal orientation energy representation that is used for the action detectors in action bank.A video is decomposed into seven canonical space-time energies: leftward, rightward, upward, downward, flicker (very rapid changes), static, and lack of oriented structure; the last two are not associated with motion and are hence used to modulate the other five (their energies are subtracted from the raw oriented energies) to improve the discriminative power of the representation.The resulting five energies form our appearance-invariant template.

Figure 3 .
Figure 3. Volumetric max-pooling extracts a spatiotemporal feature vector from the correlation output of each action detector.
The relationship between the response variable and the predictor variable can be characterized by the equation Y = β0 + β1X + ε where Y response variable X predictor variable β0 intercept parameter, which corresponds to the value of the response variable when the predictor is 0 β1 slope parameter, which corresponds to the magnitude of change in the response variable given a one unit change in the predictor variable ε error term representing deviations of Y about β0 + β1X.

Ȳ
To determine whether the predictor variable explains a significant amount of variability in the response variable, the simple linear regression model is compared to the baseline model.The fitted regression line in a baseline model is a horizontal line across all values of the predictor variable.The slope of the regression line is 0 and the intercept is the sample mean of the response variable, (Y ).
a simple linear regression model is better than the baseline m explained variability to the unexplained variability.Explained variability is related to the difference between the regression line a response variable.The model sum of squares (SSM) is variability explained by your model.The model sum of related to the difference between the observed values line.The error sum of squares (SSE) is the amount of v by your model.The error sum of squares is equal to ∑ Total variability is related to the difference between the observed values response variable.The corrected total sum of squares is explained and unexplained variability.The corrected to equal to ( ) 2 Fig. 4: Regression Model

Fig. 7 :Fig. 8 :
Fig. 7: Number of Selected Features using R-Squared Feature Selection for each Action +1}} with n observations in a ddimensional space and y i denotes for classes, SVM can handle non-separable observation by slack variable ξ i for observation x i which indicates how much the observation violates the soft

TABLE II :
Comparison with the Literature Results on UCF50 Dataset