Intelligent Real-Time Facial Expression Recognition from Video Sequences based on Hybrid Feature Tracking Algorithms

In this paper, a method for automatic facial expression recognition (FER) from video sequences is introduced. The features are extracted from tracking of facial landmarks. Each landmark component is tracked by appropriate method, results in proposing a hybrid technique that is able to achieve high recognition accuracy with limited feature dimensionality. Moreover, our approach aims to increase the system accuracy by increasing the FER recognition accuracy of the most overlapped expressions while achieving low processing time. Thus, the paper introduces also an intelligent Hierarchal Support Vector Machine (HSVM) to reduce the cross-correlation between the confusing expressions. The proposed system was trained and tested using a standard video sequence dataset for six facial expressions, and compared with previous work. Experimental results show an average of 96% recognition accuracy and average processing time of 93 msec. Keywords—facial expression recognition (FER); Hierarchical Support Vector Machine (HSVM); Human computer Interaction (HCI); Real-time facial expression recognition


INTRODUCTION
Science of HCI focuses on simulating the human interaction with computers as human like human interaction.The researches in that field care mainly about knowing the human behaviors to facilitate the interaction with computers.The researches show that the facial expression recognition (FER) of human is considered the most important way to represent the emotion which reflects essentially the human behavior.
FER has many applications such as video conferencing, medical applications, forensics, virtual reality, computer games; machine vision and many more.FER is categorized according to the type of input data into two types; video based FER, and image based FER.This paper considers mainly real-time video based FER, standard dataset for realtime video facial expression are used for testing and evaluating.Moreover, the proposed system is designed to recognize the most six effective expressions of Human face; anger, disgust, fear, happiness, sadness, and surprise [1,2,3].
Most of video-based real-time FER systems depend on tracking the motion and the position of the muscles in the face.
These movements indicate the emotion state of the human from his face.An automatic FER system usually consists of three steps [1,2]: i) Face detection, ii) Facial feature extraction, iii) Facial expression recognition.Each step of them is considered a separate research area and has its own challenges.On the other hand, when speaking on video-based real-time FER, the trade-off between processing time and recognition accuracy becomes the main challenge.A high level of accuracy and a few milliseconds processing time are the main goal of such system.Many FER systems have been developed to satisfy these goals [1,2,3,4].However, some limitations still exist.There are some constraints that affect the FER system accuracy and time such as; numbers of features, number of recognized emotions, pose of the face…etc.The goal of this paper is to develop a hybrid feature tracking technique that is able to track the most effective face features, regarding the differences in its movements nature, with the corresponding technique while keeping feature vectors dimensionality as minimum as possible.Moreover, an intelligent Hierarchical SVM classification system is deployed to achieve low processing time and high recognition accuracy.
Previous researches had proven the superiority of FER based geometric feature (GF) compared with that based on appearance feature (AF) [1,2].The processing time was among the important challenges in FER. in [3,5] the authors achieved accuracy rate up to 90% and 91% within 31 msec processing time, while the processing time was reduced to 5 msec using HSVM with the same accuracy rate [4].In [6,7,8] using geometric feature extraction achieved 78.3%, 88.8% and 89% with low processing time.The low accuracy rate with low processing time is because of using specific face components like mouth and eyes.In [1], the triangles based feature was deployed instead of distances based feature or points based feature with accuracy rate 95% and 97% by implementing MUG and CK+ datasets.In [2] using points, distances and triangle features in CK+ database is 96.37%, 96.58%, and 97.80%, in MMI database is 67.64%, 74.31%, and 77.22%, and in MUG database is 91.41%, 94.13%, and 95.50%, respectively.The systems [1,2] achieve to high accuracy, on the other hand it shows a high processing time.www.ijacsa.thesai.org The rest of this paper is organized as follows: the proposed framework, face detection and landmark points extraction is described in section 2. The implementing of the three FER systems (distances, triangles and hybrid) are presented in section 3. Section 4 describes the intelligent HSVM of the three systems.Experimental results of the three systems are presented in section 5. Finally, conclusion and future work are given in section 6.

II. THE PROPOSED FRAMEWORK
Most automatic systems for FER usually consist of main sequence processing blocks [2], these blocks are: video acquisition, frames preprocessing, feature extraction, feature tracking and classification.Fig. 1 shows the framework of the proposed systems.The input of the proposed system is video.The frames are extracted from input video.After the frames are extracted from facial expression video, the face detection algorithm is applied on the first frame.After that the facial landmark points are extracted from the detected face.The accuracy of the geometric FER system mainly depends on the landmark point's detection accuracy.Once the landmark points are extracted, the point tracking algorithm is applied on the video frames.The feature vector is formed from the tracking results of landmark points.Since the scope of this paper is to accelerate real-time video FER process by integrating the most effective algorithms at each stage, so, the proposed approach implemented three systems which have a different feature vector formation: based on distances, based on triangles and hybrid between them.The distances based system achieves low processing time but some expressions have low accuracy.The triangles based system achieves to high accuracy with high processing time.The hybrid system proposed for improving the accuracy of expressions that have lack of accuracy in distances based system and reduce processing time.The hybrid system recognizes happy, anger and disgust expressions by distances and recognizes the other expressions by triangles.The feature vector input as training data with class label at the intelligent HSVM training stage.The feature vector input as test data after the intelligent HSVM is trained and it can recognize the expression of this feature vector.The proposed approach uses tree to represent HSVM and uses depth-first search algorithm for expression searching.In the coming sub sections, each step will be explained and discussed.

A. Face Detection and Feature Point's Extraction
Face detection is the first step in any FER system.This step is applied only to the first frame then the face will be tracked through the follow frames using defined specific tracking points.The proposed approach employs an adaptive version of Viola-Jones (VJ) face detector which is based on the Haar-like features [3].This approach suits mainly real time video applications, since it is approximately 15 times faster than most recent approaches.
The detected face is represented by a number of static and dynamic points.The static points are the points that are not located on the face components (e.g., eyes, eyebrows, nose, and mouth).These points are located on the face border, in addition to two points are located on the nose, total of 8 blue points are shown in Fig. 2. The dynamic points that are located on the face component (eyebrows, eyes, nose, and mouth) are strongly related to FER accuracy.Thus, these points are more important than static points.So, they must be located carefully, shouldn't be determined by face ratio.Since this paper integrates the most powerful algorithms to serve high FER accuracy, the next subsections will introduce briefly the employed algorithms for extracting these dynamic points.
Where k is the value of  binarizing the image I' as following: Where is the average, is the standard deviation of pixels' intensity of the image , and Z is a constant equal to 0.9  This method detects four points on each eye and three points on each eyebrow as shown in Fig. 2. The algorithm of eyes and eyebrows points' detection is considered the most effective technique for locating these dynamic points [10].
2) Nose Points' Extraction: Locating the nose dynamic points is the next step after detecting the eyes points.The nose as a ROI is specified as the vertical part which it's top is between the eyes that have been detected by enhanced VJ [3].The nose bounds by three lines: two vertical lines and horizontal line which passes on the nose holes are determined by the highest gradient H of the Sobel projection curve that proposed in [3].Finally, two points of the intersection of the three lines (vertical lines and horizontal line) are the landmark points of nose.This method detects two points at any frontal face position as shown in Fig. 6. www.ijacsa.thesai.org 3) Mouth Points' Extraction: Detecting the mouth dynamic points depends mainly on the static points of face border and the pre-located nose points.Firstly, the mouth box as a ROI is determined as a ratio of the side face box ‗R'.The nose is located at the top of the mouth region, it is at 0.67*R from the face top border and has a width equal to 0.25*R.This part is the bounding box of the face with a width 0.1*R [3].Once the mouth ROI is determined, a lip map is employed to detect the mouth landmarks points [11] as following: Where r, g and b are the RGB component after normalization r= g= b= (4) [11] The four mouth landmarks are determined as shown in Fig. 7.  REAL-TIME HYBRID FEATURE TRACKING ALGORITHMS FROM VIDEO SEQUENCES Regarding the slightly changes in the dynamic points movements for different expressions, in addition to the confusion that may occur between two or more facial expression, because of common movements.It is required to employ different feature tracking algorithms corresponding to the nature of each dynamic point.In this paper, the proposed approach implements a hybrid feature tracking algorithms based on distances feature vector and triangles feature [1,3].

A. FER System Based on Distances Feature Vector
This technique relies on defining 43 universal distances on the face, named distance vectors.These vectors are derived from 2D distribution between two types of facial landmark points and are known by being effective to track the facial expressions through video sequences [3].The distance vectors are shown in Fig. 9.For data normalization, each distance is divided by those in the first frame which represent the nature expression as in [3].This technique achieves low processing time because of feature vector dimensionality.
However, it fails to achieve high recognition accuracy for some expressions.

B. FER System Based on Triangles Feature Vector
This technique tracks the movements of feature points through a triangle shape, three facial points are tracked at a time more rather than tacking one or two facial landmarks [1].The tracking information of the facial points and the relationship between them can be captured well by tracking three points at a time.
Triangle components in each frame are subtracted from the triangle components in the first frame of the video.Each triangle has four components that are saved at feature vector, named ‗a', ‗b', ‗α' and ‗β', the changes in their values corresponding to the first frame are considered the feature vector that will be used in FER.Fig. 10 shows the mathematical representation of the difference between triangle components between two frames FER based on this technique proposed in [1] employs 52 landmark points.The feature vector is too long.This leads to an increase in the processing and classification time.In this paper, specific 30 landmark points are supposed to be tracked by triangle vectors.These points were chosen to easily identify expressions [12].There would be a total of 30!/(3!(30-3)!) = 4060 unique triangles.If the feature vector is formed by all these triangles it will be very long.So the Adaboost algorithm was used to choose the best triangles that represent expressions.Fig. 11 shows the 4060 triangles and the Adaboost 80 triangles.www.ijacsa.thesai.orgAlthough the adaboost choose the best triangles, the feature vector still long and the processing time would be high.So, the proposed approach implements a hybrid system that used triangles in recognizing the expressions which have low accuracy in distances based system and used the distances for other expressions.

C. Hybrid System Based on Triangles Feature Vector
The proposed hybrid approach depends on employing distance vectors in recognizing the expressions that are defined clearly.Moreover, it uses the triangles for recognizing the expressions that are not recognized accurately in the distances system.The hybrid technique is used for tracking certain dynamic points for confusing expressions.For example, fear and sadness expressions have a low recognition accuracy rate when employing distances based method.Thus, the triangles based method is implemented to enhance their recognition accuracy.However, the fear expression usually confuses with surprise expression, also sadness.Therefore, fear and surprise expressions are trained in ADaboost technique to select triangles features of this expressions, this Adaboost selects 35 triangles.Also, sadness, anger and disgust expressions trained in ADaboost technique to select triangles features for these expressions, this Adaboost selects 41 triangles.Fig. 12 shows the superiority of triangle technique to detect the variations of eye height at fear expression.Fig. 13 shows the varying angles of eye triangle at fear expression.The first angle changes from 108 ˚ to 120 ˚ and the second angle change from 38˚ to 50˚ beside the line varying of the triangle that represent eye height.These two figures show that tracking three points as and their relationship between them is more efficient than tracking distances between two points.INTELLIGENT RECOGNITION TECHNIQUE BASED ON HIERARCHICAL SVM CLASSIFICATION SVM is used for classification and regression analysis.SVMs exhibit high classification accuracy for small training sets and good generalization performance on date that are very variable and difficult to separate [3].The proposed approach generates hierarchal SVM using RBF kernel [13].The HSVM framework of the three systems is similar to that shown in Fig. 7. Notice that the input feature vectors differ.
The proposed HSVM recognition strategy is processed as an uninformed artificial intelligent search problem.The search tree consists of six SVM that are trained to differentiate accurately among the most common six expressions.The search process based on depth-first algorithm [13] , it starts from the root node and goes through the parent nodes, it ends immediately as soon as the goal achieved, no need to go through all the leaves.Therefore, it reduces the processing time.
As shown in Fig. 14, each SVM examines different input features corresponding to the expressions that are going to be tested.In Fig. 7 the SVM input features are labeled from -A‖ to -F‖, each label indicates specific input features as explained in Table I.Since, for each expression there are governing features that help accurately and quickly to recognize this expression.For example, the root node SVM1 is feed by -A‖ input features, which are considered a common feature for several expressions.If the SVM1 output result is close to happy or disgust features it leads to SVM2.Otherwise it leads to SVM3.Now, SVM2 node tests the input features -B‖, which are considered more specific features and are used to differentiate between two conflicting expressions.Upon SVM2 test result, it decides if the expression is happy or disgust.The process continues, each SVM node examines more details about a pair of confusing expressions.The search process ends as soon as the goal is achieved.www.ijacsa.thesai.orgTable I shows the input vectors of HSVM in the three systems that based on distances, triangles and the proposed system that based on both.The distances from D1 to D43 are shown in fig.9.The triangles of a face component (eyes, eyebrows, nose, mouth) are triangles that have a landmark point on this component i.e. the triangles of mouth are all triangles that have a landmark point or more on the mouth.At the triangles based system, two triangles were first inputs to the SVM1 are shown in fig.15 .theall face triangles are triangles that adaboost was chosen.The triangles that input to SVM4 and SVM5 in the hybrid system were chosen by adaboost algorithm i.e. the 35 triangles that input to SVM4 were chosen by adaboost algorithm when training it by surprise and fear expressions.The distances in the hybrid system are shown in fig.9.

V. EXPERIMENTS AND EVALUATION
A series of experiments and tests were conducted using a C++ computer to simulate the previous techniques.Testing was done based on ‗FEEDTUM' common standard video sequences dataset for six different facial expressions.It contains 385 sequences.Each sequence begins from neutral and ending to its emotion.The proposed approach used 20 video in each expression for training and 344 sequences for testing.In the training the sequences were attached by expression labels.In the testing mode the sequence input to the system without labeling.Table II summarizes the proposed technique, the dynamic points that are tracked as a distance vector, triangle or hybrid ones are shown, and also the input features corresponding to each SVM are shown in Table III.
Table IVV shows the FER accuracy for the six facial expressions based on distance vector only, triangle vectors only and the proposed hybrid technique, the average of these accuracy and the average of processing time.The most important factor of the processing time is the feature vector length.There are the positive relation between the feature vector length and the processing time in the recognition.
Regarding the comparable results introduced in Table VIVII, It can be concluded that the hybrid technique has achieved the tradeoff between high recognition accuracy and low processing time which is the main challenge of real-time video sequences FER.
There are an example that shows that the hybrid system is better than distances system and triangles system.For recognizing the fear expression the feature vector path from SVM1 then SVM3 finally SVM4 the total feature vectors length are calculated for distances system, triangles system and hybrid system as following:  10.Fig. 16 shows the relationship between the feature vector length and the accuracy of fear expression in the three systems.Fig. 16.The relationship between the feature vector lengths of fear expression and its accuracy in the three systems Previous published results using FEEDTUM dataset [8,14] had achieved 93.4% and 94.6% accuracy rates, respectively.Moreover, the feature vector dimension in [1] is larger than the proposed feature vector.The proposed approach achieved an average recognition rate of 93.8%, 97.2% and 96% with 22 msec, 256 msec and 93 msec of distance based vector technique, triangle based vector technique and hybrid based vector technique respectively.The recognition accuracy of each expression is represented graphically in Fig. 17.Moreover, the proposed intelligent hierarchal SVM with hybrid based vector succeed in decreasing the classification correlation between the conflicting expressions as shown in Table VIIIIXX.
Finally, Table XIV introduces the comparison between the related work and the proposed system which shows the superiority of the proposed algorithm in terms of FER accuracy.

VI. CONCLUSIONS AND FUTURE WORK
This paper proposed a real time FER from a video sequences based on hybrid feature tracking algorithm and an intelligent HSVM search method.The proposed technique integrated the best well known algorithms in each step.The main goal is to achieve the tradeoff between real time video demands and high recognition accuracy.Hybrid system is implemented to reduce the feature vector dimensionality and intelligent HSVM achieve high accuracy with low processing time.The proposed approach was tested using the standard FEEDTMUM database.FEEDTUM database is difficult to be treated because intra-class confusions.This confusion between some expressions makes it very difficult in recognizing even by human.Despite this limitation, the proposed technique showed good results regarding accuracy and processing time.For future work, the proposed approach may be implemented on real life image sequences instead of the FEEDTUM database.For feature selection some other methods will be considered for more advantages.Moreover, using variance angles of human face may be used for training and testing the new approach.

Fig. 1 .
Fig. 1.The steps of the proposed FER system

Fig. 2 .
Fig. 2. The static landmark points (blue points) and the feature landmark points (green points) 1) Eyes and Eyebrows Points' Extraction: The proposed approach employs the enhanced VJ algorithm [9].The face is divided into three sub portions: upper left half, upper right half and lower half show in fig. 3. The VJ eyes detection was applied on upper left half and upper right half.Face image division based on physical approximation of location of eyes and mouth on face.This algorithm increases the accuracy of VJ techniques and decreases processing time.The idea of this algorithm that used for eyes and eyebrows points detection is based on color segmentation, since that the skin pixels would have high red intensities compared to eye and eyebrows pixels.This algorithm is based on RGB as shown at fig.4 [10]:  Complement the eye ROI red channel  Calculate the exponential operator for each pixel in the image I=CR:

Fig. 5 .
Fig. 5.The steps of detecting eyebrows landmark points

Fig. 6 .
Fig. 6.Description of the steps of detecting nose feature points: (a) 1.Gradient Image ∇Ix (a) 2. Gradient Image ∇Iy (b) Curve of gradient projection (c) Two points of the intersection of three lines (vertical lines, horizontal line)

Fig. 7 .
Fig. 7. Example of mouth landmark points detection stepsFinally, after detecting static and dynamic points on the first video frame, pyramidal implementation of the Kanade Lucas Tracker (KLT) algorithm is used for locating these points through the upcoming frames[1].The example of tracking the distances is shown in fig.8.

Fig. 8 .
Fig. 8. Samlples of point tracking of image sequence of happy expression III.REAL-TIME HYBRID FEATURE TRACKING ALGORITHMS FROM VIDEO SEQUENCES Regarding the slightly changes in the dynamic points movements for different expressions, in addition to the confusion that may occur between two or more facial expression, because of common movements.It is required to

Fig. 10 .
Fig.10.Difference in components of two triangles used as features[1]

Fig. 12 .
Fig. 12. Distance varying of eye in the fear expression

Fig. 15 .
Fig. 15.The two triangles that first input in the triangles system

TABLE I .
THE CLASSIFIERS OF HSVM AND ITS INPUT DISTANCES FOR CLASSIFICATION PROCESS

TABLE II .
RECOGNITION ACCURACY OF ALL EXPRESSIONS IN THE THREE TECHNIQUES

TABLE IV .
THE COMPARISON OF THE RELATED WORK AND THE PROPOSED SYSTEMS