A Face Replacement System Based on Face Pose Estimation

Face replacement system plays an important role in the entertainment industries. However, most of these systems nowadays are assisted by hand and specific tools. In this paper, a new face replacement system for automatically replacing a face with image processing technique is described. The system is divided into two main parts: facial feature extraction and face pose estimation. In the first part, the face region is determined and the facial features are extracted and located. Eyes, mouth, and chin curve are extracted by their statistical and geometrical properties. These facial features are used as the information for the second part. A neural network is adopted here to classify the face pose according to the feature vectors which are obtained from the different ratio of facial features. From the experiments and some comparisons, they show that this system works better while dealing with different pose, especially for non-frontal face pose. KeywordsFacial feature• Face replacement• Neural network• Support vector machine (SVM)


INTRODUCTION
For entertainment and special effects industries, the ability of automatically replacing a face in a video sequence with that of another person has huge implications.For example, consider a stunt double in full-view of the camera performs a dangerous routine, and the stunt double's face could be automatically replaced latter with that of the desired actor for each instance by a post-processing.While few of the recent films have achieved good results when performing face replacement on the stunt doubles, there are still some limits, such like the illumination conditions in the environment should be controlled and the stunt double has to wear a special custom-fit mask with reflective markers for tracking [1].
In order to accurately replace a face in a photograph or a frame of video, we separate the system into two main parts.The first part is facial feature extraction and the second part is face pose estimation.Generally, the common approach of face region detection is to detect the face region by using the characteristic of the skin color.After locating the face region, the facial features can be obtained and determined by the geometric relation and statistical information.For example, the most common pre-processing method is to detect skin regions by a built skin tone model.R.L. Hsu et al. [2] proposed a face detection method based on a novel light compensation technique and a nonlinear color trans-formation.Besides, there are still many color models used for the human skin-color [3]- [5].For example, H. K. Jee et al. [6] used the color, edge, and binary information to detect eye pair from input image with support vector machine.Classifier boost methods are used to detect face region in paper [7]- [8].However, neural networkbased approaches required a large number of face and non-face training examples [9]- [11].C. Garcia et al. [12] presented a novel face detection based on a convolutional neural architecture, which synthesized simple problem-specific feature extractors.There are also several algorithms for facial feature extraction.C. H. Lin [13] located facial feature points based on deformable templates algorithm.C. Lin [14] used the geometric triangle relation of the eyes and the mouth to locate the face position.Yokoyama [15] synthesized the color and edge information to locate facial feature.
The second part of face replacement system is face pose estimation.It is assumed that the viewpoint is on a fixed location and the face has an unknown pose that needs to be determined by one or more images of the human head.Previous face pose estimation algorithms can be roughly classified into two main categories: window-based approaches [16]- [19] and feature-based approaches [20]- [23].Windowbased approaches extract face block from the input image and analyze the whole block by statistical algorithms.Among window-based approaches, multi-class classification method divides the whole head pose parameter space into several intervals and determine head pose [16]- [17].For example, Y. M. Li et al. [18] used the technique of support vector regression to estimate the head pose, which could provide crucial information and improve the accuracy of face recognition.L. Zhao et al. [19] trained two neural networks to approximate the functions that map a head from an image to its orientation.Windowed-based approaches have the advantage that they can simplify the face pose estimation problem.However, the face pose is generally coupled with many factors, such as the difference of illumination, skin color, and so on.Therefore, the learning methods listed above require large number of training samples.
On the other hand, the feature-based approaches extract facial features from human face by making use of the 3D structure of human face.These approaches are used to build 3D models for human faces and to match the facial features, such *Corresponding author http://ijacsa.thesai.org/as face contour and the facial components of the 3D face model with their projection on the 2D image.Y. Hu et al. [20] combined facial appearance asymmetry and 3D geometry to estimate face poses.Besides, some sensors are used to improve feature location.For instance, D. Colbry et al. [21] detected key anchor points with 3D face scanner data.These anchor points are used to estimate the pose and then to match the test image to 3D face model.Depth and brightness constraints can be used to locate features and to determine the face pose in some researches [22]- [23].
This paper is organized as follows.The face region detection and facial feature extraction system are introduced in Section 2. Section 3 describes the face pose estimation system.The face replacement system will be exhibited in Section 4. Section 5 shows the experimental results and comparisons.Finally, the conclusions and the future works are drawn in Section 6.

II. FACIAL FEATURE EXTRACTION
Facial feature extraction plays an important role in face recognition, facial expression recognition, and face pose estimation.A facial feature extraction system contains two major parts: face region detection and facial feature extraction.According to the skin color model, the candidate face regions can be detected first.Then, the facial features can be extracted by their geometric and statistic properties from the face region.In this section, face region detection and facial feature extraction will be described.

A. Face Region Detection
The first step of the proposed face replacement system is to detect and to track the target face in an image.A skin color model is used here to extract the skin color region which may be a candidate face region.The skin color model is built in YCbCr color space [24].This color space is attractive for skin color modeling because it can separate chrominance from luminance.Hence, an input image is first transformed from RGB color space to YCbCr color space.Then the skin-color pixels are obtained by applying threshold values which are obtained from training data.After the skin color region is extracted, the morphological operator and 4-connectivity are then adopted to enhance the possible face region.The larger connected region of skin-color pixels are considered as the face region candidate and the real face region is determined by eye detection.Skin color region with eyes is defined as the face region.SVM classifier [25] is used here to detect eyes.Three   Hence, for an input image as Fig. 2a, the skin color region, which may be a face candidate, can be extracted after applying skin color model, as Fig. 2b.Morphological operator and 4connectivity is used then to eliminate noise and enhance the region shape and boundary, as Fig. 2c.The skin color region is defined as a face when the eyes can be found by using SVMbased eye detection, as Fig. 2d.

B. Facial Feature Extraction
After the eyes are located and the face region is determined, the other facial features, such like lips region and the chin curve, can be easily extracted according to their geometrical relationship.In this section, the locating of right tip and left tip of lips region, the construction of chin curve, and the hair region segmentation are described.
To extract the lips region, the property that the lips region is on the lower part of the face and the color of lips region is different from the skin color is considered.Since the lips region is redder than the skin color region, a red to green function RG(x,y) is employed to enhance the difference of lips color and skin color [26].From the experimental results, the function RG(x,y) is defined as follows:

R x y R x y G x y B x y RG x y G x y R x y G x y B x y
The RG(x,y) has higher value when the value of red channel is larger then the value of green channel, which is probably a pixel of lips region.The possible lips region with higher red value is shown in binary image as Fig. 3b.Besides, the edge information is also taken into account here to improve the lips region locating.In the YCbCr color space, the Sobel operator is employed to find the horizontal edge in luminance (Y) channel.The edge information is shown in Fig. 3c.Using the union of redder region and edge information, the left and right tip points of lips region can be determined.The results of left and right tip points of lips region locating is shown in Fig. 3d.The next facial feature which is going to be extracted is the chin curve.There are two advantages to extract the chin curve: one is to separate the head from the neck and the other is to estimate the face pose.Since the chin curve holds strong edge information, a face block image is transformed into gray value image, as Fig. 4a, and then the entropy function is applied to measure the edge information.Large entropy value contains more edge information, as shown in Fig. 4b.The equation of entropy is defined as follows: Hence, for an input image as Fig. 5a, the skin color region can be found first as Fig. 5b.Using the information of curve fitting function, the face region can be separated from the neck, as shown in Fig. 5c.After the face region is found, the hair region can be defined easily.It is known that the hair region is above the face region.Hence, if an appropriate block above the face region is chosen and the skin color region is neglected, the remaining pixels, as Fig. 6a, can be used as the seeds for seed region growing (SRG) algorithm.The hair region then can be extracted.The hair region extraction result is shown in Fig. 6b.

A. Pose Angle Estimation of Class A
For an input image of class A, it will be normalized and rotated first so that the line crossing two eyes is horizontal.In other words, the roll angle γ of the input face should be found out first.The roll angle γ is defined as the elevation or depression angle from left eye.Using the relative vertical and horizontal distance of the two eyes, the roll angle γ can be obtained.Set x 6 and x 7 as the center of left eye and right eye respectively as shown in Fig. 9a, the roll angle γ is defined by:  After retrieving the roll angle information, the face can be normalized to horizontal.Five scalars, v 1 , v 2 , v 3 , v 4 , and v 5 , are used as the input of neural network to estimate the face pose in class A. The first scalar v 1 is defined as: where L 1 is the horizontal distance between the left tip of lips and the constructed chin curve f c and L 2 is the distance between the right tip of lips and f c .The scalar v 1 is relative to the yaw angle α.It is close to 1 when the yaw angle 90   , as Fig. 10a.
When the face turns to right as Fig. 10b, L 1 is smaller than L 2 and the scalar v 1 is smaller than 1.Contrarily, when the face turns to left as Fig. 10c, L 1 is larger than L 2 and the scalar v 1 is larger than 1.The second scalar v 2 is defined as the ratio of L 3 and L 4 : where L 3 is the vertical distance between the middle point of two eyes, defined as x 8 , and the constructed chin curve, and L 4 is the vertical distance between the center of the lips and x 8 , as Fig. 11a.The scalar v 2 is relative to the tilt angle β as Fig. 11b.The third scalar v 3 is defined as: where L 5 represents the distance of x 6 and x 7 .The scalar v 3 is relative to the tilt angle β, as Fig. 11c, and the yaw angle α, as Fig. 11d, simultaneously.Before defining the last two scalars, another two parameters, L 6 and L 7 , are defined first.Connecting the feature point x 3 of the chin curve and two tip points of the lips, the extended lines will intersect the extended line crossing x 6 and x 7 with two intersections.These two intersections are defined as x 9 and x 10 from left to right respectively as Fig. 12a.Parameter L 6 is then defined as the distance between x 6 and x 9 , and L 7 is the distance between x 7 and x 10 .The definitions of parameters L 6 and L 7 are shown in Fig. 12b.Then, the forth scalars v 4 is defined as: and the last scalars v 5 is defined as:

B. Pose Angle Estimation of Class B
The face is classified to class B if only one eye can be found when applying eye detection.For the face in class B, there are also five scalars used as the input of neural network to estimate the face pose.Feature points x 11 and x 12 represent the intersection points of face edge and the horizontal extended line crossing the eye and the lips respectively.Feature point x 13 the tip point of chin curve which is found with the largest curvature and feature point x 14 is the only extracted eye center.With these four feature points which are shown in Fig. 13a, the first scalar 1 v is defined as: where L 8 is the distance between x 14 and face edge x 11 and L 9 is the distance between x 14 and the middle point of the lips.These two parameters are shown in Fig. 13b.The first scalar 1 v is relative to the yaw angle α as Fig. 13c.The scalar 2 v is the slope of the line crossing x 12 and x 13 as shown in Fig. 14a and it is defined by:  '  3 is relative to the tilt angle β.(e) The scalar v '  3 is relative to the yaw angle α.
Connecting x 14 with middle point and right tip point of lips, the extended line will intersect the horizontal line passing x 8 with two intersections.L 10 is defined as the distance between these two intersections as Fig. 15a.Then the scalar 4 v is defined as: v L L   (12) and the scalar 5 v is defined as:

IV. FACE REPLACEMENT
In this section, the procedure of face replacement is detailed.For an input target face, the face pose is estimated first and then the face with similar face pose is chosen from the database as the source face to replace the target face.However, there are some problems when replacing the face, such like the mismatch of face size, face position, face pose angle, and skin color.Hence, image warping and shifting are adopted first to adjust the source face so that it is much similar as the target face.Color consistency and image blending are used later to reduce the discontinuousness due to the replacement.All the details are described below.

A. Image Warping and Shifting
After the face pose angle of target face is determined and the face region of source face is segmented, the target face is going to be replaced by the source face.However, the resolution, face size, and face pose angle may not be exactly the same.Hence, image warping is adopted here to deal with this problem.
Image warping is applied according to features matching.It is a spatial transformation that includes shifting, scaling, and rotating.In this paper, an affine matrix with bilinear interpolation is used to achieve image warping.The affine transformation matrix is defined by: where ( , ) X Y   is the feature point coordinate of target face, (X,Y) is the feature point coordinate of source face, and m 1 ,…,m 6 are parameters.For faces in class A, six feature points, two eyes (x 6 and x 7 ), the center of lips, and feature points x 1 , x 3 , and x 5 of chin curve, are used to solve the matrix by the least square method as Fig. 16a, while four feature points, x 11 , x 12 , x 13 , and x 14 , are used for faces in class B as Fig. 16b.After the source face is warped, a suitable position is going to be found to replace the target face.A better face replacement is achieved when more face and hair regions are matched in both source face and target face.The source face is first pasted on so that the coordinates of pasted feature point are the same for source face and target face.The pasted feature point is chosen as the middle point of chin curve, x 3 , for class A and tip points of chin curve, x 13 , for class B. Later, the pasted source L 10 L 1 0 http://ijacsa.thesai.org/face is shifted around the pasted feature point to a best position with most matching points.A matching degree function M(x,y) for a pasting point (x,y) is used to evaluate the degree of matching, which is defined as: [ ( ( , ), ( , )) ( ( , ), ( , ))] where F s (i,j) and F t (i,j) are binary face images which have value 1 only for face region pixel in source and target images respectively, H s (i,j) and H t (i,j) are binary hair images which have value 1 only for hair region pixel in source and target images, and I is the region of interest which is larger than the pasted region.The function h(a,b) in equation ( 15) is defined by: For each point near the pasted feature point, the matching degree can be calculated.The point with highest matching degree will be chosen as the best position to paste the source face on.For example, the face region (white) and hair region (red) for source face image and target face image are shown in Fig. 17a and 17b respectively.When the target face is randomly pasted by the source face as Fig. 17c, there are more "-1", denoted as the red region, and less "+1", denoted as the white region.This means that the matching degree is low.After calculating all the matching degree of nearby points, the best pasting point with most "+1" and least "-1" can be found, as Fig. 17d.

B. Color Consistency and Image Blending
Because of the difference of luminance and human races, the skin color of target face may not be similar to the source face.To solve the problem, skin color consistency is adopted here.The histogram of both source face and target face are analyzed first and the mean of skin color of target face is shifted to the same value as the mean of source face.For example, the source face as Fig. 18a is darker than the target face as Fig. 18b.If the face replacement is applied without adopting skin color consistency, the skin color of face region and necks region of the result is different, as shown in Fig. 18c.To avoid this situation, the mean of histogram of the target face is shifted to the same value as the source face, as Fig. 18d.Then, the skin color of the face region and necks region will be similar after replacement.The result of face replacement with skin color consistency is shown in Fig. 18e.Finally, an image blending method is applied to deal with the boundary problem.Though the source skin color is changed so that it is consistent with target face, there is still boundary problem when the source face replaces the target face because of discontinuousness.The objective of image blending is to smooth boundary by using interpolation.The hyperbolic tangent is used as the weight function: The horizontal interpolation is described as: and the vertical interpolation is described as: where I(x,y) is the boundary point; L(x,Y), R(x,Y), U(X,y), and D(X,y) represent the left, right, up, and down image respectively.The result of image blending is exhibited in Fig. 19.These images are not applied with color consistency, so the boundary is sharper because of face replacement.However, it can be seen that the image in Fig. 19b with image blending has smoother boundary than the one in Fig. 19a without image blending.http://ijacsa.thesai.org/

V. EXPERIMENTAL RESULTS
In this section, the results of face pose estimation and face replacement will be shown.Some analyses and comparisons will also be made in this section.

A. Face Pose Estimation
To verify the accuracy of the face pose estimation system, face images under various poses are collected and tested.In  Therefore, the accuracy rate of face pose estimation can be calculated.There are some other face pose estimation methods, such as pose estimation method based on Support Vector Regression (SVR) using Principal Component Analysis (PCA) [18] and Neural Network (NN) based approach [19].The comparisons of face pose estimation methods are shown in Fig. 21.

B. Face Replacement
In this section, the results of face replacement are shown.Various conditions are considered to test the robustness of this automatic face replacement system, such like wearing glasses, different resolution, different luminance, different skin color, different yaw angle, different roll angle, and different tilt angle.It can be seen from the results that this system performs well while dealing with these conditions.In Fig. 22, wearing glasses or not is discussed.The target face with glasses, as Fig. 22b, is replaced by the source face without glasses, as Fig. 22a.Since the target face region is replaced by the entire source face, wearing glasses or not will not affect the results.
When the face size and luminance of target face and source face are different, the face size and the skin color will be adjusted.The source face in Fig. 23a will be resized to fit the target face by the affine matrix according to the facial feature matching.Color consistency method is also applied in this case.It can adjust the skin color of the target face in Fig. 23b so that the skin color of target face is similar to the source face and the result would be better after replacement.From the result in Fig. 23c, it can be seen that the skin color of target face are adjusted and shifted to the similar value as source face, especially for the neck region.Since the skin color and the face size are adjusted, the replacement result is more nature.http://ijacsa.thesai.org/While dealing with the profile pose, such as the face with 90 degrees of yaw angle as Fig. 24a, a face image with similar face pose is chosen from the database to replace the target face.The result would be poor if the replacement is done by only adopting an affine matrix without a proper face pose.From the result shown in Fig. 24b, it can be seen that this system performs well while dealing with profile pose, even if the face has 90 degrees of yaw angle.Like the profile pose, when dealing with the face with tilt angle such as Fig. 25b, a proper source face will be found from the database first.According to the face pose estimation system, a face with most similar face pose is chosen, as Fig. 25a.After applying a reflection matrix, the face pose of source face is almost the same as the target face.With color consistency method, the replacement can be done even though there are tilt angles and yaw angles at the same time for a target face.The face replacement result is shown in Fig. 25c.
When considering the target face with a roll angle, such as Fig. 26b, the roll angle is calculated first according to two eyes.After the roll angle is found and a similar pose is chosen from the database for the source face as Fig. 26a, a rigid transformation is adopted to rotate the source image such that the roll angle of source face is the same as the roll angle of target face.In Fig. 26c, it can be seen that the replacement is done with a rigid transformation.

VI. CONCLUSIONS AND FUTURE WORK
Face replacement system plays an important role in the entertainment industries.In this paper, a face replacement system based on image processing and face pose estimation is described.Various conditions are considered when replacing the face, such as different yaw angle, roll angle, tilt angle, skin color, luminance, and face size.The experiment results show that this face replacement system has good performance while dealing the conditions listed before.In the future, facial expression would be a further challenge task to be considered.Along with the facial expression, the results of face replacement will be realer and the system will be much more powerful and useful in entertainment industries.
sets of eye data are used for training.Eye images with frontal pose (set A) or profile pose (set B) are trained as the positive patterns.For negative patterns, non-eye images (set C) such as nose, lips, and ears are included for eye detection.All the training sets for eye detection are shown in Fig.1 .

Figure 1 .
Figure 1.Training data of SVM.(a) Eye images of frontal pose.(b) Eye images of half-profile or profile pose.(c) Non-eye images.

Figure 2 .
Figure 2. Face region detection.(a) Original image.(b) Binary image after applying the skin color model.(c) Possible face candidate regions after applying morphological operator and 4-connectivity.(d) The remaining face region after applying eye detection.

Figure 3 .
Figure 3. Lips region locating.(a) Original image.(b) Binary image of function RG(x,y).(c) Horizontal edge by using Sobel operator.(d) The left and right tip points of lips region.

Figure 4 .
Figure 4. Chin curve construction.(a) Input gray level image.(b) The entropy of input gray level image.(c) Five feature points, x 1 , x 2 , x 3 , x 4 , and x 5 , represent the most left point to the most right point respectively.(d) The function of chin curve fitting.

Figure 5 .
Figure 5. Face region segmentation.(a) Input image with curve fitting function.(b) Skin color region.(c) Face region only by using curve fitting information.

Figure 6 .
Figure 6.Hair region extraction.(a) The remaining pixels are used as the seeds for SRG after the skin color region is neglected.(b) The result of hair region extraction.

Figure 7 .
Figure 7. Parameters of face pose estimation.Since the profile pose is much different from the frontal pose, two different methods are proposed here.All the different poses are roughly divided into two classes according to the number of eyes extracted in SVM-based eye detection system.When there are two eyes extracted, the face pose belongs to class A which is more frontal.Otherwise, if only one eye is extracted, then the face pose belongs to class B which is more profile.The examples of class A and class B are shown in Fig.8a and Fig.8b respectively.

Figure 8 .
Figure 8. Two kinds of face pose.(a) Frontal face pose with two eyes extracted.(b) Profile face pose with only one eye extracted.
x and y represent the x-coordinate and y-coordinate respectively.Using the information of roll angle γ, the image can be rotated to horizontal as Fig.9b.For an input image Fig.9c, the normalization result is shown in Fig.9d.

Figure 9 .
Figure 9.The roll angle γ.(a) The definition of x 6 and x 7 .(b) The rotated image with horizontal eyes.(c) Input image.(d) Normalization result of input image (c).

Figure 10 .
Figure 10.The relationship between scalar v 1 and the yaw angle α.(a) Scalar v 1 is close to 1 when the face is frontal.(b) Scalar v 1 is smaller than 1 when the face turns to right.(c) Scalar v 1 is larger than 1 when the face turns to left.

Figure 11 .
Figure 11.The relationship between scalars, v 2 and v 3 , and pose parameter, tilt angle β and yaw angle α.(a) The definitions of x 8 , L 3 and L 4 .(b) The relationship between scalar v 2 and tilt angle β.(c) The relationship between scalar v 3 and tilt angle β.(d) The relationship between scalar v 3 and yaw angle α.

Scalar v 4
is relative to tilt angle β, as shown in Fig.12c, and the scalar v 5 is relative to yaw angle α, as shown in Fig.12d.

Figure 12 .
Figure 12.The relationship between scalars, v 4 and v 5 , and pose parameter, tilt angle β and yaw angle α.(a) The definitions of x 9 and x 10 .(b) The definitions of L 6 and L 7 .(c) The relationship between scalar v 4 and tilt angle β.(d) The relationship between scalar v 5 and yaw angle α.

Figure 13 .
Figure 13.The relationship between scalar v ' 1 and yaw angle α.(a) The definition of feature points x 11 , x 12 , x 13 , and x 14 .(b) The definition of parameters L 8 and L 9 .(c) The scalar v ' 1 is relative to the yaw angle α.

x 8 Figure 14 .
Figure 14.The relationship between scalars, v ' 2 and v ' 3 , and pose parameter, tilt angle β and yaw angle α.(a) The line crossing x 12 and x 13 .(b) The scalar v ' 2 is relative to the tilt angle β.(c) The definition of angle θ.(d) The scalar v'  3 is relative to the tilt angle β.(e) The scalar v'  3 is relative to the yaw angle α.

Figure 15 .
Figure 15.The relationship between scalars and pose parameter (a) The definition of L 10 .(b) The scalar v'  4 is relative to the tilt angle β.(c) The scalar v'  5 is relative to the yaw angle α.

Figure 16 .
Figure 16.Feature points for image warping.(a) Six feature points are used for class A. (b) Four feature points are used for class B.

Figure 17 .
Figure 17.Image shifting according to matching degree.(a) Source face image (b) Target face image (c) Face replacement with lower matching degree.(d) Face replacement with highest matching degree.

Figure 18 .
Figure 18.Skin color consistency.(a) Source Face.(b) Target Face.(c) Face replacement without skin color consistency.(d) The mean of histogram of target face is shifted to the same value as the source face.(e) Face replacement with skin color consistency.
the database, the face poses are divided into 21 classes according to different yaw angle α and tilt angle β.The face pose is estimated by multi-class classification based on neural network.The yaw angle is divided into 7 intervals and the tilt angle is divided into 3 intervals, as shown in Fig. 20a and Fig. 20b respectively.Because the face poses of turning right and turning left are the same by applying a reflection matrix, only left profile face poses are considered.

Figure 20 .
Figure 20.Intervals of different face angles and different face poses.(a) 7 intervals of different α.Positive value represents turning left and the magnitude represents the turning degree.(b) 3 intervals of different β.Positive value represents looking up and the magnitude represents the turning degree.(c) 21 Different face poses.There are 1680 images from 80 people in the database for training and another 1680 images from other 80 people are used for testing.The accuracy rate of pose estimation, R PE , is defined

Figure 21 .
Figure 21.Accuracy comparisons of face pose estimation.

Figure 22 .Figure 23 .
Figure 22.Face replacement result when considering the glasses.(a) Source face image without glasses.(b) Target face image with glasses.(c) The replacement result.

Figure 24 .
Figure 24.Face replacement result when considering the profile face.(a) Target face image.(b) The result of face replacement.

Figure 25 .Figure 26 .
Figure 25.Face replacement result when considering the face with tilt angle.(a) Source face image (b) Target face image with a tilt angle.(c) The replacement result.