A New 3D Model-Based Tracking Technique for Robust Camera Pose Estimation

In this paper we present a new robust camera pose estimation approach based on 3D lines features. The proposed method is well adapted for mobile augmented reality applications We used an Extended Kalman Filter (EKF) to incrementally update the camera pose in real-time. The principal contributions of our method include first, the expansion of the RANSAC scheme in order to achieve a robust matching algorithm that associates 2D edges from the image with the 3D line segments from the input model. And second, a new powerful framework for camera pose estimation using only 2D-3D straight-lines within an EKF. Experimental results on real image sequences are presented to evaluate the performances and the feasibility of the proposed approach in indoor and outdoor environments.


INTRODUCTION
Estimation of relative 3D camera position and orientation (pose) is one of the most challenging problems in computer vision.Indeed, the knowledge of the camera pose is useful for numerous applications, including motion analysis, robotic control tasks, and Augmented Reality (AR) [1] [2].For example, Dasgupta and Banerjee [3] developed an AR system for autonomous navigation which uses a panoramic vision.Their system allows augmenting, via a laptop, the real video by simultaneously viewing an artificial 3D view of the scene.During the last years, several approaches based on natural features (points, planes, edges, etc.) extracted from the image scene have been developed.The mean idea of these techniques is to locate the camera in three dimensions given a sequence of 2D intensity images of the scene containing the 3D features whose positions and orientations are known relative to a reference frame.When the correspondences between 2D features extracted from the image and 3D features defined in the world frame are established, the problem is then solved using 2D-3D registration techniques.Numerical nonlinear optimization methods like the Newton-Raphson or Levenberg-Marquardt algorithm are generally used for the minimization [4][5] [6].Thus, Wang et al. [7] propose a 3D localization approach using Harris-Laplace interest points as natural landmarks.Their localization system is composed of two stages: coarse localization from a location vector space model (LVSM) and affine localization from the location database using a voting algorithm.The pose recovery of the camera is obtained from essential matrix decomposition.Line-based methods are more robust and more efficient than point-based techniques.Indeed, line features are present in many scenes and are more visible than points under a wider range of lighting and environmental conditions.Wuest et al. [8] present a modelbased line tracking approach that can handle partial occlusion and illumination changes.The camera pose is computed by minimizing the distances between the projection of the model lines and the most likely matches found in the image.Drummond and Cipolla [9] propose a novel framework for 3D model-based tracking.Objects are tracked by comparing projected model edges to edges detected in the current image.Their tracking system predicts the edge locations in order to rapidly perform the edge search.They have used a Lie group formalism in order to transform the motion problem into simple geometrics terms.Thus, tracking becomes a simple optimization problem solved by means of iterative reweighed least squares.Recently, Comport et al. [10] propose a real-time 3D model-based tracking algorithm.They have used a visual servoing approach to formulate the pose estimation problem.A local moving edges tracker based on tracking of points normal to the object contours is implemented.In order to make their algorithm robust, they have integrated a M-estimator into the visual control law.Other approaches have also been applied where different features have been combined to compute the camera pose.Ababsa and Mallem [11] propose to combine point and line features in order to handle partial occlusion.They integrated a M-estimator into the optimization process to increase the robustness against outliers.Koch and Teller [12] describe an egomotion estimation algorithm that takes as input a coarse 3D model of an environment.Their system uses a prior visibility analysis to speed initialization and accelerate image/model matching.
For real-time applications, extended Kalman filtering (EKF) is the most widely used method for recursive estimation.EKF allows also to achieve noise and disturbance rejection and to enhance estimation accuracy.EKF was used for the first time to compute the 3D location by Wilson et al. [13].Yu et al. [14] propose an EKF based method to recover structure and motion from image sequences.The EKF is used to estimate the object's pose and to refine the positions of the model points in the 3D space.More recently the same authors propose an EKF algorithm for pose tracking using the trifocal tensor [15].Their approach is also based on natural points and incorporates the tri-focal constraint into the measurements model.Yoon et al. [16] present a model-based object tracking to compute the 3D camera pose.Their algorithm uses an EKF to provide an incremental pose-update scheme in a prediction-verification framework.In order to enhance the accuracy and the robustness of the tracking against occlusion, they take into account the www.ijacsa.thesai.orgmeasurement uncertainties associated with the location of the extracted image straight-lines.Lippiello et al. [17] developed an algorithm for pose estimation of a moving object.In order to enhance the robustness with respect to measurement noise and modeling error, they proposed an adaptive version of the EKF customized for visual applications.Jiang and Neumann [18] propose an EKF-based extendible tracking method that can extend tracking from prepared to unprepared regions of the environment by auto-calibrating a 3D structure of a priori unknown 3D lines.
Other approaches use Simultaneous Localization And Mapping (SLAM) to track the camera pose while building a 3D map of the unknown scene.William et al. [19] proposed a recovery module that finds the pose of a camera using a map point features created by a single-camera SLAM system.They used an efficient version of RANSAC to achieve pose hypotheses verification and to reject outliers.Davison et al.
[20] used the EKF to estimate the 3D trajectory of a monocular camera moving through an unknown scene.The core of their system, named MonoSLAM, is the online creation of a sparse map of natural landmarks within a probabilistic framework.However, the main problem with most existing monocular SLAM techniques is a lack of robustness when rapid camera motions, occlusion and motion blur occur.
In this paper we present an original robust camera pose tracking using only straight lines and which differs from existing works.We propose to combine an EKF with a RANSAC scheme in order to achieve a robust 2D-3D lines matching.This gives an efficient solution for outlier's rejection.
To our knowledge such solution has not been explored before.Furthermore, we have combined the 2D-3D lines correspondence constraints for object pose estimation, developed by Phong et al. [21], with an EKF in order to update recursively the camera pose.We have compared our results with classical approaches where pose estimation is solved using least square approaches [22][23] [24].Our method requires no training phase, no artificial landmarks, and uses only one camera.
The rest of the paper is structured as fellows: In section II, we describe the camera pose estimation problem formulation when using straight lines, and we also give a complete implementation of the Extended Kalman Filter to update the camera pose recursively over the time using 2D and 3D lines features.In section III, we explain how we have expended the RANSAC scheme [25] in order to achieve robust 2D-3D lines matching.In section IV, we show experimental results and evaluations; we discuss also the merits and the limitations of the proposed approach.Conclusion and future work are presented in section V.

II. CAMERA POSE ESTIMATION ALGORITHM
In any Kalman Filter implementation, the system state is stored as a vector.In our algorithm the state is represented by the position and the orientation of the camera with respect to the world coordinate system.For computational we use a unit quaternion to represent the rotation [26].Thus, the state vector is given by: We denote the camera state at time t by the vector t X .The EKF is used to maintain an estimate of the camera state X in the form of a probability distribution   , where Z t is the measurement vector at time t.The Kalman filter models the probability distribution as Gaussian, allowing it to be represented by a covariance matrix .In an extended Kalman filter, the nonlinear measurements and motion models are linearised about the current state estimate as Jacobian matrices [27] [28].Our algorithm follows the usual predict-refine cycle, whereby the state X is predicted at timestep t, and measurements are used to refine the prediction.

A. Time update
The time update model is employed in order to predict the camera pose at the following time step.In our case, the time update is simple because of the fact that we estimate the camera pose at each frame of the images sequence.Therefore the 3D cameras pose between two successive frames changes very little.The time update equation is then given by : Where A is 77 identity matrix.
The time update step also produces estimates of the error covariance matrix  from the previous time step to the current time step t.To perform this prediction we use the general update equation of the Kalman filter: Where Q t represents the covariance matrix of the process noise. reflects the variance of the state distribution.

B. Measurement model and estimate update
The measurement update model relates the state vector to the measurement vector.Since our goal is to estimate the camera pose using only straight lines, we will first describe the constraint equation which relates the state vector to the 3D model lines and their corresponding 2D image edges.We choose to base our technique on line features, rather than points, because this approach is relatively unexplored in the vision literature.We consider a pin-hole camera model and we assume that the intrinsic camera parameters are known.The world coordinate frame is a reference frame.All the 3D model lines are defined with respect to it.Let Li be a 3D line.Li is represented with the Cartesian coordinates of its two end-points i P 1 and i P 2 (see figure 1).The points i P 1 and i P 2 in world coordinates can be expressed in the camera frame as well : Where the 33 rotation matrix R and the translation vector T describe the rigid body transformation from the world coordinate system to the camera coordinate system and are precisely the components of the camera state vector.We can see that the points Furthermore, a measurement input of the normal vector i N  can be obtained from the image data.Indeed, image line matched with model line belongs also to the projection plane defined above.Let l i be a 2D image line corresponding to the 3D line L i .In similar manner l i is represented by its two extremities defined in the 2D image coordinates frame.The points i m 1 and i m 2 can be expressed in the camera coordinate frame as follows: Where the matrix K contains camera calibration parameters, such as focal length, aspect ratio and principal point coordinates.
A measurement i n  of the unit vector i N  normal to the projection plane is thus given by (see figure 1): Where Combining equations ( 5) and ( 7), a measurement equation can be written, for each matching event L i  l i : Where v t represent the noise term in the measurement input with covariance R t .The noise is due to the uncertainty in the measured image position of the end points of the extracted 2D lines.The nonlinear function h(X) in measurement equation (8) relates the state to the measurement input.Three 2D-3D line correspondences are sufficient in theory to recover 6-DOF camera pose [29] through in practice mores line may be required to increase accuracy.
The state estimate and covariance are refined after each feature measurement z t using the standard equation of the EKF as follows: Where H t is the Jacobian matrix defined by: The measurement update model is executed once a set of 2D-3D matched lines become available.

C. Iterated EKF
The standard EKF method does not consider errors due to the linearization of the nonlinear function h(X) in the vicinity of  t X .However, these errors can lead to wrong estimates and/or divergence of the camera pose.Since the nonlinearity is only in measurement equation, the Iterated Extended Kalman Filter (IEKF) is the best technique to deal with it [30] [31].The IEKF uses the same prediction equation as EKF, namely (2) and (3).The measurement update relations are replaced setting   t t X X 0 and doing iteration on : The iteration could be stopped when consecutive values differ by less than a defined threshold.

III. ROBUST 2D-3D LINES MATCHING ALGORITHM
In this section we explain the expansion of the RANSAC scheme that we have developed in order to achieve a robust matching algorithm that associates 2D edges from the image with the 3D line segments from the input model, and without using any verification algorithm.
be a set of 2D edges extracted from the image and   a set of 3D model lines.Our robust lines matching algorithm is summarized as follows: 1. Randomly sample subsets of four {li  L j } pairs of 2D and 3D lines.In theory a minimum three pairs are of lines are sufficient to compute an accurate rigid transformation.
2. For each sample, compute the camera pose (R,T) using the IEKF algorithm described in section II.
3. Each candidate  is tested against all the correspondences j i L l  by computing, in the camera frame, the angle between the normal vector i n  (see figure 1) associated with the image line i l and the transformed line j L R  .If this match is wrong with respect the pose , then the co sinus of the angle should be significantly larger than zero.
4. We choose the pose  which has the highest number of inliers, i.e the  for which all the pairs are within a fixed angle threshold.
Hence, the obtained camera pose for the current image is robustly updated using only inliers of correspondences.

IV. EXPERIMENTAL RESULTS
The proposed camera pose estimation algorithm has been tested in real scenes and the registration accuracy was analyzed.The results are presented for both indoor and outdoor images (Figure 2).The indoor scene consists of a camera moving in an office room whereas the outdoor one corresponds to a moving camera pointing towards one frontage of a building.The frame rate of the recorded image sequences is about 25 frames/s and the resolution of the video images is 320240 pixels.The 3D models of the office room and the building frontage are known, they are composed, respectively, of 19 lines and 120 line defined by the 3D coordinates of their end points within the world coordinates frame (Figure 3).In order to estimate the camera pose accuracy we defined, in the camera space, the registration error .
Given a set of correspondences between image edges and model segments the error  corresponds to the normalized square sum of the sinus of the angular disparities i  for each correspondence between image edge and the re projected model segments (Figure 1): Where M is the number of correspondences and  i the angle between the two planes spanned by the camera center, the observed image edge l i and the model segment L i (see figure 1).
In our experiment we took M=10 correspondences.We have first considered that the data set has no outliers (100 % inliers or good matching) and we computed the registration error for several frames of the image sequence.Our algorithm was run on both the indoor and outdoor scenes.Similar results were obtained in both situations.Thus, for the indoor scenario, the mean error is about  In the second experiment, we have evaluated the capacity of our robust algorithm to reject outliers in observed data.In our case, an outlier corresponds to a wrong feature matching between 2D and 3D line.
For that, we have contaminated the input data set with different percentage of outliers, for both indoor and outdoor scenes, and have computed the corresponding registration error.The obtained results are summarized in table 1.As can be seen, our robust algorithm succeeded in all the cases to detect and delete the outliers.The camera pose is then estimated using the final data consensus which contains only the good 2D-3D lines correspondences.For example, in the worst case when 60% of the input data for the indoor scene (i.e. 6 lines correspondences among 10) are labeled as outliers, our algorithm was been able to identify the four inliers in the data.The camera pose returned using this inliers gives a registration error about 8.1810 -5 .This result demonstrates the robustness and the accuracy of the proposed approach.Furthermore, we note that the number of trials needed to get the best solution increase with number of outliers (for example 240 trials for 60% of outliers for the indoor scenario and 56 trails for 50% of outliers for the outdoor scene).This means more processing time and will decrease the real time performance of the algorithm.40% of outliers (15 trials on average) represents a good compromise.Figure 5 shows the projection of the models using the pose estimated by our algorithm for different frames of the indoor and outdoor images sequence and when run with different percentage of outliers.We can see that all lines are well aligned, and in this cases, the outliers have not affected the registration performance of our robust algorithm.Another advantage of our approach is its robustness to severe lines occlusion.Indeed, as the line constraint equation (see section II-B) for the cameras pose parameters was developed in the case of "infinite image line".Any image points on the 2D line can be used to construct the corresponding projection plane.So, when partial occlusion occurs, it is enough to detect only small parts of the image edges to estimate the camera pose.In figure 6 we can see that several image edges are partially occluded (table and door for the indoor scene and because of the trees and the cars for the outdoor scene), in spite of that, the camera pose was successfully estimated.
We analyzed the processing time needed for camera pose estimation on a Pentium IV with 3GHz.All computations were performed in Matlab.The pose estimation process using IEKF does not take much time.Indeed, we have tuned the parameters of the IEKF in such manner so that it converges in a few iterations (20 at the maximum).
The processing time strongly depends only on the number of outliers in the current field of view.For example, the average time is about 28 millisecond per frame when having 40% of outliers in 10 input data for indoor scene.3 milliseconds are used to estimate the camera pose with the IEKF and 25 milliseconds are measured for the time needed to reject outliers.
We have also evaluated the 3D camera localization accuracy.For that, we considered the outdoor scene, and we compared the computed camera trajectory obtained with our algorithm with the ground truth data.The ground truth for position is provided by a Trimble Pathfinder ProXT GPS receiver giving a submeter accuracy, The covered distance is about 30 meters.The ground truth for rotation was obtained using an Xsens MTx gyro mounted rigidly with the camera [32].www.ijacsa.thesai.org The results of the experiments are given in figure 6 which shows the error between the ground truth data and the recovered position and orientation of the camera for different frames of the image sequence.Table 2 reports the mean values and the standard deviation of the position and the orientation errors.It can be seen that our algorithm allows a good 3D localization performance, especially for orientation components.

V. CONCLUSION
In this paper, we proposed a new approach for 6-DOF camera localization based on matching between 2D image edges and 3D model segments.We performed a generic camera pose estimation framework based only on lines features using an Iterated Extended Kalman Filter.We also achieved significant improvements on robust 2D/3D lines matching scheme by adapting the well-know RANSAC algorithm to our application.The experimental results confirm the robustness of our approach against severe occlusion and outliers for both indoor and outdoor applications.We also showed the good performance of our algorithm to localize a moving camera in 3D environment.Future work will be devoted to extend our approach by using other outdoor sensors (e.g. an inertial sensor and a GPS).Thus, the system could be used for navigation and localization in large-scale outdoor environments.An hybrid algorithm will the fuse the data provided by the three sensors in order to refine the camera pose.

Figure 1 .
Figure 1.Projection plane.The model line, its projection onto the image and the center of projection OC are coplanar.

Figure. 2 .
Two frames from the recorded images sequence.(a) indoor scene.(b) outdoor scene.Figure.3. 3D models of indoor and outdoor scenes used for experiments.(a) the office room model.(b) The frontage building model.
the projection of the office model using the camera pose estimated by our algorithm, and as can see it is quite skewed.All the lines are fairly well aligned.
. Projection of the 3D models using the final camera pose estimate when the input data has no outliers.(a) indoor scene (b) outdoor scene

Figure. 5 .
Camera pose estimation results

Figure. 6 .
Figure. 6. Position and orientation accuracy with respect to ground truth

TABLE I :
EXPERIMENTAL RESULTS OF THE ROBUST ALGORITHM