A New Framework of Moving Object Tracking based on Object Detection-Tracking with Removal of Moving Features

Object Tracking (OT) on a Moving Camera socalled Moving Object Tracking (MOT) is extremely vital in Computer Vision. While other conventional tracking methods based on fixed camera can only track the objects in its range, a moving camera can tackle this issue by following the objects. Moreover, single tracker is used widely to track object but it is not effective due to the moving camera because the challenges such as sudden movements, blurring, pose variation. The paper proposes a method inherited by tracking by detection approach. It integrates a single tracker with object detection method. The proposed tracking system can track object efficiency and effectively because object detection method can be used to find the tracked object again if the single tracker loses track. Three main contributions are presented in the paper as follow. First, the proposed Unified Visual based-MOT system can do the tasks such as Localization, 3D Environment Reconstruction and Tracking based on Stereo Camera and Inertial Measurement Unit (IMU). Second, it takes into account camera motion and the moving objects to improve the precision rate in localization and tracking. Third, proposed tracking system based on integration of single tracker as Deep Particle Filter and Object Detection as Yolov3. The overall system is tested on the dataset KITTI 2012, and it has achieved a good accuracy rate in real time. Keywords—Moving object tracking; object detection; camera localization; 3D environment reconstruction; tracking by detection


I. INTRODUCTION
In Object Tracking, it is necessary to predict the position of object being tracked in the current frame and match them with the previous ones to achieve its precise position in the current frame. Many significant works have dealt with appearance changes overtime such as color histogram [1], HoG feature [2], SIFT or SURF feature [3], or texture features like LBP [4].The single tracker based on the popular filters such as Correlation Filter, Kalman Filter or Particle Filter. Correlation filter [5] [6] [7] is also used and it acquired high speed and accuracy. Other filters such as Kalman Filter or Particle Filter are used because they could predict the position of objects and then match the predicted position with the previous one. Kalman Filter [8]- [10] could not deal with nonlinearity in the measurements because the filter tries to linearize it using approximation method. Particle filter [11], [12] are used as it could solve the drawback of Kalman filter. Recently, deep neural networks have been applied in tracking problems. S. Chen and W. Liang [13] used a CNN to distinguish the background from objects and then track the objects according to their position. CNN is integrated with correlation filter [14] or with particle filter [15], [16]. But these approaches do not take into account the challenges of moving camera. J. S. Lim and W. H. Kim [17], Y. Chen et al. [18] tried to calculate translation vector between two consecutive frames (or two frames from stereo camera).
Based on data acquired by IMU and stereo camera, the paper proposed a solution by integration of a single tracker as Deep Particle Filter and an object detection method as YOLOv3 [19], however, the object would be tracked by its three-dimensional center. In traditional object tracking from static camera, two-dimensional position of tracked object is enough but in MOT, three-dimensional position of tracked object must be considered. The challenges must be taken into account as the vibration of the camera and the movement of the object. YOLOv3 is the right solution because it can detect objects very quickly and then this result can be used to support the single tracker be more robust, and the most important thing is that it is suitable for real-time applications. In addition, in the localization and three-dimensional environment reconstruction, the removal of moving objects is considered to increase accuracy rate. To do that, the paper does not rely on estimating 6 degrees of freedom to find out the robot position, but inspired from [20], the paper splits it into two separate transformations including a rotation transformation and a translation transformation. Rotation transformation is calculated based on IMU and the translation transformation is estimated from the stereo camera. Robot can locate by itself based on these two transformations in real environments. Data which is observed from stereo camerabased environments includes two kinds of object: moving objects and static objects. If the feature points of moving objects are used to estimate the robot position and 3D point cloud of environment, the estimated error will increase over time. Therefore, the paper considers to eliminate feature points of moving objects to increase accuracy rate. This is an improvement of the paper to increase the accuracy rate of localization and 3D environment reconstruction. Most of the solutions be published have not yet considered the feature points of moving objects. But in experimental results of the paper, the accuracy rate with removal of moving features has yielded better results than the opposite. To remove moving objects, the paper uses the background subtraction method with camera motion compensation to detect moving objects proposed in [21], [22], the advantage is fast and accurate www.ijacsa.thesai.org detection of moving objects. Meanwhile in the research [23] it is assumed that moving objects are identified as belonging to movable categories which are likely to move currently or in the time coming, such as people, dog, cat, and car. For instance, once a person is detected, no matter walking or standing, it is considered as a potentially moving object and remove the features belonging to the region in the image where the person was detected. The limitation of the proposed method in [23] is that it is impossible to distinguish moving objects or static objects.
In the MOT problem, the paper tries to use stereo camera and IMU without GPS for the following reasons: The paper would like to test the power of visual information acquired from stereo camera in estimating the position of the robot. The IMU data will provide rotation transformation of the robot motion. Stereo camera integrated with IMU can work better than GPS in many environments such as indoors, radio interference, noisy GPS and in the cases that the input is only visual information of tracked object.
In Section II, the paper reviews the previous work in visual tracking on a fixed camera as well as moving camera. Section III describes the proposed methods such as object localization, 3D environment reconstruction and tracking algorithm based on stereo camera and IMU. Section IV shows experimental results of localization and tracking. The paper discusses about the pros and cons of the proposed methods in Section V. Conclusion and future works will be presented in Section VI.

A. Camera Localization
Robot localization is crucial for many high-level tasks such as object tracking, obstacle detection and avoidance, motion planning, autonomous navigation, local path planning and a waypoint follower, etc. Over the years, many researchers have been working on the problem of robot localization and made certain contributions. David Nistér et al. [24] proposed a system for real-time ego-motion estimation of a single camera or stereo camera. Bernd Kitt et al. [25] proposed another visual odometry algorithm based on RANSAC outlier rejection technique. Shaojie Shen et al. [20] used the feature points from stereo images and IMU information to estimate robot position. S.Prabu and G. Hu [12] proposed a vision based on localization algorithm which combinesthe partial depth estimation and particle filter techniques. Yanqing Liu et al. [26] present a robust stereo visual odometry using an improved RANSAC based method (PASAC) that makes the procedure of motion estimation much faster and more accurate than standard RANSAC. Yuquan Xu et al. [27] propose a novel algorithm for the problem of three-dimensional point cloud map based on localization using a stereo camera. S. Hong et al. [28] proposed the real-time autonomous navigation system using only a stereo camera and a low-cost GPS. All the aforementioned research works have provided the fundamental background knowledge to solve the localization problem in this paper. Here, the paper proposes a novel method to localize a robot using a stereo camera and IMU sensor, especially it takes into account moving objects to increase accuracy rate.

B. Moving Object Tracking
The moving camera could solve the disadvantages of fixed camera. The fixed camera can only track objects within their range, if the objects come out of field of view (FOV) of camera, it cannot monitor the objects and for realize this matter, it should be mounted on the moving framework such as robot, drone or an autonomous-driving car.
Y. Chen et al. [18] used features such as SIFT, SURF to match the features between two consecutive frames to find out the translation vector of camera and uses it to predict the position of objects in the frame. J. S. Lim and W. H. Kim [17] estimated motion by distinguishing 16x16 patches between the two frames. Each patch has a vector that is the main motion in this area and after traverse all the 16x16 patches in two consecutive frames, the vector with the highest frequency is selected as camera motion vector. These frameworks partly alleviate the effects of fast moving, rotation, vibration of the cameras.
There are also several ways to matching objects between two images. Q. Zhao et al. [1] matched objects by comparing color histograms but this is easy to fail in case there are regions which have the same colors with the objects. C. Ma et al. [14] applied CNN to extract features and compared objects by a correlation filter. R. J. Mozhdehi and H. Medeiros [15], T. Zhang et al. [16] inherited the previous framework and integrated it with Particle Filter.
The tracking part will inherit Particle filter to track objects and improve the performance in its prediction and measurement steps. Firstly, the paper will find out a translation vector by using feature matching algorithm, and then the position of the tracked object will be solved by applying deep neural network in conjunction with correlation filter.
Moreover, the paper inherits a deep CNN-based object detection algorithm named YOLOv3 [19] which is very fast and quite accurate to detect objects. By combining these methods, the tracking part has developed an algorithm called Tracking by Detection.
However, to track the object in the context of moving camera and moving object, the tracking part has to track object in 3D environment (by IMU and stereo cameras) so that the tracking system is realistic.

III. METHOD
The paper proposes a Unified Visual Based-MOT system can do the tasks such as Camera Localization, 3D Environment Reconstruction and Object Tracking.

A. Camera Localization
Inspired from the method of [20], the significant improvement is proposed in feature detection stage with removal of moving feature points. In addition, there are some differences between [20] and the paper. Specifically, instead using a built-in system [20] to get the camera position as www.ijacsa.thesai.org ground truth, the paper used the ground truth GPS of KITTI dataset.
To locate the position of the camera, the paper estimates the camera motion at time t, consists of the translation transformation and the rotation transformation of camera coordinate system between two consecutive frames based on the stereo camera and IMU sensor. The IMU data provides the rotation matrix for rotation transformation. The feature points of the image used to estimate the translation, these features include moving and static features. In this case, moving features are noise. Therefore, the paper removes the moving feature points to reduce error rate in estimating camera position. It is a new point in improving the robot localization process. The camera location estimation steps are shown in Fig. 1.

1) Camera model, feature detection and feature tracking:
Both cameras in the system are calibrated using the Camera Calibration Toolbox. Both cameras are divided into two systems and play different roles:  In the model, at each moment, the paper gets two images from the stereo camera (see Fig. 2). These two images are used for feature detection and reproduce the 3D positions of the features in WCS. However, feature detection and 3D position reconstruction will not be performed consecutively in pairs of successive images, but they are performed in a given cycle, which corresponds to 25 consecutive frames (depending on the device) (see Fig. 2). This means that at the beginning the feature detection is made from the two images of stereo camera and reconstructed the 3D position of the features in the world coordinate system, and after 25 consecutive frames of cycle (including frames used for feature detection), the above calculation process will be performed again. For 25 consecutive frames of cycle, the feature detection process and estimate the 3D positions are not performed, instead the features will be kept track on the successive image frames until a new cycle be done. The purpose of this solution is to reduce computational time but still retain the required accuracy rate.
In feature detection stage, the image features play an important role in locating robot positions. The SURF feature (Speeded Up Robust Features) [29] is extracted from pairs of images of the left and right cameras. FLANN matching algorithm (Fast Library for Approximate Nearest Neighbor) [30] is used for matching the features of two images. In order to remove outliers, Lowe outlier rejection method [31] is used. This outlier removal supports significant improvement the accuracy of localization.  In the feature tracking stage, the KLT algorithm [32] is used to track features through the consecutive image frames. This feature tracking is only performed in the left camera (Monocular Camera System). Tracked features are features that are detected and estimated at the 3D position at the stereo camera system at the start of each new cycle. In the tracking process, the moving features are detected and removed in the image (see Fig. 3).
In traditional methods, to estimate camera position, the moving and static features are all used. In order to increase the accuracy rate of camera position, the paper proposes removal of moving features. Because these moving features will have a position that changes over time, so using the features to calculate the camera position, the error of the predicted position will increase over time. The process of detecting and removing moving features is performed before estimating the camera position. Here, the paper proposes using the background subtraction method for two consecutive images to detect and remove moving objects as suggested in [21]. This method will find a transformation matrix (called a homography matrix) between two consecutive images and then use this homography matrix to transform two consecutive images into the same coordinate system. Then background subtraction method for these two images is performed to find the regions of moving objects on the image. Finally, the removal of features is performed in moving regions.
Outline of the steps of the removal of moving features process (see Fig. 3). At time t, there are two consecutive images from the left camera at the time t-1 and t, are called −1 and . Besides, at this time, in the feature tracking step between successive frames, two feature sets of tracking −1 and are obtained, respectively, for two images −1 and . Assuming that −1 = [p −1 1 , … , p −1 ] be the set of N key points found at time t -1 and = [p 1 , … , p ] be the set of the tracked points at time t. Here, p = [ , ] while and represent its 2D position in the image. Two sets of −1 and are used to find the coordinate transformation between the two images. Then convert two images −1 and to the same coordinate system and perform background subtraction to find the moving regions. Finally, remove the features located in the moving regions. Steps to remove moving features at time t: Step 1: Image registration: To find the transformation between two frames at the time t-1 and t, homography transformation is used. The relationships between these frames are shown as follow.
where the transform matrix H (homography matrix) is a 3by-3 matrix which describes the spatial relationship between two consecutive image frames.
As in [33], H can be solved by least square criteria with: where (•) represents the matrix transpose and(•) −1 represents the matrix inverse.
By multiplying the estimated transform matrix H on the pixel positions on current image, they are warped onto the image plane at the previous time instance and the same background scenes in consecutive frames can approximately overlap with each other, it performs camera motion compensation. Therefore, a new image will get at time t, ( ) , this image has the same coordinate system as the image at time t-1, −1 .
Step 2: Background subtraction: Perform background subtraction between image ( ) and −1 with a certain threshold and the moving regions be detected. In addition, using Morphology operators to refine the result image to increase the accuracy of moving regions.
Step 3: Removal of moving features: The feature points of moving regions are moving features, so they are excluded (The feature points are converted to the same coordinate system with the result image).  2) 3D Feature location via stereo correspondences: The 3D pose of corresponding feature points is estimated by stereo correspondences [34] (see Fig. 4). The 3D camera coordinates of the feature points are obtained based on the following equations: where is the focal length of the stereo camera. represents the baseline between the stereo cameras. , represents and coordinate of the principal point. is the disparity between the feature points in the left and right images. ( , ) and ( , ) ∈ ℝ 2 are the coordinates in the left and right images of the feature point, respectively.
Thus, ( , , ) is 3D camera coordinates of the feature points and represents the depth of the feature point.

3) Estimate camera position via 2D-3D correspondences:
Inspired from the camera location estimation method presented in [20], the paper improves precision rate of localization by removal of moving features. Assume that the 3D local feature map is known. Details of local map initialization and maintenance will be presented in the following sections. The robot position is assumed that the  The observation of the ℎ feature point from the homogeneous image coordinate system (ICS) ( , , 1) will be transformed into an observation vector in the camera coordinate system (CCS) at time and is denoted as, where −1 is the inverse matrix of K, this matrix will convert a point on the image plane into a directional vector , starting from the camera position to (or from the feature point to ), in the CCS. is a matrix transform from the CCS to the ICS.
Then, the observation vector is normalized to unit vector ‖ ‖ .
Transforming the observation vectors from the CCS to the WCS, and is given by, = where is rotation matrix from IMU coordinate system (IMUCS) to WCS, is rotation matrix from CCS to IMUCS, is obtained from offline camera calibration.
is obtained from IMU data at each time . Assume that the camera motion between two consecutive images is small, formula (7) can be approximated as: where = ‖ − ‖ ≈ ‖ −1 − ‖ are known quantities. By taking the derivative of formula (10) and setting it to zero, a linear system are obtained with the optimal camera position is the unknown: where is the 3D position of the camera at time in WCS, is the observation vector of the ℎ feature point at time in WCS, is the 3D position of the ℎ feature in WCS, ℑ represents the set of features observed in the image at time t, = ‖ −1 − ‖ is the known value, (⋅) represents the matrix transpose. 3 is a 3 × 3 matrix unit.
Equation (11) consisted of three equations corresponding to three unknowns which are the 3D position of the camera in WCS, these three equations will not change regardless of the number of observed features. Therefore, the camera position estimation can be solved efficiently in constant time. The observed features used to calculate camera position are features that are not in moving regions. Suppose, if using the features of the moving regions, the error of the estimated camera position will increase. Since the camera position is estimated based on 3D feature points at time t-1 and the corresponding feature observation vectors at time t, if you consider a feature of moving regions, the 3D position of features in the environment will be different at the time t-1 and t in the same WCS, and the feature observation vector at time t will not match the truth vector of the 3D feature point at time t-1 (i.e. the observation vector will not match the truth vector ) will increase the error for camera location estimation. In this case, if it is a static feature, the error level will be 0 or very small. Equation (11) can use at least two features to calculate camera position . As such, an efficient 2-point RANSAC (Random Sample Consensus) can be applied for outlier rejection. Using this algorithm will help reduce computational time compared to the traditional 3-point algorithm [35] and 5point algorithm [36].
As mentioned above, equation (11) is solved by the 2-point RANSAC algorithm. This algorithm includes the following steps: Firstly, determine the number of iterations for the algorithm. Secondly, at each iteration, define a random sample set of two elements which are two random points from the 3D features point set. Then, the estimation results based on this sample will be evaluated by an error function. The steps above are done several times to find the best robot position. Finally, after all iterations, RANSAC will converge at a good robot position, but not sure if this is the best position. This RANSAC algorithm ensures fast processing time, the ability to estimate a good enough model and eliminate noise in the data set.

B. 3D Environment Reconstruction
In this section, 3D environment reconstruction task is presented (see Fig. 7). The environmental map is a local map. The local map is defined as the set of currently tracked 3D features. The 3D points are calculated from two different methods, one from the stereo camera, the other from the monocular camera. These 3D points will be transferred from the CCS to the WCS of robots at the start and is added to the local map. The 3D features added to the local map are static feature points, because these will be used to estimate robot positions at different times. For moving feature points, it will cause an error when estimating the robot position. At the initial time = 0, the robot position is initialized. The 3D positions of the feature points will be estimated in the world coordinate from stereo camera. The 3D points are used to initialize the local map.
At the time ( ≠ 0), with given robot position, the local map is updated according to the following two systems: 1) Stereo camera system: After a given period of time, the system will be restarted to update the local map. The 3D points are calculated by stereo camera.
2) Monocular camera system: During the feature tracking process, some features will be lost and lost features will be removed from the map. New features are added to the local map if the current number of features is smaller than the minimum allowable feature count. The 3D feature location of the new feature point is estimated based on a set of observation of the ℎ feature at different camera positions and is given by: where, is the 3D position of the camera at time in WCS, is the observation vector of the ℎ feature point at time , is the 3D position of the ℎ feature in WCS.
And solve equation (12) via the following linear system: where is the 3D position of the camera at time in WCS, is the observation vector of the ℎ feature point at time in WCS, is the 3D position of the ℎ feature in WCS, is the set of times t that the ℎ feature is observed on the left camera, (⋅) represents the matrix transpose, 3 is a 3 × 3 matrix unit.
Equation (13) is solved by basic matrix algebra, is defined as 2 consecutive times t-1 and t. www.ijacsa.thesai.org The 3D positions of the feature points are calculated from monocular camera will be updated in the following two cases:  The feature points have been restored the 3D position from stereo camera system: Add the 3D position from monocular camera system to the local map.
 The feature points have not had the 3D position from stereo camera system: The feature points are added to the local map if the current number of feature points in local map is smaller than the minimum allowable feature points count.

Failure Detection and Recovery
Because the 3D positions of the feature points in local map are calculated from two systems: feature tracking by monocular camera, feature detection and matching by stereo camera, therefore, it will accumulate errors. The error of the local map at time is calculated as: where, is the location of feature obtained by stereo correspondence, is the location of feature obtained by monocular camera, is the set of 3D points to calculate the errors, is the camera position at time .
The system works well when ≅ 1 and otherwise it is error. When the system has error, the system will remove all features from the monocular camera system and restart the local map based on the 3D positions of feature points from stereo camera .

C. Tracking by Detection
The goal of the paper's tracking algorithm is to solve the challenges of a moving camera which are vibration of the camera and motion of tracked object. The basic essence of the paper's tracking algorithm is the integration of single tracker and object detection to become a method called tracking by detection. The single tracker will give the state of the tracked object at every time step, moreover, its 3D position by reconstructing the environment can provide the necessary information for tracking process. A single tracker is integrated to object detection as YOLOv3 because YOLOv3 can detect objects very quickly and accurately, it can support the single tracker to find the tracked object more quickly and when the single tracker fails to track down the tracked object, YOLOv3 is a useful helper to find the object being tracked. And Tracking by Detection with 3D Environment Reconstruction can estimate the three-dimensional position of the tracked object.
This section describes the workflow of the paper's tracking algorithm, it includes four steps: Initialization, Prediction, Matching and Resampling.
Firstly, an object could be selected in a frame to track, however, the system might be given an image of object and it would be found out before tracking.
After having the bounding box of the tracked object at time t, in prediction step at time (t+1), the particle filter will generate some particles and each particle will be guided by a motion vector of the tracked object. An object detector will come in handy in this step, it will detect objects that is the same kind with the tracking object and then, some of detected objects will be chosen to add into the particle set. Object detection based on deep learning can be robust to the vibration of camera.
In the matching step, YOLOv3 will detect a number of objects of the same type as the tracked object and then select a detected object with highest matching rate with the tracked object as the current tracked object. If matching is failed, each particle will be matched with the previous object by an observation model. The correlation filter will be integrated to a deep neural net to match objects. After that, resampling could be performed if it is necessary.
The tracking pipeline of the paper is as Fig. 8: Using a matching feature algorithm, feature points which are not matched are eliminated and the remaining K pairs of feature points are treated as a set of K motion vectors.
With each , i ∈ K, is a camera's motion vector of a pair of feature points. This set is used to compute camera's motion vector of the object at time t. (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 4, 2020 42 | P a g e www.ijacsa.thesai.org 2) Prediction: In this step, the state of the object will be predicted in the next frame.
The particle filter generates some particles around the previous object and each object has a weight which is the importance of a particle.
For example, in Fig. 9, the motorbike is being tracked: The yellow box indicates the object is being tracked. The green box indicates a particle around the object.
The particle is guided by a camera's motion vector which is obtained by matching SURF features in two consecutive frames got from the left or right camera. Stereo camera is used to get two current frames instead of using two consecutive frames as [38]. After matching features, The predicted state of tracked object is ( ): ( − 1) is the state of the tracked object in previous frame (t-1).
( ) is a camera's motion vector at time t computed by feature matching.

( ) is Gaussian noise added.
After that, a number of particles will be generated around ( ) and each particle is called ̂( ). With, YOLOv3 [19] is used to detect objects in this step, after detecting, some objects will be selected by their IoU with the tracked object, if this value is higher than a threshold, this object will be kept and added into the particle set. 3) Matching: Each particle will be compared with the previous object by an observation (matching) model, after comparing, each particle will have a weight.
The correlation filter is presented in [5] [6] [7] will be used in observation model. This filter is obtained at time (t-1), next, it will traverse the frame at time (t), this means the filter will convolve with each region of the image from left to right, up to down. This result of a value which determine the correlation between the object at time (t-1) and the region at time (t). The higher the value is, the higher the region will be the state of the tracked object at time (t).
The correlation filter is integrated with particle filter because it does not need to traverse the whole frame, it only needs to convolve with each particle. This will reduce computational cost.
But, to use this correlation filter, it has been learned in frame (t-1). It is called as and the region of the tracked object at time (t-1) is and M, N is width and height of this region, and , with , ∈ {0,1, … , − 1} × {0,1, … , − 1} is the region of the tracked object at time (t-1) after translating to the right pixel and to below pixel.
Each , is corresponding with a value called , = ⅇ . This value is a Gaussian value because Gaussian value expresses how far a pixel from its center.
The correlation filter is then learned through this expression: After find out * , the weight also known as the correlation of each particle calculated by consoling * with the region of a particle.
In this step, a deep neural network called VGG-19 is applied [14] [16]. This network is trained and has a optimal parameters. , and the region of each particle in this step have to go through this network to obtain three feature maps called conv-3, conv-4, conv-5. And with each convolutional map, a corresponding * will be learned. Three weights of a particle called 1 , 2 , 3 and then the final weight of a particle is: Finally, the weight of each particle will be normalized. Fig. 10 is an example when a particle is passed through a deep network and feature maps are obtained. After that, each feature maps are convolved with a correlation filter to get weight values. Finally, the weight of the particle is the sum of three weight values. www.ijacsa.thesai.org

4) Resampling:
After matching step, the effective sample size is computed and it is compared with a threshold to perform resampling.
In this step, YOLOv3 [19] will be used to detect objects and choose the candidate which has the highest IoU value with the tracking object to become the one which is tracked.
Object Detector is used to overcome the degeneracy problem of particle filter.
If Object Detector does not find any objects, the Roullette table will be used to resample.
The Roullet table is described as Fig. 11, all the weights of particles will be put on a Roullet table: The wheel will be spinned N times (N is the number of particles), and N particles will be got with different weights.

5) Estimation
: After the C.3 step, the estimated state of tracked object is calculated by this expression: Recently, there are various of object detection algorithms based on neural network. The paper explored several popular object detection DNN architecture: Faster-RCNN [39], RetinaNet [40] achieved the high accuracy rate but they are not suitable to apply in real-time applications because their computational time are still slow.
SSD [41] achieved the accuracy rate lower than Faster-RCNN and RetinaNet and it is not robust to detect small objects. Fig. 11. Describe the Process of Spinning the Roullette Table. Inspired from [22]. YOLOv3 [19] is a suitable solution to detect objects in real-time applications, it achieves the accuracy rate ranked between the groups listed below (see Table I) but its processing speed is unsurpassed.

IV. EXPERIMENTAL RESULTS
The proposed method will be evaluated on the accuracy rate of camera location, tracked object location.
KITTI data set is used to evaluate the accuracy of the proposed method. KITTI dataset has many different data sets, two data sets "Raw Data" and "Object Tracking Evaluation 2012" are selected to experiment. The accuracy of the camera position estimation algorithm (camera position) is evaluated on "Raw Data" dataset. Datasets 0091, 0060, 0095, 0113, 0106 and 0005 are selected in dataset "Raw Data". Tracked object position estimation is evaluated based on the proposed tracking by detection method on "Object Tracking Evaluation 2012" dataset. These following datasets 0000, 0004, 0005, 0010, 0011, 0020 are used because they have specific objects to tracking.
The experiments are performed based on Python programming language (version 3.7.0) on computers with the following specifications: Intel Corei5 5200U 2.7GHz/ RAM 4GB / Windows 10 operating system (for camera localization); For the tasks such as tracking and 3D reconstruction, 2080 Titan GPU, Ubuntu 16.04 are used.

A. Accuracy of Camera Location Estimation
In this experiment, the data set KITTI is selected. The camera location is estimated with and without removal of moving objects.
The solution to remove moving objects in the image is described in the section 3.A.1 and Fig. 3. Distance error between estimated camera position and ground-truth camera position is estimated by formula: where, is the estimated camera location at the time t; , , and are the coordinates x, y and z of camera www.ijacsa.thesai.org location , GPS is camera location from GPS at time t; GPS , GPS , and GPS are the coordinates x, y and z of camera location from GPS , N is the number of frames.
From the experiment results (see Table II), estimated camera position is quite good and if the removal of moving object is considered, the better results can be achieved compared to the opposite case.

B. Accuracy of the Tracked Object Center in 3D
Object center is estimated as the average of all 3D points of the object being considered.
The following distances are used to evaluate errors (see Table III) between tracked object center and its ground truth object center: where ( ) is the center of the estimated 3D bounding box at the time t; , ( ) , , ( ) and , ( ) are the coordinates x, y and z of center ( ); ( ) is the center of 3D bounding box from ground-truth data at time t; , ( ), , ( ) and , ( ) are the coordinates x, y and z of center ( ). N is number of frames.
In this Section 4.B, a different dataset will be used to evaluate the accuracy of the tracked object center because it has ground truth data about the object's center.
From Table III, the center error in case of using YOLO has achieved better results than the opposite case.

C. Accuracy of the Tracking Object Position based on Object Center and IoU in 2D
The IoU metric is used to calculate the IoU between predicted boxes with its ground truth boxes in 2D.
In Section 4.C, the following datasets are used because of some objects exist to track, others do not have a consistent object to follow.
In Table IV, Euclide distance is used to estimate error between predicted box center and its ground truth box center. The table has shown that the IoU metric of proposed tracking method in case of using YOLOv3 is better than the opposite case.
In Table V, Euclidean distance is used to estimate the errors between the ground truth centers and the predicted centers. The table has shown that when combines YOLOv3, the Euclid distances of proposed tracking method in case of using YOLOv3 is smaller than the opposite case.

D. Speed of the Tracking Algorithm
Number of frames are estimated per second (FPS).
In Table VI, the paper measured the time taken to process a frame and then invert this time to get the given FPS. The table has shown that FPS of proposed tracking method in case of using YOLOv3 is about more three times faster than the opposite case.

V. DISCUSSION
The experimental results have shown that camera position is estimated quite well because the moving features are removed when estimating camera position. However, the moving features removal algorithm is still limited and should be improved in the future, though it has shown an improvement in the accuracy of the camera position when the moving features are removed. Though the error between the estimated camera location and ground truth camera location is still remaining, but it is useful in the environments such as indoors, noisy GPS and in the cases where input of tracked object is image. The experimental results has shown the important role of visual information in MOT.
The experimental results have also shown that the processing speed is suitable for real-time applications. The speed of tracking algorithm is increased when Particle filter is integrated to YOLOv3, this is because the tracking algorithm could use either of these methods to track the object. And when it uses YOLOv3, the algorithm can run very fast. Because of this, the speed increases significantly, it achieves more three times faster than the conventional method. In comparison to accuracy, the tracking by detection method is also more effective than the single tracker method based on metric of IoU in 2D or the tracked object center.

VI. CONCLUSIONS AND FUTURE WORK
In this paper, a unified system consisted of robot localization, environment reconstruction and object tracking based on stereo camera and IMU has been proposed.
In localization, the paper's contribution is a solution to estimate camera position and tracked object position based on stereo camera and IMU with the removal of moving features. It has shown that the accuracy rate of locatization is improved and the computational time is suitable to real-time applications.
In tracking, the paper's contributions are: (1) particles guided by a motion vectors are calculated using pairs of SURF feature points, so the direction the object is pointing to is gathered; (2) an observation (matching) model contains correlation filter and a deep neural network (VGG- 19), it can deal with changes of translation of the object; (3) Tracking by detection with an object detection algorithm (YOLOv3) can support the single tracker become more accurate because it can detect more participants for particle filter, besides, it also can detect objects very fast and accurately.
Although the framework works well on the KITTI dataset, it is necessary to improve both localization and tracking algorithms.
In the localization algorithm, the algorithm should be tested on a real robot. The localization algorithm can also pave the way for the dynamic obstacle avoidance. And combination lidar with stereo camera is a novel way to explore.
In the tracking algorithm, the most time-consuming step is the matching step of the Particle Filter. In the matching step, Correlation Filter trained at each frame so the matching has increased the computational time. In the future, the Correlation Filter should be replaced by a pre-trained model such as Siamese net that can compare the features of the target at the consecutive times in real time.