Multi-Target Tracking Using Hierarchical Convolutional Features and Motion Cues

In this paper, the problem of multi-target tracking with single camera in complex scenes is addressed. A new approach is proposed for multi-target tracking problem that learns from hierarchy of convolution features. First fast Region-based Convolutional Neutral Networks is trained to detect pedestrian in each frame. Then cooperate it with correlation filter tracker which learns target’s appearance from pretrained convolutional neural networks. Correlation filter learns from middle and last convolutional layers to enhances targets localization. However correlation filters fail in case of targets full occlusion. This lead to separated tracklets (mini-trajectories) problem. So a post processing step is added to link separated tracklets with minimum-cost network flow. A cost function is used, that depends on motion cues in associating short tracklets. Experimental results on MOT2015 benchmark show that the proposed approach produce comparable result against state-of-the-art approaches. It shows an increase 4.5 % in multiple object tracking accuracy. Also mostly tracked targets is 12.9% vs 7.5% against state-ofthe-art minimum-cost network flow tracker. Keywords—Multi-target tracking; correlation filters; convolution neural networks


I. INTRODUCTION
Multi-target tracking task is to estimate number of targets and their trajectories across multiple frames.It is a crucial problem in the field of computer vision.Also it is highly demanded in many computer application such as surveillance, human behavior analysis and augmented reality.Mainly it consists of two components: detection and data association between detections across frames.Data association step is challenging due to many reasons such as missed or faulty detections, short and long term occlusions and interactions between targets in crowded scenes.Most recent approaches in multi-target tracking have followed tracking-by-detection approach, where object detectors output are linked to build targets trajectories.
Recently, convolutional neural networks (CNN) have gained a lot of attention.CNN demonstrated the state of the results in various computer vision tasks such as object recognition, semantic segmentation and object detection.Due to it's ability to capture a generic feature representation from visual data.However, CNN rarely used in multi-target tracking.As CNN require collecting large number of training positive and negative samples, which is not always available.Also dealing with ambiguity in the decision boundary between positive and negative samples.As sampling is done near target which lead to high correlation between positive and negative samples.On the other hand CNN learned features have outperformed hand crafted features in many vision problems.As stated in previous work in single object tracking task [1] features from last convolutional layer encode target semantic information and more robust to handle appearance change in the target.However they have low spatial details which is necessary in target localization task.On the other hand features in earlier layers have high spatial details.So it's more helpful in localization but less invariant to target appearance change.
In this paper Fast Region-based Convolutional Neutral Networks (Fast RCNN) detector is integrated with correlation filters tracker.The proposed multi-target tracking algorithm is based on correlation filters that learn from hierarchical convolutional features of a pretrained CNN.Also cosine similarity is used between convolutional features along with Euclidean distance as association measure between previous frame tracklets and current detection.Then a post processing step is added with minimum-cost network flow tracker to recover target after long occlusions.Overview of our proposed approach is shown in Fig. 1.
The following three contributions are made.First Fast RCNN is integrated with correlation filters tracker.Using Fast RCNN to detect pedestrians in each frame.Also Fast RCNN is cooperated with correlation filters to handle targets disappearance.Second new data association metric is proposed between detections and trajectories that describe the path of target instances over time.Data association is based on measuring cosine similarity between middle convolutional layers.Third a fix for occlusion problem in correlation filters is proposed.Using min-cost network flow to avoid short tracklets and high identity switch rate.This paper is organized as follow: In Section II, previous work is discussed.In Section III, The proposed approach details are discussed including basic idea of Fast R-CNN, mathematical concept for correlation filters tracker, data association metrics strategy and min-cost network flow.In Section IV, evaluation of proposed approach on MOT2015 benchmark is presented.Comparison against state-of-the-art approaches is shown.In Section V, advantages, limitations of proposed approach and future work are discussed.In Section VI, conclusion for the work done in the paper is presented.

A. Object Detection
Recently deep convolutional neural networks have made a huge progress in object detection.RCNN [2] is a detector that classify proposal regions with a deep convolutional neural networks.RCNN first compute region proposals using separated algorithm such as selective search [3].Then it feeds the candidate regions to convolution neural networks to classify selected regions.However RCNN is slow as it doesn't share computation while performing forward pass.It process each proposal separately.Girshick proposed Fast RCNN [4] an end-to-end architecture with shared convolutional layer.Fast RCNN improved train and test speed while also gained higher detection quality.

B. Multi-target Tracking
Due to the importance of multi-target tracking in computer vision, a large number of sophisticated approaches have been developed to handle this challenging task.Specially in case of crowded scene where occlusion and false positive are common.Most work in multi target tracking follow tracking by detection approach [5,6,7], where it can be divided into two steps.First detect all targets in each frame.Then link theses detections to form trajectories.Processing detections can be done online, where only past and current frames are considered in building tracklets.Different approaches have been proposed in handling data association in online tracking.Early approaches handled data association by using recursive Bayesian filters such as: Kalman filter [8], Particle filter [9] which depends on firstorder Markov assumption.Another direction in association is to match between objects at consecutive frames using similarity measure, where only local features are considered such as object appearance, distance between detections and size.However local association that considers consecutive frame have limitation in handling false positive and missed detections.
On the other hand other approaches in multi-target tracking adopt batch learning approach [10,11,12,13,14], where future detections are also considered and data association construct targets trajectory globally.The Association between detections is then formulated as minimization of cost function.Data association problem can be formulated to achieve global optimum by using linear programming relaxation [10,15] or minimum-cost flow [11,16,17].

C. Deep Learning Multi-Target Tracking
Recent multi-target tracking algorithms based on CNN [18] or Recurrent Neural Networks have been proposed.They show higher performance when compared to handcrafted features.A Siamese CNN [18] was used to estimate likelihood if two pedestrian belong to same entity using images and optical flow as model input.Then they used gradient boosting to combine features from Siamese CNN features with contextual features.An end-to-end learning with Long Short-Term Memory (LSTM) was proposed in [19] for online tracking.Although this work was the first fully end-to-end learning method based on deep learning.Its performance did not achieve the accuracy of the state-of-the-art methods.

D. Tracking with Correlation Filters
Another recent approach in tracking is based on correlation filters.It starts with a cropped image of the tracker from a given position.After initialization, from every new frame an image patch is cropped from the estimated position.Features is extracted from the cropped image and a cosine function is applied for smoothing the discontinuities at boundary.Afterwards a correlation is computed between input and trained filter in frequency domain.Then apply inverse Fourier transform on correlation to get confidence map which give high values at the estimated target position and low values to the background.These filters can be considered as simple linear classifiers.
A new approach was proposed in [20], where all translated samples collected from target will be used in training the classifier.This enhanced training performance, without sacrificing much speed.Enhancement was done by exploiting circular structure of the kernel matrix.Extended Kernel Correlation Filter was proposed in [21] using both depth and color features.Also depth distribution was used to identify scale changes and reflect these changes in the Fourier domain.Depth was used to detect occlusion based on checking if there are sudden changes in target depth histogram.This approach achieved real time performance as it work on 35 fps.A fast Scalable Kernel Correlation Filter was introduced in [22].This approach used Gaussian window function to deal with fixed size limitation in the kernelized correlation filter.So it allowed target scale changes and provide better separation around the target.An extensive survey on correlation filters is available at [23], where experiments have been conducted on correlation filters to evaluate the effectiveness and efficiency of different algorithms.

III. PROPOSED ALGORITHM
Taking inspiration from previous approaches in multi-target tracking.The proposed approach is subdivided into two modules: multi-target detection and two step tracking.First step in tracking is based on correlation filters tracker for each target in scene.Each correlation filters learn from hierarchically of convolution features.Second step minimum-cost network flow is applied to link short tracklets from previous step.Our goal is to obtain entire trajectory for each target in the scene.Also each target will be associated with unique ID.Overview of our approach is shown in Fig. 2. Algorithm 1 summarize the proposed approach.

A. Fine-Tune Multi-Target Detector
An end-to-end multi-object detector is adopted which called Fast RCNN.Fast RCNN was trained on PASCAL VOC dataset.Since 2DMOT2015 benchmark is based on pedestrian.A fine-tune step is applied to consider pedestrians only.So softmax layer is changed to only consider two classes: pedestrians and non pedestrians.A selective search algorithm is used to generate region proposals for the network.Object detector will measure each proposal region and give each detection a score.Only detections with a score higher than a threshold will be considered as valid.High predefined threshold = 0.9 will cause most pedestrians to remain undetected.However decreasing detection threshold will cause an increase in false positive, which is more severe than missed detections.Final step non-maximal suppression is applied based on bounding box overlap between detections in order to suppress redundant boxes.

B. Correlation Filters Tracking
As mentioned above.Many possible targets states remain undetected.Correlation filters tracking is applied to improve detection.Correlation filters (CF) have attracted a lot of attention in recent years for speed and accuracy.Due to analyzing frames in Fourier domain which lead to faster processing.It has the ability to update appearance model at every frame.CF is a discriminative classifier that learns to separate target form it's background.The main idea behind CF tracker is that a learned filter is used to predict target position by searching maximum value in correlation response map.We follow the same mathematical model in model learning [1].CF learns from features that were extracted from VGG-Net-19 [24] which was trained on imageNet [25].The proposed algorithm in [1] used output from conv 3-4, conv 4-4 and conv 5-4.Due to pooling operation in VGG-net-19 which cause gradual decrease in spatial resolution.This lead to imprecise target localization.For example convolution feature size of pool4 is 14x14 and in pool5 is 7x7.In order to solve this problem Convolution feature is resized to fixed larger size using bilinear interpolation.
A correlation filter W l is learned from each convolution map to generate response map, where l indicate number of convolutional maps.Feature vector of l-layer of size MxNxD is denoted as x , where M,N and D indicate width,length and height.Output from applying Gaussian function to the circular shifts of x along the M and N dimensions is denoted as y.Each W l is updated in frame t from previous frame t-1 using the numerator A l and the denominator B l through the following equations: (2) The capital letters refer to Fourier transformed signals.indicate element wise multiplication and η is a learning rate and λ is a regularization parameter.
The a correlation response map is calculated given new image patch that contain target with the following equation: where IFFT symbol for inverse Fast Fourier transform .Then target new position can be deduced by searching maximum value of correlation response map.

C. Detection Guidance
We propose cooperation between Fast RCNN detector and CF tracker to handle CF disadvantages such as: scale variation and model drifting problem.As CF doesn't handle scale variation well.We add CF predicted bounding box of targets to region proposals of Fast RCNN.Then use predicted scales to updated targets appearance model.This way the detector will validate CF predictions.Also we use Fast RCNN detector to discover if target disappeared from point of view.We consider detector score, if it's less than predefined threshold.We know that target state is inactive and stop update it's appearance model.So this step will prevent model drift that may occur to the tracker.As shown in Fig. 3 and 4. Example of detection guidance in our model.At frame i-1 we have three targets each assigned an ID and target with ID '1' appear to be leaving the view.At frame i the target '1' disappears.So bounding box disappears.

D. Data Association
Data association goal is to associate between current frame detections and tracks that describe targets paths.This lead to updating each target state and identify the new detections.Each target can belong to a single state.Target state can be one of the following: assigned, unassigned, lost and new.
Hungarian assignment algorithm [26] is used to achieve this task.According to the assignment algorithm results we can determine the following: assigned detections with tracks, unassigned tracks and new detection.In each frame we apply Hungarian algorithm and use cosine similarity as cost measure between current frame detections and predicted target position from CF. Highly overlapped bounding boxes are considered in the association.The cosine similarly is calculated between middle convolution features from output of conv 3-4 layer.Since the middle layer has higher spatial detail which is useful in differentiating between targets.The overlap function between two bounding boxes A,B is calculated as follows:

E. Minimum-Cost Network Flow
This is done as post processing step after the proposed multi-target correlation filters tracking.The integration between Fast RCNN with correlation filters handle low missed detection rate and increase tracker precision.Also our proposed approach can handle some cases of occlusion such as targets occluded by non pedestrian objects or leaving scene.On the other hand we still need to handle recovering target after long occlusion.Also we need to handle the case where target is occluded by other pedestrian.All these issues may cause wrong association and false target model update.We follow batch learning approach to handle these issues, where we can use future detections.Multi-target tracking is formulated with minimum-cost network flow.Matching between detections is solved jointly for all tracklets.
We used modified version of minimum-cost network flow in [12] to refine the resulted tracklets.As it depends on motion cues in the association between detections.Given initial set of tracklets that consists of a set of ordered detections.Motion cues are used to refine those tracklets and link separated short tracklets.
For every detection d in tracklet.Two sets of detections is defined.The first set contains all tracklet detections before d.The second set contains all tracklet detections after d.These two sets are used to determine linear regression coefficients that can predict forward and backward target position.Then cost function that considers residual between the predicted and actual tracklet positions is computed.Finally a minimumcost network flow solution is computed to produce the final tracklets.

IV. EXPERIMENTS
In the evaluation step 2DMOT2015 [29]  Detect pedestrian using Fast RCNN 3: Apply non-maximal suppression 4: for each correlation filters tracker do for each detection in unassigned detection do 9: Initialize new correlation filters tracklet 10: end for for each tracklet in assigned tracklets do 12: Update target model from convolutions maps 13: end for 14: for each tracklet in unassigned tracklets do end for 23: end for 24: for each tracklet do 25: Refine tracklets using motion cue LP optimization 26: end for set consists of 11 sequence with 60,000 boxes.2DMOT2015 benchmark contains sequences with high target motion variation, camera motion, a different views and person density.Also for fair comparison in tracking task, 2DMOT2015 provide public detections, given by Aggregate Channel Features (ACF) pedestrian detector [30].In order to be able to compare the proposed work with others, public detections are used as region proposals for the Fast RCNN detector.
The widely accepted CLEARMOT evaluation metrics [31] are employed by 2DMOT2015.To summarize multi-object tracking (MOT) performance the following measures were reported: MOT accuracy (MOTA) measures jointly three errors: false positives (FP), false negatives (FN) and identity switches (IDSwitch).MOT precision (MOTP) measures the misalignment between the detected target locations and ground truth.Also mostly tracked and mostly lost targets percentages (MT and ML) are reported.Furthermore, the IDSwitch ratio between targets is reported.

B. Evaluation on MOT Testing Data
In order to be fair in comparing with other approaches.During testing phase, public detections are used which were provided by benchmark as region proposals for Fast RCNN detector.So Fast RCNN will only filter 2DMOT2015 public detections and eliminate false positive.Baseline comparison The proposed approach is compared with minimumcost network flow in [12].To show the progress achieved by the proposed multi-target tracking algorithm in improving precision and restoring undetected targets states.
The proposed approach is compared against state-of-the art deep learning based approach such as Siamese CNN in [18] and recurrent neural network in [19], as shown in Table .I. Also the proposed approach is compared against approaches based on handcrafted features in [27,28].
The results show that cooperating Fast RCNN with multicorrelation filters tracker produce high precision and low missed detection rate.Also the benefit of using motion cue with minimum-cost network lowered identity switch which improve the mostly tracked targets rate.

V. DISCUSSION
Training correlation filters with hierarchy of convolution features improves tracker robustness and accuracy.Also cooperating correlation filters with Fast RCNN helps tracker from drifting and scale estimation problem.These lead to low missed detection rate and high tracker precision.The last step in the proposed approach is refining the tracklets while considering motion similarity.However refining tracklets with linear velocity assumption may fail in case of random motion patterns, which lead to false association and increase identity switch between targets.Non linear motion patterns will be considered in future work.

VI. CONCLUSION
In this paper, multi-target tracking algorithm is proposed that exploit features from pretrained convolutional neural network.First Fast RCNN is trained to detect all pedestrians in the scene.Then a correlation filters tracker is proposed to learn target appearance.It learns from hierarchy of convolution features.As middle convolution layers are useful for target localization while last convolutional layer are more robust in handling target appearance changes.Also cosine similarity is used between convolution features in data assignment between tracklets and detections.
Finally to handle correlation filters failure in case of occlusion a minimum cost network is proposed to link short tracklets.Experimental results demonstrate that the proposed algorithm provides competitive performance on the 2DMOT2015 benchmark.

Fig. 2 .
Fig.2.Detailed description for our system using correlation filer and minimum-cost network flow.
benchmark is used to evaluate our multi-target tracking algorithm.A common reference in multi-target tracking task.2DMOT2015 benchmark is composed of training and testing sets.Training set consists of a 11 sequence with a 40,000 bounding box, while testing

TABLE I .
COMPARISON WITH STATE-OF-THE-ART APPROACHES.THE BEST SCORE ARE BOLDFACED.ARROW UP INDICATE HIGHER IS BETTER.WHILE ARROW DOWN INDICATE LOWER IS BETTER