Compact Scrutiny of Current Video Tracking System and its Associated Standard Approaches

With an increasing demands of video tracking systems with object detection over wide ranges of computer vision applications, it is necessary to understand the strengths and weaknesses of the present situation of approaches. However, there are various publications on different techniques in the visual tracking system associated with video surveillance application. It has been seen that there are prime classes of approaches that are only three, viz. point-based tracking, kernelbased tracking, and silhouette-based tracking. Therefore, this paper contributes to studying the literature published in the last decade to highlight the techniques obtained and brief the tracking performance yields. The paper also highlights the present research trend towards these three core approaches and significantly highlights the open-end research issues in these regards. The prime aim of this paper is to study all the prominent approaches of video tracking system which has been evolved till date in various literatures. The idea is to understand the strength and weakness associated with the standard approach so that new approaches could be effectively act as a guideline for constructing a new upcoming model. The prominent challenge in reviewing the existing approaches are that all the approaches are targeted towards achieving accuracy, whereas there are various other connected problems with internal process which has not been considered for e.g. feature extraction, processing time, dimensional problems, non-inclusion of contextual factor, which has been an outcome of the proposed review findings. The paper concluded by highlighting this as research gap acting as contribution of this review work and further states that there are some good possibilities of new work evolution if these issues are considered prior to developing any video tracking system. Overall, this paper offers an unbiased picture of the current state of video tracking approaches to be considered for developing any


I. INTRODUCTION
With the advancement of computer vision and video surveillance systems, video tracking has gained immense popularity in both domestic and commercial applications [1]. Fundamentally, video tracking is a mechanism of identifying, recognizing, and tracking a mobile object over time [2]. Apart from its applicability towards video surveillance systems, video tracking is now used over various applications: viz. video editing, medical imaging, traffic control, augmented reality, communication, and video compression, human and computer communication [3][4][5]. Usually, the comprehensive mechanism of a video tracking system could involve more processing of time owing to its dependency on a massive amount of data within a video sequence [6]. Complexity in the operational process also existing in recognizing an object with accuracy in a video tracking system [7]. Essentially, the video tracking system aims to connect the mobile target object (or multiple objects) present over a sequence of video frames. This could be highly a difficult process, especially when the speed of the mobile object is quite faster relatively or uncertain concerning the defined rate of video frames. The uneven orientation of a mobile object is another complicated scenario in video tracking, which offers complexity in analyzing the presence of an object for a given scene over a sample of time. In order to address this conventional issue, the motion model is adopted in the video tracking system [8]. This motion model is responsible for defining the relationship between the target object image and its influence over the mobility scenario. Regarding the motion model, generally homography or affine transformation is used for two-dimensional models when tracking is performed over planer objects [9]. The motion model for a three-dimensional object is usually related to the position and orientation of the object [10]. While dealing with video compression, the macroblocks are divided into keyframes, and selected motion motions are considered disruptions of these frames considering motion parameters [11]. In the case of a deformable object, the motion model generally considers the position of a target object over the mesh [12]. At present, there is various literature on video tracking systems, which mainly evaluates sequential frames in a video yielding to an identified target object within the transition of frames [13][14][15][16]. However, considering the generalized classification, it is found that existing video tracking algorithms are of two types, i.e., representation along with localization of a target object and filtering of data. The first kind of algorithm are generally known for their low computational complexity, and they are again classified into contour-based tracking and kernel-based tracking. The second kind of algorithm mainly deals with the dynamics of the target object and performs assessment based on multiple hypotheses. Thereby, such an algorithm results in enhance capability towards tracking mobile objects of complex form. However, these algorithms are also computationally complex, and it has dependencies over different parameters, e.g., stability, redundancy, quality, etc. The algorithms that fall under such category are Kalman filter and particle filtering. Therefore, the prime research problem considered for this work is thatalthough there are various implementation and discussionwww.ijacsa.thesai.org based papers on video tracking system, but there is no globalviewpoint to standardize the effective of all the exercised approaches. It is still vague to understand the actual scenario of existing video tracking approaches as the taxonomies are not well discussed. Apart from this, it is also known fact that there is an increasing demands of video surveillance system where sophisticated features are demands. There are also consistent evolution of various approaches in order to assist in internal processing of video frames in image processing. This leads to a motivating factor that this topic is worth doing a research on owing to its abundant scope of application in upcoming days as well as trade-off in finding any potential standardized model. Therefore, the significant objective / contribution of this paper is to discuss the techniques applied in major implementation work towards the significant classes of video tracking approaches, i.e., point-based tracking, kernel-based tracking, and silhouette-based tracking. The study also contributes to discuss open research problems. The organization of this manuscript is as follows: Section II briefs about the taxonomy of video tracking approaches followed by a discussion of Point tracking approaches in Section III. Existing kernel tracking and silhouette tracking approaches are carried out in Section IV, and Section V. Discussion of findings of the study is carried out in Section VI. A summary of this paper is carried out in Section VII.

II. TAXONOMY OF VIDEO TRACKING APPROACHES
Basically, the video tracking mechanism targets trajectory generation associated with an object by identifying the position of an object with respect to all the video sequences over time. The taxonomy of the existing video tracking approaches is pictorially shown in Fig. 1. It is found that existing video tracking approaches are broadly classified into three forms, i.e., point-based, kernel-based, and silhouette-based approach. The point-based approach makes use of points for tracking and is further classified into deterministic and probabilistic approaches. In a deterministic approach, the cost factor is evaluated for connecting with each object over a video sequence considering motion constraints (i.e., common motion, change of minor velocity, higher velocity, rigidity, and proximity). In the deterministic method, the state-space approach is usually used to model the properties of an object and its associated parameters (i.e., acceleration, velocity, position, etc.). At present, studies towards point-based tracking mainly use Kalman filtering, particle filtering, and hypothesisbased tracking system. The kernel-tracking-based system uses the mobility of an object with respect to connecting frames and is broadly classified into two forms, i.e., multiple tracking and template-based tracking. The multi-view tracking system is modeling without the inclusion of zero interactivity between object and background. The aggregated information is represented by this model, corresponding to the objects in the given scene of the video sequence. The possibility of different shapes and sizes of the same object is quite high in this approach. It is further categorized into subspace and classifier forms of tracking. Template-based matching is used mainly for tracking single objects, and it ensures the minimal cost of computation. It has been found that the kernel-based tracking dominantly uses the mean-shift approach, support vector machine, and template-based approach. Such an approach is also discussed to offer better occlusion handling capability while requiring more training to attain better performance. The final class of video tracking system is silhouette tracking mechanism, which is meant for overcoming the incapability in tracking issues associated with simplified geometric shapes, e.g., shoulders, head, hands, etc. Therefore, in this form of tracking, the region of the objects is explored from each frame to offer a detailed description of the target object. The modeling is feasible in this form of tracking mechanism using contours of an object or edges of an object or color histogram. Fundamentally, there are two classes of silhouette tracking systems, i.e., shape-based and contour-based. The shape is also used to define the state-space of an object. In order to increase the posteriori probability, the system always updates the state over a specific period of time. In such a case, the posterior probability depends upon the initial state and the likelihood of the current state that generally uses the spatial distance between two contours. It is to be noted that the silhouette-based approach classification is more or less a similar form, i.e., contour and shape-based, and no new forms have been ever surfaced. In all the approaches mentioned above, certain issues have been consistently under observation, i.e., occlusion and tracking using multiple cameras. There are three classes of occlusion, i.e. i) Occlusion due to structures in the background scene, ii) Inter-object occlusion, and iii) Self-occlusion. Developing a uniform algorithm for video tracking, considering all these three occlusion cases itself, is a challenge of a bigger dimension. Similarly, variation in shape and size of the same object is a bigger challenge when applying a multiview tracking system. The next section briefs about existing research work in this direction.

III. POINT TRACKING MECHANISM
Point tracking is the first form of the object tracking method, where various forms of points are used across the target frame to represent the identified object over the video sequences. However, mapping a specific point over an identified object is quite a complex scenario, especially when a target object exists in the scene, misdetection, or occurrence of occlusion. Basically, this system is of two types, i.e. deterministic approach and probabilistic approach. However, a closer look into the trends of the approaches and methods being carried out by point-based tracking system is mainly found to use Kalman filter, particle filtering, and hypothesisbased tracking.

A. Kalman Filter-based Methods
In recent years, the Kalman filter usage has significantly increased for object detection under various environmental conditions. In most of the work, the adoption of the Kalman filter is proven to offer significant accuracy when it comes to tracking at a higher speed. Even under the complex form of video files (e.g., satellite videos), the Kalman filter is reported to offer better tracking performance (Guo et al. [17]). Global motion attributes characterize the moving object to offer a measurable score of tracking performance using the Kalman filter. The core part of the tracking system is developed using the correlation filter, which uses the original pixel and HOG to represent its features. The study formulates an objective function intending to reduce the squared form of the error that occurs between the suitable map of response and correlated response. The study integrates the usage of correlation filter and Kalman filter to facilitate higher tracking speed with accuracy and fault tolerance. However, the limitation is that usage of this approach requires more robustness while performing dynamic tracking operations. A study towards achieving robustness is carried out by Gupta et al. [18], where the depth of interest is used to perform object tracking over the mobile environment. The study makes use of an unscented Kalman filter using an experimental approach for performing forecasting of the location of a mobile object. A similar direction of the work towards tracking a mobile object's position is also discussed by del Rincon et al. [19]. In this work, the author has considered a use case of tracking different parts of the human body using the strategy of multiple tracking and a two-dimensional articulated model. The interesting part of this study is its supportability towards identifying and tracking various rotational aspects of the human body. Study towards tracking multiple targets have been carried out by Wang and Nguang [20]. The uniqueness of this part of the study is to integrate the connected data using a probabilistic model with the Kalman filtering (Fig. 2).
A slightly unconventional mechanism of object tracking is carried out by Yang et al. [21] considering the use case of tracking aircraft. The study has used a deep learning mechanism to improvise the accuracy in the tracking system where the model integrates the Kalman filter and extended Kalman filter to forecast the trajectory. Based on a regionbased deep neural network, the presented scheme uses a shared structure of the convolution, which is used to encode the data connected with the positional information of the flying object. The region of interest area is then attached to the pooling layers on the top of the deep neural network where the response system of (i, j) is mathematically depicted as: In the expression (1), the variable r c represents the mapping score associated with the region of interest. Each class is further subjected to a software response system in deep learning. The mobility model of the presented deep learning mechanism is as follows: x k =Ax k-1 +Bu k +w k z k =Hx k +v k (2) The above expression (2) is used for tracking the mobile target where state vector x is used to define the system with estimated value as z and matrix for state transition A. The variables B and u represent the controlling parameters, while the transformation matrix is represented as H. This model also considers the presents of noises w and v. The study outcome ( Fig. 3) shows that the model is capable of identifying the flying object under different context of background. Irrespective of any direction of mobile state of air-born object, the model can successfully perform identification. The study outcome has been finally verified by comparing with other existing contents on multiple schemes, e.g., mean strategy, cropping with correction, cropping with estimation, normal cropping, and estimation.   Table I highlights that Kalman filtering with deep learning, offering the higher capability to perform tracking of the dynamic mobile object. The study has also claimed to reduce the detection time; however, the spontaneity in the tracking duration may differ based on the different background aspects of the scene, which is not found to be discussed in the presented system and thereby acts as limiting factor.

B. Particle Filtering Method
When there is a large set of information, the data sample is required for performing any further processing. This sample of data, which is also called a particle, is utilized to represent the data distribution associated with various stochastic nature processes. The particle filtering process is used for extracting such filtered samples of a particle in the presence of noisy information. There is also a higher possibility of the presence of impartial information and a non-linear state of varied form. From the perspective of the video tracking system, there is a need to track the object with various nonlinearities and uncertainties. Hence, the concept of particle filtering suits well in designing a video tracking system. One of the significant advantages of using the particle filtering process is its inclusion of a non-Gaussian mechanism of distribution study and nonlinearities. Apart from this, it can also be said that particle filter acts as a better alternative for the existing Kalman filter. The critical issue associated with Kalman filtering is that it assumes a normal distribution of state variables, which is less practical in nature and therefore it is its limitation. Such issues can be addressed using particle filtering where the state density at a specific time can be mathematically represented as: The variable n represents particles with a sampling probability of π t n as weight, which can index the significance of the considered sample. This mechanism can also address the computational complexity by storing the cumulative weight for each tuple. The frequently used sampling process includes selection, prediction, and correction. In the Selection method, the random samples s t n is selected from S t-1 by generating an arbitrary number r in the probability range of [0, 1]. The idea is to find the smallest sample j such that sampling probability c t-1 is less than r considering s t n =s t-1 . In the prediction method, a new sample is generated as s t n =f{s t n , W t n }, where f(x) is a non-linear function with W t n as mean Gaussian error. In the correction method, the estimation of weights π t n is carried out where π t n is equivalent to g(z t |x t = s t n ), where g is Gaussian density function. Adoption of these methods offers more comprehensive tracking performance, even considering a different number of features. One such recent work used a similar approach where multiple features are used towards facilitating video tracking (Bhat et al. [22]). The authors commented that features are exclusive of environmental variables, and various attributes of color distributed can be used as feature space. It is highly application-specific, while the study has considered that the KAZE feature, which is capable of blurring and smoothening the information along with noise. This challenge is addressed by using additive operator splitting for achieving the sharpness. According to this model (Fig. 4), the system takes the input of video sequences followed by selecting a target from a specific frame. The particles are generated in the surrounding of the centroid of the blob, followed by updating the particles. This updating procedure can be carried out by using the motion model. Finally, the particles based on the spatial score are weighted, followed by resampling all such particles to obtain a new centroid, which leads to the generation of the filtered location of the target, thereby assisting in video tracking. Particle filtering is also used to address the issues of the appearance model, which suffers from various extrinsic limitation factors, e.g., clutter in the background, occlusion, and variation in illumination (Wang et al. [23]) (Fig. 5). This system uses unique particle filtering to generate information about the state of the target concerning the current frame immediately after updating the template.
It should be noted that the study mentioned above is based on tracking using a matching mechanism also, which has a dependency upon an interesting local point, thereby reducing the robustness. This problem is sorted out by Zhang et al. [24], where basis matching is used as a substitution of point matching. Gabor filter is used to learn the target model, while particle filter identifies the object over the dynamic system. The study outcome was assessed on various test environments of occlusion, variation in illumination condition, alteration in poses of an object, and clutter in the background.

C. Hypothesis based Method
The majority of the video tracking system consists of the inclusion of two video frames, where there is less likelihood of inappropriate correspondence if the correspondence is created between two frames only. It facilitates effective tracking outcomes when there is a deferment of reading other video frames. In this regard, this process of multiple hypotheses helps manage multiple such correspondence hypotheses associated with all the objects for the given instantaneous frame. This approach offers a higher likelihood of the last frame with an object over a specific time period with a capability to construct upcoming queued tracks for the next object while eliminating the already existing track results. It should be noted that multiple hypothesis-based approaches are essentially an iterative process that initiates from the set of current tracks while multiple disjoint tracks formulate to form a collection within the hypothesis. The system then carries out a prediction for the position of an object for each hypothesis over the consecutive frame. These predictive outcomes are compared with the original measurement by assessing spatial measurement. Depending on this measurement of the spatial score, the system establishes hypotheses that further provide new hypotheses over the next rounds of iteration. However, it is necessary to know that owing to the iterative operation. It leads to a computational burden. This complexity can be addressed by considering probabilistic modeling, where the correspondence is random variables that are statistically independent of each other. Particle filtering can also be used to address this issue; however, it may offer a lower probability of enumerating all the possible correspondence. Hence, multiple hypotheses area better option when it comes to the demand for checking all the possibilities.
Another advantageous feature of the multiple hypothesisbased tracking systems is their ability to perform tracking of smaller targets. However, it is associated with the larger tree structure in existing approaches with a large number of branches. This issue can be sorted out by applying a certain optimization approach. Work towards this target is carried out by Ahmadi and Salari [25], where particle swarm optimization has been used to explore the optimal number of tracks from the video sequence. The implemented steps in this work are i) exploring the preliminary tracks with the aid of a multiple hypothesis approach, ii) fine-tune and adjust the observed track information using particle swarm optimization, and iii) merging all the collected track information that maps with a single target object. However, the limitation of this approach is its capability to track only a single object.
This limitation is overcome in the work of Kutschbach et al. [26], which extends its tracking towards multiple objects using the probability of Gaussian mixture with multiple hypotheses. The study also makes use of a kernelized correlation filter for better accuracy performance. It is to be noted that the iterative nature of this approach is also discussed in existing literature for optimal outcomes. However, most of the existing approaches are found to have a lack of any inclusion of relevance between two video frames which is one major limitation. Apart from this, there is no optimized approach to utilize the preliminary information from the individual frames (Sheng et al. [27] [28]). The optimization carried out in this approach is to undertake the information about independent sets of maximum weight. The study constructs the hypothesis between the consecutive video frames using the transfer model of the hypothesis. Also, the complexity associated with the iterative process has been addressed using an approximation algorithm of the polynomial time. This process indirectly improves the efficiency of the system. The upper bound UB of this tracklet is mathematically given in this work as.       6 highlights the visual outcomes to show that this model can track different objects at the same time with different sizes of windows. However, this approach is limited to a single camera with multiple object tracking. Further, the authors have developed a graph model with distance and time information connected with the trajectories. The model has used a temporal graph to assess the presented tracklet generation, resulting in connectivity among hypotheses and benchmarking. The video tracking operation is further improved when this model is integrated with tracking using network flow. At the same time, similar network flow parameters are utilized to assess the validity of the model. The test environment used in this study is further extended where multiple similar targets are subjected to tracking but using multiple cameras (Yoo et al. [29]). The tracking is carried out over multiple tracks that are completely unknown and obtained from time and distance relationships. The realization of the multiple tracks is carried out by solving the clique problem of a higher degree of weights associated with each frame. The study makes use of feedback information obtained from the result of the preliminary frame online. With the adoption of the tracklet, the presented approach is now capable of generating much fine-tune set of candidate tracks and filtering out all the candidate tracks that are found to be unreliable. Hence, there are various point tracking systems at present for video tracking system.

IV. KERNEL-BASED TRACKING MECHANISM
This is a typical mobile object tracking system in a video represented using a primitive object region from one to another video sequence frame. Normally, the parametric motion is witnessed in the form of affine, conformal, and translation for all the motion of objects. The computation of the flow field of dense nature can also be used to represent the motion of an object. Various approaches in such methodology are constructed based on the techniques used for motion estimation of an object and the number of a tracked object. The existing literature is witnessed to adopt mainly three essential approaches under this method viz. i) Mean Shift Method, ii) Support Vector Machine, and iii) Template-based Method.

A. Mean Shift-based Method
Although this is one form of video tracking mechanism, its core principle is based on the video sequences segmentation. Existing literature discusses a technique where the mean shift approach is used along with the other associated techniques to improve the tracking performance. The work carried out by Baheti et al. [30] has used an enhanced Lucas Kanade Algorithm for effective controlling of the computational complexity for performing object tracking. The objective function stated for this purpose is: In the above expression (5), the estimation of an error for aligning the template with reference is carried out by considering input image I with reference image T(x). W is considered as a warping function with warping parameters p such that p=(p 1 , p 2 , …, p n ) T , while isincrement parameter of warping function. The preliminary set of warping parameters is obtained from the RANSAC algorithm using its homography estimation. The warped image is subtracted from the gradient, followed by computing the gradient information about the template image in a specific direction and extracting Jacobian related to it. The descent image of the steepest form computed using matrix multiplication followed by computing the Hessian matrix with further multiplying it with the error. The variation is the computed parameter, which is subjected to the objective function for minimization of the error leading to good accuracy in tracking. Fig. 7 highlights the visual performance of both the method to show that adding the Lucas Kanade Algorithm with mean shift offers more accuracy compared to conventional mean shift. However, it should be noted that this approach doesn't emphasize much on the complex environment of background, which is necessary for adaptive tracking. The existing approach reports that the usage of the kernel correlation filter can solve this issue of complexity associated with the background when used alongside with mean shift (Feng et al. [31]). In this method, the trained image with its respective positional information is considered an input followed by tracking based on the kernel correlation filter.  [30]. www.ijacsa.thesai.org Further, with the inclusion of the new frame, the confidence map of the filter is obtained, which is followed by assessing if the mean shift is required to be used. For this purpose, the histogram feature is mean shift is obtained, which finally leads to the outcome of tracking (Fig. 8). However, the method doesn't include occlusion mitigation, which is required for cluttered scene analysis. The problem of occlusion and complex background have better possibilities of solving if the emphasis is offered much on data distribution with a multidimensional approach. The conventional mean-shift approach can be extended to three dimensional with more suitability in tracking dynamic object (Liu et al. [32]). Such mechanism performs dual steps viz. i) all the significant mobility points are tracked, and appearance model is subjected to related finetuning necessary and ii) detection process is initiated along with compensation of errors in tracking owing to complete occlusion. The study outcomes show some robust tracking performance compared to some of the existing mechanisms of different variants of kernel correlation filters using colored videos. The technique involves preprocessing the infrared sequence of video followed by target identification. For the first image, the detected target region is captured, followed by multiscale transform and fusion of the target region. Upon subjecting to transformation using multiscale image give the outcome for part of the fused image. A similar sequence of processes is carried out for the second image. The background is captured from the identified target, followed by a similar set of processing carried out by the first image to give a second set of fused images. Both the fused image is further organized as a sequence to perform tracking. A recent work carried out by Peng and Zhang [33] has a unique implementation of mean shift where detection and tracking of the target region are tracked using the mean shift method. In contrast, the root means square is estimated between two frames to assess the error score. Other associated studies on similar technique with slighted variation in using the mean shift was seen in the work of Shu et al. [34], Tan et al. [35], Wang et al. [36], and Chen et al. [37].

B. Support Vector Machine based Approach
In the area of the learning algorithm, Support Vector Machine (SVM) is considered as a supervised model which is capable of performing both linear and non-linear classification. This characteristic makes it suitable for improving the accuracy of the video tracking system. The SVM approach has proven effective when it comes to object recognition and tracking. SVM, when combined with Scale Invariant Feature Transformation (SIFT), offers better performance (Dardas and Georganas [38]). The technique applies vector quantization for mapping key points with the training image, followed by applying k-means clustering and bag-of-words. However, better classification performance is seen when one-class SVM is used with Markov chain Monte Carlo implementation (Feng et al. [39]). The inclusion of dynamics in tracking are addressed using probability hypothesis density. The enhancement of SVM in tracking is further proven when excluding the coupled label prediction using kernelized supported SVM for adaptive tracking. The complexity owing to unbound growth in support vector is controlled using a budgeting mechanism. This fact was also verified by Yuan et al. [40] using multiplicative kernels. Such approaches also void the inclusion of contextual modeling, which otherwise is discussed to offer better SVM predictability (Sun et al. [41]). Such an approach decomposes the spatial context in the form of foreground and background for obtaining a robust appearance model to deal with deformation and occlusion issue in video tracking. Sun et al. [42] have also used SVM for categorizing the scene from its sophisticated surroundings. Such an approach is proven to encode the perception of human vision using gaze shifting path. Fig. 9 highlights the process used where an aggregated convolution neural network is used along with over gaze shifting path further subjected to SVM for effective classification. Such idea of the combined process, i.e., identifying an object, learning, and tracking, is also carried out by Yin et al. [43]. In this work, SVM is used for dual purpose, i.e., performing linear classification and state-based structure classification where applicability increases over complex video scenes. SVM is also proven to reliable modeling (Sun et al. [41] [42]) where learning is carried out over multiple views and harnesses the geometrical structures of the tracked outcome. Overall, SVM has a fruitful performance when it comes to video tracking from complex video sequences. However, the approaches don't offer many solutions towards computational complexity associated with classification performance.  Fig. 9. Categorization of the Scene for Tracking Sun et al. [42].

C. Template-based Approach
The formation of the template is usually carried out using normal geometrical structures. It is capable of bearing both information of appearance and spatial data from the given scene. However, one of the pitfalls of this approach is that the generated appearance of an object can only be encoded from a single view. This narrows down its applicability towards tracking video with lesser variation in poses of an object while tracking. Hence, there are various attempts in present times to circumvent this issue. Guo et al. [44] have used an adversarial network with the guidance of the generative task to perform dynamic learning. Templates are selected from online adaption from the image sources with ground truth along with an arbitrary vector. However, it doesn't perform the dynamic matching of the template. This problem is discussed to be solved by Huang et al. [45] where segmentation of an object is carried out using aggregation network with temporal attribute with Hungarian matching scheme from template bank (Fig. 10).
Existing studies have also witnessed the template matching process to be hybridized where different methods from other categories of video tracking are found to be used. Studies of Lin and Chen [46] and Mutsam and Pernkopf [47] have discussed the usage of the particle filter with template matching. A unique study carried out by Pei et al. [48] has used graph matching over the template to establish the connection between objects and trajectories. Rehman et al. [49] have used template matching and deep learning over multiple regions of interest to improve its scope of tracking and accuracy. The work carried out by Su et al. [50] has integrated color histogram with a template for maximizing the performance of video tracking along with the significance of the update process over selected regions of interest. Xiu et al. [51] have extracted differential information where the initial region of the target is carried out using rough matching followed by magnifying the region of search. The study outcome is proven to offer better tracking performance in contrast to the existing template matching algorithm. Apart from this, various artifacts associated with the interference are also solved in such an approach with a reduction in outliers for improving video tracking performance.

V. SILHOUETTE TRACKING MECHANISM
The silhouette is a simpler mechanism for tracking an object when it comes to the non-rigid nature of an object. In such a case, the region of an object is estimated for all the frames to facilitate tracking. The encoded information with the region of an object is used for this form of the tracking system. The possibilities of such information could be an edge map, or it could also be any shape model. Typically, contours and shape factors are used in the process of a silhouette tracking mechanism.
Object tracking could be a complicated process when the video is of multi-dimension as there is a proliferation of multidimensional video with the advancement of digital technologies. The work carried out by Kim et al. [52] has addressed this problem by performing contour tracking using the graph-cut method. The study considers the distance of the angular radial factor and its variation as its essential constraint with a presence of deformation in shape. With the refinement of the contours, the ambiguous seeds are eliminated for precise segmentation using graph cut. Combined with a neural network, the performance of the contour-based tracking system can be enhanced (Kishore et al. [53]). This strategy uses the Horn Schunk optical flow method for obtaining features for tracking while the shape features are extracted from active contours. Different events are classified using the backpropagation approach in the form of words, and then converted into signals for matching. The work of Luo et al. [54] has implemented a silhouette-based tracking system using a segmentation approach with a block-based technique. The information of motion during the encoding of the video is utilized for tracking purposes. Kalman filter is also reported to enhance this in the form of video tracking system (Pokheriya and Pradhan [55]). The study makes use of the background subtraction method of adaptive nature.
Another unique mechanism to carry out this silhouette tracking mechanism by using the camshaft algorithm (Zou et al. [56]). This is mainly utilized for computing the distribution of color probability, thereby facilitating a video tracking system. Apart from this, the inference system becomes quite simpler concerning the background. This algorithm is considered to be useful to deal with occlusion and target deformation. Fig. 11 showcase the unique outcome where the original image ( Fig. 11(a)) is processed to the obtained distribution of color probability (Fig. 11(b)) followed by extraction of distribution of motion probability (Fig. 11(c)) and distribution of cumulative probability (Fig. 11(d)). The core study findings of the proposed paper are discussed with respect to existing research trends and briefing of openend research problems.

A. Existing Research Trend
To visualize existing research trends, the proposed system collects the paper published in the IEEE Xplore digital library published between 2010 and 2020. The findings are graphically shown in Fig. 12 as following,  Fig. 12, it can be seen that there is less survey work in this area, as well, as more emphasis is given to the conventional mechanism of point-based and kernel-based tracking system. Contribution towards silhouette-based is very few to find. Apart from this, the number of journal publications towards kernel-based is significantly less as compared to pointbased tracking. This eventually shows that there was no equal emphasis being given to all the taxonomies of the video tracking system.

B. Research Gap
Different variants of research work are being carried out towards the video tracking system with a unique focus on accuracy. Every implementation offers a productive guideline towards adopting an effective methodology towards addressing the problems while also associated with a specific limitation and issues. Following are the list of open-end research issues which demands attention:  Less Simplified Feature Extraction Process: Apart from extracting unique features, it is essential to ensure costeffective modeling adherence. The majority of the existing approaches are highly inclined towards extraction of local level features, limiting the applicability in case of a change in visual and scene context. There is an emergent need to include a global level of features, which should result in inclusive of both low and high levels of attributes towards facilitating effective modeling of the video tracking system.
 Less Focus on Processing Time: An effective video tracking algorithm and system will definitely demand almost instantaneous response time. Without this inclusion, the practicability cannot be defined precisely. The existing system uses iterative and complete modeling towards a video tracking system due to its sole focus on achieving accuracy in its performance. With the inclusion of different challenges like different variants of occlusion, multi-view tracking, and sophistication of algorithm operation, the system must offer an instantaneous response in the presence of any dynamic video sequences.
 Need to Emphasize on Dimension Reduction: While performing video tracking, the system undergoes extraction of various informative contents, which are required to be stored and processed for improving preciseness in tracking. This is the case mainly with learning-based algorithms, which demands a higher dimension of trained data. The inclusion of a higher dimension of trained data will increase the memory complexity and increase the processing time to yield an appropriate response. Hence, there is a need to evolve up with an approach that can offer a better form of dimension reduction of the features considering the cases of a complex form of imageries in the video sequence, e.g. ariel images. There is also less focus on the optimization-based approach, which has good potential to deal with this open-end problem. www.ijacsa.thesai.org  The need to include contextual scene information: Existing approaches are built primarily over object detection followed by tracking. In the process of detection, the emphasis is only towards the foreground object and less towards the contextual information of the object and given scene. Without the inclusion of a contextual-based approach, the video tracking will have a limited scope of operation when exposed to uneven and dynamic mobility of an object whose heuristics are not present in the ruleset or ground truth or even in the trained image. Hence, contextual information demands enhanced scope.

VII. CONCLUSION
Irrespective of archival of the work carried out towards video tracking system, there is no evidence of any standardized model that acts as a benchmarked factor. Therefore, this article presents a typical insight into the identified three video tracking classes which is frequently found to be used: pointbased tracking, kernel-based tracking, and silhouette-based tracking. It also briefs all standard methods that are witnessed to be implemented in these three standard video tracking algorithms. However, a closer look into the existing approach will only exhibit that they all can be further classified into three more classes of approach, i.e., contour-based tracking, tracking using native geometric model, and representation of a target object. All these associated models and their accuracy strongly depend upon how accurate the process of object detection and recognition is over a challenging scene of a video sequence. The study also concludes that each of the three standard categories discussed in this paper has both advantages and limiting factors, which should be improved upon to come up with a novel and effective video tracking scheme. Therefore, our future work will emphasize addressing the open-end research issues discussed in the prior section. To do this, the future direction of work will emphasize more on modeling global features for the extraction process, along with an emphasis on precision. The future work will also be in the direction of inclusion of policy to balance the demands of higher accuracy and optimal processing time, lacking in existing approaches. Finally, an optimization-based approach could be implemented to address the issues connected with computational complexity.