A Biologically Inspired Appearance Modeling and Sample Feature-based Approach for Visual Target Tracking in Aerial Images

org


I. INTRODUCTION
Visual tracking is mainly investigated as an active research topic according to its wide range of applications, including smart surveillance systems, intelligent remote sensing technologies, action recognition, and robotic and humancomputer interaction [1]. Particularly, visual target tracking is broadly studied on uncrewed aerial vehicles (UAV) to detect and track targets on aerial images. As opposed to applications with fixed cameras, for example, traffic monitoring, aerial videos have the favorable circumstances of higher portability and superior reconnaissance and surveillance [2]. However, visual tracking systems are still suffered from various challenges and difficulties. Target appearance variations and tracking the target under uncontrollable and unpredictable conditions are as main challenges of these systems [3,4]. In this regard, online learning-based tracking methods that can incrementally update feature representations have received more attention for achieving a reliable and robust visual tracking algorithm [5]. Therefore, an online target feature representation is crucial for preserving an efficient appearance model to describe and identify the target from background [3,6].
Recently, biologically inspired and cognition-based approaches have become a topic of interest by many researchers [7]. These approaches are inspired by human biological mechanisms, meaningfully indicating that human perception is sensitive to attentional regions (ARs) [5,8]. By adopting biologically inspired approaches, visual saliency detection methods have been presented to detect attentional regions based on spatial and temporal information, resulting in the design of a significant number of saliency-based object detection methods [2]. Hence, a fascinating question is how we can exploit and take advantage of these approaches to develop a more powerful visual tracking algorithm [9]. Current visual saliency detection approaches can be categorized as task-driven attention (top-down) and data-driven attention (bottom-up) [10][11][12][13]. The top-down approach is a result of long-term visual simulation with prior knowledge [10]. The top-down approaches focused on high-level information investigation, such as the sky, faces, and humans [14]. This approach's drawback is a hard generalization because it is not simply obtained from images [15]. Furthermore, they are slow and computationally expensive [10]. On the other hand, the bottom-up approaches are based on low-level visual features simulating the formation of shortterm visual attention [16]. In contrast to the top-down method, the bottom-up-approaches are rapid [14,17]. As promising results indicated on bottom-up based approaches, this study focuses on the bottom-up approach.

II. RELATED WORKS
In this section, existing related works discuss two categories, visual saliency-based, and appearance modelingbased methods.

A. Visual Saliency-Based Approach
As discussed previously, saliency-based methods are categorized as bottom-up and top-down methods. The bottom-1) Temporal saliency: Detection of salient regions is highly dependent on the recognition of moving objects since motions attract more attention [18]. For moving object detection from a video, moving object detection methods are mainly based on temporal information, such as background subtraction [19,20], frame difference [21,22], and optical flow [23].
Background subtraction is based on background modeling. It is widely used for moving object detection. It segments foreground objects from the image background to detect objects that are not moving [24]. However, the background subtraction method suffers from some limitations. These methods are sensitive to the fixed background, and the object extraction fails when the background has changed [10].
Optical flow is also used for moving object detection, even under moving camera conditions. However, optical flow is computationally expensive and sensitive to noise. Thus, optical flow algorithms are not robust in real-time visual tracking systems [2,25].
Frame differencing as a practical approach for moving object detection is based on pixel-wise difference extraction among the image frames. The frame difference does not require background modeling and is not sensitive to fixed background-like background subtraction methods. Frame differencing is adaptive to dynamic backgrounds and has a low computational cost [2]. However, one of the major drawbacks of frame differencing is moving the objects during the frame capturing process. Because a target may have unpredictable motions, such as stop-and-go periods, the frame difference method is not robust under uncontrollable and unpredictable target movement conditions [26]. Therefore, it is required to propose a temporal saliency detection method to overcome the mentioned challenges [25].
2) Spatial saliency: Spatial saliency detection is based on low-level feature representation and focuses on salient region extraction from images. These features are investigated to describe and identify the region of interest as salient regions [2,40]. Several spatial-based methods have been developed, such as region low rank matrix recovery [27], region covariance [28,29], color-based [41], contrast-based [30], frequency domain [31] and graph [32] methods. As mentioned earlier, visual object tracking in aerial has difficulties in target appearance variations and background changes. To deal with difficulties, spatial saliency detection can extract the salient regions that are moving targets. This method is not sensitive to background changes and abrupt motion.
3) Spatiotemporal saliency: Spatiotemporal saliency detection methods calculate the temporal and spatial saliencies [10] separately., They used motion cues investigation for moving object detection. Generally, the results only based on motion cues are not undesirable because of the lack of spatial distribution [2]. Therefore, it is required to integrate the temporal and spatial features to detect the salient regions more accurately [9,10,46].

B. Appearance Modeling-based Approach
Appearance modeling approaches are normally used to deal with appearance variations challenges. These approaches are categorized into generative and discriminative-based methods [5].
1) Generative-based methods: Generative-based methods are used to generate a model of an object during appearance changes in scenes. The generated model exploits the discriminative features to handle the target's appearance variations. The mechanism of appearance model generation is frequently updated online to describe the appearance variations. Some generative-based methods are as follows. Lee and Kriegman [33] proposed a generative-based algorithm to update a model for target detection dynamically. Wu and Wang [34] proposed a real-time generative method integrated with an incrementally updating covariance modeling approach for visual tracking. Tianxiang and Li [35] presented an appearance modeling approach based on a generative method and structured sparse representation for tracking an object in a video. However, even though various generative-based methods have been proposed, these methods still have not fully exploited spatial identification within the images efficiently.
2) Discriminative-based methods: The discriminativebased method has also been utilized to overcome the challenges related to appearance changes during visual tracking. The discriminative method are called tracking-bydetection. The mechanism for discriminative-based methods is a separate set of features that are extracted such that they distinguish the target from the background image. A binary separation approach is used for target identification from the background in successive frames. Various studies have been conducted based on discriminative-based methods, such as the discriminative learning method based on graph embedding proposed by Zhang et al. [37]. Fan et al. [38] presented an approach of discriminative region attention that describes the target from the background in terms of spatial features. Their proposed method aimed to overcome the spatial distraction in the visual appearance changes challenging. Tang et al. [39] proposed a robust visual tracking method, DRLTracker, based on a discriminative ranking list approach. DRLTracker utilizes the ranking lists and two-scale features to generate a model of the target and recognize it from the background based on the ranking lists of generated patches. However, the proposed method is limited to two-scales DRLTracker and suffers from high processing time.
This study investigates an enhanced discriminative-based appearance modeling approach to overcome the noise shortcoming and appearance variations difficulty. This study uses appearance modeling instead of discriminative-based appearance modeling term in this paper. The appearance www.ijacsa.thesai.org modeling approach details are discussed in the Material and Methods section.
Finally, the core of this paper is the proposal of visual object tracking based on a combination of spatiotemporal saliency appearance modeling and sample feature-based approaches. The proposed method is based on a tracking-bydetection approach to provide a robust visual tracking system in appearance variation conditions. Correspondingly, a semiautomatic trigger-based algorithm is proposed to handle the phases' operation. Furthermore, an automatic algorithm is proposed to detect the region saliences in a parallel implementation that leads to low processing time. Consequently, the temporal and spatial saliencies are integrated to generate the final saliency and sample features. Our contributions can be summarized as follows,  A visual tracking method based on spatiotemporal saliency-based appearance modeling (SSAM) and sample feature-based target detection (SFTD) to preserve the visual target tracking robustly under appearance variation conditions, unpredictable motion, and low processing time. The proposed method is efficient in both camera and target moving platforms.
 Develop a novel algorithm for switching automatically between phases to handle their operation based on trigger activation.
 An algorithm is proposed for multiple target detection based on dynamic multithreading implementation of SLIC segmentation algorithm.
The remainder of this paper is organized as follows. Material and Methods section discusses the proposed framework and details of the material and methods. The results and discussion section presents our experimental and performance analysis. Finally, the conclusion section concludes this study.

III. MATERIAL AND METHODS
This section presents an overview and the details of the proposed approach. The underlying goal of the proposed approach is to take advantage of saliency values and appearance modeling in an efficient manner for target detection.
In this study, the proposed method consists of two main phases, spatiotemporal saliency appearance modeling (SSAM) and sample feature-based target detection (SFTD). To handle the phases' operation, a semi-automatic trigger-based algorithm is proposed to switch between the two phases; a phase operation is started when that phase receives a trigger activation. For example, when the saliency-slot time is reached, the SSAM phase activates a trigger to switch to the SFTD phase. The proposed method defines the saliency slot to activate the trigger. The SFTD phase activates a trigger when it cannot detect any objects. The overall architecture of the proposed framework is shown in Fig. 1. The details of the proposed approach are presented in the following sections.

A. Spatiotemporal Saliency Appearance Modeling (SSAM) Phase
This phase involves three stages, temporal saliency and localization detection, spatial and final saliency detection, and, finally, sample feature generation and target detection stages.

1) Temporal saliency and localization detection (TSLD)
stage: This stage consists of temporal saliency detection and localization modules. In order to extract the moving target, salient regions are extracted using motion cues detection. For motion cue detection, temporal saliency is investigated with the following details.
2) Temporal saliency detection module: The purpose of temporal saliency is for attention region (ARs) extraction and coarse segmentation. The extracted attentional regions are called Candidate Motion Regions (CMRs). To extract the CMRs, we propose the following steps, Frame differencing is used for temporal saliency detection. Frame differencing is utilized to identify moving objects in consecutive frames. This technique employs the image subtraction operator, which takes two images (or frames) as input and produces the output [42].
Image Enhancement. Morphological operations are generally applied for image enhancement [43]. This proposed method uses these operators, which are dilation, erosion, and opening and closing; the morphological operators are inspired by [44,45] with adapted structuring elements paraments.

3) Localization module:
Once the temporal saliency detects the moving target region and enhances the CMRs, a localization module is applied to localize the extracted CMRs based on connected components and blob identification methods [47]. This module involves the following steps.
Thresholding. Thresholding assists us in reducing the number of false positives and avoiding missed valid objects. The thresholding is based on the variation of intensity consideration between the object pixels and the background pixels, as inspired by [48]. Setting a determined value to identify those pixels to implement the thresholding. In this matter, THRESH_OTSU is used to determine the optimal threshold value using Otsu's algorithm [49].
Edge segmentation. Canny edge segmentation is then run on the binarized image for further improvement of the extracted region.
Blob Identification. After all the region's edges are extracted, we need to detect the blobs (connected components). To identify the blobs, active contour features, such as the one proposed by [50], are utilized to detect regions of interest and localize them. In this paper, we also use active contour features to detect the contours of regions. The detected contour features from CMRs regions are used for connected component detection, blob area, and bounding box determination. Moreover, removing unwanted blobs with a pixel area smaller than A low or a bounding box with dimensions larger than B max .
Candidate Mask Generation. In this step, geometrical features are extracted from the regions to recognize their location in each frame. Extracting X pos , Y pos as the centroid of www.ijacsa.thesai.org each object based on moment features, as described in the spatial saliency detection module section later. Furthermore, we experimentally found the appropriate width and height values to generate the candidate mask (CM), extracting the regions based on the region of interest (ROI) function. Fig. 2 shows the generated candidate mask by the proposed temporal saliency and localization stage.

4) Spatial saliency detection (SSD) stage:
The result of the TSLD stage is one or multiple CM regions, which are ARs. However, as shown in Fig. 2, some regions are incorrectly extracted that are unrelated to a target object or include useless regions. The spatial saliency detection (SSD) is used to overcome this fault detection. Furthermore, because candidate masks are compact and informative, we also investigate SSD to extract the saliency over them to provide further information and generate sample features. Our proposed SSD algorithm is based on integrating the proposed methods in [2,29] with several modifications in feature extraction, feature representation, and spatial distribution measurement to improve the efficiency of spatial saliency extraction.
In brief, the input image is first decomposed into perceptually homogeneous segments as patches based on a SLIC superpixel algorithm presented in [51]. Second, we extract visual features, including color and moment, to measure the uniqueness and compactness of the spatial distribution. Finally, the temporal and spatial information are integrated to generate a final saliency map named spatiotemporal saliency.
However, since the SLIC is time-consuming for spatial saliency, we implement spatial saliency detection via parallel processing based on multi-threading programming. The use of multi-threading assists us in processing all CM regions in parallel. It can impressively decrease the overall processing time of the SSD stage. In this regard, each thread captures a candidate mask and performs the following processes to determine the spatial saliency and sample feature generation. For instance, if the result of the TSLD stage includes four objects, we assign each object to a thread, totaling four. OpenMP multi-threading is used as a tool to implement our spatial saliency detection algorithm. The steps for the SSD stage are as follows.

5) Patch generation module:
Super-pixels segmentation as an effective region-based analysis algorithm is increasingly investigated by many researchers in computer vision communities [36,52]. As proposed in [51], this study uses a SLIC algorithm to segment the CM regions into homogeneous regions. Fig. 3 shows patch generation for a moving object using the SLIC superpixel algorithm.
6) Spatial saliency module: In this study, we use spatial uniqueness and compactness to compute the spatial saliency detection inspired by Perazzi et al. [53]. Moreover, we take advantage of other features, such as image moments and different metrics, to improve efficiency. In our method, we investigate pixel intensity for dissimilarity measurement of a patch compared to other regions. Compactness spatial distribution also contributes to detecting salient objects based on image moments for uniqueness measurement. Details of the proposed spatial saliency detection are explained in the following.
Spatial uniqueness measurement. Similar to [54], each region's color similarity with other regions is measured. However, in [54], they implemented the color feature in a static image. In contrast, we investigate saliency detection in a dynamic environment. Furthermore, as reported in [55], Earth Mover's Distance (EMD) yielded excellent retrieval performance for the small sample size; we also use the EMD distance metric instead of Euclidean.
Spatial compactness measurement. Because the salient patches are spatially compact, the pixels with high saliency values are also expected to be spatially close [56]. Spatial moments are efficient and powerful in describing spatial distribution and compactness. In this study, we investigate spatial moments to estimate spatial compactness. Our work employs firstand second-order spatial moments.

7) Final saliency map generation:
Generally, it is necessary to collaborate the temporal and spatial saliencies in a meaningful way to produce the final spatiotemporal saliency maps [10]. Therefore, the temporal and spatial information are integrated to generate a final saliency map named spatiotemporal saliency. Fig. 1, the sample feature generation and target detection stage involve feature extraction, sample feature generation, and target detection. According to the feature extraction step and the result of the SSD stage, we collect appropriate features, such as color contrast and region compactness. As mentioned previously, these features are dynamically updated per frame and normalized, generating the sample features. Based on the sample features, we can detect the target.

B. Sample Feature-based Target Detection (SFTD) Phase
A trigger is activated upon the sample features being generated, and the sample features are transferred from the SSAM phase to this phase. The advantage of this phase is that it covers both moving and non-moving object detection conditions to detect objects with uncontrollable and unpredictable target movement conditions and overcome the difficulty of frame difference. The steps for this phase are mostly similar to the previous operation's steps, i.e., frame differencing, Image Enhancement, Feature Matching, Object Segmentation, and, finally, Target Detection.

IV. RESULTS AND DISCUSSION
This section presents the implementation details and experimental results. Additionally, we compare the results with existing methods based on qualitative and quantitative performance evaluations to test and evaluate the proposed method. The qualitative analysis presents the image results from the proposed and other methods. In contrast, quantitative analysis involves precision and recall calculation and processing time. To validate the efficacy of the proposed method, the experiment was conducted on the VIVID public www.ijacsa.thesai.org dataset [57]. The VIVID dataset was collected at Eglin during DARPA VIVID and involves aerial images in video sequences. Several videos have been collected in VIVID, of which we use the EgTest01 and EgTest02 videos. The EgTest01 video involves moving cars that pass each other, with an image size of 640*480 pixels and 1800 frames, whereas the EgTest02 video involves 1300 frames with two sets of three civilian vehicles passing each other on a runway.

A. Qualitative Analysis
Qualitative analysis is implemented to demonstrate the result of each phase and compare the proposed method with others. Fig. 4 shows the results of the qualitative analysis. The saliency-based methods considered for comparison are Itti [58], MD [21], GBVS [59], and SD [2]. Fig. 5 shows the comparison of the proposed method with other existing methods. The first row is the original raw images (Raw), the second, third and fourth are the results for the TSLD, SSD, and MOED phases, respectively, and the final row represents the feature-based object detection phase.
In the following sections, quantitative analysis for precision, recall, and f-measure calculation is discussed.

B. Precision and Recall Measurement
Similar to Refs. [2,31,60], precision and recall measures are used to evaluate the performance of the proposed method. In our evaluation, the target is the exciting object, whether moving or not. To measure the precision, recall, and f-measure, we need to define the following terms,  Table I.
The precision and recall rates of different numbers of frames are illustrated in Fig. 6. As shown in Fig. 6 and 7, precision and recall rates are increased when the number of frames is increased.
Furthermore, to validate the proposed method, we compare our model with state-of-the-art visual object tracking methods, such as the FMD [2], DMM [60], HSC [10], RD [2], and SD [2]. The comparison was conducted based on precision, recall (PR), and F1-score. Table II and Fig. 8 show the comparison results. Based on the obtained experimental results, we show that the proposed approach can be effectively employed for the extraction of moving objects.

C. Processing Time
Our experiment was implemented in Visual Studio and performed on a Windows 8 platform with an Intel 2.6 GHz CPU and 4 GB of Memory. The processing time is measured based on wall-clock time computation because, when measuring the performance of parallel programs, the wallclock time needs to be considered, then using the tick_count class, which is located in tbb/tick_count.h. A tick_count is an absolute timestamp. The average processing time for the proposed method is approximately 78 and 24 milliseconds for SSAM and SFTD, respectively, which is suitable for near-realtime visual tracking applications.      This paper addresses the significant problems facing visual tracking, such as appearance variations and unpredictable moving targets, for aerial images. The proposed method uses spatial and temporal saliencies to address these challenges by adopting biologically inspired approaches to detect the attentional regions (ARs). Furthermore, a biologically inspired approach integrated with an appearance modeling-based approach is investigated to overcome visual tracking challenges. In this regard, the proposed method consists of two main phases, spatiotemporal saliency-based appearance modeling (SSAM) and sample feature-based target detection (SFTD). The proposed method uses a tracking-by-detection approach to provide a robust visual tracking system under appearance variation conditions. Correspondingly, a semiautomatic trigger-based algorithm is proposed to handle the phases' operation, and a discriminative-based method is utilized for appearance modeling. In the spatiotemporal saliency phase, temporal saliency is used to extract the attentional regions (ARs) and coarse segmentation. Spatial saliency is utilized to obtain the object's appearance details in ARs regions. By combining temporal and spatial saliencies, we can obtain refined detection results and track the target. During the spatial saliency detection, prominent features are collected, and a sample feature is generated to describe the target.
Consequently, a target detection process is performed to recognize the target in images. Experiments were conducted on the VIVID dataset. Moreover, the proposed method compared with other state-of-the-art methods. The analyses demonstrate that the proposed method is superior to most state-of-the-art methods and presents an effective visual tracking method which is robust in appearance variation difficulties.
Future works can be conducted to address other difficulties and challenges in visual tracking, such as when complicated backgrounds or backgrounds with partial and/or full occlusion are present.