Artificial Bee Colony Algorithm Optimization for Video Summarization on VSUMM Dataset

This paper attempts to prove that the Artificial Bee Colony algorithm can be used as an optimization algorithm in sparse-land setup to solve Video Summarization. The critical challenge in doing quasi(real-time) video summarization is still time-consuming with ANN-based methods, as these methods require training time. By doing video summarization in a quasi (real-time), we can solve other challenges like anomaly detection and Online Video Highlighting. A simple threshold function is tested to see the reconstruction error of the current frame given the previous 50 frames from the dictionary. The frames with higher threshold errors form the video summarization. In this work, we have used Image histogram, HOG, HOOF, and Canny edge features as features to the ABC algorithm. We have used Matlab 2014a for doing the feature extraction and ABC algorithm for VS. The results are compared to the existing methods. The evaluation scores are calculated on the VSUMM dataset for all the 50 videos against the two user summaries. This research answers how the ABC algorithm can be used in a sparse-land setup to solve video summarization. Further studies are required to understand the performance evaluation scores as we change the threshold function. Keywords—Artificial Bee Colony optimization; video summarization; online video highlighting; sparse-land; anomaly detection; image histogram; HOG; HOOF; canny edge


I. INTRODUCTION
Since campuses, roads, and public places are monitored constantly by video surveillance, the adaptation of VS will be imperative. Skimming through a huge corpus of video data to derive meaningful summarization requires efficient VS techniques. The need of the hour is to come up with techniques that can be easily deployed and require less training of the algorithms as in ANN methods. Some of the frameworks work well in the object tracking environment or any others. In this framework, we have come up with a common approach to do VS, as seen in the result section evaluated across multiple genres of videos table reference. The main motivation behind this work is three-fold. Firstly, to prove the use of the ABC optimization algorithm in a sparse-land setup. Secondly, to apply this approach to a real-time (quasi) framework similar to [1]. Thirdly, to adapt any domains online video content so that it can be used to solve other challenges in real-time like anomaly detection [2].
The challenge to any video summarization is to adapt to any domain, some of the frameworks work well on a certain domain as the methods are restricted or concentrated for a particular purpose like choosing humans and vehicle [3]. Methods like the sparse-land approach give the liberty to adapt to any domain videos, which is also proven in this work by the evaluation scores across multiple genre videos in VSUMM dataset in Table I, II. There has been a keen interest in the sparse-land based approach in the literature [4,1,5,6,7,8], hence taking this approach in this paper is proven.
The rest of the paper is organized as follows: Section II briefs about the related works in VS. Section III describes the proposed ABC method for the VS framework, Section IV deals with the proposed methodology, Section V discusses the experimental results. Finally, Section VI concludes the paper.

II. RELATED WORKS
The optimization algorithms play a vital role in selecting the right frames for video summarization and updating the dictionary D. Various studies on optimization algorithm and its performance metrics are based on storage reduction and computation time, as discussed in [9]. In this paper, we have evaluated the ABC algorithm against a well know dataset VSUMM [10], and the results benchmarked against a known dataset. In this section, we will go through optimization algorithm selection and different strategies to do VS. In the literature, we find lots of methods and techniques to do VS, based on clustering [11], saliency-based methods calculating the frame importance score on egocentric VS[lee2012discovering], traditional approaches with SVD [12]. In recent years there is an enormous amount of papers based on ANN [13,14,15,16,17,18,19]. ANN methods involve training in supervised, unsupervised approaches which may not be suitable for a near real-time VS. Graph-based methods [20,21,22,23] also requires the data porting into a graph database before computation which specializes in keyframe retrievals and browsing system. Among all of the methods, the sparse-land based approach to solve VS still stands out of other techniques due to its simplicity in solving VS as an optimization problem. The other features can be easily plugged and played with any optimization algorithm, as demonstrated in [9], flexibility in selecting the right dictionary shapes and elements and support quasi (real-time) in solving the VS [1], followed by anomaly detection [2]. We also see recent advancement in the sparse-land approach using CSC(Convolutional Sparse Coding Model) as on par with the current ANN methods [24].
Optimization algorithm from Evolutionary methods like ABC [25], P SO [26], GA [27], ADM M [28] and rmsprop [29] are quite common methods for optimization algorithm, In this paper, we have used the ABC method for optimization. In the recent literature, we can see ABC usage [30] for VS, where the authors have worked on another well-known dataset Summe [31] using segment level data on the Video for VS. The global effects of the entire video may not be captured well in such approaches [30]. [30] has used the ABC algorithm to identify key video segments and used clustering techniques to arrive at keyframes. The keyframes come from the center of the cluster. A region of interest approach is used to identify important frames, similar to the camshift algorithm proposed in our work to reduce the unwanted frames. The final reduction of keyframes is done via the hue histogram comparison. Also, the ABC algorithm has shown better convergence than other algorithms like PSO.
[9], in our previous approach, we have proposed four algorithms to test video summarization optimization time and storage reduction. The test was performed on random videos on youtube, whereas this paper accomplishes the performance of the ABC optimization algorithm against known VSUMM dataset in VS, also we have calculated the performance evaluation scores as indicated in the experiments and results section.
[1] has used ADMM optimization techniques in a sparseland setup to solve VS challenge, these ideas are some of the key foundations in solving the VS framework along with dictionary initialization and sparse modeling. References for image restoration can be found in [32]. Image reconstruction is done with the current frame and frames from the dictionary. A high reconstruction error of α denotes more changes between frames. When the reconstruction error α is high the frame is included for summarization [33,5,34,35].

A. Summary of the Contribution
Our contribution in this work is the usage of the ABC optimization algorithm in a sparse-land setup to do VS. The evaluation metrics precision, recall, F 1 − Score are obtained for the individual video to showcase the working of the ABC algorithm on par with other methods as compared in Table III with earlier reference works [10]. The other two Tables I, II gives the precision, recall, F 1 − Score for all the 50 individual videos in VSUMM dataset. This framework works as a near real-time(quasi-real-time) summarization and anomaly detection framework. The framework can also be easily extended to other advanced sparse-land setups such as CSC [24].

III. THE ABC OPTIMIZATION FOR VS FRAMEWORK
The artificial bee colony (ABC) algorithm comes from the swarm intelligence branch. The ABC algorithm is modeled around the intelligent behavior of honey bee in performing their task efficiently to identify the target food locations [25,36]. There are mainly three types of phase, Employed, onlooker, and scout bee phases. The employed bees are responsible for visiting the existing food sources, onlooker bees wait for the dance ceremony to select the next food source depending upon the performance of the bees, the scout bees do a random pickup of food sources. The main function of Employed phase is to update the X new position variable and to find a suitable partner solution X p , the update equation to calculate the new position is as shown in the below equation 1. X is the current solution and X p is the partner solution. φ is a random value in the range [-1,1].
The Onlooker bees are responsible for selecting the food sources with a highest nectar value F (θ i ), θ i is the i t h food source, the probability of a cycle is given as P (c) = {θ i (c)|i = 1, 2, ....S}, (C: cycle, S:no. of food sources), probability function p(X i ) for choosing the food sources as given below.
The scout bees do a random discovery of the food sources with the predefined limits specified by the search space limits [X M in , X M ax ], the randomness of the food sources are determined by the below equation 3.

IV. PROPOSED METHODOLOGY
The architectural flow for VS is similar to our previous work [9]. The features used are HOG(histogram of oriented gradients) with nine bins with a range of 20 degrees per bin, HOF (Histogram of Optical Flow), HOOF (Histogram of Optical Flow), Canny edge detection, the sample feature output of a frame can be seen in the below Fig. 1.

A. Preprocessing of Video Using
The camshift algorithm is used to preprocess the frames, a wide variety of applications can found for the camshift algorithm [37,38,39] including object tracking and frame rate and size reduction by only capturing the ROI areas. In our approach, we have used the camshift algorithm to reduce the number of frames. This is an important step to filter keyframes. The camshift algorithm usage and depiction can be seen in Fig.  2, similar methods can be seen in the literature [40].

B. Features Used
The features used can be seen in the code listing Matlab code below and the values as depicted in Fig. 1. currF is the current frame read, Canny − edge variable Contain the Canny edges, HI is the histogram image, HOG is the histogram of oriented gradients [41], HOOF is the Histograms of oriented optical flow.

C. Dictionary of Key Frames
The atom selection for the dictionary is done using a similar approach as followed in [2,1], We have selected 50 frames for dictionary comparison, the 50 frame is a selected as a computational limit, The current frame feature values are compared against and previous frames value as indicted in equation 4, where pre is the previous frames feature value, cu is the current frames feature value, the α is calculated by the ABC algorithm as indicated in the algorithm section, λ is initialized to a small value of 0.01, 50 k atoms. Dictionary selection is again a great way to start the summarization with good representation from the video data, the dictionary initialization is discussed in [42,43,44].

D. Threshold as the Reconstruction Error
The threshold α is calculated as a mean of the 50 frames in the current cycle comparison from the dictionary, as we increment by 50 frames for the next comparison. The threshold α as compared with the value from equation 4 when there is a higher reconstruction error (higher value of α), we include the frame for summarization.

V. EXPERIMENTAL RESULT
In this section, we discuss the results obtained using the ABC optimization algorithm on a well-known dataset VSUMM [10]. The dataset consists of 50 videos from different genres and user summary keyframes for each video. In this experiment, we have compared the results for two user summaries and given the evaluation for each user summary against the automated summary generation as available in VSUMM dataset [10].
The average evaluation scores obtained in Table III indicate the approach using the ABC algorithm in a sparseland approach is close to other results as compared to [10].
www.ijacsa.thesai.org Fig. 3 depicts the results of one of the video # 30 from the VSUMM dataset giving a clear indication of the frame number matches and +/ − 1 frame matches, hence the results obtained demonstrate the approach for sparse-land based VS, a full framework for VS, anomaly detection, and online-highlighting. This approach is open to include any other Text/NLP [45,46,47,48,49,50] based feature inputs. Frame importance rankings [45] with NLP caption generation methods [51,52] combined with other video features are recent advancements in video summarization features [53,50].

A. Evaluation of Video Summary
The evaluation is based on the proposed approach as discussed in [54,10] called Comparison of User Summaries (CUS). The user summary is composed of many user summaries and taken a common score approach in the VSUMM dataset. The results in our approach called the automatic summary are compared with two user summaries as depicted in Fig. 3. Precision, recall, and F1-score are the common metrics to measure the performance of the VS framework, the formulas are followed from [54,10]. The evaluation metrics for precision, recall, F1-score is depicted in Tables I and II against both the user summary in VSUMM dataset. The equation depicted below 5, 6, 7 are used for the evaluation metrics with the automated summary generated by our approach, the comparison scores are mean accuracy rate CUSA(precision) Error rate CUSE(Recall), and F1-Score. The F1-score obtained by our approach is close enough to other methods [10], by balancing the threshold parameters in ABC algorithm we can improve the F1-Score, also we need to take care of other scores that get affected like precision and recall. Finding the right balance with all the parameters of our ABC approach for video summarization and evaluation by F1-Score is another open challenge.
VI. CONCLUSION In this work, we propose the ABC optimization algorithm for Video summarization to reduce long video to short video, removing redundant frames. We have compared the performance metrics for evaluations with the known dataset VSUMM. The comparison metrics have given a better score with other methods with reasonable performance. This method can be easily used for (quasi) real-time VS and anomaly detection, also extendable with other advanced sparse-land approaches as CSC (Convolutional Sparse Coding Model) [24], and K-SVD approaches [55,56]. Finding an optimal threshold function or value for summarization is still open as the performance measure gets affected as we decrease or increase the threshold function.