An Optimized Deep Learning Method for Video Summarization Based on the User Object of Interest

—Surveillance video is now able to play a vital role in maintaining security and protection thanks to the advancement of digital video technology. Businesses, both private and public, employ surveillance systems to monitor and track their daily operations. As a result, video generates a significant volume of data that needs to be further processed to satisfy security protocol requirements. Analyzing video requires a lot of effort and time, as well as quick equipment. The concept of a video summary was developed in order to overcome these limitations. To work past these limitations, the concept of video summarization has emerged. In this study, a deep learning-based method for customized video summarization is presented. This research enables users to produce a video summary in accordance with the User Object of Interest (UOoI), such as a car, airplane, person, bicycle, automobile, etc. Several experiments have been conducted on the two datasets, SumMe and self-created, to assess the efficiency of the proposed method. On SumMe and the self-created dataset, the overall accuracy is 98.7% and 97.5%, respectively, with a summarization rate of 93.5% and 67.3%. Furthermore, a comparison study is done to demonstrate that our proposed method is superior to other existing methods in terms of video summarization accuracy and robustness. Additionally, a graphic user interface is created to assist the user with summarizing the video using the UOoI.


I. INTRODUCTION
Globally, everyone's first priority is security.On both private and public assets, video surveillance cameras have been deployed as well as other security measures to address this difficulty.At homes, businesses, airports, banks, and other public locations, a variety of security surveillance camerasboth stationary and mobile-have been placed.These cameras are extremely important for monitoring and spotting anomalous activities.They are also useful for assisting with the investigation of incidents or crime scenes, such as car accidents, robberies, murders, and terrorist activities.Additionally, there are presently expected to be over 770 million cameras in use worldwide [1].Over 2,500 petabytes of video data are produced each day by these cameras, which are typically always in use [2].It is also estimated that projection growth will exceed 120 zettabytes in 2023 [3].Every minute, 500 hours of videos are posted to YouTube [4].Fig. 1 displays daily statistics on the actual data generated by the video surveillance cameras around the world.
Motion detection, time monitoring, facial recognition, recognition of license plates, and other content-based video analysis technologies have already made great progress in the development of video analytic technology.The issue is that manual analysis of the video recordings still needs human intervention (camera operator, security personnel, etc.).Because visual inspection requires concentration and watching the entire video, it is challenging and time-consuming to extract useful information from video footage.In the event of lengthy videos, it could potentially lead to false negatives.Therefore, it is imperative to find a solution that reduces the human time and effort required for manual analysis.To solve this issue, attempts are being made to create a video summary that quickly conveys the essence of the entire video [5].By identifying and presenting the most interesting and up-to-date content to potential consumers, video summarizing (VS) creates a summary of substantial video content.Security surveillance systems use video surveillance to detect and analyze suspicious or anomalous activity.Individuals also use VS to share sporadic videos on social media, create highlights of different sports, create movie and television trailers, index video content to enable quick browsing of large amounts of video through video search engines, etc. [6,7].There have been various attempts by the researchers to propose an automated VS.The majority of VS approaches provide a summary based on choosing key frames that best depict the video during the skimming procedure.For video summarization, the shot boundary detection techniques [8][9][10][11][12][13][14][15] are widely known.Instead of concentrating on a single item, feature-based techniques [16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34] for VS provide a generalized video summary.These methods have trouble accurately recognizing the item, which prevents them from meeting the user's needs.The video is distilled using trajectory-based [6,25] and clustering [17,22,[35][36][37][38] algorithms that highlight related objects, actions, and events.These methods, however, do not produce a summary of any video that provides information based on the user's interests.As a result, these methods restrict the usage of retrieval tasks and do little to improve users' observing experiences.The summarizing of a video may be accomplished during the video skimming process by choosing shot portions with the use of video editing software like Filmora [38], SpenShot [39], and Davinci [40].
The aforementioned tools are expensive, need extensive storage, and require user skill.In order to capture the user's attention, it is also important to carefully choose segments that accurately portray the complete video.However, the key-frame extraction process appears appropriate for bandwidthconstrained devices and gives the video's core subject in a few frames.Similarly most of the existing techniques work on the principle of key-frame selection by eliminating the redundant www.ijacsa.thesai.orgframes that may result in loss of important information related to a user's interest and create vagueness.Many surveillance cameras have been erected in public locations to monitor suspicious actions such as mobile phone snatching, terrorism, robbery, and so on, where the information contained in every single frame is critical.As a result, these strategies restrict the usage of retrieval tasks and do not contribute to improving the users' observing experience.Due to the limitation (disappearance of object and event), these techniques are unable to produce significant results.Though there are various ways that summarize the video based on the user's interest, their fundamental issue is their high processing power needs and limited accuracy.This paper proposes a powerful VS method built on the User Object of Interest (UOoI) to address the challenges of video summarizing.The UOoI is the object that a user selects to collect all of the frames in which the selected object appears to summarize the movie.Examples of such objects are people, purses, mobile phones, motorcycles, etc.The proposed VS method has three main steps: i) the selection of the UOoI phase; ii) the detection of the object phase; and iii) the summarization of the video based on the UOoI phase.In order to exclude the unneeded noisy items (other than the OoI) that are essential for the object segmentation, the UOoI selection is first carried out from a database.Then, in order to detect an object that is thought to be a UOoI, the detector YOLOv3 is used.The VS algorithm detects the objects and then summarizes the video, relying on the UOoI.The implications of the proposed method may be summed up as follows, based on the discussion above:  Initially selection of UOoI is done.The proposed technique chooses the object from the repository and automatically throws out any unnecessary objects; the YOLOv3 is then utilized to discover the needed object.
 The VS technique may identify a single object as well as several objects in a video clip.
 The proposed technique effectively summarizes the video and outperforms all difficulties demonstrated in the SumMe [40] and self-created Dataset.
 The experiments analysis demonstrates that the proposed method works better than cutting-edge techniques in the field of VS.
The remainder of the article is structured as follows: The literature review for departing strategies is described in Section II.The proposed VS method that describes the video summarizing technique is presented in Section III.Section IV discussed the experimental analysis and results.However, the comparative analysis is performed in Section V. Section VI provides an overview of the graphical user interface.Finally, Section VII addresses the conclusion and future work.

II. LITERATURE REVIEW
Several VS approaches have been put forth in the literature.A technique for summarizing the video developed by Ngo et al. [33] is based on content balance and perceptual quality.The task was completed by immediately identifying moving objects, which were then used to apply video optimization.An event-based video clarification approach has been presented by Damnjanovic et al. [19].The method first totals the absolute difference in pixel values between the current frame and the reference frame before calculating each frame's energy.All of the current events in the frames are identified in this manner.The technique for video summarization is then used to extract keyframes.Using three processes to produce the video summary: extracting visual elements, summarizing the movie, and filtering it.
A two-stage technique is presented by Miniakhmetova and Zymbler [10] to produce a personalized summary of the video.The first step is video structure, which involves using different scene identification algorithms to produce a video summary.Using the detection bank, items are picked out of the frames of videos in the second step.The most influential sequences in which items are recognized, which later form a region of the user's interest, are included in the video summary that is produced.Three primary steps-shot boundary identification, redundant frame reduction, and stroboscopic imaging-can be used to summarize video according to a method suggested by Varghese and Nair [8].
The neighboring frame is compared to the current frame to determine the shot boundary.After that, the Structural Similarity Index (SSI) is adopted to eliminate repeating frames.The strobe is also used to display the activities that are already taking place in the film and to grasp the common backdrop.In comparison to the original video, the summarized video's overall volume has decreased by 55%.Lai et al. [15] developed a frame re-composition-based technique utilizing a clustering algorithm, optical flow, and background reduction with the goal of recognizing foreground elements.The foreground object has been identified thanks to the fusion of several pixels.Once the objects or actions have been seen, a sliding window has been utilized to integrate the recognized elements in succeeding frames to produce a spatiotemporal trajectory.The full spatiotemporal trajectory is combined to produce the video summary, and the algorithm has a 97% accuracy rate.
Three factors may be employed to determine a video's summary, according to Srinivas et al. [17].First, each frame is given a score based on a variety of factors, such as color, statistical attention, quality, demonstration, temporal segment, and uniformity.After that, it assigns a weight to each value based on the position of the attribute for producing key frames.The weighting is determined using the standard deviation.Lastly, the repeated frames are removed, with frames being gathered in ascending order by considering score.In comparison to previous strategies, this keyframe selection method yields outcomes that are 1.8% better.Frame selection in lecture clips has been studied by Davila and Zanibbi [32], who focused on segmentation by reducing content section conflicts, deleting objects, and re-building every frame to produce a summary of the video.The Kalman filter has been used to monitor human movements in Ajmal et al.'s [27] approach to determining the trajectory.The properties of color are useful for video since the color histogram may be utilized to identify shots and provide a synopsis of the video.
To identify an aberrant frame and eliminate noisy information from the video, Ma et al. [9] have developed a shared representation of neighboring frames.Keyframes are chosen using minimal sparse reconstruction to minimize noise and preserve critical information.A keyframe is the frame with a significant aberrant representation inaccuracy.The average percentage of reconstruction (APOR) and the sparse border are used to manage the keyframe count in a greedy iterative technique for model optimization.A cloud-based system called HOMER was introduced by Meyer et al. [41] for the creation of video highlights.With this technology, the user's emotions may be detected in order to provide a video summary.A dataset captured using a dual-camera system and a video of a home randomly chosen from Microsoft's Video Titles in the Wild (VTW) dataset are both used for experimental research.As a result, HOMER improved by 38% above baseline.Uncertainty detection and image processing technique in decision making has been presented [42][43].ResNet 152 and a Gated Recurrent Unit (GRU) were used in tandem to summarize a movie, according to Afzal and Tahir [44].The deep features that were present in the movie are extracted using ResNet 152 in this technique.Similarly, a GRU is utilized to increase the approach's performance and resilience.Utilizing the F-measure 43.7 and the SumMe dataset, an experimental study is conducted.A brief overview of current VS approaches is discussed in Table I.
The majority of current solutions remove unnecessary frames and a few key frames that can lose crucial information pertaining to a user's interest.In order to monitor suspicious actions like mobile theft, terrorism, robberies, etc., where the information contained in each single frame is crucial, numerous surveillance cameras have been erected in public spaces.These methods are unable to yield meaningful results because of the restriction (the disappearance of objects and events).Additionally, no method produces a summary of a video according to the user's specifications, such as one based on a single item (person, car, etc.).The proposed VS method is straightforward and incredibly reliable; it quickly and accurately generates a summary of a video depending on the user's requirements.The user chooses the UOoI as an input in the proposed method, and the algorithm generates the output in accordance with the user's requirements.For the purpose of inspecting the common backdrop frames, stroboscopic effect is used.
55% reduces of video duration.

2.
Lai et al. [15] By using frame re-composition, it is deleting the irrelevant spatio-temporal segments.For the purpose of creating a video summary, the extracted items are reconnected in the spatiotemporal trajectory.
Only a stationary camera is capable of detecting objects.

3.
Ma et al. [9] To optimize the model based on the adjacent frame, utilize the iteration method that use the average percentage of frame reconstruction.
Dedicated only to fixed-size frames.
4 Davila and Zanibbi [32] Focusing the lecture video on the hand-written material that was present on the whiteboard and summarizing the film by removing any uncertainty between the topic sections.
Lower in term of accuracy.5 Damnjanovic et al. [19] Identifying and grouping the events shown in CCTV footage.Additionally, two summary types static and dynamic were added.
The main drawback is the possibility of falsely detecting events when the backdrop environment changes.

6
Ngo et al. [33] This method of video summary caught both attention values and the structure of the video.Video can be organized in a hierarchical tree depending on scenes, groups, etc. to eliminate redundancy.
Low summarization rate about 10-15% 7 Miniakhmetova and Zymbler [10] Make a description of the video based on user comments that include likes, dislikes, and neutral criticism in light of aspects influencing the scenario, including item appearance.Prototype is missing.

8
Ajmal et al. [27] The Support Vector Machine (SVM) classifier may be used to recognise the individual using a Histogram of Oriented Gradient (HOG), and the Kalman filter can be used to monitor mobility.
By making browsing quickly, the technology decreases video storage and saves time.

III. PROPOSED METHODOLOGY
Fig. 2 depicts the architecture of the proposed VS system based on the UOoI.The following key modules make up the suggested system:  UOoI detection module: detect UOoI in videos using deep learning.
 The video summary is produced by the video summarization module using the frames having a UOoI in the video.www.ijacsa.thesai.org

A. User Object of Interest (Repository)
Defining the UOoI is the initial stage for VS.An 80-item UOoI dictionary is made specifically for this purpose.The COCO dataset, which comprises 330 thousand pictures and more than 200 thousand labeled images, is used to define the UOoI.It also offers 80 categories for things like cars, people, and handbags [45].

B. Object Type Detection
Yolo (You Look Only Once) v3 is employed in the intended attempt to identify the OoI.The position of the scene and picture where the UOoI is detected and categorized according to the category, such as a person, automobile, bicycle, etc., is determined by the object's detection.Yolo v3 employs a 53-layer modified darknet that is trained on Imagenet.In addition, 53 more layers have been added for job identification, giving Yolo v3's underlying architecture a total of 106 layers.To avoid losing low-level data, there is no pooling layer, and the feature maps are down-sampled using a convolutional layer with stride 2. YOLOv3 is substantially quicker at identifying objects than other object recognition methods [46].The entire video is processed by Yolo v3 using just one neural network.The network divides the images into areas and generates bounding boxes and probabilities for each region.Logistic regression is used in YOLO v3 to forecast each class score, and a threshold may be used to predict an object's multiple labels.The courses that have scores over a certain level, however, are put in the box [47].Fig. 3 describes the prediction of bounding box.
where, the bounding-box's x and y dimensions are (b x ,b y ).However, four coordinates predicted by YOLO v3 such t x , t y , t w , t h for each bounding box.The predictions are shown as follows if the cell is offset by (C x ,C y ) from the image's top-left corner and the bounding box prior has dimensions of p w ,p h :  However, Pr(obj) is the value of the probability that an object existed in the grid.The value of Pr(obj) as well as confidence are dependent on object existence in grid.For example, the score of Pr(obj) is 1 if the object is in a grid.Similarly, the score is 0 if the confidence is 0. Eq. ( 6) describes the which is the ratio between predicted objects and real objects.
) describes the area of the intersection between predicted and real objects; whereas describe the area that is combined regarding predicted and real objects.Similarly, the object class is predicted and defined when it appears in the grid.In such a case, the confidence is measured by the multiplication of the predicted class by the probability of www.ijacsa.thesai.organ object with box convergence, as mentioned in the following equations: (7) (8)

C. Comparison in Terms of Speed and Accuracy
In terms of speed, when compared to other models, Yolo v3 is the best object detection model.YOLO v3 processes data at a rate of 45 fps, which is rather fast in contrast to single-shot detectors (SSD), Faster-RCNN, and R-FCN.YOLO v3's speed performance against other object identification models is shown in Fig. 4. Accuracy, taken into account for the comparison, is another crucial element, as mentioned in Fig. 5.The model Faster-RCNN executes with an accuracy rate greater than YOLO v3, but YOLO v3 has significantly better accuracy than most of the other models.It operates in a realistic situation considerably more slowly than Yolo v3 does.Many other algorithms, like the R-CNN family and SSDs, operate similarly but take longer to complete because of their numerous, intricate phases.On the other hand, YOLO v3 uses single-stage detection to complete the same task using a single neural network.When compared to other models, YOLO v3 operates precisely and executes more quickly, for example, detecting 45 frames per second as opposed to the Faster-RCNN family's detecting just five frames per second [48][49].

D. Video Summary Generation Algorithm
The key collection process based on objects of interest in videos using the YOLOv3 deep learning model can be described mathematically as follows: Confidently selecting the frames that include interesting items is a necessary step in the key collection process.This may be achieved by setting a threshold for the class probability.The symbol α will be used to represent this threshold, where 0 ≤ α ≤ 1.
The mathematical method for key collection using YOLOv3 depending on desired object is defined below: where: Key is the set of key frames containing user objects of interest.pᵢₖ(uoⱼ) represents the class probability of user object uoⱼ in frame fmᵢₖ.
In this equation, each video vs i and its frames fm ik are repeated a number of times.We include the video-frame pair (vs i , fm ik ) in the set of key frames K if an object uoj appears in frame fm ik with a class probability p ik (uo j ) greater than or equal to the threshold.Additionally, to summarize the video using all crucial frames, we may alter the equation mentioned earlier as follows: where: Smv is the collection of frames containing interesting items whose class probabilities are larger than or equal to the summary threshold , and where 0 ≤ ≤ 1.Now that we have set a threshold on the class probabilities, we may collect n important frames and also summarize the video by looking at all frames that meet the threshold and include relevant elements.With these adjustments, the key frame collection procedure is more adaptable, and YOLOv3based movie summaries are now possible.Utilizing the suggested architecture, the process for creating video summaries is depicted in Algorithm.Output Key: Set of key frames containing desired objects Smv: Summarized collection of frames containing desired objects Algorithm: 1. for 2. Initialize an empty set Key to store key frames.

IV. EXPERIMENTAL ANALYSIS AND RESULT
Python is used as the programming language, and all experiments are done on a computer with specifications such as an Intel Core i5 6th generation with 8 GB of RAM.
In this study, the effectiveness of the proposed method is assessed using a subjective technique.Each test stream has a summarized movie that is produced both manually (using the video editing application Davinci) and automatically using the proposed scheme.In simple terms, this is considered a framelevel comparison.Precision, recall, F1-score, and accuracy are used to assess the performance of the proposed method.The following equations provide the mathematical expressions for various evaluation parameters: Two distinct datasets, the SumMe dataset and the author's dataset, are utilized to evaluate and compare the effectiveness of our approach with the manual method.25 videos that have each had at least 15 human video tags are included in a video summarizing dataset called SumMe.The videos gathered from various sources are included in our own dataset.The videos come in numerous sizes, including 320 x 240, 352 x 240, 640 x 360, 854 x 480, and 1920 x 1080, and are in the AVI and MP4 formats.The example test video sequences from the SumMe dataset and our self-created datasets are listed in Tables II and III, along with their parameters.The efficiency of our approach is assessed by a number of tests on video of various lengths and resolutions.The following sections discuss the evaluation of both datasets.

A. Evaluation of SumMe dataset
For the evaluation of the SumME dataset, different scenarios have been taken from it, as mentioned in Table III.However, the first scenario belongs to a river crossing where several people are crossing the river.In which some of them have a handbag.So, in this scenario, collect all those scenes where a handbag (a user object of interest) appears.Similarly, the second scenario is related to playing ball, in which a dog is playing with the ball, so in this video, keep tracking all the movements of the dog.In kid's scenarios, a bicycle appears for a limited duration, so it is taken as an object of interest.In the next video, St. Martin is taken as UOoI, and the final video is related to the documentary Under Water, where people are searching for different things, so here the person is taken as UOoI.Fig. 6 describes these scenarios, and Fig. 7 shows the efficiency of our approach by presenting UOoI-detection shots.

1) Results of SumMe dataset:
As can be seen, the proposed method showed tremendous results on the SumMe dataset.However, some frames can be falsely predicted as well as missed, as mentioned in Scenario 2 of Fig. 7.This is because of distortion in the video, so such frames can only be viewed with the naked eye.In the best case, like Scenario 5, all the frames are properly detected by the proposed method and achieve the highest accuracy.Fig. 8 shows the confusion matrices.The SumMe dataset results are mentioned in Table IV.www.ijacsa.thesai.org

B. Evaluation of Self Created Dataset
For the evaluation of the self-created dataset, different scenarios have been taken from online repositories, as mentioned in Table III.However, the first scenario belongs to a car mirror breaking, in which a person broke the car mirror and took the car from it, so the handbag is considered an object of interest.The second scene belongs to a robbery, so the person is taken as UOoI.The third scene is related to monitoring dog activity, so the dog is UOoI.In the fourth and fifth scenarios, bicycles and people are taken as UOoI.Fig. 9 describes these scenarios along with UOoI detection shots in order to show the efficiency of our method.Fig. 10 shows UOoI-based shot detection in order to show the efficiency of the proposed method.
1) Result of self created dataset: On a self-created dataset, as can be seen, the proposed strategy yields superior results.However, some frames can be falsely predicted as well as missed because of low resolution or low light in the video, so such frames can only be viewed with the naked eye.In the best case, like Scenarios 1, 3, and 5, all the frames are accurately detected by our method with the highest accuracy.The confusion matrices are shown in Fig. 11.Table V describes the result of the self-created dataset.

V. COMPARATIVE ANALYSIS
The comparative analysis of the proposed method with the existing state-of-the-art method is done in this section.The following core characteristics serve as the foundation for the comparison analysis:  C1.Customised User object type (UOoI),  C2.Frame Extraction based on UOoI,  C3.Accuracy.

 C4: Rate of Summarization
Table VI demonstrates that the majority of the strategies now in use focus on the general detection of objects rather than one particular, specific object (UOoI).Similar to this, numerous algorithms extracted the video summary by eliminating unnecessary frames and scenes rather than concentrating on the objects.This research demonstrates that our method is distinctive in that it includes the most important qualities for VS.Like the proposed method, it considers the user's input to summarize the video and produce the output according to the user.So the proposed method extracted those frames that were in the region of the user's interest.Furthermore, the proposed method is more accurate and achieved 98.7% accuracy with the highest summarization rate of 93.5% as compared to existing state-of-the-art methods.

VI. GRAPHIC USER INTERFACE (GUI APPLICATION)
In the current study, a desktop application is also created utilizing PYQT5 and a Python-based GUI to give users an interactive interface for performing VS after finding objects.The interface of the application created for the selection of input video is shown in Fig. 12.Additionally, it does validation to verify the supplied input's format and contains information about the input.The system requires a video file in MP4 or AVI format as input.The explanation is that MP4 and AVI are both standardized file types.The application does not regard the input as a video if it has fewer than two frames.As a result, it issues a warning notice to the user.The next step is choosing the object type (UOoI) that will be recognized in the input video after the video has been chosen.There are several possibilities for choosing an object in this section.As a result, a user may choose UOoI with ease based on his or her preferences.

VII. CONCLUSION AND FUTURE DIRECTION
This article provides a useful VS technique for summarizing videos using the UOoI.The proposed approach is notably more efficient, optimal, and quick as compared to current state-of-the-art techniques for summarizing the video.
The UOoI-based solution increases the user's ability to reliably and flexibly construct the pertinent video summary.The proposed method can detect diverse object types accurately and efficiently using YOLOv3.The proposed approach is extensively tested on two different datasets, including the SumMe dataset and my personal dataset.The proposed approach achieves an accuracy of 98.7% with a quick processing rate and a time savings of 93.5% when the complete video is viewed to detect the UOoI on the SumME dataset.Accuracy is 97.5% on the self-created dataset, and overall time reductions are 67.3%.Similarly, a comparative analysis has been performed that shows the proposed work contains novelty with the highest accuracy as well as the highest summarization rate.Furthermore, a GUI that provides ease and configurable object selection is also developed.Future work on this project will expand it to include multiple objects of interest and concentrate on improving its accuracy and summary rate.

Fig. 4 .
Fig. 4. Comparison of object detection models in term of speed.
collection of videos, it is denoted by vsᵢ, i ∈ [1, n].UO be the collection of desired object, it is denoted by uoⱼ, j ∈ [1, m].Fm(vsᵢ) be the set of frames in video vᵢ, where each frame is denoted by fmᵢₖ, k ∈ [1, p].The YOLOv3 model calculates the bounding-boxes (b ik ) and associated class-probabilities (p ik ) for each frame fm ik in the video v i .The item's size, location, and bounding box coordinates (x, y, w, h) are displayed, and the likelihood that it belongs to a certain class is indicated by the class probabilities.

TABLE III .
SELF CREATED DATASET STATISTICS

TABLE IV .
RESULT OF PROPOSEED METHOD ON SUMME DATASET

Table
VII provides another comparison of the proposed work with the existing method.

TABLE VI .
COMPARATIVE ANALYSIS WITH EXISTING METHODS BASED ON FACTORS 5% www.ijacsa.thesai.org

TABLE VII .
ANALYSIS OF THE PROPOESED METHOD'S EFFICIENCY IN COMPARISON TO MODERN TECHNIQUES