Moment Features based Violence Action Detection using Optical Flow

Instantaneous detection of violence is still an unsolved research problem although artificial intelligence lives its prosperous years. The severity of injury causes due to violence can be minimized by detecting violence in real time demands for effective violence detection. Various methods were previously proposed for violence detection which could not provide robust results due many challenges, i.e. noise, motion estimation, lack of appropriate feature selection, lack of effective classification approach, complex background and variations in illumination. This research proposes an efficient method for violence detection using moment features to use motion patterns to facilitate detection in each frame and provides smaller area as region of interest. This means probability for extraction of motion intensity is getting lost because of same colored object in the background is reduced and thus minimizes background complexity. After that, proposed method uses optical flow to calculate angles and linear distances in each frame. In this context, if there is any frame loss due to noise or illumination variation, proposed method uses Kalman filter to process that frame by illuminating noise. Finally, decision for violence is determined using random forest classifier from single feature vector by generating a set of probabilities for each class. Proposed research performed extensive experimentation where accuracy rate of 99.12% was achieved using frame rate of 35 fps which is higher comparing with previous research results. Experimental results reveal the effectiveness of the proposed methodology. Keywords—Violence detection; feature extraction; classification; optical flow


I. INTRODUCTION
Surveillance applications have been used to monitor public and private areas where intelligent violence detection is still an unsolved research problem. Surveillance system with violence detection capability can be used to monitor various places, i.e. airports [1], football matches and protests [2], parks [3], stadium [4], internet video filtration [5], markets [6], Government offices [7]. Previous researchers proposed various methods to provide intelligent violence detection which could not provide satisfactory results due to many challenges, i.e. lack of efficient feature extractions [8,9], lack of appropriate segmentation method [10], motion estimation [11,12], variation of brightness or illumination [13], cluttered backgrounds [14], scene complexities [15], low level image [16], lack of efficient noise reduction method [17], occlusions [18,19]. Proposed method by this research used motion intensity characteristics by extracting moment features to facilitate violence detection in each frame effectively.
Till now many researchers have done their researches regarding violence detection. Research in [20] used convolutional neural network (C3D) and CNN with long short-term memory (CNN-LSTM) through shallow neural network to learn high level spatial-temporal information from raw image data for violence detection. However, their proposed method was suitable for still images. Research in [21] used set of selectively distributed frames and spatio-temporal features using both space and time dimensions provided to fully connected neural framework to classify violence or nonviolence action which required further validation in terms with complexity. Research in [22] distinguished physical violence by designing DT-SVM (Decision Tree-SVM) two-layer classifier. However, their overall methodology requires further improvement towards more complex scenes with both nearby and distant objects. In in [23] modeled crowd dynamics using temporal summaries of grey level co-occurrence matrix (GLCM) features. However, improvement towards adaptive selection of optimal parameters based on given data was required in their proposed methodology.
This research performs moment features extraction to facilitate violence detection as the basis of motion pattern. In this context, linear distances and angles are calculated using optical flow. Besides, if there is any frame loss due to noise or illumination variation, proposed method uses Kalman filter to consider that frame for further processing by illuminating noise. Overall contributions by this research are stated below:  Proposed method uses moment features by implicating weighted average of pictorial intensities to use motion patterns for facilitating detection in each frame in lieu with reducing background complexity and provides smaller area as region of interest to reduce overall computation time per frame.
 After calculating linear distances and angles as the basis of optical flow, proposed method uses Kalman filter to rectify frame loss due to nose or illumination variation which plays significant role to optimally estimate distance and angles for higher accuracy.
 As part of overall proposed methodology, this research uses random forest classifier to classify single feature type in order avoid complication like using multiple types of features causes lower processing time per frame comparing with previous research methods.
Rest of this paper is organized as follows. Section 2 demonstrates comprehensive and critical reviews in the (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 11, 2020 504 | P a g e www.ijacsa.thesai.org existing research, Section 3 illustrates proposed methodology for violence detection, Section 4 depicts extensive experimental validation for the proposed method and finally Section 5 presents concluding remarks.

II. BACKGROUND STUDY
Various methods aiming to solve violence detection problem has been proposed mentioned in Fig. 1.
Research in [20] applied two deep neural networks (DNNs), i.e. 3D-based convolutional neural network (C3D) and CNN with long short-term memory (CNN-LSTM) for learning high level spatial-temporal information from raw image data. They combined features map achieved from C3D and CNN-LSTM through designing a shallow neural network. However, combination of features map from C3D and CNN-LSTM is suited for objet detection from still images causes high complexity in their overall research. In this context, estimation of computation time was not considered in their research during validation. Research in [21] illustrated an end to end deep neural network for violence detection using surveillance cameras. They extracted set of selectively distributed frames from video in lieu with passing spatiotemporal features to a fully connected neural network in order to classify violence or non violence action. Although, they created spatio-temporal features by performing features extraction using both space and time dimensions through a custom build convolutional neural network and long short term memory LSTM recurrent neural network, validation against computation time or processing time per frame was ignored in their research. Research in [22] selected some prior features to distinguish physical violence from daily-life activities. They designed DT-SVM (Decision Tree-SVM) two-layer classifier, i.e. first layer was acted as decision tree for using benefits of previously selected features and second layer was SVM classifier which used features for classification. However, for nearby and distant objects under complex scenarios, their research could not provide satisfactory results. In addition, their proposed approach provided significant misclassification for the frames with significant changes in light and shadow. Research in [23] proposed real time descriptor to model crowd dynamics by encoding variations in crowd texture by implicating temporal summaries of grey level co-occurrence matrix (GLCM) features. They measured inter-frame uniformity and illustrated that violent behavior varies in a less uniform manner. In addition, they performed discrimination between abnormal and normal scenes by generating scene description. However, adaptive selection of optimal parameters based on given data requires further improvement in their research. Research in [24] extracted key features, i.e. speed, direction, centroid and dimensions where they used Linear SVM to classify input video as violent or non-violent. Their proposed method considered two feature vectors, i.e. Local Binary Pattern (LBP) and Violent Flows (ViF). As Local Binary Pattern (LBP) or Violent Flows (ViF) takes less time for calculation separately than applying these feature vectors together, for this reason combination of Local Binary Pattern (LBP) and Violent Flows (ViF) in their research did not provide significant direction for future improvement. Research in [25] tested Bag-of-Words framework for detecting fight or violence by constructing a versatile and accurate fight detector using local descriptors. Although, they achieved encouraging accuracy rate, computation cost for extracting local descriptor is prohibitive for practical applications, particularly in surveillance and media rating systems. Research in [26] proposed a method using extreme acceleration pattern estimated by Radon transform to the power spectrum of consecutive frames. Their method assumed that kinematic cues that represent violent motion and strokes can be used to detect fights. In addition, they hypothesized if motion is considered as sufficient characteristics for recognition, in that case their overall methodology requires significant additional computation in lieu with confusing the detector. However, global motion estimation in their research did not seem to improve results significantly. Their proposed method required further perfection by approximating the Radon transform, which was the most time-consuming stage. Research in [27] proposed Oriented VIolent Flows (OViF) for feature extraction to take optimum benefits of motion magnitude variations in statistical motion orientations. In addition, they implicated feature combination and multiclassifier combination strategies. However, their proposed method could detect violence only in crowded scenes. Research in [28] proposed a method where corner joints of pictures are detected using Shi-Tomasi corner detection algorithm. They used optical flow parameter which was calculated using Lucas-Kanade pyramid optical flow algorithm for violence detection. However, for discontinuous and fast motion their proposed method did not provide robust performance.
Proposed method by this research calculated distance and angles as the basis of optical flow. In addition, proposed method used Kalman filter to handle illumination variation in case of any frame loss due to noise. Besides, motion pattern based on moment features extraction is used to reduce background complexity in case of similar colored objects by implicating weighted average of pictorial intensities. Previous methods 3D-based convolutional neural network (C3D) [20] Optical flow energy characteristics [28] Oriented VIolent Flows (OViF) [27] CNN with long short-term memory (CNN-LSTM) [20] Local Binary Pattern (LBP) [24] Bag-of-Words framework [25] Grey Level Cooccurrence Matrix (GLCM) [23] Deep Neural Network [21] Decision Tree-SVM (DT-SVM) [22] Acceleration Measure Vectors (AMV) [26] www.ijacsa.thesai.org

III. PROPOSED METHODOLOGY
This research follows nontracking based action recognition due to achieve computationally sound and effective violence action. Proposed research uses object motion to identify the event whether violent or non violent. This research uses moment features in lieu with optical flow to calculate linear distances and angles which works in pixel intensities of objects. Overall proposed method is depicted in Fig. 2.

A. Input Image and Preprocessing
Preprocessing is essential part to extract features efficiently to classify violence action. This step is the crucial as any inconsistency arising from it may lead to a misclassification of violence action. Proposed research uses monocular video camera using 35 fps frame rate for input video frames collection. Proposed research uses median filter [29,30] to remove noise from the collected video frames. In this context, morphological processing such as resizing of the frame into 300x250 dimension, erosion and dilation are applied to ensure noise free frames to the next subsequent frames. In addition, this research also uses two frame differential approaches to find the difference between frames for finding the initial change between consecutive frames.

B. Moment Features Extraction
Proposed method extracted moment features from median filtered image. If and are the co-ordinates of median filtered frame , raw moments of for order is defined as in (1). When considering as 2D continuous function, (1) can be expressed as (2).
Here, is denoted as raw moments to be used for calculating distances and angles as the basis of optical flow for further processing.

C. Calculation of Distance and Angle
Proposed method determines motion pattern among consecutive frames from the extracted moment features using two frame differential approach where angles and linear distances are calculated for each frame of input video. Parameters like pyramid scale, levels, window size, iteration are used to calculate motion. Pyramid scale of 0.5 is used as classical pyramid where each next layer is twice smaller than previous one. Value of levels 5 is used as mean number of pyramid layers including the initial frame. Proposed method uses larger window size of 20 to improve efficiency of the classification in terms with detecting motion. This research uses determined window around foreground object which limits the scope of motion segmentation to a smaller area. This means that the probability of the extraction of motion is getting lost because a similar colored object in the background is reduced in lieu with improving processing speed.
Proposed method calculates combination of distance and angles as the basis of optical flow for each frame of input video to classify whether action is violent or not. Distances are categorized for every pixel in the consecutive frame by 60° interval followed by summation of all distances. Proposed method calculates image gradients in horizontal and vertical direction in lieu with gradient along time. Proposed method uses optical flow to calculate angles and linear distance for each frame of input video. From the calculated angles, proposed method categorizes linear distances which generate oriented histogram of 6 bins and equally divided by 60 degree. Distance and angles are calculated as in (3) and (4).

L= (4)
Here, K is denoted as distance, angle is denoted by L, denotes distance change in horizontal direction and denotes distance change in vertical direction. Distance and angle is achieved for certain pixel position denoted as ( ) of (i-1) th frame and same pixel position denoted as ( ) of i th frame. Pythagoras theorem [31] is used to calculate angles and linear distances by the proposed method where by angle L changes of degree is defined from (i-1) th frame to i th frame in every pixel. Linear distances of pixels between (i-1) th and frame n th frame are defined by K. Finally, proposed method uses Random forest classifier to classify whether the video scene contains violence or not.

D. Kalman Filter
Kalman filter is used to provide the best estimate of states in the presence of noise. Proposed method uses Kalman filter to optimally estimate distances and angles for higher accuracy rate. During distances and angles calculation, if there is any  Vol. 11, No. 11, 2020 506 | P a g e www.ijacsa.thesai.org frame loss due to noise or illumination variation then Kalman filter is used by the proposed method to process that frame by illuminating noise. Although, median filter was applied during preprocessing step, some frames can be often still noise may cause deviation in performance.

E. Classification
Finally, decision for violence is determined using random forest classifier from single feature vector by generating a set of probabilities for each class. In this context, probabilities are estimated using mean predicted class probabilities of the trees in the forest where class probability of a single tree is the fraction of samples of the same class in the tree. Class with highest probability is the one that is assigned to the frame as the -decision‖. In this context, ratio of the highest probability to the second highest probability is referred to as -confidence‖ of the decision. In this regard, proposed method by this research used adaptive threshold using (5) [32,33]. Any decision with confidence more than threshold is considered as -violence‖ and others are -non violence‖.

T= (5)
Here, total number of frames is denoted as , mean value of pictorial intensities is denoted as in a video frame. Threshold value is denoted as .

B. Datasets
Proposed method is validated using Hockey Fight and Movies datasets [24,25,38,26,27]. Total 1000 videos are taken from Hockey fight dataset where 500 videos contain violent sequence and rest 500 videos contain non-violent video sequence. Besides, total 200 videos are taken from Movies datasets where 100 videos contain violent sequence and rest 100 videos contain non-violent video sequence. Whole datasets are divided into five sets for five cross validation. For Hockey Fight dataset and Movies datasets, resolution of frames is fixed to 300X250.

C. Experimental Results
Proposed method received accuracy rate of 99.12% for Movies dataset and 94.82% for Hockey Fight dataset shown in Table I

D. Comparison with Previous Research Results
Research in [20] received accuracy rate of 63% using 3Dbased convolutional neural network (C3D) and 61% using CNN with long short-term memory (CNN-LSTM) by using frame rate of 25 fps. Due to the additional step such as design of shallow neural network by combining features map obtained from C3D and CNN-LSTM can lead to increase computation complexity for the overall methodology. Research in [21] received accuracy rate of 94.5% using spatiotemporal features and passing them to a fully connected neural framework to classify the video to violence or non-violence action. Although, they performed feature extraction using both space and time dimensions to create spatio-temporal features through a custom build convolutional neural network, estimation of processing time per frame was ignored in their research. Research in [22] received accuracy rate of 97.6% by designing DT-SVM (Decision Tree SVM) using prior determined features to distinguish physical violence from daily-life activities. However, in case of complex scenes with both nearby and distant objects, their research could not provide satisfactory validation results. Research in [23], received accuracy rate of 90.5% using frame rate of 30 fps by implicating crowd dynamics with the use of encoding changes in crowd texture and temporal summaries of grey level cooccurrence matrix (GLCM) features. However, their research requires adaptive method for choosing optimal parameters based on given data. Research in [24] tested their proposed method in non-crowded and crowded scenarios to verify the effectiveness of Local Binary Pattern (LBP) features. They received accuracy rate of 89.1% with error rate of 10.9%. However, they did not validate their method based on computation time and frame rate. Research in [25] received approximate accuracy rate of 90% by using popular bag-ofwords approach which can accurately recognize fight sequences. However, computational cost of extracting features in their research is not encouraging for practical applications. Research in [27] received accuracy of 88% using statistical motion orientation information. However, they received error rate of 12% as their proposed Oriented VIolent Flows (OViF) could detect violence in crowded scenarios only which demands more robust validation. Research in [28] received accuracy rate of 72% using histogram of the computed optical flow energy values. However, they received high error rate of 28% due to inability of their method to perform under discontinuous and fast motion. Research in [26] received accuracy of 98.9% using extreme acceleration patterns as the main feature where their required processing time was 0.0419 second. However, further perfection was required in their method by approximating Radon transform, which is the most time-consuming stage. Among all these previous methods stated above, proposed method by this research received higher accuracy rate of 99.12% with lower required computation time per frame of 0.0010 second and low error rate of 0.88% using 35 fps frame rate mentioned in Table II,  Table III, Table IV and Table V.

E. Analysis and Discussion
Research in [20] utilized 3D-based convolutional neural network (C3D) and CNN with long short-term memory (CNN-LSTM). For C3D and CNN-LSTM they achieved accuracy rate of 63% and 61% respectively using 25 fps frame rate. Although, their accuracy was not promising, usages of two deep neural networks (DNNs) were robust on learning high level spatial-temporal information from raw image data. They combined features maps obtained from C3D and CNN-LSTM networks by designing shallow neural network which acted as third scenario in their research demands for further validation to establish their overall research in terms with computational time. Proposed method by this research estimates approximate computation time of 0.0010 sec per frame in lieu with accuracy rate of 99.12% using moment features instead of combing other feature measurement indicates better validation performance than research in [20]. Research in [21] received accuracy rate of 94.5% by extracting set of selectively distributed frames of the video clip and passing spatio-temporal features to a fully connected neural architecture in order to classify the video as violence or non-violence action. However, due to usage of a custom build convolutional neural network and long short term memory LSTM recurrent neural network to process spatio-temporal features based on space and time dimensions for feature extractions, validation against computation time or processing time was totally ignored in their research. In this context, proposed method by this research uses moments feature to extract motion characteristics for classifying violent characteristics and estimates processing time per frame to validate the overall proposed methodology. Research in [22] received accuracy rate of 97.6% where they designed DT-SVM (Decision Tree-SVM) two-layer classifier, i.e. first layer was a decision tree to take benefits of prior determined features, second layer was SVM classifier to use features for classification. However, frames with significant variations in light and shadow caused misclassification in their research. In addition, their research requires further investigation in case of nearby and distant objects for complex scenarios. Proposed method by this research is validated using sufficient datasets comparing with research in [22] in lieu with that proposed method uses random forest classifier to classify single feature type in order to avoid complication like using multiple types of features causes lower processing time per frame comparing with previous research methods. Research in [23] received accuracy rate of 90.5% using 30 fps frame rate by introducing measurement of inter-frame uniformity in lieu with demonstrating violent behavior changes in a less uniform manner. Although, their proposed method performed discrimination between abnormal and normal scenes, in case of choosing optimal parameters based on given data initiates the need of adaptive method to be integrated with their overall proposed methodology which will surely demand for further validation. Proposed method by this research uses adaptive threshold during classification in lieu with extracting single feature type initially remedies the need of choosing additional parameters to achieve optimal performance like in research [23] causes gaining better performance in terms with accuracy and frame rate. Research in [24] achieved accuracy rate of 89.1% using Linear SVM to classify video as violent or nonwww.ijacsa.thesai.org violent. They received error rate of 10.9% which indicates lower performance comparing with the proposed method by this research. In their research, Local Binary Pattern (LBP) or Violent Flows (ViF) takes less time for calculation than applying Local Binary Pattern (LBP) and Violent Flows (ViF) together. However, need of applying Local Binary Pattern (LBP) and Violent Flows (ViF) together instead of applying them separately did not provide any future direction for improvement in terms with performance by their research. In addition, their proposed method was not validated based computation time to indicate efficiency in terms with processing duration per frame. In this regard, proposed method by this research received higher accuracy rate of 99.12% and error rate of 0.88% using random forest classifier from single feature vector by generating a set of probabilities for each class indicates higher efficiency that research in [24]. Research in [25] achieved accuracy rate of 90% by constructing versatile and accurate fight detector using local features descriptors. Although, their accuracy rate was impressive, computational cost to construct local features vectors was impractical particularly in surveillance and media rating system. Proposed method by this research received accuracy rate of 99.12% by using moment features as means of the pixel intensity distribution to limit segmentation tasks in lieu with reducing computational complexity. Oriented VIolent Flows (OViF) by research in [27] received accuracy rate of 88% by taking full advantage of motion magnitude change information in statistical motion orientations. They received accuracy rate of 12% due to adaptation of features combination and multiclassifier strategies. In addition, their method was applicable only for crowded scenarios. In this context, proposed method by this research achieved accuracy rate of 99.12% with low error rate of 0.88% by measuring distance and angles in horizontal and vertical direction as the basis of optical flow strategy indicates better performance than research in [27]. Research in [26] received accuracy rate of 98.9% due to efficient estimation of extreme acceleration patterns as the main features by implicating Radon transform to the power spectrum of consecutive frames causes low error rate of 1.1% with required computation time of 0.0419second. Although, they hypothesized that motion was sufficient for recognition, global motion estimation experiments did not seem to improve results significantly. In addition, their proposed method needs further investigation to estimate relative importance of motion and appearance in formation for the recognition of violence or nonviolence actions. Proposed method by this research considers motion estimation using two frame differential approach and moments feature extraction where accuracy rate of 99.12% was achieved using 35 fps and error rate of 0.88% were received using 0.0010 second computation time per frame indicates better performance than in research [26] shown in Fig. 3, Fig. 4, Fig. 5 and Fig. 6.
Research in [28] received accuracy rate of 72% using Shi-Tomasi corner detection and histogram of the computed optical flow energy values. They received error rate of 28% using 0.025 second due to the usage of corner. Although, usage of corner features increases computation time in most of the research in computer vision domain, one of their significant achievement was that their method reduced time cost.
However, in case of discontinuous and fast motion, their method was not robust. Proposed method by this research showed better performance than in research [28] using moment features as the basis of uncertainty measurement of pictorial intensities to use motion pattern and later random forest classifier was used to classy violence and nonviolence behavior.  C3D [20] Optical flow energy characteristics [28] Oriented VIolent Flows (OViF) [27] Local Binary Pattern (LBP) [24] Bag-of-Words framework [25] GLCM [23] DNN [21] DT-SVM [22] Acceleration Measure Vectors (AMV) [26] Proposed method Optical flow energy characteristics [28] Oriented VIolent Flows (OViF) [27] Local Binary Pattern (LBP) [24] Acceleration Measure Vectors (AMV) [26] Proposed method

V. CONCLUSION
Proposed method used motion patterns by extracting moment features for uncertainty measurement of pictorial intensity distribution based on efficient scene interpretation. In case of similar colored object in the background, moment properties provide certain particular weighted average of pictorial intensities causes attractive interpretation which played vital role to minimize background complexity. After that, linear distances and angles are calculated for each frame as the basis of optical flow followed by Kalman filter to rectify frame loss due to noise or illumination variation which plays significant role to optimally estimate distance and angles for higher accuracy rate. Finally, proposed method used random forest classifier to classify single feature type in order to avoid complication like using multiple types of features causes lower processing time comparing with previous research methods. Experimental results for the proposed method reveal higher efficiency comparing with previous research results in terms with accuracy rate, computation time and frame rate. Proposed method achieved maximum accuracy of 99.12% using frame rate of 35 fps where required computation time per frame was 0.0010 sec. Performance of the proposed method reveals the potentiality to provide significant capability to surveillance applications for monitoring violence efficiently and reduce the impact of violence related injuries. In future, this research intends to be involved more complex activities, i.e. recognition of violence for distant objects and improvement of recognition performance for the misclassified samples. Optical flow energy characteristics [28] Proposed method Computation time (second) 25  GLCM [23] Proposed method Frame rate