Real Time Analysis of Crowd Behaviour for Automatic and Accurate Surveillance

Surveillance in this modern era is a necessity. Creating an alert in case of emergencies and disturbances is of very much importance. As the number of simultaneous camera feeds increase, burden on human supervisor also increases. The proposed system is a way to aid human supervisor in the surveillance job. Creating alerts in real time will help responding quickly to crucial situations. With this in mind, we propose the following things: (1) Generation of ViF (Violent Flow Descriptors) as high-level features in real time. (2) Using generated ViF’s of a Video Dataset for training a neural net and testing its accuracy. (3) Developing a system that can detect the signs of disturbance among the crowd in real time and can learn from the decisions it makes. Keywords—Real time surveillance; violent flow descriptors; neural network


I. INTRODUCTION
Cost of surveillance equipment in this digital era is minimal.In the view of public safety, CCTV cameras are installed in crowded and densely populated areas.The footage from CCTV cameras are continuously monitored by humans in order to respond in case of emergencies.This is a routine and tedious job for a human to continuously pay attention to multiple screens.Surveillance by humans is inefficient as it is limited to human capacity and may not be error free.If computers are replaced by humans to perform surveillance and generate alerts, it may aid the humans to respond quickly to the alerts.If we consider the amount of video footage generated simultaneously, we need a solution which can handle input at this scale.
The research done until now focuses on increasing the accuracy but makes a significant trade off with speed.We here focus on a scalable and efficient algorithm which focuses on both accuracy and generating alerts in real-time.[1] is already been experimented previously.So as to generate ViF, there is a detailed process that has to be followed.Starting with the videos, they have standard aspect ratio of 3:4 and are of very low quality.As the crowd behavior is completely random, detecting breakouts in the crowd becomes a real challenge.Also the content of the video is considered to be originated from a CCTV camera hence any other source of information such as subtitles and audio cannot be used.Continuous surveillance system is of much importance and very less attention is given to it.In this proposed system, we try to implement an algorithm which accurately detects violence in real-time.Through this algorithm we try to obtain safer surroundings and have a quick response time to violent incidents.

A. Optical Flow
Optical Flow is the core part of Violence Detection.Optical Flow is the relative motion between two image frames which are taken at times t and t+∆t at every pixel position.Methods for determination of Optical Flow can be listed as Phase correlation [8], Block-based method [9], Differential methods and Discrete Optimization methods.The most commonly used methods are Lucas Kanade and Horn-schunck optical flow methods [6], which come under Differential methods based on solving first order derivative.We used C Liu's [2] optical flow algorithm for our task which will be used to further obtain Flow Vector Magnitude.Suppose V x and V y are the velocities of a pixel along x and y axis obtained through optical flow algorithm, then the flow vector magnitude can be obtained as, m t = V 2 x,t + V 2 y,t .C Liu's Optical flow algorithm was originally written in C language and mex files were written for compatibility with MATLAB.We used bob package [3] for using that particular algorithm in Python.

B. Violent Flow Descriptors
ViF (Violent Flow Descriptors) [1] have been used previously to obtain global level features of a video.After obtaining the flow vector magnitude (m), we calculate the binary vector.This binary vector is calculated for each pixel which reflects the change in magnitude.
After obtaining the binary vector for each frame, we add binary vectors which are obtained for all the frames and normalize the value with the number of frames taken under consideration. b This b generated is divided into M*N non-overlapping cells and collecting magnitude changes of each cell separately.These magnitude changes are then represented by a fixed size histogram.These M*N histograms are further concatenated to obtain a single descriptor vector which is known as the ViF.

III. METHODOLOGY
Doing real time automatic surveillance on CCTV footage has many challenges and limitations.There are two assumptions which we make, the first assumption is to keep the camera away from the area under surveillance, there has to be standard distance between the CCTV camera and the area to be monitored.The main challenge we face is to keep the processing in terms of real-time, which means all the processing has to be done in less than 1/25th of a second.Our system should have the ability to handle multiple video sources at a time.This has been achieved by continuously accepting frames through multiprocessing using threads.First part is we calculate the Optical flow which is the most time consuming of all the processes.Then we use the calculated optical flow to obtain flow vector magnitude.Next we generate ViF which are further used for training and classification.

A. Algorithm
The main part of building the system is to have a well built algorithm.First the video is preprocessed, then we calculate the optical flow which is most time consuming.Feature Extraction is done next which is used for training a multilayer perceptron [5].So as to make the system scalable for multiple video sources, we have embedded threading and multi-processing.The built algorithm is robust and can handle faulty video sources.

B. Global Feature Extraction
1) Video Preprocessing: Video coming from source is preprocessed.Considering the video aspect ratio as 3:4.The surveillance footage is consider to be standard definition (scale = 240:320).The input frames coming in are resized to 75:100 size and then converted to gray-scale.The length and breadth of the videos are almost reduced by one-third.For video processing OpenCV [7] package has been used.
2) Optical Flow: Ce.Liu [2] optical flow algorithm has been used to calculate the optical flow.This algorithm has been used particularly since it is highly efficient and robust.It returns three values, v x (velocity vector along x-axis), v y (velocity vector along y-axis) and w(wrap).The vectors are in the same shape of the resized frame width and height.
3) Violent Flow Descriptors: Violent Flow Descriptor [1](Global Features) Algorithm is being used which has been already been implemented previously in MATLAB.So as to increase the scalability of the algorithm, system has been implemented using basic multiprocessing and threads.Violent Flow Descriptors use flow vector magnitude's output.Histogram equalization on normalized binary vector for the whole set of frames under consideration is performed to obtain a single feature vector.The ViF's obtained are then used for training and classification purposes.If the number of bins are fixed as 21(0.0 to 1.0, interval of 0.05) and consider both M, N as 4, then for a standard definition video of scale 240:320, 336 features are obtained exactly(21 * 4 * 4).Values of these 336 features range from 0.0 to 1.0.

C. Neural Network
1) Structure: For the given dataset once violent flow descriptors are generated, neural net training using these features is done.Built four layered neural net, one input layer, two dense layers and one output layer.Input layer accepts 350 inputs and gives 336 outputs.Middle dense layers accept 336 inputs and give 336 outputs.Output layer accepts 336 inputs and gives 1 output.For input a nd dense layers ReLU (Rectifier Linear Unit) activation function is used and for the output layer Sigmoid function is being used.
2) Training: For building neural net Keras [10] and Tensorflow are being used.Initially the data needs to be formatted so that it fits into input layer of Neural Net.The Violent Flow Features generated for each video will be in the format of numpy array of dimensions 336*1.Array is reshaped into 1*336 dimensions, so that the features of single video fit into the input layer of neural net as one data tuple.Further, concatenation of reshaped feature array of each video into a single array is done so that the final input to the neural net will be an array of dimensions 246*336 as there are 246 videos (violent and non-violent) in the dataset.On the other hand, pre known outputs (0(non-violent) or 1(violent)) of each video are stored in an array of dimensions 246*1 to train neural net and to calculate accuracy.After data is ready we can proceed to build the neural net into the structure mentioned above.Using Keras Sequential neural net with dense layers can be built.
A model is compiled for which we must specify a logarithmic loss function which evaluates a set of weights and also an optimizer to set learning rate.Keras has a logarithmic loss function for binary classification problem defined as binary-crossentropy and Adam Optimizer which is an efficient optimizer of choice.Number of epochs for which the training must be carried out, batch size(number of instances evaluated to perform weight changes) and the input data for training the model are provided as parameters.Trained model is stored into a file of hd5 format using hdpy python package.
3) Violence Detection: This phase involves detection of disturbance or violence in live crowd surveillance videos in real time.The input surveillance video is preprocessed and Violent Flow Descriptors are generated dynamically in Real Time.For each second of video, features are extracted and are given as input to the trained model for classification and violence detection.If some disturbance or violence is detected, it will be reported as an alert stating that it is violence along with the time it has occurred within a second of occurrence.

4) Feedback:
In the Violence Detection phase, as the real time surveillance takes place, the features generated for every second are tested against trained model for classification and violence detection.Those features, along with their actual output(provided by human) generated by the trained model are given as feedback to the model.This allows continuous training of neural net model which helps to increase the accuracy of classification and also faster detection of violence.

D. Extraction of Interesting Features
AdaBoost is an ensemble of weak classifiers.AdaBoost is an algorithm which could tell us the important set of features that help us classify our features.For this the Feature Selection Algorithm through AdaBoost [4] is used.Once the features are arranged in increasing order of the error rates, we can obtain the features among total 336 features which are highly efficient in classifying videos.The weak classifiers used here are decision stumps (decision tree of height 1).

IV. IMPLEMENTATION
In the below subsections we provide the implementation details and the outputs analysis.Clear analysis of the system will be done in the next section.

A. Continuous Surveillance
We used two sets of configurations for calculating processing speeds of the system.First Configuration consists of 4GB RAM, Intel Centrino Processor with Ubuntu OS.Second Configuration consists of 8GB RAM, Intel i5 Processor with Debian OS.
For a video which is not initially violent but later on becomes violent, the proposed system is able to detect the exact instance where the video frames go from violent to nonviolent.Considering Real-Time CCTV feed, within a second of occurrence of violence our system in able detect the violence and raise an alert.Above are two figures showing running time of the system.The total length of video taken under consideration is nearly 200 seconds.Proposed system is able to detect the exact second of violence occurrence i.e where the frames go from non-violent to violent.With Intel centrino processor Fig. 1, the processing is being completed nearly in 180 seconds (20 seconds faster than run time of video).With Intel i5 processor Fig. 2, the processing of entire video is being done nearly in 140 seconds (1 minute faster than run time of video).With each detection of violence outbreak, the corresponding time taken by the system to detect that is being shown.

B. Accuracy
The accuracy obtained by ViF's [1] as global features using a linear SVM is 81.30% for existing system.Proposed system has an accuracy of nearly 85%.The bar graph in Fig. 3 shows the result of N-folds cross verification with N=7.There are total 5 runs(execution of nfolds once in a run).In each run we consider 7 heaps in total.
Each heap containing equal number of videos.Among these videos violent and non-violent videos are distributed evenly.Violent and Non Violent videos are placed randomly in heaps.This gives us an idea how robust the proposed system is.The minimum accuracy we obtain for a heap in any run is greater than 70%.Fig. 4 shows the line graph, it is same as Fig. 4, but gives us the clear picture of accuracies of each set in its corresponding run.Each run has been assigned a different color.From this we can clearly identify minimum and maximum accuracy.The maximum obtained accuracy is nearly 96% and minimum accuracy is of 73%.The dataset which contains 246 videos is divided in the ratio of 70:30.70% of the data is used to train the neural net, 30% of the data is used to test the accuracy of generated model.Output in Fig. 5 shows that the accuracy obtained is 90.27% .
As we can see the confusion matrix in Table I, the number of False Negatives are just 2, that means there are only 2 cases in the test set which are actually violent but our system was not able to detect it.Whereas there were 5 cases in which videos were not violent but our system detected some violence.Following are the results obtained: • Accuracy = TP/(total) = 0.9027 The above accuracy tests were done on a dataset [1] containing 246 videos.Shortest Video is of 1 second and Longest Video is of 6 seconds.These collections of videos have equal number of violent and non-violent videos.This kind of dataset is known as "in the wild" dataset.Videos present in the dataset are of standard CCTV resolution (scale = 240:320) and of similar aspect ratio (3:4).

V. RESULTS AND DISCUSSION
Consider the following scenes obtained through surveillance footage:   In Fig. 10, it shows the output of the system for the particular video shown in previous figures.As we can see for the initial frame, output value is very less, as the scene gets tense in Fig. 7, the system output value increases.When the violence starts in Fig. 9, output value increases to 0.999 indicating violence.Later on in the video violence decreases gradually and hence the output value falls down to 0.06.

VI. CONCLUSION AND FURTHER WORK
Timely detection of violence in real time is of much importance.System is able to accurately detect violent scenarios in real time.It is scalable enough to process three to four parallel video streams at a time with a household PC setup.Whatever work has been done is to give attention and importance to accurate real time surveillance.
Further the accuracy can be greatly improved by using results of Feature Selection algorithm of AdaBoost.As explained above, once we arrange the features in increasing order of their error rate, we get order of importance of 336 features.In that particular order, we can associate the weights of input layer in neural net.The features which are highly important can have a higher weight at their input node and as the importance decreases, weights can also be decreased.This may make a difference in increasing the accuracy of neural net.

Fig. 3 .
Fig. 3. Bar Plot of Obtained Accuracy Values in N Folds

Fig. 4 .
Fig. 4. Line Plot of Obtained Accuracy Values in N Folds

Fig. 5 .
Fig. 5. Accuracy Obtained by training with 70% of data and testing with 30% of data

Fig. 10 .
Fig. 10.System's output for the above video