Open Challenges for Crowd Density Estimation

Nowadays, many emergency systems and surveillance systems are related to the management of the crowd. The supervision of a crowded area presents a great challenge especially when the size of the crowd is unknown. This issue presents a point of start to the field of the estimation of the crowd based on density or counts. The density of a crowded area is one of the important topics dealt with in many kinds of applications like surveillance, security, biology, traffic. In this paper, we try not only to present a deep review of the different approaches/techniques used in the previous works to estimate the size of the crowd but also to describe the different datasets used. A comparison of some related works based on the weakness and the strength features of each approach is highlighted to show the important research key related to the field of the estimation of the crowded area.


I. INTRODUCTION
The surveillance system is widely used in our life daily. It is mounted in bank agency [1], traffic [2], mall [3], etc. Their uses are different from one application to another. This paper is focused in the use of the surveillance systems to estimate the size of the crowd.
Estimate crowd density aims to understand the behavior of a crowded scene and to analyze the comportment for better security, management, and safety. A computer vision technique is used for all systems. Some of them propose the computation on a single image, and the most compute frames from a video streaming. Crowd analysis is associated with multiple disciplinary research topic as computer vision, public safety, biology [4], physics [5], psychology [6].
Crowd density estimation can be classified into five research topics.

1) Disaster management:
Systems based on the estimation of the crowd aims to supervise the behavior to avoid disasters in several situations as music concerts, sports events, political rallies, and public demonstrations.
2) Safety monitoring: Crowd analysis help in understanding behavior, congestion, anomaly, and event [7]. These analyses are applied for many video surveillance purposes such as shopping malls, airports, and sports events.
3) Design of public spaces: Crowd analysis improves the optimization of public spaces design to ensure more safety in crowded situations. 4) Intelligence gathering and analysis: Crowd analysis is used for interesting products, interesting places. It ensures intelligence in queuing systems. Therefore, analysis improves the knowledge of the system and helps in improvement or optimization strategies. 5) Forensic search: Crowd analysis determine a particular data in a crowded scene as detecting suspicious behavior or detecting suspects [8], [7].
These topics have encouraged researchers in different specialties to contribute and to improve the estimation of a crowded area via various methods and related tasks such as density estimation [9] counting [9], [10,11], tracking [11], behavior analysis [12]. All these tasks can be extracted from a crowded scene and there can be applied for different applications. The challenge is increased when the scene is identified as a very high dense situation. Previous studies use a variety of techniques/methods like regression [13], clustering [14], and detection [15] to count or to estimate crowds. These approaches require standards dataset to estimate the performance of crowd density analysis. This paper is distributed as follows: different datasets used by researchers to evaluate their approaches are described in Section 2. A review of crowd density estimation approaches is presented in Section 3. A comparison between previous approaches in crowd density estimation is performed in Section 4. Finally, a conclusion based on open challenges is presented in the last section.

II. DATASETS
In vision processing systems, datasets represent an essential requirement to evaluate their proposed design. This section lists the different datasets used by the previous works to assess the estimation of the crowded area approaches. Some related works perform their own dataset but most of the studies use standard and universal datasets. We focus on this section to the standard dataset used in the field of the crowded zone.
 WorldExpo'10 dataset [16]: This dataset is characterized by their size. It is composed of a big number of scene performed for the count of the crowd. It is characterized by the number of prototypes (1132 video clips), the number of scenes (108), the resolution (576*720) which are bigger than other datasets. These videos are captured by more than one hundred cameras.  ShanghaiTech dataset [19]: this dataset introduces a large scale crowd. It includes 330165 people as the total number of labeled pedestrians. It includes 1198 images. These images define two groups: The Part A is grouped randomly from images stored in the Internet, and the part B is composed of images captured from Shanghai streets.
 Make3D [18]: the resolution of this dataset is 2272*1704. This dataset is adopted to learn features and it estimates the scene depth from a single frame. The Make3D dataset provides more than 1000 scenes composed of outdoor and indoor scenes.
In light of this brief review, we mention that each dataset could be applied for a particular case. Table I resumes characteristics of the most used datasets in literature.

III. EVALUATION METRICS
This section discusses the different metrics used to evaluate crowded systems. In the literature, there are four factors. The most two factors used by them are:  The Mean Absolute Error (MAE) [20] computes the average of all absolute errors which is defined by theses following formula: Absolute Error (AE) computes the error rate between the true value (x) and the measured value (xi) related to n frames.
The MAE is the average of all absolute errors:  The Mean Squared Error (MSE) [20] represents the average of all errors related to the distance between the regression line and the value. The regression line is the best line drawn by the measured data. The accuracy is higher when the MSE is smaller.
Other studies use the following metrics to evaluate crowded systems:  Mean Windowed Relative Absolute Errors MWRAE [21] computes the average of all errors related to the distance between the real counts and the estimated counts. This metric is defined by the following formula: Where Ci is the real count of the crowd related to the ith test video stream, ̃ is the estimated count of the crowd related to the ith test video stream, and the parameter n represents the total number of test stream.
 Root MSE (RMSE) [17] computes the averages of all errors related to the standard deviation. This metric defines the best line around data based on the following formula: Where the xi is the measured value and the x is the true value of n frames.

IV. REVIEW OF CROWD DENSITY ESTIMATION METHODS
This study is based on a deep search on Web of science database since December 2017. The most significant keywords for this search are 'Crowd density estimation' that describes the scope of this paper. During the study collection, we set only papers written in English and dealt about the density/count estimation a crowd.
During the search, we use the combination of the following words: "Crowd", Density Estimation", Crowd Count" to find papers related to the scope. Logical operators are used between keywords. Only articles on journals and conferences are approved. Steps of the selection of the paper can be resumed as follow: First step: This step aims to filter papers according to the title and the abstract.
Second step: Eliminate duplicate papers that use the same methods in the same dataset.
Third step: In this step, we approve papers agreed with necessary criteria: articles in English, known authors, reviewed paper, developing paper, and discussed paper.
In the literature, the crowded scene is seen either by the level of density or by counting the number of people. www.ijacsa.thesai.org S. Lin et al., [22], propose an intelligent algorithm based on SVM classifier to detect heads. The frame is proceeded by a processing phase to reduce noises. Then an extraction phase is performed by Haar wavelet transformation and normalization step. Finally, the matching phase ensured by the SVM classifier is chosen. The SVM aims to classify the extracted features belong to head class or not. The estimation accuracy is between 90-95% (about 125 persons in image). The experimentation shows that the camera position has to aligned to the optimal value of the angle (72.5 degrees). This angle is defined by the camera sensor position and the plane of the crowd. The method proposed by the authors supposes a unique size of all human heads and a uniform repartition of the crowd over the horizontal plane.
JH. Yin et al., [23], performs five methods to recognize the size of the crowd area. The first method removes the background based on the subtraction of the reference image and computes the occupied surface. The error associated with this method is about 15%. The second method computes the total perimeter of the busy area by applying the Edge detection algorithm. When the number of people has increased the accuracy of the algorithm is decreased. The error is about 23%. The third method combines the two-last method to improve estimation accuracy. The crowd density estimation error is decreased to 8%. These three methods still suffer from near-far effect especially. Persons how are near the camera occupied more area than other distant persons. Then authors propose the fourth method based on Geometric distortion to compensate for the near-far effect. The fifth method attempts to detect the movement without identifying objects in video streaming. This is done by the optical flow that is defined by the difference of the brightness from one image to the next. Based on Horn's optical flow algorithm, the motion is measured. These methods did not take into consideration the constraints of real-time execution.
CS. Regazzoni et al., [24], propose an estimation approach. They use temporal information of a sequence image. A means of a distributed Kalman filter network is performed. The proposed approach attempts to synchronize between multiple sensors based on modularity and data-fusion. The distributed Extended Kalman Filtering (DEKLF) algorithm implements both static models and status history. The first one is defined by some features of the edge function as the number of vertical edges, the number of edge points, the sum of the amplitudes of the maxima detected in the shape. The second one is defined by the depletion, the enhancement, and the steady conditions of the number of people. Algorithms are chosen to increase density accuracy and real-time exigence. The experimentation discusses the results according to a comparison between the proposed DEKF and the Bayesian belief network. The error is less than 20%.
AN. Marana et al., [25], use the Minkowski fractal dimension to estimate the count of people. This method verifies the case of a railway station. The edge detection is performed to the input image. Then a binarization step followed by a dilation method (enlarge the boundaries of regions) is applied to the image. Finally, the fractal dimension classifies the image according to the density into very high, high, moderate, low, and very low rubrics. The authors evaluate their method by comparing results with Minkowski methods and the Gray Level Dependence Matrix (GLDM) [26]. The last method is not able to distinguish between area with very high density and area with high density.
SY. Cho et al., [27] choose to apply the neural network as an intelligent algorithm to find an accurate result of the crowd's size. This method is based on background removal. The neural network is applied to identify if a mask belongs to black features, white features, and edge features. The case study is implemented for Railway station. This paper proposes a novel block diagram: a fast edge detection is proposed to ensure real-time. A binary step is applied instead of the Sobel filter. Then the edge algorithm is used. Then an estimation of the undesired region is performed by a crowd object extraction. This is done by removing the background. Finally, a Hybrid Global Learning (HGL) associated to a neural model is implemented. The HGL is performed by three algorithms. The first algorithm performs a hybrid of least squares and random search. Results prove that is the fast one with 2.02 min (CPU running time for learning) but the lowest estimation accuracy (90.72%). The second algorithm performs a hybrid of least squares and Simulated Annealing (SA). It obtains the best estimation accuracy (94.36%) but the worst speed (197.5 min). The third algorithm performs a hybrid of least squares and genetic algorithm (GA). It obtains 75.3 min in terms of time learning and 93.89 % for estimation accuracy.
C. Wang et al., [28] apply an end-to-end deep CNN regression to approximate people's number in a condensed crowd. The authors focus to decrease the influence of the ground by including negative samples to the training data. The truth counting of these samples is defined as zero. The proposed method enhances the estimation of counting persons. Results highlight a decrease of the error between absolute difference and the normalized absolute difference by 16.7% and 27% to mean respectively. The error rate is about 10 % which is still important. Nevertheless, this method is limited to 1300 persons per image.
F. Min, [29] presents an optimized method of the CNN named ConvNet to enhance the accuracy and the speed of the estimated crowd density. The author implements two stages on the cascade of CNN and he proposes to remove some network connections of the CNN design to speed up the computation. Experimentation is based on the PETS_2009 dataset. This method is limited to an image size of 42x40. Results show a decrease in the error rate to 3.2 %. Unfortunately, the author does not discuss the acceleration achieved by his method.
C. Zhang et al., [16] approximate the crowd's density/count especially for unseen scene by applying a deep CNN. The authors describe a data-driven method to finetune the trained CNN model for the target scene. They, also, built a novel dataset constituted with 108 frames which supports about 200,000 persons. A comparative study based on other datasets is done to show the reliability and the effectiveness of their method.
E. Wolf et al., [18]  182 | P a g e www.ijacsa.thesai.org the mean absolute error from 20% to 35% and the training time is decreased by 50%.
Z. Zhao et al., [21] present a CNN-based method to compute the number of persons across a line-of-interest. The method uses pairs of videos as inputs and it performs the training with pixel-level supervision maps. The proposed enhancement let the CNN learn more about features by decomposing the training phase into two steps: (1) Estimate the crowd density map, and (2) Estimate crowd velocity map. This decomposition provides more accuracy to solve the original problem by starting to answer each step. The authors perform a new dataset based on pedestrian trajectory annotations to show the robustness of the method via introducing a novel metrics: Mean Windowed Absolute error (6%).
Y. Zhang et al., [19] try to estimate the crowd from an unique image by performing a Multi-column CNN architecture. The MCNN supports any size or resolution of the input image. The method uses filters with different sizes to let CNN learn the features of each column. Then a geometryadaptive kernel is used to compute the true density map associated with the input image. A new dataset including 1198 images is introduced by authors to cover all the challenging situations. Experimentation shows that the mean absolute error is 1.07%.
C. Shang et al., [30] attempt to count the crowd directly from an input image using an end-to-end CNN. The method estimates the crowd based on both global and local features by applying a pre-trained CNN to the image. The recurrent network layers provide the local counting by mapping features. The local count reduces the training time, and the global count enhances the accuracy of the results obtained by the local regions. A comparative study based on many databases is discussed to demonstrate the effectiveness of their attempt.
T. Mundhenk et al., [31] apply the deep learning method to count the crowd related to cars. The authors perform a large contextual dataset to help drivers to choose the best target and avoid bottlenecks. The proposed method aggregates residual learning and inception-style layers. This solution represents a new way to counts objects instead of the base of the known method on density estimation and localization. The authors prove via their experimentation that results are more accurate and the processing time is faster.
L. Boominathan et al., [32] announce the "crowdnet" framework based on the deep CNN to count the density of the crowd. Crowdnet is performed by the combination of the deep and shallow applied to a static image. This aggregation provides effective results associated with semantic information and features. To improve accuracy, the authors propose to enlarge the trained dataset to exceed 100 samples. Results are discussed using UCF CC 50 dataset.
A. Vishwanath et al., [33] count the crowd by using both the end-to-end cascaded CNN and the density map estimation. The proposed idea by authors provides the estimation of the crowd by classifying count into groups. This method enables us to learn globally features that refine highly the density maps and decreases the error count. A comparative study is highlighted to prove the accuracy of the density maps with the minimum count error.
S. Deepak et al., [34] present a mapping method between crowd counting and their density. A multi-scale CNN is described to decrease the worst effects of some factors as interocclusion, the high similarity of appearance, and view-points. The method is based on the switching of the CNN according to independent regressors to enhance the accuracy and the estimation. The proposed switch between classifiers to select the best CNN regressor. Results show that the switch relays patch to, particularly column in CNN to identify the crowd density of the input image. The comparative study proves that the proposed method enhances the accuracy and the mean error is decreased to near 2%.
X. Yang et al., [35] present an emergency evacuation as a case study related to the crowd area. The authors perform a clustering algorithm to extract informed and uninformed walkers. The goal of their study is to find the optimized guide during evacuation. The density of the crowd constitutes important criteria to achieve their goal. The informed method with an exponent model attains an approved accuracy.
Z. Zhikang et al., [9] propose to count the crowd based on many structures. The authors announced their method named the Adaptative Capacity Multi-scale CNN. This method ensures the assignment of different capacities to different portions. This method focuses on important regions instead of the whole image to ensure optimized allocations. The proposed method is composed of a fine network, a coarse network, and a smooth network. The first one finds the region to be focused and produce the rough feature map. The second one extracts the region of interest into a fine feature map. The third one enhances results by aggregate the two studied features to decrease the effect of division. The proposed method is well validated according to five used datasets.
Z.Liping et al., [36] introduce a deep learning technique to compute the crowd's density in the case of non-uniform density and variations. The authors apply pooling operation to the density map to overcome the loss of the local spatial information. This pooling is performed by the use of dilated CNN to support details related to person position. This last feature is provided by global context guidance. The proposed method is proved by the use of many datasets.
X. Zeng et al., [37] attempt to decrease the problem of the scale variation related to the crowd's estimation. The authors propose to provide more accurate contextual information by using a deep scale purifier network. The described method encodes multiscale features. The proposed supports a frontend and a backend model. A cross scene evaluation is applied to the approach. Many datasets are used to evaluate the accuracy of the DSPNet method.
This brief review proves that most techniques are applied only for an image. The authors in [18], [21], [33], and [33] propose an attempt to treat video instead the image to estimate the density of the crowd. These attempts should be enhanced to support any inputs. Recent works adopt deep learning methods to compute the density. These attempts request a learning phase based on a dataset. The high accuracy is the strong point www.ijacsa.thesai.org of these methods but they suffer always from the increased time of processing. Datasets aim to evaluate the performance of methods proposed by researchers to estimate the crowd's size. When the evaluation is made by different datasets, results are more acceptable.
The real-time constraints are not well studied by the cited related works. The authors in [18], [21] attempt to propose methods with respect to the real-time constraints. This field requests to propose many hardware architectures to be implemented into a camera to estimate the density of the crowd.
This section has discussed some important studies related to the estimation of the density or the counts of crowds. The presented review lists many techniques based on video or image processing. Some methods extract the density according to the spatial information of the frame. These methods are accurate only in the case of the small size of crowds (inferior to 50). Other methods based on deep learning techniques show more accurate results especially in the case of the biggest size of crowds.

V. SYNTHESIS
This section discusses the most important studies to extract the benefits and limits of each work. Then, a comparison based on different results metrics is highlighted to show the accuracy. At the end of this section, the evolution of this field is presented according to the number of publications during the last five years.

VI. CONCLUSIONS
The estimation of the density of the crowded area is still a challenge for researchers. In this paper, we have presented the different research axes related to the crowded zone especially in terms of the datasets, metrics, and approaches. Datasets should be improved by increasing the number of scenes and achieve over than half-million pedestrians. These enhancements are required to improve results found in the evaluation phase.
Approaches still request improvement by decreasing the error between the real number of crowd and the estimated counts. Other methods would be applied to support the movement of the camera and the fusion of data received from many sources. In addition, the Real-Time constraint has to be the future of research work related to this domain.
The estimation of the density of crowd area is employed in a different type of applications as the estimation of pedestrians, crowded car traffic, the crowd in malls, and bacterial cell microscopy.
Besides, this domain continues to be an open challenge according to the number of the article published in the Web of Science database. Publishing is boosting from 85 articles in 2015 to achieve 148 articles published in 2019. The IEEE Xplore database shows that the number of published papers between 2018 and 2019 is increased by 25%. In the Science direct database, about 20% of the evolution of published papers are highlighted since 2018.
These statistics prove not only the importance of the domain but also the continuity of the challenge facing the crowd's estimation.
According to this deep study, the crowd size's estimation still requests enhancement in accuracy and real-time constraints.