A Review on Scream Classification for Situation Understanding

—In our living environment, a non-speech audio signal provides a significant evidence for situation awareness. It also compliments the information obtained from a video signal. In non-speech audio signals, screaming is one of the events in which the people like security guard, care taker and family members are particularly interested in terms of care and surveillance because screams are atomically considered as a sign of danger. Contrary to this concept, this review is particularly targeting automated acoustic systems using non-speech class of scream believing that the screams can further be classified into various classes like happiness, sadness, fear, danger, etc. Inspired by the prevalent scream audio detection and classification field, a taxonomy has been projected to highlight the target applications, significant sound features, classification techniques, and their impact on classification problems in last few decades. This review will assist the researchers for retrieving the most appropriate scream detection and classification technique and acoustic parameters for scream classification that can assist in understanding the vocalization condition of the speaker.


I. INTRODUCTION
In the past few decades, there have been several efforts regarding the classification of the acoustic data into classes. The audio data is very informative and a rich source of extraction for the type of content involving content-based classification of the acoustic signals. Human beings use vocal tract for producing speech sounds such as talking, singing, crying, and laughing. These sounds are further classified as speech or non-speech vocalizations. Speech consists of voices that are in the form of sentences and can be understood using different Natural Language Processing (NLP) techniques. The non-speech sounds include laugh, sneeze, cough, snore, and scream. These non-speech vocalizations are sometimes segregated from speech signals to extract additional information about the context, situation, or emotional state of the speaker. Scream is a non-speech signal that is caused by a loud vocalization when air passes through vocal folds with greater force than regular vocalizations. Most often, a scream is a reflex action or a response from an unexpected situation and it is strongly associated with emotional behavior of the speaker. It can have many forms like a scream of joy, danger, pain, surprise, etc.
Scream sound event classification and detection has wide applications in science due to which it has gained significant importance in literature. Many real-life acoustic systems use scream detection in the areas like speaker identification [1], Audio-Surveillance Systems [2] and Home applications [3]. These systems use the knowledge extracted from scream detection and classification for processing. In this field, the conjunction of time-frequency features and machine learning classifier have achieved recent developments. Different techniques and methodologies have been established to differentiate speech and non-speech sounds. These include Support Vector Machines [3], band-limited spectral entropy [4], Deep Neural Networks (DNN) [5], Hidden Markov Model (HMM), sound event partitioning [6] and modulation power spectrum [7].
Most works on scream detection and classification emphasize on some crucial acoustic events, none cover the overall state-of-the-art for scream classification and detection. The current work varies of all preceding efforts in terms of emphasis, correctness as well as suitability. The aim of this review is to highlight the scream classification concerns and challenges to analyze and classify the screams from a variety of perspectives. Additionally, a comparative study is hereby presented that is based on the problem domain, sound features, and classification techniques. By overviewing this review, one can easily determine the problem domains where to put the scream efforts, using best sound parameters and scream classification techniques for situation understanding.
This review is planned as follows. Section 2 covers the data collection techniques and research methodology. Section 3 contains an overview of different classes of problem domains, sound features, and classification techniques. Section 4 evaluates the various data classes and argues on the comparison and accuracy rates. Finally, Section 5 concludes the key points in this review.

II. DATA COLLECTION
A review of 30 different research articles that are associated with scream detection and classification in various environments is presented. Highly cited and credible publications are used from different digital libraries for obtaining the research source. A thorough analysis is performed on all the articles to make sure that the content is pertinent to the research interests. Those classification problems that have hindered the further development and exploration in screaming environments, are discussed. 4  The selected data has been divided into different categories to carve out possible alternatives in several directions. Table I presents literature works related to the scenarios using scream detection or classification as major. The problem of each research article is described along with its ability of detection or classification of screams. Most of the authors are focusing on using the scream detection in the surveillance systems as in a common understanding the screams are a source of danger. Other have focused on whether enhancing the sound features of the systems under study or the identifying the speakers by their vocal scream samples. Only one author has worked indirectly on scream classification along with detection for animal screams.
These research studies analyze and compare the crucial aspects of different scream detection and classification methods. The major concerning factor is the accuracy of detection and classification stages, while minimizing the error rates and choosing the best possible sound features. In this review the emphasis is on the aspects of proficiency and accuracy of scream classification techniques.
On the first glance of Table I, it is very unsure to find out the loose ends and research gaps for a researcher who is new to this field. For this reason, each source is separated in terms of its problem domain, sounds parameters, type of classification technique used, and the results obtained in each case. All these categories are later further divided into different classes for even a broader understanding. Furthermore, tables and graphs are used in each class to compare on source with the other to find out which domain, parameter or technique is the best one to work out in future.

III. DATA CLASSIFICATION
The process of organizing data into groups and categories for its most effective and efficient use is broadly defined as data classification. As described above the collected data samples from different sources are analyzed based on the parameters discussed in Table II.

A. Problem Domain
A problem domain is the area of knowledge or application that desires to be analyzed and examined to solve a problem. Converging on a problem domain is simply focusing at only the topics of a person's interest, and apart from everything else. Based on the observations from various research sources, the problem domain has been divided into three categories: i) Surveillance, ii) Speaker Identification, and iii) Acoustic Features Enhancement. All these categories are discussed in detail below: 1) Surveillance: Surveillance means managing, protecting, influencing, or directing the people by monitoring the abnormal activities or changing information in their surroundings [32]. Surveillance systems enable the remote observation of prevalent society for public safety and integrity. These observations can be made by some electronic devices like audio/video recordings or phone calls. Sound based surveillance systems enables remote public protection by analyzing sound samples collected from the target location or the target person. Screams plays an intense role in analyzing the situation analysis for any signs of danger.
2) Speaker Identification: Speaker identification systems are used to identify a person from voice biometrics. These systems use those human voice features that differ in different individuals. Screams can be used very effectively for textindependent speaker identification.
3) Acoustic Feature Enhancement: Quite a large set of scream literature is based on the techniques that are used to improve the acoustic features enhancement of scream detection and classification. These techniques help in increasing the robustness of the detection and classification for several different kind of sound-based scream dependent systems.

B. Feature Extraction
While evaluating and characterizing the contents of an audio stream, feature extraction plays a vital role. To analyze the a scream audio stream, the first step is extracting the concerned acoustic features form the audio frames.  Table III represents different kind of acoustic features including Temporal, Spectral and Prosodic. This categorization is performed on the basis of diverse behaviour of acoustic parameters. These features can be extracted from audio signals or easy adaptability, robustness again noise and implementation.

1) Temporal:
In a sound signal the amplitude fluctuation with time (the waveform signal) is represented as Temporal or time amplitude features. These acoustic features can be straightly extracted from raw sound signals for which no prior data is required. Typical temporal cases include amplitudebased features, zero-crossing rate (ZCR), and power-based features. Such features usually recommend a simple tactic to examine acoustic signals.
2) Spectral: Spectral/Cepstral features are resulted from short-term spectral features. Audio signals mostly speech and non-speech, speaker and language recognition rely on Cepstral features. The computation of cepstral is composed of three processes namely Fourier transform, inverse Fourier transform and logarithm [33]. These processes allow the identification of the purification and base frequency and of the audio signal. The different variants of Spectral features include Melfrequency Cepstral Coefficients, Spectral Centroid, Spectral Flux, Spectral Roll off, Spectral Tilt, Spectral Entropy, Signal Bandwidth, Sub-Band Energy Ratio, and Linear Prediction. Generally, the temporal features are necessarily combined with spectral features for in-depth audio analysis. Consequently, the computational complexity of spectral features is higher than that of temporal features.
3) Prosodic: In the context of human listeners, to specify information with semantic sense, prosodic/ perceptual frequency features are used. On the other hand, the prosodic features define auditory signals in terms of mathematical and physical properties. These features are ordered based on semantically eloquent characteristics of sounds. These aspects include loudness/intensity, fundamental frequency, and rhythm.

C. Scream Classification Techniques
Scream classification can be performed using ttraditional classification tactics. An example of such tactics includes manual classification done by human experts. The experience and skills of a good analyst make this method more reliable. Though, it is time intense and arduous in spite of the precise results. To diminish human interaction for automating the detection and classification process, two approaches are widely used and applied for scream detection and classification. These two classification approaches are supervised and unsupervised that are highlighted in Table IV along with their subtechniques. The use of semi-supervised learning algorithms is nearly non-considerable in terms of scream classification and hereby not a part of this review. Supervised learning is extensively used in scream audio event detection systems. These techniques include K-nearest neighbor (k-NN), linear discriminant analysis, Bayesian networks, support vector machine, and rule-based machine algorithms. The obvious description or specification of these algorithms is to train the behavioral models with labelled data. This method holds high demand on resource consumption.

a) Instance-Based or K-Nearest-Neighbors (KNN)
The k-nearest neighbor algorithm (KNN) is the simplest and most efficient non-parametric algorithm from the family of instance-based learning [34]. The output of this algorithm depends on whether it is used for regression or classification. K-NN is a robust method that is proficient enough for organizing and segmenting audio streams into music, speech, environmental sounds, and silence [35]. The author in [11] used KNN for scream classification. The classification is done based on majority of neighbors. The object is allocated to the class that is in its k nearest neighbors where k is a positive integer. The value of k=1 depicts that the object is allocated to the class of exactly that single nearest neighbor. Although KNN is quite easy to implement but this technique requires memory and computation complexities. To overcome this problem, [36] and many other techniques have been developed. www.ijacsa.thesai.org b) Neural Networks The Artificial Neural Network (ANN) is a data processing computing system which is vaguely encouraged by the biological neural networks, such as the animal or human brain process information. For audio events the Radial Basis Function (RBF) and Multi-Layer Perceptron (MLP) were applied in Artificial Neural Networks (ANNs) for supervised audio classification to decrease misclassification errors. In MLP, input datasets are mapped onto appropriate output sets. The most common use of MLP is in automatic phoneme recognition tasks [37]. A particular case [38] of feed-forward network is Radial Basis Function (RBF) which creates a linear map from the hidden space to the output space.

c) Rule-Based Classifiers
A rule-based machine learner identifies and utilize a set of relational rules that cooperatively show the knowledge captured by the system. This contrasts with the other machine learners where a singular model is commonly identified that can be applied universally on nay instance to make a prediction. A variation of this classifier is fuzzy rule-based classifier (FRBC) that is efficiently being used for numerous classification tasks. Auditory event detection in fuzzy setoriented contains the information concerning to a set of rules that classify the several characteristics of the fuzzy rule base in the training data [39]. The disadvantage of fuzzy operators is that there is no specific way to define fuzzy operators especially symbolic variables. The classification problem of non-speech human voice was solved [40] using fuzzy integral and some of the associated fuzzy measures.

d) Bayesian Networks
A Bayesian/Bayes/Belief network is a graphical model that probabilistically signifies a set of variables and their inter dependencies using a directed acyclic graph (DAG). There are the variants of the Bayesian network include: 1) serial, 2) divergent, and 3) convergent. It does fast supervised classification due to which It is appropriate for forecasting and classification tasks on complex large-scale datasets. Various multi modals [41]- [43] have been projected to resolve the glitches in acoustic and speech segmentation in movies or robot speech under noise conditions.

e) Linear Discriminants
Linear discriminant analysis (LDA) is used to find a linear combination of features that classifies two or more classes of objects or events. The resulting combination can then be used as a classifier, or for dimensionality reduction. LDA basically transfers raw data into a feature space [44] supporting a more robust classification.

f) Support Vector Machines (SVM)
Support vector machines (SVMs) are valuable machine learning method for complicated data classification problems [45]. A training set is provided to the SVM by a set called input vector. SVMs separate two types or classes by maximizing the margin between the class boundaries and the nearest ample to it.
2) Unsupervised Learning Algorithms: Unsupervised Learning algorithms are applied to infer a function or conclusions from unlabeled input data. As the data is unlabeled so its process involves finding and correlating the labels. The main objective of unsupervised learning is to examine the information and discovering similarities between the objects.
In unsupervised learning, the most common method is Cluster analysis that utilizes heuristic data for analyzing and finding hidden classes and patterns in audio data. Similarity measurement is used in clustering that is based upon metrics like Euclidean distance and probabilistic distance [46]. Some common algorithm for clustering are: 1) Gaussian Mixture Models, 2) Clustering, 3) Hidden Markov Models, and 4) Neural Networks.

a) Gaussian Mixture Models (GMM)
Gaussian mixture models (GMMs) are unsupervised classification methods. These methods are extensively used in speech/voice recognition and sensing and hence can be applied t. GMM assumes that all the data points are created from a mixture that contains several Gaussian distributions with unidentified parameters.

b) Clustering
Hierarchical clustering (HC) also called hierarchical cluster analysis is a technique of cluster analysis that is aimed at building a hierarchy of clusters by recursively merging or dividing the patterns [47], [48]. It uses two kinds of strategies. One includes constructing a hierarchy from the bottom up (agglomerative) after calculating the similarities among all duos of clusters iteratively merging the most similar pair. The other top down (Divisive) approach performs splits recursively moving down the hierarchy.
In Partitioning approaches, samples are repositioned by transferring from one cluster to the other. This method initially requires the total number of clusters that will be pre-set by the user. The well-known methods in this field include K-means and its variants [48], [49].

c) Hidden Markov Models (HMM)
A hidden Markov model (HMM) is based on unobserved or hidden variable stats. This model is a statistical Markov chain. The unobserved states are obtained based on a particular emission function that is resultant of some observable symbols [50]. The hidden Markov model can be considered to be the simplest dynamic Bayesian network. C. Chan et al. [22], M. Vacher et al. [16] used HMM for scream classification.

d) Neural Networks
Artificial neural networks (ANNs) are huge computing systems working together, consisting huge number of processors and their interconnections. The ANNs can solve reliable and efficient classification problems obtaining high tolerance and aadaptability [51]. The most commonly used neural network models for unsupervised learning algorithms are Self organizing Map (SOM) and Adaptive Resonance Theory (ART).

IV. RESULTS AND DISCUSSION
A total of 30 research articles based on scream classification and detection are used and compared based on www.ijacsa.thesai.org problem domains, sound features, and classification techniques. A quick analysis of the review for each case is presented below:

A. Analysis of Problem Domain
Three main problem domains for scream classification include Surveillance, Speaker Identification and Feature Enhancement. The relevant research articles are separated for each problem domain. The division of articles is hereby shown in Table V. It represents that out of 30 articles, 19 belong to individual person or public surveillance, 3 belong to the identification of the speaker and 8 discussed the methods and mechanisms to enhance and enrich the scream sound vocal experimental results. In the next step the overall percentages are calculated for these problem domains to find out which one is lagging and needs further exploration (Fig. 1). With the increasing rate of public crime occurrences (like on streets and transports), and danger to the precious human lives, surveillance systems based on audio analysis of screams are rapidly becoming popular. This is because the screams are usually considered and interpreted as to be the signals of survival in humans. Such systems can help majorly in medical surveys, audio scene classification, embedded transport environments like buses and trains, and 24x7 monitoring for the signs of distress in humans' daily routine. Fig. 1 indicates that the Surveillance domain is more enriched with scream detection and classification because of the two reasons 1) Increasing number of health and safety issues and, 2) Screams are a sign of danger.

B. Analysis of Scream Sound Features
It is computationally expensive to utilize all the sound features for scream classification, so it is a common practice to mix-up one or two type of features together to achieve the best results in conjunction with classification techniques.  While exploring the sound features it can be observed that some of the articles are using the combined feature approach. Following this, a taxonomy has been developed (described in Table VI). The temporal features cannot be effectively used separately so no article has independently used these features but in combination with other types.
Spectral and Prosodic features are used independently as well as in combination. Table VI describes all of the articles under consideration and the type of sound features they have used or recommended for scream classification. The results of this step are shown in Fig. 2 and 3.
In Fig. 2, S=Spectral, P=Prosodic, T=Temporal, TS=Temporal and Spectral, SP= Spectral and Prosodic, TP= Temporal and Prosodic and TSP= Temporal, Spectral and Prosodic. It also shows that the most commonly used sound features are spectral. Out of 30 researches, 12 used spectral features independently. The second-best features are the combination of either TS or SP. While no one recommended T or TP.
The results are presented by calculating the percentages for each type or combination. The percentage evaluation is shown in Fig. 3 which clearly expresses that the spectral parameters are the most recommended ones to achieve the best scream classification with 40% of usability.
Further we see that there are further many forms of each category of scream sound feature. Table VII describes all the considered scream articles with the type of sound feature they have used in detail.
In the last step, it has been concluded that spectral features are highly recommended in literature for scream classification. The basic purpose of this step is to figure out that out of many forms of Spectral features which one shows the best performance out of all.
Fourier transform is used to convert time-domain signal into frequency domain for obtaining spectral features. These features are quite helpful in identifying the notes, pitch, rhythms and melody.
The results of this step are shown in Fig. 4. It can be clearly observed that Mel-frequency Cepstral Coefficients are the most used and highly recommended sound feature for scream classification. It can either be used individually or in combination with other sound features. MFCC are extensively applied in voice recognition because of the reason that these features are very similar to human listening. In more complicated and complex signals such as speech or music where the signal changes its properties over time, it is evidently more meaningful to refer to the altering frequency content over a smaller time interval than an infinite time interval.  Z. Zaheer et al. [14], K. Kato [19], B. Uzkent et al. [20], W. Liao et al. [23] Loudness/Intensity L. Gerosa et al. [2], K. Kato [19], C. Zhang et al. [25] Rhythm/Duration C. Chan et al. [22], K. Kato [19], C. Zhang et al. [25] Log Energy N. Hayasaka et al. [4], W. Huang et al. [3]

C. Analysis of Classification Techniques
There are two clear divisions of sound event detection approaches: supervised, unsupervised or combination. These approaches are studied perceptibly however still suffer from a scarcity of additional thorough and complete analysis on classification approaches, primarily in scream signal classification. This review documents the scream classification with two subclasses in conjunction with a close review of every class.
This taxonomy has been shown in Table VIII. The referenced articles in each category are carefully observed and assigned to the relevant class. Some of the techniques are using supervised and unsupervised approach independently while the others are using a combination of both approaches (separately). This table is not for comparison as the datasets and the sound features are used differently. It is just providing a review of the current illustrative approaches.
A more precise view is presented in Fig. 5, where 11 researches used supervised, 13 used un-supervised and 4 used combined scream classification approaches. Furthermore, the generic analytical view of classification approach is shown in Fig. 6, where the percentage calculations are performed in each case. It can clearly be seen that the un-supervised approaches have been more successfully been applied than other approaches in the last 18 years for scream detection and classification. For this purpose, the supervised and unsupervised scream classification techniques are further explained and analyzed in the next section. The primary purpose of this review is to present supervised learning approaches based on scream classification. The future researchers can find out the ways to explore the automated acoustic environments and systems. The most recent experimental research works related to screams classifications and detection are summarized in Table IX. It presents the latest methods for undertaking scream classification and detection issues based on supervised learning methods.
Accuracies of classifiers are sstatistically compared and calculated by finding out the total no. of researches along with their classification results. By finding the individual accuracy of each supervised learning classification technique mentioned in the literature, average accuracies have been calculated to find out which techniques is providing the best results.      2) Unsupervised Learning Algorithms: Unsupervised learning algorithms comprehend a major learning paradigm and have drawn considerable attention in past few decades, as shown by the growing range of research publications in this field. The unsupervised methods for scream detection and classification are classified into four classes: Clustering, GMM, HMM and NN. Table X lists the most significant research works and their average accuracies dealing with scream detection and classification problems associated with unsupervised approaches to present some solutions to the problems restraining the performance of scream classification systems for situation understanding. Zaheer et al. [14] achieved 100% scream detection accuracy with GMM technique. Another classification technique used by N. Hayasaka et al. [4] achieved an accuracy rate of 99% again with GMM.    The overall average accuracies of the four un-supervised scream classifiers are calculated and plotted in Fig. 10. It can clearly be observed that GMMs are producing the best results with an average classification accuracy rate of 86%.

a) Combining Results
The results of overall review are converged in Table XI. It shows all of the research articles of last 18 years (from 2000-2018) based on the specified sound parameters and classification techniques. The accuracy percentage and the effective error rate (ERR) for each article is also mentioned. Fig. 11   It can be clearly observed that only a single research conducted by P. C. Schön et al. [27] in 2004 has focused on scream detection as well as classification. But this research is based on chimpanzee screams. The authors have figured out the ways in which the chimpanzees can be understood and what different kind of meanings can be driven from their screams. Two of the researches i.e. K. Kato [19] and E. R. Siebert et al. [29], have not used machine learning techniques instead they have developed their own for scream detection and classification.
So, there is a clear and a wide scope for scream classification non-understanding the situations in which they occur and to support the embedded sound-based systems especially surveillance systems to make the humans and animals out of danger.

V. CONCLUSION
A thorough analysis is presented on researchers' attempts related to scream detection and classification techniques. An in-depth taxonomy of scream detection and classification systems was presented in this review. The concerning efforts are expected to maximize scream signal detection and classification accuracy and understanding the surrounding situation of a speaker. The focus of this review is on machine learning and classification methods as well as essential sound parameters for scream-based audio embedded systems.
Although the best combination that can be concluded is that for the case of scream classification, unsupervised learning technique i.e. GMM can be applied using spectral sound features necessarily including MFCC in the field of surveillance. Because in surveillance scream detection and sound classification has been implemented in remarkably high percentage, so there are chances that the surveillance systems based on scream detection and classification, are causing a higher risk to humanity. But these results are concluded on the information and statistics that is based on different kind of data sets using various combinations of sound parameters and classification techniques. The results may vary based on the datasets used and the background noise level.
In future, this review can be beneficial for the researchers to conduct a mechanism for scream classification and to understand the best possible alternatives in terms of classification techniques and sound parameters. A system can be developed using the concluded research to find out the differences in different classes of screams like joy, fear, sadness, etc. and to find out that how such kind of research can be helpful for understanding the surroundings of a speaker.