Static vs. Dynamic Modelling of Acoustic Speech Features for Detection of Dementia

Dementia is a chronic neurological disease that causes cognitive disabilities and significantly impacts daily activities of affected individuals. It is known that early detection of dementia can improve the quality of life of patients through a specialized care program. Recently, there has been a growing interest in speech-based screening of neurological diseases such as dementia. The focus is on continuous monitoring of changes in speech of dementia patients, aiming to identify the early onset of the disease which could facilitate development of preventative treatment care. In this work, we propose a dynamic (temporal) modeling of acoustic speech characteristics aiming at identifying the signs of dementia. The classification performance of the proposed framework is compared with a baseline static modeling of acoustic speech features. Experimental results show that the proposed dynamic approach outperforms the static method. It achieves the classification accuracy of 74.55% compared to 66.92% obtained using the static models. Keywords—Dementia detection; speech classification; neural networks; recurrent neural networks


I. INTRODUCTION
Dementia affects a large number of people worldwide. It is estimated that currently more than 50 million individuals suffer from this disease and the number is expected to grow to 75.63 million by the end of 2030 [1].Dementia is an umbrella term for a set of progressive neurological diseases that lead to impairment or even complete loss of language, memory, thought processes, and problem-solving abilities which compromise the quality of living in affected individuals. There are various types of dementia. In Alzheimer's dementia, nerve cells connections and communication are impaired, eventually leading to nerve cells death and tissue loss throughout the brain. Over time, the brain shrinks dramatically, affecting nearly all its functions. Dementia can also be caused by prolonged suffering from high blood pressure as well as strokes, known as vascular dementia) [2], [3]. Alzheimer's disease disrupts both the way electrical charges travel within nerve cells and the activity of neurotransmitters, thus affecting functions of memory, movement, and thinking ability, which depend on the region of the brain being affected.
Traditional methods for diagnosis are based on neurophysiological tests [4], [5] and neuroimaging (MRI) [6]. However, in recent years there has been a growing focus on less invasive sensing technologies, in particular, speech-based diagnosis and monitoring of dementia [7]. This year, at the Interspeech 2020 conference, Alzheimer's Dementia Recognition through Spontaneous Speech (ADReSS) Challenge was organized which encouraged researchers to develop automated methods for detecting Alzheimer's dementia from speech recordings of patients [8].
This study is focused on developing a method for automated detection of dementia using the dataset provided as part of the ADReSS challenge. Here, we propose a framework based on temporal modelling of acoustic features and demonstrate its effectiveness for the task of identifying individuals with dementia. The performance of the temporal models is bench-marked against the static models which are based on functionals of descriptive statistics. The paper is organized as follows: In section II, we present a brief literature review, in section III we provide a summary of the ADReSS challenge dataset. In section IV we detail the methodology of the temporal modelling framework. Experimental results and discussion are provided in section V and section VI, summary and future outlook are presented in section VII.

II. RELATED WORK
The ADReSS dataset, central to this study, is the baseline paper relevant to the ADReSS challenge by Luz et al. [8]. The dataset consists of speech recordings of subjects from two groups, healthy individuals and Alzheimer's dementia sufferers. For the classification baseline, Luz et al. computed four types of acoustic feature sets, (i) emobase [9], (ii) Computational Paralinguistics Challenge (ComParE) [10], (iii) Extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) [11], and (iv) Multi-Resolution Cochleagram (MRCG) [12], to represent speech characteristics of subjects from the two groups. Here, the functionals based static modelling was used to generate a representation for speech recordings using the above mentioned acoustic feature sets. Results showed that ComParE provided the best classification performance amongst the four feature sets. In [13], Haider et al. investigated the feasibility of paralinguistic features for recognizing Alzheimer's dementia from recordings of spontaneous speech. Notably, while the authors used the same feature sets as [8], they found the eGeMAPS feature set to provide the best classification performance. This suggests that the efficacy of feature sets largely depends on the dataset itself.
Literature shows that it has been a common approach of using linguistic features for the task of recognizing dementia from speech. Fraser et al. [14] computed a large number of features to capture linguistic phenomena, such as grammar constituents, information content, part-of-speech, psycholinguistics, the richness of vocabulary, and the syntactic complexity of speech transcripts. In addition to these features, Fraser et al. also investigated the existence of acoustic abnormality using features derived from the Mel Frequency Cepstral Coefficients (MFCCs) [15] which provide information about the spectral characteristics of speech. Based on their investigation, it was reported that individuals with dementia have a semantically impoverished, syntactically and information deficient language, in addition to abnormal speech. Another notable research by Mirheidari et al. [16], the authors constructed a feature vector consisting of acoustic and linguistic features for the task at hand. In addition, they also computed conversational features that are tuned towards identifying individuals with memory disorders [17]. Their research findings suggest that conversational features provide the best classification performance when transcripts are annotated manually. However, when automated speech recognition was used, the classification accuracy significantly decreased from 96.70% to 76.70%. A decrease in accuracy was also observed for linguistic features with automatically generated transcripts. These results highlight the limitations of automated screening methods based on transcripts, especially when high-fidelity speech transcription is not possible, which can be the case for people suffering from diseases affecting their speaking capability. In the current work, therefore, we focus only on the audio modality.

III. DATASET
To validate the proposed methodology, we used the dataset provided by the ADReSS challenge at Interspeech 2020. This is advantageous as our experiments can be reproduced by other researchers, since the dataset is available in the public domain. The dataset includes speech recordings from 144 subjects in total; one-half of those are individuals with dementia, whereas the other half are healthy individuals. The recordings have an average duration of 75.30 seconds, a standard deviation of 38.38 seconds, and a maximum duration of 268.48 seconds. Given the significant variation in the duration of speech recordings, we segment each recording into 10 seconds based on non-overlapping chunks for training classifiers on equal duration of speech. At the evaluation stage of the classifier, majority voting was conducted for classification performance analysis. This means that the class with higher probability was assigned to the input sample. A summary of dataset distribution is provided in Table I.

IV. METHODOLOGY
We hypothesize that temporal characteristics of speech acoustics can be useful for distinguishing between healthy and dementia individuals. The hypothesis is based on the fact that patients with dementia reveal a lack of speech fluency and exhibit other rhythmic issues (pause and forget, difficulty joining or following a conversation).
For this analysis, we started by computing low-level acoustic speech descriptors (LLDs) preserving the paralinguistic aspect of speech. The LLDs are extracted from speech waveforms segment by segment, thus preserving the temporal characteristics. We normalize the LLD features using standard scaling (z-scores), and then pass them to a recurrent sequence network for generating an embedding that preserves the temporal characteristics of speech. Finally, a fully connected dense classifier is applied to identify subjects with dementia and healthy. The functional block diagram of the proposed temporal modelling framework is shown in Fig. 1.

A. Audio Features for Speech Paralinguistics
The term speech paralinguistics refers to non-verbal and non-linguistic aspects of speech. As per Schuller et al. [18], paralinguistics are important facets of communication, when human-beings naturally communicate their underlying emotional states without explicitly describing them. It has been shown that audio (acoustic) features representing speech paralinguistics can also be used to screen individuals for various disorders, such as autism, bipolar disorder, depression, dementia, and Parkinson's disease (they can effectively characterize manifestations of mental and neurological disorders based on speech) [19], [20], [21], [13]. Here, we utilize three expertknowledge based acoustic feature sets that are known to adequately represent characteristics of speech paralinguistics. These feature sets include the Interspeech 2010 Paralinguistics Challenge feature set (IS10-Paralinguistics) [22], the Interspeech 2013 Computational Paralinguistics Challenge (Com-ParE) feature set [18], and the Extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) feature [11]. Given that these feature sets are well known, we refer the reader to [22], [18], [11] for further details on the feature sets.

B. Temporal Modeling of Acoustic Features
Recurrent neural networks (RNNs) are a special class of deep neural networks that can learn temporal characteristics from time-series data. In the context of this work, RNNs are used to model temporal variations in speech paralinguistics within utterances of subjects from the two groups. Although there are various types of recurrent models, the two most popular structures include the Long-Short Term Memory and the Gated Recurrent Unit. Both of these provide various improvements over the legacy recurrent neural network structure (also known as vanilla RNNs) which has been difficult to train historically [23], [24]. This study makes use of four variations of the LSTM and one variation of the GRU for performing the temporal modeling.

1) Long Short Term Memory (LSTM):
The LSTM was invented by Hochreiter and Schmidhuber [25] aiming to alleviate the vanishing/exploding gradient problem of vanilla RNNs. An LSTM cell, shown in Fig. 2, consists of four interacting layers which include forget gate, input gate, update layer, and output gate. Each of these can be considered fully connected networks in their own right, and are therefore trainable. These layers enable the LSTM cell to learn temporal patterns by managing hidden state h t , cell state C t , and the output of LSTM cell y t . Fig. 2. Illustration of an LSTM Cell (Adopted from [26] ).
The first layer in an LSTM cell is the forget gate, which is responsible for identifying parts of the previous cell state (C t−1 ) that should be removed during the forthcoming update. If this is the first time step of the training process, the previous cell state is generated through random initialization. Next starts the process of updating the cell state with new information. Here, the input gate first identifies the location of values within the cell state which should be updated (and by how much). Then, a candidate vector of cell state values is prepared by passing the combination of input and hidden state through a tanh activation to squash their values between -1 and 1. Now, the cell state is updated by the summation of two products: (1) the output of forget gate and the cell state from the previous time-step (f t * C t−1 ) and (2) the input gate output and the candidate cell state (i t * C t ). The output gate makes decisions about the parts of the cell state which should be produced as the output of the LSTM cell. Finally, the hidden state of the LSTM cell is updated for the next time-step by multiplying the cell output by the tanh squashed cell state. Mathematically, the process flow within the LSTM can be summarized as follows, with x t representing the input N -dimensional acoustic features: A bidirectional LSTM (BLSTM) is similar to an LSTM network, except that instead of the just processing the input in the forward direction for one LSTM cell, a BLSTM has another cell side by side which is fed the input in a reversed manner. By doing this, a BLSTM is able to learn from future occurrences in sequences, and is able to form more complex models than the simple LSTM. BLSTMs have been used in a variety of applications successfully (automatic speech recognition [27], voice conversion [28], etc.).

2) Gated recurrrent unit:
The GRU is a temporal sequence network proposed by Cho et al. [29] with a cell structure that is more simplified than LSTM's cell structure and also has a fewer number of parameters. The GRU cell, as shown in Fig. 2, achieves the reduced complexity by combining the functions of forget and input gates from the LSTM cell into a single update gate.
Mathematically, the process flow within the GRU cell can be summarized as: Temporal models for recognizing dementia speech: The process of determining a particular neural network architecture is usually a trial-and-error process based on cross-validation and guided by intuition. To overcome this, we investigate the efficacy of five temporal models to identify the best performing one for the task at hand. A summary of the structure of these models is provided in Table II.   TABLE II The first model consists of the three-layer stacked LSTM with two dense layers, including the final layer which serves as the classifier. The first and last LSTM layer has recurrent units equal to the dimensionality (N ) of the input acoustic features. For example, since ComParE LLDs have a dimensionality of 65, therefore the first and last LSTM layers in Model 1 have 65 units. The middle layer is twice the size of the first and the last LSTM layers. We decided to use these settings assuming that a larger number of recurrent units may assist in learning complex temporal patterns from acoustic features. However, this is not straightforward, since the limited amount of training data may impact the learning ability of deeper LSTM models. Moreover, a dense layer was added to investigate whether the addition of a fully connected layer can aid in learning a more meaningful representation from the acoustic embedding learned by networks. Model 2 is identical to Model 1 except that it consists of GRU blocks rather than LSTM cells.
Model 3 is based on bi-directional LSTMs (BLSTM). It consists of two LSTM layers, each with the number of units equal to the dimensionality of the input acoustic LLDs and two dense layers. Here, we used smaller number of BLSTM layers, since BLSTMs naturally train two LSTMS, one in forward direction and the other -in reverse direction of the sequence.
Model 4 and Model 5 are similar to Model 1 and Model 3 except that an attention layer [30] is added to the networks. The attention layer takes into consideration the affect of long-term speech characteristics that may be learned by the LSTMs and have been found to improve the LSTM performance [31].

C. Static Modelling of Acoustic Features
Whereas temporal models seek to learn a global representation for speech recordings by explicitly considering inter-frame changes, static modelling generates a global representation of speech through feature aggregation. The simplest method of static modelling is to compute functionals of descriptive statistics for acoustic features. While this method may appear trivial, it has often produced state-of-the-art classification performance for paralinguistics tasks [32], [33]. Further, we found static modelling of acoustic features to be useful in our previous research on recognition of perceived trustworthiness [?]. Therefore, we benchmark the classification performance of temporal models against the static models for the task of speech-based detection of dementia.

V. EXPERIMENTS AND RESULTS
The classification experiments were conducted using the stratified 5-fold cross validation method. Stratification ensures that the distribution of labels in the training partition was matched by the distribution in test partition. The performance of temporal models was benchmarked against the static models using the same LLD feature sets under the same crossvalidation settings. Two types of classifiers, the support vector machine classifier (SVC) and random forest classifier (RFC), are used for classification with default settings as given in the Scikit-learn toolkit [34]. The results are presented in Table III, where it can be seen the best performing model, ComParE-SVC, yields an accuracy of 66.92%.    In Table IV, we summarize the classification performance of the static and temporal models for IS10-Paralinguistics acoustic features. As can be seen, the best performance for the static modelling of features is 61.92% whereas the best performing temporal model, Model 5 which uses BLSTM-Attention, achieves an accuracy of 74.55%, closely followed by Model 4 which is based on LSTM-Attention achieves an accuracy of 72.50%. While these results are preliminary, they indicate that attention mechanism can assist in improving the classification models.
The classification performance using ComParE feature set has been summarized in Table V. Here, the static modelling achieves the best performance with a classification accuracy of 62.31%. In contrast, best model amongst the temporal modelling approaches achieves an accuracy of 69.60%. Interestingly, the performance of the best sequence modelling result is significantly lower when compared to the result of the previous experiment. We suggest this is because the dimensionality of ComParE is much larger than that of IS10-Paralinguistic features (130 versus 76), and there are not enough examples in the dataset to adequately train temporal models with ComParE features.
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 10, 2020 Finally, Table VI shows the summary of classification performance for the static and temporal modelling using eGeMAPS features. As can be seen, the best performing model for the static features is eGeMAPS-RF which achieves an accuracy of 62.31%, whereas the best performing model amongst temporal models is Model 5 that provides an accuracy of 73.84%. The second placed temporal model, Model 4, achieves an accuracy of 71.70%. Both these results are more superior than the accuracy achieved through the staticmodelling.

VI. DISCUSSION
The results presented above lead to some interesting observations. Firstly, we note that the temporal models provide higher classification performance than the static models. A caveat to this is the performance of the temporal models with ComParE features which offer relatively smaller improvements than IS10-Paralinguistics and eGeMAPS feature set. We suggest this is due to the large dimensionality of ComParE LLDs -130 for ComParE versus 76 for IS10-Paralingusitics and 23 for eGeMAPS LLDs, respecively. Another observation is that attention mechanism contributed to the consistently improved classification by the temporal models. We believe this is because attention mechanism assists the temporal models in focusing on time-dependent charactersitics of acoustic LLDs that are unique for individuals with dementia.
Table VII summarizes the classification results of top-3 best performing models, where one can note that all of these models are based on the temporal modelling with attention mechanism, and two out of these make use of the IS10-Paralinguistic feature set. As can be seen, a significant improvement in classification accuracy is achieved when compared to the best performing statistic model (which achieved 66.92%).

VII. CONCLUSION
In this paper, we investigated the efficiency of the temporal modelling vs. static modeling of speech acoustics for detection of individuals with dementia. We benchmarked the proposed temporal models with the static models of acoustic features. Experimental results showed that the temporal modelling is a more effective approach for the intended classification task, revealing the best-case accuracy of 74.55% in a 5-fold crossvalidation setup. This accuracy may not be sufficient to support a medical diagnosis; however it is sufficiently high to conduct a low-cost rapid screening for dementia that could be followed up by a professional assessment. The proposed acoustic speech classification could be used either alone or in combination with transcript analysis, questionnaires, and other standard screening techniques.
Our future work will investigate application of other deep learning models and integration of dynamic and static approaches into a combined decision-making system.