Artificial Neural Networks and Support Vector Machine for Voice Disorders Identification

The diagnosis of voice diseases through the invasive medical techniques is an efficient way but it is often uncomfortable for patients, therefore, the automatic speech recognition methods have attracted more and more interest recent years and have known a real success in the identification of voice impairments. In this context, this paper proposes a reliable algorithm for voice disorders identification based on two classification algorithms; the Artificial Neural Networks (ANN) and the Support Vector Machine (SVM). The feature extraction task is performed by the Mel Frequency Cepstral Coefficients (MFCC) and their first and second derivatives. In addition, the Linear Discriminant Analysis (LDA) is proposed as feature selection procedure in order to enhance the discriminative ability of the algorithm and minimize its complexity. The proposed voice disorders identification system is evaluated based on a widespread performance measures such as the accuracy, sensitivity, specificity, precision and Area Under Curve (AUC). Keywords—Automatic Speech Recognition (ASR); Pathological voices; Artificial Neural Networks (ANN); Support Vector Machine (SVM); Linear Discriminant Analysis (LDA); Mel Frequency Cepstral Coefficients (MFCC)


INTRODUCTION
When the mechanism of voice production is affected, the voice becomes pathological and sometimes intelligible which causes many problems and difficulties to integrate the social environment and to have an easy exchange between members of the same community.Therefore, the diagnosis of voice impairments is imperative to avoid so many issues.Voice disorders can be classified into three main categories: organic, functional or combination of both [1].This study is designed for organic voice disorders.Indeed, a voice disorder is organic if it is caused by structural (anatomic) or physiologic disease, either a disease of the larynx itself or by remote systemic or neurologic diseases that alter larungeal structure or function [2].In this research, we have worked on both structural and neurogenic disorders.Four types of pathologies are examined: Chronical laryngitis, Cyst, Reinke edema and Spasmodic dysphonia since they are widespread diseases and their medical analysis is a bit tricky to date.Among many techniques to identify voice diseases, the automatic acoustic analysis has proven its efficiency last years and has attracted more and more success.The advantage of acoustic analysis is its nonintrusive nature and its potential for providing quantitative data with reasonable expenditure of analysis time [3].Therefore, several techniques and methods have been introduced and many studies have been conducted in the literature.Some of these researches indicate that voice disorders identification can be done by the exploitation of Mel Frequency Cepstral Coefficients (MFCC) with the harmonics-to-noise ratio, normalized noise energy and glottal-to-noise excitation ratio, Gaussian mixture model was used as classifier [4].Also, Daubechies" discrete wavelet transform, linear prediction coefficient, and least-square Support Vector Machine (LS-SVM) were investigated in [5].In addition, a voice recognition algorithm was proposed in [6] based on the MFCC coefficients, their first and second derivatives, performance of F-ratio and Fisher"s discriminant ratio as feature reduction methods and Gaussian Mixture Model (GMM) as classifier; the main idea, here, consists in demonstrating that the detection of voice impairments can be performed using both mel cepstral vectors and their first derivative, ignoring the second derivative.In this paper, we will prove that the contribution of the first and second derivatives of the MFCC features mainly depends on the classifier.Indeed, the Artificial Neural Networks (ANN) and the Support Vector Machine (SVM) as classifiers are investigated in this work and a comparative study between their respective performances is conducted.In addition, three combinations of the MFCC features, their first and second derivatives are proposed for the feature extraction task.In order to select the most relevant parameters from the resulting feature vector, the Linear Discriminant Analysis (LDA) is suggested as feature selection procedure.Furthermore, the system performance is assessed in terms of the accuracy, sensitivity, specificity, precision and Area Under Curve (AUC).In the next section, the methodology and database used in this work are described as well as the performance measures.Then, section 3 presents the experimental results and section 4 discusses these obtained results.Finally, we conclude this paper with section 5.

A. Database
In this research, we have selected the voice samples from the "Saarbrucken Voice Database" (SVD) [7], [8] which is a German disorders voice database collected in collaboration with the Department of Phonetics and ENT at the Caritas clinic St. Theresia in Saarbrucken and the Institute of Phonetics of the University of the Saarland.It contains 2225 voice samples with a sampling rate of 50 kHz and with a 16 bit amplitude resolution.Subjects have sustained the vowels [i], [a] and [u] for 1s long.In this study, the continuous vowel [a] phonation produced by 50 normal people and 70 patients were examined.Four types of pathologies are investigated: Chronical laryngitis www.ijacsa.thesai.org(24), Cyst (6), Reinke"s edema (19) and Spasmodic dysphonia (21).

B. The Proposed Algorithm
In this paper, the extraction of the acoustical features from the speech signal is performed by the MFCC parameterization method.In addition, the first and second derivatives which provide information about the dynamics of the time-variation in MFCC original features were investigated to verify their contribution to the proposed algorithm.In order to optimize the voice disorders detection, a projection based Linear Discriminant Analysis (LDA) as feature selection method is suggested and a comparative study is elaborated between optimized and non-optimized features for every tested combination.As regards the classification task, the Artificial Neural Networks (ANN) are used as unconventional approach in addition to the Support Vector Machine as a new method successfully exploited in recent years, Fig. 1.

C. Feature Extraction Method
Feature extraction is obviously the most crucial task in speech recognition process.In this research, the Mel Frequency Cepstral Coefficients (MFCC) procedure is chosen as a robust technique commonly used and has proven its efficiency in speech recognition.
The Mel Frequency Cepstral Coefficients (MFCC) is a nonparametric frequency domain approach which is based on human auditory perception system.As presented in Fig. 2, the procedure of the MFCC features extraction starts by the decomposition of the speech signal into small frames since it is slowly time varying and can be treated as a stationary random process when considered under a short time frame [9], Then, windowed (a 30 ms.Hamming window was used) with no preemphasis.The frames were extracted with a 50% frame shift.The spectral coefficients of the speech frames are estimated using the nonparametric fast Fourier transform (FFT)-based approach.On the other hand, the human auditory system perceives sound in a nonlinear frequency binning.Therefore, Mel filtering process has to be performed.Thus, the obtained speech signal spectrum is filtered by a group of triangle bandpass filters that simulate the characteristics of human's ear [9], [10].The following equation is used to compute the Mel frequency f Mel for a given linear frequency f Hz in Hz.

* log(1 / 700)
Mel Hz ff  The nonlinear characteristic of human auditory system in frequency is approximated by the Mel filtering procedure.At this stage, a natural logarithm is applied on each output spectrum from Mel bank.Finally, The Discrete Cosine Transform (DCT) is performed to convert the log Mel spectrum into time domain; thus the Mel Frequency Cepstrum Coefficients (MFCC) are obtained.Besides, there are several ways to approximate the first derivative of a cepstral coefficient.In this research, we use the following formula [11]: Where x(t) is the cepstral coefficient, t is the frame number and 2M + 1 is the number of frames considered in the evaluation.The same formula can be applied to the first derivative to produce the acceleration.
For each time frame, the MFCC feature vector is composing of N original cepstral features, N delta cepstral coefficients and N delta-delta coefficients.Where N is the number of MFCC features chosen for a simulation.In this work, several experiments were conducted using 13 original MFCC features, their derivatives and accelerations in order to perform a comparative study between the different proposed combinations.

D. Feature Selection Procedure
In this research, Linear Discriminant Analysis (LDA) is suggested as a feature selection procedure which is a supervised subspace learning method based on Fisher Criterion [12].Indeed, it aims to estimate the parameters of a projection matrix in order to map features from an h-dimensional space to a k-dimensional space (k<h) in which the between class scatter is maximized while the within-class scatter is minimized.The within-class scatter calculates the average variance of the data within each class, while the between-class scatter represents the average distance between the means of the data in each class and the global mean [13].Linear Discriminant Analysis is investigated in this research in order to optimize the proposed identification algorithm since it is able to select the most relevant parameters from a feature vector in order to minimise the complexity of the system while improving recognition rates.

E. Classification Algorithms
Two classification algorithms are proposed in this work and a comparative study is established between their performance rates in order to conclude the most effective classifier for the identification of voice disorders.

1) Support Vector Machine:
Support Vector Machines are a class of learning techniques introduced by Vladimir Vapnik in the early 90s [14], [15].The binary classification is where the training data comes only from two different classes (+1 or -1).The idea of SVM is to find a hyperplane that best separates the two classes with maximum margin.If the data is linearly separable, it is called « Hardmargin SVM ».If the data is non-linearly separable, it is called "Soft-margin SVM".In this case, the data are mapped into a higher-dimensional space where the function becomes linear.This transformation space is often performed using a "Kernel Mapping function" and the new space is called "Features space".The most widely used SVM kernel functions are linear kernel, polynomial kernel and Radial Basis function (RBF) as Gaussian kernel.
The training phase of the SVM classifier involves searching the hyperplane that maximizes the margin.Such hyperplane is called « hyperplane optimal separation ».
In this research, the proposed algorithm was trained with the « Radial Basis Function » (RBF) as a Gaussian SVM kernel and LIBSVM which is a SVM library [16].

2) Artificial Neural Networks:
Artificial Neural Networks are absolutely one of the most effective approaches for speech recognition thanks to their numerous architectures and learning algorithms, In this paper, the architecture of the proposed neural networks is composed of three layers, an input layer for the transmission of the input features without distortion, a hidden layer containing 250 neurons (sigmoid is applied as activation function) and an output layer containing a linear function neuron.Each layer is completely connected to the next one.The proposed neural network learning is performed based on the principles of the Bayesian regularization algorithms.Indeed, the network weight values are adjusted successively at every step of learning in order to achieve an output as close as possible to the considered data [17].
Concerning the Bayesian approach, it is based on the exploitation of a random distribution of the network weight probabilities.The neural network learning consists in determining the distribution knowing the training data.Indeed, after the examination of the training data, the initial probability attributed to weights, before performing the learning, is transformed into a final distribution through the application of the Bayes theorem [17].

F. Evaluation Process
In order to judge the effectiveness and the robustness of the proposed algorithm, it has to be assessed according to different performance measures.In this research, five performance measures were used: accuracy, sensitivity, specificity, precision and the Area Under Curve (AUC) from the Receiver Operating Characteristic Curve (ROC).Indeed, sensitivity measures the ability of the algorithm to recognise pathological samples.It opposes specificity which evaluates the ability of the algorithm to identify normal samples.Precision represents the proportion of well-classified pathological samples from the pathological class.Furthermore, Accuracy measures the algorithm correct classification rate and the AUC which is an important statistical property for evaluating the discriminability between the two classes of normal and pathological samples.Therefore, the AUC provides another way to measure the accuracy of the proposed system.These measures are based on the following notions:

III. EXPERIMENTAL RESULTS
In this research, the dataset was divided into two parts: 70% of the data were used for training and 30% for validation.All simulations were conducted in MATLAB 2013a with Intel Core-i7, 2.20 GHz CPU and 4 GB RAM.

A. Evaluation Based on the SVM Performance
In this part of the article, we present the SVM performance rates for different combinations of the MFCC coefficients before and after applying the LDA feature selection procedure.Table 1 shows the SVM performance in terms of accuracy (Acc %), sensitivity (Sens %), specificity (Spec %), precision (Prec %) and AUC (%) for the different MFCC feature vectors.
The experimental results show that there is a slight increase, in the SVM performance rates between the MFCC and MFCC_Delta1 combinations, of 0.04% in the accuracy rate, 0.03% in the AUC rate, 0.04% in the sensitivity rate, 0.05% in the specificity rate and 0.07% in the precision rate.Whereas, the system performances are exactly equal for the combinations of MFCC_Delta1 and MFCC_Deltas1&2 with an accuracy rate of 80.4%, sensitivity of 87.83%, specificity of 73.58%, AUC of 80.7% and precision of 72.29%.Therefore, we can note that the first and the second derivatives don"t provide a significant improvement in the system performances when the SVM is used as classifier which demonstrates that the www.ijacsa.thesai.orgSVM algorithm is not sensible to the information provided by these features about the dynamics of the time-variation in the MFCC original vector.Besides, after applying the LDA procedure, the SVM performance rates are certainly less close but not enough distant to change the whole analysis about the contribution of the first and the second derivatives in the proposed algorithm when the SVM is applied as classifier.In the literature, previous results found by Godino-Llorente et al. [6] demonstrate that the detection of voice impairments can be performed using both mel cepstral vectors and their first derivative, ignoring the second derivative when the Gaussian Mixture Models are applied as classifier.However, our findings prove that even the first derivative can be ignored in the detection of voice impairment and only the original Mel Frequency Cepstral Coefficients are significant with the SVM classifier.
On the other hand, the LDA feature selection method was applied considering the different MFCC feature vectors.The experimental results show a significant improvement in the system performance.Thus, Fig. 3 exposes an optimization of 5.92% for the MFCC features which leads to an accuracy rate of 86.28%.Similarly, the optimized MFCC_Delta1 and MFCC_Delta1&2 combinations provide the accuracy rates of 86.07%and 86.44% representing an increase of 5.67% and 6.04%, respectively.
The AUC rates for the different MFCC combinations are presented in Fig. 4. It is observed that the improvement is important between optimized and non-optimized features such as the increase of 6.86% for the MFCC combination and 6.61% for the MFCC_Delta1 and 6.94% between the optimized and non-optimized MFCC_Delta1&2 features.Hence, the LDA procedure can be considered efficient in the selection of the most relevant parameters in order to obtain the optimized feature vector able to achieve best performance rates.Thus, the best performances were achieved by the optimized MFCC_Delta1&2 with a slight increase comparing to the other optimized features as mentioned in Table 1.

B. Evaluation Based on the ANN Performance
Table 2 shows the ANN performance rates for different combinations of the MFCC coefficients before and after applying the LDA feature selection procedure.The system performances are presented in terms of accuracy (Acc%), sensitivity (Sens%), specificity (Spec%), precision (Prec%) and AUC(%) for the different MFCC feature lengths.It is obvious that the ANN performance is increasingly better after integrating the first and second derivatives of the MFCC features.In fact, the accuracy and AUC rates are about 75.13% and 75.02%, respectively, for the combination of the original MFCC features whereas these rates are about 81.19% and 81.74%, respectively, when the first MFCC derivatives are associated with the original ones.This improvement is enhanced for the combination of the MFCC features with their first and second derivatives since a significant increase in the system performance measurements is observed.Indeed, this combination offers an accuracy of 85.2% and AUC of 85.21%.Therefore, the first and second derivatives of the MFCC coefficients can be considered significant when the ANN is applied as classifier since they offer a great improvement in the system performance compared to the results of the original MFCC features.In fact, this variation between the different Non optimized (%) Optimized (%) www.ijacsa.thesai.orgcombinations is observed before and after applying the LDA transformation.
As regards the LDA method, it was applied to the different MFCC combinations in order to select the most significant parameters from the feature extraction task to be the input vector of the ANN architecture.This strategy leads to an optimization in the system performance.Indeed, the experimental results show an improvement in the ANN performance measurements for all the optimized MFCC feature combinations.Fig. 5 compares the ANN accuracy rates of the optimized and non-optimized MFCC vectors.The experimental results exposed in Fig. 5 show an optimization of 5.12% in the accuracy rate of the nonoptimized MFCC features, while the improvement is about 2.87% for the combination of the MFCC features and their first derivatives.Also the optimization procedure provides a 2.62% increase in the accuracy rate of the MFCC features associated with their first and second derivatives.In fact, the improvement was observed for all performance measures namely the AUC rates which were improved to reach 81.87% for the MFCC combination with an optimization of 6.85% while 3.85% and 2.75% were the improvement rates for the combination of MFCC_Delta1 and MFCC_Delta1&2, respectively, Fig. 6.Finally, the optimized MFCC_Delta1&2 combination reached the best ANN performance rates with an accuracy rate of 87.82%, sensitivity of 99.12%, specificity of 80.31%, AUC of 87.96% and a precision of 81.42% as mentioned in Table 2.

IV. DISCUSSION
In this paper, the ANN is proposed as unconventional approach in addition to the SVM as a new method successfully exploited in speech recognition.The main motivation for conducting this research was to investigate the efficiency of each of those classifiers in the identification of voice disorders.In addition, it was interesting to scrutinize the contribution of the first and second derivatives of the MFCC features for every classifier.The experimental results demonstrate that the effect of these derivative features depends on the classifier.Indeed, when the SVM is used as classifier, the first and second derivatives do not provide any improvement to the system performance comparing to the original MFCC features.However, when the ANN is used as classifier, these derivative features can be considered important since they contribute in the improvement of the system performance.In this case, there is an average improvement about 4% between the combination of the MFCC, MFCC_Delta1 and the MFCC_Delta1&2.
Besides, the LDA procedure is used to select the most relevant parameters from a resulting feature vector in order to reduce the system dimensionality without affecting its performance.Indeed, our findings show that the LDA method minimizes the system complexity while improving the performance rates for every feature combination; therefore it can be considered as an optimization procedure.
Table 3 compares the proposed algorithms with previous significant works.It is observed that the proposed algorithm appears competitive for the detection of voice disorders from the Saarbrucken Voice Database (SVD).Finally, with an accuracy rate of 86.44%, sensitivity of 98.24%, specificity of 77.04%, AUC of 87.64% and precision of 74.42%, the SVM classifier can be judged efficient for voice disorders identification.Also, the ANN classifier offers an accuracy rate of 87.82%, sensitivity of 99.12%, specificity of

Fig. 1 .
Fig. 1.Block diagram of the proposed system

Fig. 2 .
Fig. 2. Block diagram of the MFCC procedure TP : True Positive : identified as pathological when pathological samples are actually present TN : True Negative : identified as normal when normal samples are actually present FP : False Positive : identified as pathological when normal samples are actually present FN : False Negative : identified as normal when pathological samples are actually present These measures can be calculated as follows:

Fig. 3 .
Fig. 3. Comparison between the SVM accuracy rates of the optimized and non-optimized MFCC features

Fig. 4 .
Fig. 4. Comparison between the SVM AUC rates of the optimized and nonoptimized MFCC features

Fig. 5 .
Fig. 5. Comparison between the ANN accuracy rates of the optimized and non-optimized MFCC features

Fig. 6 .
Fig. 6.Comparison between the ANN AUC rates of the optimized and nonoptimized MFCC features

TABLE I .
THE SVM PERFORMANCE BASED ON THE MFCC COMBINATIONS BEFORE APPLYING THE LDA PROCEDURE (TABLE 1-1) AND AFTER INCLUDING LDA PROCEDURE (TABLE1-2)

TABLE II .
THE ANN PERFORMANCE BASED ON THE MFCC COMBINATIONS BEFORE APPLYING THE LDA PROCEDURE (TABLE 2-1) AND AFTER INCLUDING LDA PROCEDURE (TABLE 2-2)

TABLE III .
COMPARATIVE TABLE BETWEEN PROPOSED ALGORITHM AND PREVIOUS WORKS