Physiologically Motivated Feature Extraction for Robust Automatic Speech Recognition

In this paper, a new method is presented to extract robust speech features in the presence of the external noise. The proposed method based on two-dimensional Gabor filters takes in account the spectro-temporal modulation frequencies and also limits the redundancy on the feature level. The performance of the proposed feature extraction method was evaluated on isolated speech words which are extracted from TIMIT corpus and corrupted by background noise. The evaluation results demonstrate that the proposed feature extraction method outperforms the classic methods such as Perceptual Linear Prediction, Linear Predictive Coding, Linear Prediction Cepstral coefficients and Mel Frequency Cepstral Coefficients. Keywords—Feature extraction; Two-dimensional Gabor filters; Noisy speech recognition


I. INTRODUCTION
Over the last years, numerous feature extraction methods have been developed for noise robust Automatic Speech Recognition (ASR) to improve performance and robustness of the recognition task.Several of these methods exploit the principles of speech processing of human speech perception to overcome the lack of robustness against the variability of speech signals.The traditional feature extraction methods such as Mel-frequency cepstral coefficients (MFCC) [1], Linear Prediction coding (LPC) [2] and Perceptual Linear Prediction (PLP) [3] were been based on the use of auditory filter modeling.Further improvements were made by using various auditory modeling in other methods [4] [ 5] [6].
Recent physiological and psychoacoustic studies have additionally shown that the primary auditory cortex neurons responsive to spectro-temporal modulations which referred as the Spectro-Temporal Receptive Fields (STRFs) have an important role in speech perception.Two-dimensional spectrotemporal Gabor filters have successfully used for modeling STRFs [7][8].This has led to various extraction approaches of spectro-temporal features that achieve good performance in ASR noise robustness compared to traditional features [9][10] [11].In [12], Gabor features was obtained by processing a log Mel-spectrogram by a number 2D Gabor filters which were organized in a filterbank while these features were calculated from time-frequency representation derived from Power-Normalized Cepstral Coefficients (PNCCs) [15] in [16].
In this study, a physiologically motivated extraction method of Gabor features for noisy speech recognition is presented.The proposed method was based on the use of a set of 41 two-dimensional Gabor filters organized in a filter bank.It was applied to recognition of the TIMIT isolated words in the noisy environments.The recognition task is performed using Hidden Markov Models, which have been built using HTK toolkit [15].This paper was organized as follows: Section 2 describes the proposed Gabor features extraction method.The experimental framework and results were detailed in section 3. Section 4 provides conclusions of this paper.

II. THE PROPOSED FEATURE EXTRACTION BASED ON TWO-DIMENSIONAL GABOR FILTERS
A novel method based on two-dimensional Gabor filters is proposed to extract robust speech features for recognition of isolated speech words.The various steps were illustrated in Figure 2.
After pre-emphasizing the input speech signal, the power spectrum of signal is calculated by performing a windowing operation using a Hamming window (20 ms length with 10 ms overlap) and the square of Discrete Fourier Transform.It is then passed into a Bark-scale filter bank which aims to simulate the critical-band-masking curves, in order to obtain a critical-band power spectrum [3].
Subsequently, the equal loudness pre-emphasis and the intensity loudness conversion (third root amplitude compression) are performed to reproduce the two psychoacoustic properties of human hearing system; the nonequal sensitivity increase across frequency and the power law of hearing, which represents the simulation of the relation between the speech signal intensity and the perceived loudness of speech [3].These two steps allow the reduction of spectral amplitude variation of the obtained spectrum.
Finally, the proposed features named as Gabor Bark Power Spectrum features or GBPS features were extracted by applying a set of two-dimensional Gabor filters organized in a filter bank to the representation of the obtained spectrum.This filterbank is composed of 41 two-dimensional Gabor filters [12].These filters represent one of the most recent states of the art methods that were been successfully applied as front-end to noise robust speech recognition [12][16] [18].The Gabor features were obtained by calculating the 2D convolution of the filter and a time-frequency representation of speech to capture spectro-temporal modulations.Each two-dimensional Gabor filter is the product of two function terms: a complex sinusoid term denoted as ( ) and a Hanning envelope ( ) (with the time and frequency window lengths are and ) [12][13] [14].www.ijacsa.thesai.org The two terms and are time modulation frequency and the spectral modulation frequency.These terms determine the periodicity of the Gabor function and allow it to will be being tuned to a wide range of directions of spectro-temporal modulation.
The used bank of 41 Gabor filters were selected to get transfer functions of these filters having a constant overlap in the modulation frequency domain and covering a broad interval, which aimed to offer an approximated orthogonal filter and a limitation of redundancy of the filter output signal.The temporal and spectral modulation frequencies of the used bank of 41 Gabor filters were illustrated in Figure 1.

A. The used Databases
The TIMIT database [19] was used for all ASR experiments reported in this paper.It is one of the standard databases used to evaluate the robustness and performance of any new method on an ASR task because it has a wide range of speakers and dialects.This database consists of speech signals with sampling frequency equal to 16 kHz of 630 (192 female and 438 male) different speakers from eight different major dialects of The United States, ten sentences spoken by each one of these speakers In our experimental study, we used isolated words speech extracted from TIMIT database.A total of 9240 isolated speech words were exploited in the learning phase and 3294 isolated speech words were used for the recognition phase.
Furthermore, six background noises (restaurant, exhibition, babble, Car) drawn from the AURORA database [20] are used to evaluate the robustness of the proposed method under additive noise.The noisy isolated words used in this work were obtained by combining clean isolated words by each noise for various noise levels SNR.

B. The used Speech recognizer
The speech recognizer used in our experiments was based on HMM which have been built using the Hidden Markov Model Toolkit (HTK 3.4.1)[17].This portable toolkit is developed by Cambridge University and used to construct and manipulate HMM optimized for speech recognition.An HMM is used to model a series of acoustic vectors.It represents a collection of stationary states which are connected by transition of Markov chain.At each state change, an observed acoustic vector which described by an emitting probability distribution density ( ) is generated.The transition between state and state is also probabilistic and has a discrete probability associated with it [21][22].An example of an HMM consisting of five states with non-emitting entry and exit states is showed in Figure 3.
In the case of continuous density HMM, the most widely used output probability density ( ) is the Gaussian mixture density which was defined as [17] ( ) ∑ ( )  Where ( ) is the multivariate Gaussian density with , and are the covariance matrix, the mean vector and weight associated with, the k th Gaussian component at state j."n" is the dimension of the vector .
))  The HMM topology exploited in our experiments is the left-to-right five-state HMM with Gaussian Mixture density and diagonal covariance matrix.Each HMM state is represented by four Gaussian Mixtures (HMM-4-GM).

C. Results and discussion
For all of our experiments, the proposed Gabor Bark Power Spectrum features or GBPS features are compared to four classic features combined with energy (E) such as Perceptual Linear Prediction (PLP_E), Linear Predictive Coding (LPC_E), Linear Prediction Cepstral coefficients (LPCC_E) and Mel Frequency Cepstral Coefficients (MFCC_E).The result rates of recognition experiments with proposed Gabor features and the four classic features obtained using HMM-4-GM are summarized in the Tables I, II, III, and IV.Six noises (restaurant, exhibition, babble and car noises) drawn from the AURORA database and six specific signal-to-noise ratios (SNR) ranging from 0 dB to 25 dB in 5 dB steps were considered.
As illustrated in these tables, the proposed Gabor features outperform PLP_E, LPC_E, LPCC_E and MFCC_E features in the different cases.It can be observed that the highest percentage of the recognition rates is obtained using our Gabor features at almost all SNR levels, particularly at low SNR values.For example, in the car-noise case at SNR equal to 5 dB, the recognition rate of our Gabor features is higher than that of PLP_E, LPC_E, LPCC_E and MFCC_E features by 52.03, 59.62, 51.52 and 50.79 respectively.As can also be seen in the different tables, when decreasing the value of SNR level, the performance of all features degrade, but the proposed features remain robust and more performing than the classic features.

IV. CONCLUSION
A new physiologically motivated feature extraction method based on Gabor filterbank for isolated-word speech recognition under noisy conditions is presented in this paper.The proposed method takes into consideration the extraction of spectrotemporal modulation frequencies and the limitation of the redundancy on the feature level.The robustness of our Gabor Bark Power Spectrum features or GBPS features was evaluated on isolated speech words taken from TIMIT database using HMM.The obtained results show that our Gabor features have given the best results at all SNR levels compared to four classical features combined with energy: PLP_E, LPC_E, LPCC_E and MFCC_E features.

Fig. 1 .Fig. 2 .
Fig. 1.The real components of a set of 41 Gabor filters employed in the proposed method

Fig. 3 .
Fig. 3. Illustration of Hidden Markov models with five left-to-right states

TABLE I .
THE RECOGNITION RATE OF THE PROPOSED FEATURES, MFCC, PLP, LPC, AND LPCC OBTAINED USING HMM-4-GM IN THE RESTAURANT NOISE CASE

TABLE II .
THE RECOGNITION RATE OF THE PROPOSED FEATURES, MFCC, PLP, LPC, AND LPCC OBTAINED USING HMM-4-GM IN THE EXHIBITION NOISE CASE

TABLE III .
THE RECOGNITION RATE OF THE PROPOSED FEATURES, MFCC, PLP, LPC, AND LPCC OBTAINED USING HMM-4-GM IN THE BABBLE NOISE CASE

TABLE IV .
THE RECOGNITION RATE OF THE PROPOSED FEATURES, MFCC, PLP, LPC, AND LPCC OBTAINED USING HMM-4-GM IN THE CAR NOISE CASE