Isolated Automatic Speech Recognition of Quechua Numbers using MFCC , DTW and KNN

The Automatic Speech (ASR) area is defined as the transformation of acoustic signals into string words. This area has been being developed for many year facilitating the lives of people so it was implemented in several languages. However, the development of ASR in some languages with few database resources but with a large population speaking these languages is very low. The development of ASR in Quechua language is almost null which leads culture and population isolation from technology and information. In this work an ASR system of isolated Quechua numbers is developed where Mel-Frequency Cepstral Coefficients (MFCC), Dynamic Time Warping (DTW) and K-Nearest Neighbor (KNN) methods are implemented using a database composed by recorded audio numbers from one to ten in Quechua. The recorded audios to feed the data base were uttered by natives man and women speakers of Quechua. The recognition accuracy reached in this research work was 91.1%. Keywords—Automatic Speech Recognition; MFCC; DTW; KNN


I. INTRODUCTION
Technology has been facilitating people's lives since it became an integral part of their lives.It makes the communication with computers easy and one of the ways to do that is emulating human intelligence to understand what a person says aloud.[1].The interaction between a person and a computer using the voice becomes simpler and more comfortable because it does not need special skills such as hand coordination and speed when typing with a keyboard [2].For this reason ASR systems were developed in many languages, including languages that have few resources in database, with the aim of making people's interaction with computers easy and thereby facilitating access to information.and technology [3] [4] [5].
ASR is the area of artificial intelligence that transform the audio signals spoken by a person into a sequence of words that can be understood for a computer [6].It has been researched for years how a person can communicate with a computer in the same way a person communicates with another person [7].The development of ASR covers issues from research on voice recognition to the implementation of dictionaries based on the speech spoken by a person and all of these issues divide the ASR into three types, ASR: from isolated words, continuous and connected words [8].
ASR systems of isolated words take as input individual words or a list of words with a well defined pause between them and each of the words is processed individually [1].In this research work an ASR system is developed for isolated words, to be precise, for natural numbers from one to ten in Quechua, which is an official language in Peru.
Quechua is essentially an agglutinative language and this peculiarity makes Quechua different from the rest of the dominant languages in South America, thus this language is suffering a strong social pressure [9].This goes hand in hand with the fact that the development of technology in these languages is very low which leads to the isolation of Quechuaspeaking people from information and technology.The development of an ASR system in Quechua will enable people who speak only this language to use the technology to greater extent without the knowledge of operating with computer keyboard developed in foreign language and understanding information published also in foreign language.This research paper presents the development of an ASR system of isolated words having a limited database.The rest of this research is organized as follows: Section II describes a review of the works related to this work.Section III provides the theoretical framework of ASR.Section IV develops the methodology used to implement the ASR system that this work proposes.Section V analyzes the results obtained from the ASR system and finally in section VI summarizes the conclusions reached through the development of this work.

II. RELATED WORKS
Atif in [10] developed a system for automatic recognition of isolated words with English language.In the phase of extraction of characteristics of an audio, MFCC was used and DTW and KNN were used in the recognition and classification block.DTW to make match the features of different audios and KNN to classify taking the characteristics that more resemble.The audios used were acoustically balanced and free of ambient noise.The recognition accuracy achieved in the work of these authors is 98.4%.
Wani in [2] developed an automatic recognition system for isolated words with the Hindi language.It is taken into account that many people who speak this language can not speak English, which is the language which the ASR systems were most developed with, and they can not access easily to this technology.For feature extraction, MFCC technique was used, and KNN and GMM (Gaussian Mixture Model) were implemented in the recognition phase.In order for the system to be independent of the speaker, the training audios of different speakers between men and women were obtained.Wani's work reaches a recognition accuracy of 94.31%.
In Indonesia, an ASR was developed using a tool based on HMM and with a limited database.[11] needed to build an acoustic model, a language model and a dictionary to develop the ASR for the Indonesian language.Own models of the numbers were developed which were used as input for CMUSphinx toolkit, which is the tool they used.The use of acoustic models already implemented to evaluate them under different SNR conditions was also investigated.The best recognition accuracy achieved is 86% and by experiment different noise level conditions the best accuracy is 80%.
Anand in [5] developed a modern ASR of wide vocabulary with an application in people with visual disabilities.In feature extraction phase, MFCC was used and in the classification and recognition phase an acoustic model was developed using thirty hours of HMM-based audio.To handle pronunciation variation, a hybrid model was used between rule-based methods and statistical methods.The audio recordings were collected from 80 native speakers of the Malay language.The best recognition accuracy achieved is 80% and the developed system was integrated into OpenOffice Writter as a text entry interface through voice.
On the other hand, Ranjan develops an ASR system for isolated words from a language dialect called Maithili [12].To obtain the necessary acoustic vectors for classification, the author implements MFCC.The system developed by Ranjan is an ASR system based on the HMM model.The acoustic model and the language model are developed with HMM.The recognition accuracy reached in the work described is 95%.However, future work is planned to improve accuracy in noise environments.
Speech recognition for people with amputated vocal cords differs to some degree from a common ASR.While it is true that the duration and intonation of words and vowels are practically the same, the pre-processing of the signals must be deeper.This problem is contemplated and developed by Malathi in [13] using MFCC for feature extraction of the audios and thus built the acoustic vectors.The classification or recognition was developed with GMM and Gradient Descent Radial Basis Function (RBF) Networks.The learning rate of the network are made proportional to probabilities density obtained from GMM.The result of the research was applied to patients who pronounced words only with the esophagus.Bhardwaj [14] developed three schemes or types of ASR with the same methodology to evaluate the behavior of this methodology in different contexts.The types of ASR that are evaluated are: dependent on the speaker, multi speaker, and independent of the speaker.The methodology used starts by implementing MFCC for feature extraction of the audio.The acoustic model and the language model are based on HMM.To classify the words in Hindi, the language which they worked with, they used the K-Mean algorithm.The recognition rate for the independent speaker ASR was 99%, for the multi-speaker it was 98%, while for the independent speaker ASR it was 97.5%.
Ananthi developed an ASR for people with hearing problems [15].If the words of an announcer are interpreted by the computer and are simultaneously transcribed into text, a person with hearing impairment can easily understand any person.An ASR of isolated words based on HMM is developed in Ananthi's work.Because the focus of the work we are describing is aimed to the use of ASR in a fluent conversation, the implementation of DWT is discarded since it only works properly in isolated word ASR.The result of this work was successfully implanted in a population of people with hearing problems.

III. ASR
ASR systems are composed of two main blocks, a feature extraction block and a classification block [10].The feature extraction block obtains values from an audio and these are passed to the classification block that is responsible for predicting the word or sequence of words corresponding to the input audio [16].
To express the audio signals in numeric values, there are a lot of algorithms and methods in feature extraction block.Some of these methods are: Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA), Independent Component Analysis (ICA), Linear Predictive Coding (LPC), Cepstral Analysis, Mel-Frequency Scale Analysis, Filter-Bank Analysis, Mel-Frequency Cepstrum Co-efficients (MFCC), Kernal Based Feature Extraction, Dynamic Feature Extraction, Wavelet based features, Spectral Subtraction and Cepstral Mean Subtraction (CMS) [17].According the review we made of related works, the most common and appropriate methods used in this type of ASR for feature extraction are MFCC and LPC and in this research work, the method we used is the first one, MFCC.
In the classification block, there are two main components, the acoustic model and the language model [18].The acoustic model models how the pronunciation of a word is represented, and on the other hand, the language model models the probability that a word fits a sequence of words.Hidden Markov model (HMM) and Neural Networks are the most common techniques for modeling an acoustic model and N-Gram model to model a language model.These techniques are common in continuous and wide vocabulary ASR [18] [19] [11] [20].However, for ASR of isolated words and with a limited data set, there are techniques that behave better in these cases.In this type of ASR, only acoustic model is built to classify the words and the techniques like Dynamic Time Warping (DTW) and K-Nearest Neighbor (KNN) are the ones which reach better results to find similarity between the signals of two or more audios [10] [2].
After analyzing the architecture of a conventional ASR and ASR of isolated words ASR, the ASR that will be developed in the present work adapts the architecture of the ASR of isolated words that is constituted of two main blocks, which is the block of feature extraction and the block of classification, and in each block the algorithms that best adapt to our problem are implemented according to the state of the art review.The blocks of the architecture as well as the algorithms to be used are presented in Fig. 1.

IV. METHODOLOGY
The ASR of isolated words that is developed in this work implements the MFCC technique for feature extraction.To classify the representation of the audio signals that MFCC provides, the DTW and KNN techniques are used.Before start www.ijacsa.thesai.org  the described process, the input audios go through a noise reduction and elimination filter.This methodology can be seen in Fig. 2.

A. Database
As a first step to implement this ASR, a Quechua database was developed.For them, isolated words from different native speakers were recorded.The numbers from 1 to 10 were obtained by recording thirty people between men and women, fifteen men and fifteen women.The numbers uttered and recorded can be seen in Table I, and each number was saved in an audio file with .wav.Each number was spoken by the thirty people, so in total we had three hundred audio to be processed and put them in the ASR system we implemented.

B. Feature Extraction.
In this stage, MFCC is implemented and it is considered the most important stage where parametric representation of the audio signals determine how effective is the performance of the next stage, which is classification.MFCC is based on human auditory perception that can not perceive frequencies above 1000 Hz [21] [22], in other words it is based on known variation of the human ears critical bandwidth with frequency .The best representation of these audio signals is the Mel scale, which is approximately linearly below the 1000Hz frequency  and logarithmically above.The entire MFCC process can be seen in Fig. 3 and then each of the phases is developed.
1) Audio Pre-Processing.:Because we needed the acoustic vectors with the same longitude, we had to edit every audio's duration in order to have them with exactly one second of duration.These numbers were spoken in an acoustically balanced and noise free environment, thus it was not necessary to used any noise reduction technique.Every recording was saved in .wavformat of 16-bit PMC and 8000Hz frequency.The signal obtained after pre-processing an audio can be seen in time series in Fig 4 .2) Pre-Emphasis.:We apply pre-emphasis to the original signal to amplify the high frequencies.According to [23] the pre-emphasis filter can be used in several ways: a) It balances the frequency spectrum since high frequencies usually have lower magnitudes than those of high frequencies.b) Avoid numerical problems during the operations of Fourier transformations.c) You can also improve the Signal-to-Noise Ratio (SNR).This filter is applied to a signal x using (1).After applying the pre-emphasis filter to the original signal, a new signal is shown, which can be seen in Fig 5 .We can see that the amplitude of high frequency bands was increased and the amplitudes of lower bands was decreased so it will help to get slightly better results.
3) Framing and Windowing.:With the signal obtained from the pre-emphasis filter, a process is done in which the signal is divided into small frames, and this process is called framing.The reason for doing this process is that when doing the fourier transformation, which is the next step, you lose frequency contours if you work on the entire signal.
After dividing the signal into frames overlapped with each other, a Window function is applied to each frame to remove discontinuities and in this work the Hamming function is used.In this work, the Hamming function is used to counter the assumption made by Fast Fourier Transform (FFT) that the data is infinite and to reduce the spectral leak [23].The equation of the Hamming function that is applied to each frame is described in (2) where "n" is the total number of samples in a single frame.
After applying the Hamming function, the output signal is plotted as shown in Fig. 6. 4) Fast Fourier Transform.: FFT is applied to the signal obtained in the previous section to transform each frame of N samples from a time based domain to a frequency based domain [21].In other words, a frequency spectrum is calculated where N is generally 256 or 512, and ( 3) is used to calculate this result.The output of the FFT method is shown in Fig. 7 where the domain of the signal is the frequency.The frequency range of the FFT spectrum is very wide and a voice signal does not follow a linear scale [21] [24].Filter Bank is then worked to transform the signal from Hertz to Mel scale as shown in Fig. 8 where the Mel filter bank comprises of triangular shaped overlapping filters.To calculate the filter banks, triangular filters are used, and the frequency in Hertz (f) can be converted to a Mel scale using (4).
6) Discrete Cousin Transform (DCT).:This is a process to convert the spectrum in Mel scale to a time-based domain.The result of this process is called MFCC.The set of coefficients obtained is called acoustic vectors [21].In other words, until this phase, the inputs that were audios, are transformed into a sequence of acoustic vectors, which in turn, will form the set of inputs for the classification algorithms.The result is shown in Fig. 9.

C. Classification and Recognition
To evaluate the recognition accuracy, the development of the classification and recognition stage plays a very important role.In this work, DTW and KNN are used to find matches between different acoustic vectors obtained in the feature extraction phase.In DTW, the dynamic programming approach is used to find similarities between two time series, which basically have the same structure as the previously obtained acoustic vectors.For classification in continuous ASR is more accurate to use other techniques such as HMM or Neural Networks applied with different approaches such as Deep Learning [25].These techniques are used because they try to imitate the human language learning taking into account variations of dialects or types of pronunciations.In this work an ASR of isolated words is worked so it is more appropriate to use DTW and KNN.
1) Dynamic Time Warping.: It is an algorithm to find the minimum distance between two sequences or time series dependent on certain values such as time scales, which was initially used only for ASR jobs but its application was extended to fields such as Data Mining [26] [27].Consider two time series P and Q with a length of n and m respectively.P = p1, p2, p3, .... pn Q = q1, q2, q3, .... qm An mxn matrix is built and for each intersection the distance between the two points (pi, qj) is calculated using the Euclidean Distance formula described in (5).
Then the minimum accumulated distance is calculated using (6).DTW can have many variations with interesting improvements but each optimization is developed under a specific domain and it is difficult to use it in fields like ASR [28].
2) K-Nearest Neighbor.:Given an "n" point, K-Nearest Neighbor is an algorithm that finds all values closest to "n" within a set of values that make up the training database [29].In ASR, a feature vector takes the value of "n", and KNN finds the vectors closest to "n" taking as reference a distance metric as the Euclidean distance that is calculated between all the vectors with the DTW algorithms.

V. ANALYSIS OF RESULTS
The experiment was conducted on a database of three hundred natural number audios from one to ten in Quechua.Each audio in .wavformat had exactly a duration of 60 seconds.Each number was pronounced by thirty different people, between men and women.
The database was divided into two sets, one for training that corresponds to 70% of the audios and another for the test that corresponds to 30% of the audios.Of the 90 numbers that passed the classification method, the number of correctly classified numbers was 82.Using (7) the accuracy of recognition of the ASR developed in this research work is calculated, which at the end of the experiment reached a value of 91.1%.

Accuracy =
words detected correctly number of words in data set (7) The results were also analyzed in the form of a normalized confusion matrix where we can see more details of the Fig. 10.Confusion Matrix performance of the ASR system [30].The confusion matrix for our system can be shown in Fig. 10 where the result of the classification is taken for each number.In the matrix, it is shown the accuracy of recognition for each label that is useful to analyze which numbers are correctly recognized or partially well recognized.Furthermore, we can identify the numbers with low recognition accuracy to analyze its features and improve the system to achieve a high recognition rate.

VI. CONCLUSION AND FUTURE WORK
This research work develops an ASR system for isolated words using the MFCC, DTW and KNN techniques.The architecture in which we worked is divided into two blocks, feature extraction block and, the classification and recognition block.In each block we used algorithms that adapt better to our problem.In feature extraction block, it was developed implementing MFCC that consists of a series of algorithms that work sequentially.In the classification block, the acoustic vectors obtained in feature extraction block were classified using DTW and KNN.The results were evaluated using (7) which at the end of the work reached a value of 91.1%.The results were also analyzed in the form of a confusion matrix which shows us the recognition accuracy for every number to identify which numbers are the most recognized and which ones are partially recognized.
As future work it is proposed to improve the ASR system developed in this work including all the words that are spoken in Quechua.Next it is proposed to develop this system as a continuous speech recognition system capable of understanding and processing an speech spoken by any Quechua native people in a fluent way.

TABLE I .
NUMBERS IN QUECHUA