Recent Advancement in Speech Recognition for Bangla: A Survey

This paper presents a brief study of remarkable works done for the development of Automatic Speech Recognition (ASR) system for Bangla language. It discusses information of available speech corpora for this language and reports major contributions made in this research paradigm in the last decade. Some important design issues to develop a speech recognizer are: levels of recognition, vocabulary size, speaker dependency and approaches for classifications; these have been defined in this paper in the order of complexity of speech recognition. It also highlights on some challenges which are very important to resolve in this exciting research field. Different studies carried out on last decade for Bangla speech recognition have been shortly reviewed in a chronological order. It was found that selection of classification model and training dataset play important roles in speech recognition. Keywords—Bangla ASR; Bangla speech corpora; speaker dependency; vocabulary size; classification approaches; challenges


I. INTRODUCTION
There are several important applications of a speech recognition system. It is used to develop chat-bots in smartphones and gadgets. For customer service in call centers, speech recognition systems are used for automated replies. ASR systems are widely used in automated machines to detect voice commands. Speech recognizer also can be used in detecting crime planned over phone calls and also for detection of hate speech delivery. A study shows that for English language more than 10% of searches are made by voice and most of them are done using smartphones [2]. This number will increase day by day. The first paper on Speech Recognition was published in 1950. Since then researches on Speech technology have achieved remarkable advancement over the last few decades, major advancement was started in 1980's with introduction of Hidden Markov Model (HMM) for Speech Recognition. The main objective of all research is to build an ASR (Automatic Speech Recognition) system which can operate for large vocabulary continuous speech for different languages.
Bangla language is spoken by more than 228 million people all over the world [1]. People from West Bengal, Tripura, Assam, Barak Valley, Andaman, Nicobar Islands, and diaspora living in various countries speak in Bangla Language. It is the national language of Bangladesh and official language of the states of West Bengal. As Bangla language has a large number of speaker groups, a successful ASR (Automatic Speech Recognition) system for this language will benefit lots of people. Research on Bangla ASR came into focus in the 90's. Recognition of Bangla speech has been started since around 2000. In 2002 A. Karim et al. presented a method for Spoken Letters Recognition in Bangla [3]. In the same year, K. Roy et al. presented the Bangla speech recognition system using Artificial neural networks [4]. In 2003, M.R. Hassan presented a phoneme recognition system using Artificial neural network [5] and K.J. Rahman presented a continuous speech recognition system using ANN in 2003 [6]. Recently, Google presented a functional speech recognizer and voice search service (SpeechTexter and Google Assistant) for Bangla and other languages. But, these available for only android devices. The aim of this paper is to summarise all important works done recently on the development of Bangla ASR to facilitate the researchers working in this filed. Fig. 1 shows the diagram of a common ASR system. The system takes voice signal x(n) as input. Then, after preprocessing, feature extractions are done to reduce dimensionality of the input vector while preserving the discriminating attributes for recognition. In the decoder, there are mainly three parts: acoustic models, pronunciation dictionary and language models. Acoustic model calculates the probability of observed acoustic signal (x 1 ...x N ) for the given word sequence (w 1 ...w N ). Language model provides the probability of proposed word sequence which is P r(w 1 ...w N ). Pronunciation dictionary contains list of words with their phonetic transcriptions, and it propose valid words for a given context. The decoder combines inputs from all three parts and applies classification models to deliver the recognized text as output y(n).
The rest of this paper contents are organized as follows. Related works are discussed in Section II, issues to consider for developing ASR are presented in Section III, challenges in developing a successful Bangla ASR are explained in Section IV, a list of available natural speech corpora are reported in Section V, recent advancement in last decade is discussed in Section VI, and discussion and conclusion of the this study is presented in Sections VII and VIII, respectively.   [7]. In 2020, Badhon et al. reviewed 15 research papers which worked on Bangla ASR. The study represented the datasets and detailed methodologies involved in those researches [8]. A few language specific surveys on ASR have been conducted for other languages by the researchers. For example, a group of researchers have represented speech recognition techniques for Chinese language [9]. Few studies have been done on Speech Recognition for Indian Languages [10][11] [12]. Lima et al. have studied the Speech Recognition components for Portuguese language [13]. A literature review on Arabic speech recognition was done by Al-Anzi and AbuZeina [14]. In 2006, Ronzhin et al. studied all methods and models used for Russian speech recognition [15]. The target of such reviews is to represent a useful summary of overall works done on a language specific Speech recognition.

III. ISSUES TO CONSIDER FOR ASR
There are some important research concerns that need to be considered when developing a speech recognition system. The application, development complexity, recognition efficiency of the system depend on a few things like utterance type, vocabulary size, speaker dependency and pattern matching approaches [16]. These factors are discussed briefly in the following sections in increasing order of recognition complexity of the system.

A. Levels of Speech Recognition
The early stage of Speech recognition started with phoneme recognition from recorded speech [17]. Speech recognition systems can be developed to recognize: isolated words, connected words, continuous speech, or spontaneous speech. Isolated words are uttered separately with sufficient pause between them. For connected words, single words are recorded together but still there are pauses between them. In continuous speech, words are connected and there is overlapping between the words. This means deliberate pauses are not added after each word while recording. The spontaneous speech recognition system processes natural speech which is characterized by pauses, silence, disfluencies, etc. This type of recognition is most difficult as they require additional methods to process the speech.

B. Vocabulary Size
The requirements for the vocabulary size of training dataset depend on the target applications of recognition systems. Some applications require as small vocabulary as a few words whereas some other requires millions of words to train the system. Small-size vocabulary dataset comprises only few to hundreds of words. It is used only when the system needs to recognize a fixed small number of digits or other spoken words. For example, digits dialing and access control. Usually contains 2 to 10 hours of recorded speech. Dataset for medium-size vocabulary contains thousands of words. This may contain 10 to 100 hours of recorded speech. This kind of datasets are used to recognize under-resourced languages. Large-size vocabulary contains millions of words. Large vocabulary recognition systems are used in real-life speech recognition e.g. class lecture transcription. The speech corpora contains more than 100 hours of recordings involving a large number of speakers.

C. Speaker Dependency
For speech recognition, features are collected from speakers' voice and the classification model is trained for these features. The system can be classified depending on the number of speakers they are able to identify successfully. Speakerdependant voice recognition technique identifies different acoustic features of a single voice. These kinds of systems are easier to develop but they do not perform well for unknown speakers. Speaker-independent speech recognition systems comprises a large collection of speech from several speakers. Features are calculated for this large size data and recognition is performed by searching the best matching for existing data. Speaker-adaptive systems collects features from user samples to enrich the training data. The system adapts to the best suited features for speech recognition collected from users, in this way the error rate is reduced and the system also performs independent of speakers.

D. Different Approaches of Speech Recognition
The approaches used to classify speech are categorized as follows [18]: Acoustic-phonetic approach: This type of approach focuses on the nature of the speech. Speech features of phonetic units are detected with help of spectral analysis. For example, accent features of vowels and diphthongs analysis, considering the formants and energy of the signals, etc. The target is to discover the acoustic features of the sounds and apply those features to recognise continuous speech. Prior to the recognition this involves few steps which are features extraction, segmenting the feature contours and labelling the segments.
Pattern recognition approach: involves two steps -pattern training and pattern matching. By applying appropriate statistical methods patterns are extracted from speech units which are probably smaller than a word or a single word are stored in the database. A training algorithm is applied for this stored dataset and direct comparison is done between unknown speech segments and trained patterns during the recognition.

Artificial intelligence approach:
This approach is considered as a mimic of our human brain which actually solves the problems based on its previous learning experiences. The problem solving strategy always follows the steps: learning, reasoning and perception. Typically this type of speech recognition system is based on neural networks (NN). Actually it combines the ideas taken from both the acoustic-phonetic approach and pattern recognition approach. Input signals are segmented and the acoustic parameters for these segments are calculated. The system is trained for these parameters and pattern matching is done for recognition. The pattern recognition task can be supervised or unsupervised. For supervised pattern recognition example input patterns are provided to the system as a predefined class. For unsupervised systems there are no example patterns, these systems are learn-based. Recent researches focus on speech recognition based on DNN, RNN, hybrid of HMM-DNN approaches.

E. Performance Analysis
For word recognition systems, raw accuracy rate was used in many studies. For continuous speech recognition, Word error rate (WER) and Word recognition rate (WRR) most commonly used performance measure of the systems. Word error rate can be computed as [19]: WER = S+D+I/N; where, S = number of substitutions, D = number of the deletions, I = number of the insertions, N = number of words in the reference. Word recognition rate (WRR) is defined as: WRR = 1 -WER. Reference word sequence and recognized word sequence may have different length and order, to skip this problem the recognized words are first aligned with the reference word then the error rate is calculated.

IV. CHALLENGES IN DEVELOPING BANGLA ASR
Inherently Bangla language has some distinct features like different phonemic systems, presence of long and short vowels, frequent use of consonant clusters, variation in stress and intonation, etc. Building an efficient and successful Speech Recogniser for continuous speech in Bangla language is a challenging task for the researchers. There are some wellestablished APIs available for English language like SAPI, SIRI, IBM Watson API, etc. Researchers face few challenges when developing a speech recognizer for Bangla based on a successful API developed for other languages, e.g. English. The reasons are discussed below.

A. Different Phonemes
Bangla Language consists of 14 vowels (7 natural, 7 nasal) and 29 consonants [20]. Number of phonemes and phonemic features differ from language to language. For example, Bangla and English language have their distinct phonemic systems [21]. One speciality of Bangla phonemes is that it has 7 nasalised vowels. There also exist two more long vowels /i:/ and /u:/.

B. Speech Patterns
There are some basic differences in the speech pattern of Bangla with that of English and other languages. Bangla is said to be bound stress as for Bangla language stress is high at initial and becomes low at the end of speech [22]. Whereas, English is said to be stress-timed for different stress patterns.

C. Difference in Accents
There is a noticeable difference of accents from region to region, it is true especially for different districts of Bangladesh. Sylhet, Dhaka, Comilla and other districts have their own dialects. The same word may be pronounced differently for different areas. This is a big challenge to build a common ASR system for all. For example, পাতা -/pata/ (English:leaf) is pronounced as ফাতা-/fata/ by many people from Sylhet.

D. Insufficient Dataset
Still, Bangla is considerd to be a low resource language [23]. That means, for the Bangla language, very low resources are public for research purposes. Nowadays almost all the research of Computational Linguistics are concentrating on Machine Learning-based models. A large training dataset is the key point of getting a highly accurate model for DNN. We only have a few annotated dataset available for Bangla speech recognition.

E. Homophones
There are a number of words which sound alike, but have different spellings, and meanings. For example, the words 'শব'-(English: dead body) 'সব' (English: all) both are pronounced as /Sob/. Another example is, 'িবশ' (English: twenty) vs 'িবষ' ( English :poison)both pronounced as /biS/. For the same phonetic representation, these problems cannot be resolved at the acoustic or phonetic levels. A higher level of language analysis is required to do this. To solve these kinds of problems we need a well defined language model and pronunciation dictionary. Unfortunately, still there is a lack of such well defined models for Bangla continuous speech.

F. Spoken vs Written Words
Sometimes spoken language is not the same as the written language. For example, /bol/ (Englis: ball) and /bolo/ (English: speak) both have the same spelling 'বল'। Another example is /Sabd " h an/ 'সাবধান'where 'স'is pronounced as 'শ'. Again, a well-defined language model is required to solve this kind of problems.

G. Consonant Clusters
Frequent use of consonant clusters in Bangla speech has made it difficult for word boundary detection. One exampleis the Bangla word ' চক্কর'( /tCok:or/), in such cases most of the time the boundary is detected wrongly before the word ends, e.g. 'চক' and 'কর'it considers it as two words. This degrades the performance accuracy for the overall systems.

H. Mismatched Environment
The background sound in many circumstances is an uncontrollable variable. For example, the level of background noise in streets of Bangladesh is not as same as those in other developed countries. It is a challenge to build an ASR system which will work efficiently on noisy environments.

I. Unit Selection
Bangla words are pronounced syllable-wise and said to be rhythmic. We know there are different units of speech i.e. syllables, demi-syllables, diphones, phoneme. Bangla and English have different syllable structures [24]. The same pronunciation model can not be applied to both language. Sometimes it also becomes hard to decide which unit to select to implement an efficient boundary detection method. Syllable segmentation is a challenging task, still researchers are working on it.  [36]. Khan and Sobhan constructed another speech corpus for isolated words in the same year which has total 375 hours of recordings collected from 150 speakers [37]. OpenSLR's 'Large Bengali ASR training dataset' was recently published by Google in 2018, the dataset contains 229 hours of continuous speech for Bangladeshi Bangla [38]. There were 323 males and 182 females in total of 505 speakers who participated in the recording of 217902 utterances. In 2020 Ahmed et al developed an annotated speech corpus of 960 hours of speech collected from publicly available audio and text data [24]. The authors of this corpus also proposed an algorithm to automatically generate transcription from existing audio sources. At Shahjalal University, the NLP research team has developed a speech corpus subak.ko which contains 241 hours of recorded speech with 38,470 unique words which is yet to be published [39]. Table I [40] which used their own dataset of 10000 words recorded from 50 males and 50 females. The correction rate is above 80% for the system. An Artificial neural network (ANN) and Linear predictive coding (LPC) based ASR has been proposed by Anup Kumar et al. [41] in the same year. Multilayer perceptron (MLP) approach was followed to design the ANN model and LPC coder was used to extract the coefficients. It was able to discriminate four different words uttered by 2 males and 2 females. In the next year, a Bangla phoneme classifier was built by Kotwal et al. [42]. It used hybrid features of Mel-frequency cepstral coefficients (MFCCs); and the phoneme probability was derived from the MFCCs and acoustic features using Multi-layer neural network (MLN). It obtained an accuracy rate of 68.90% using HMM classifier. The dataset contained 4000 sentences uttered by 40 male speakers. In the study [43] carried out by Mahedi Hasan et al., the researchers focused on triphone HMM-based classifier for word recognition. The system could recognize continuous speech using a speech corpus of 4000 sentences spoken by 40 males at the accuracy rate was above 80%. Mel-frequency cepstral coefficients MFCC38 and MFCC39 were extracted as features for classifications. In 2011, Firoze et al. [44] proposed a word recognition system which used spectral features and fuzzy logic classifier. The system was trained for a small dataset of 50 words spoken by a male and a female. The reported accuracy was 80%. An ASR method based on context sensitive triphone acoustic models was represented by Hassan et al. for continuous speech recognition. in 2011 [45]. It applied Multilayer neural network (MLN) to extract phoneme probabilities and triphone HMM for classification. It obtained accuracy of 93.71% using the same dataset from [42]. At about the same time a study was carried out by Sultana et al. [46] that applied a rule-based approach using Microsoft speech API (SAPI). The obtained accuracy was 74.81% for 270 Bangla unique words for this system. Akkas Ali et al. [47] pgresented a Bangla word recognizer in 2013 which used MFCC, LPC features and a hybrid of Gaussian mixture model (GMM) and Dynamic time warping (DTW) for classification. A group of researchers applied Back-propagation Neural Network for Bangla digit recognition [48]. Perceived recognition accuracy for the speaker-dependent system was 96.33% and speakerindependent system was 92%, respectively. The sample size of the datset was limited to 300 words taken from 10 male speakers. A speaker-dependent neural network-based speech recognizer for this language was built in 2014 using MFCC features [49]. It employed feed-forward with back-propagation algorithm for classification and the perceived accuracy was 60%. A study carried out by Mahtab Ahmed and his team in 2015 claimed accuracy of 94% which employed Deep Belief Network (DBN) to classify recorded Bangla digits [50]. Seven layers of RBMs were considerd for designing DBN and speech features were collected from MFCCs. Another study [51] applied semantic Modular time-delay neural network (MTDNN) for Bnagla isolated word recognition. They conducted recurrent time delay structure to obtain dynamic long term memory. In total of 525words were used to obtain an accuracy of 82%. In 2016, Nahid et al. [52] developed an automatic Bangla real number recognizer using the API CMU Sphinx 4 which was designed based on HMM. They used their own dataset 3207 sentences were taken from male speakers where feature extraction was done using MFCCs and accuracy of the system was 85%. In 2016, Mukherjeeet et al. [53] developed a Bangla character recognition system REARC (Record Extract Approximate Reduce Classify). Their database consisted of 3150 Bangla vowel phonemes retrieved from the voices of 18 females and 27 males. They considered MFCCs for feature extraction and the recognition rate was reported as 98.22%. Another study was published in the same year [54] which utilized Back-propagation neural network (BPNN) to classify Bangla digits using a dataset of (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 3, 2021 [55] added Bangla language to their system which was a turning point forbBangla ASR. The system employed attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS) and ngram based model for context detection [56]. Nahid et al. presented their study [57] for Bangla real number recognition using the dataset of their previous experiment [52].

VII. DISCUSSION
The study reveals that a good number of researches have been done in the field of developing Bangla ASR in the last decade. A large number of studies concentrated on developing a successful word recognition system for Bangla. Since, nowadays researchers are more interested doing NLP researches using the end-to-end systems, there is a growing attention to develop a continuous speech recognition system for Bangla based on this type of model. From the Table II, it is seen that the most commonly used features are MFC coefficients and the recent trend of using classifier is focused on ANN based models. Considering the size of the corpus the largest speech corpus for Bangla is the "Bangla Speech Corpus from Publicly Available Audio & Text", though publicly available largest natural corpus for this language is Google's "Large Bengali ASR training dataset". Considering, the training dataset and accuracy level Google's voice API is performing the best for Bangla speech recognition to date. The system uses n-gram language model which has the problem with synonyms and rigidness. It is evident that using a larger dataset in newer ML-based models improving the overall recognition rate of the system. Though, lots of works have been done related to Bangla ASR still we need to develop efficient language model and pronunciation model to be used for this purpose.

VIII. CONCLUSION
In this paper, a study has been presented covering all the relevant researches done for Bangla ASR. A short summary of 24 research papers has been reported to address major