Acoustic Modeling in Speech Recognition: A Systematic Review

The paper presents a systematic review of acoustic modeling (AM) techniques in speech recognition(SR). Acoustic modeling establishes a relationship between acoustic information and language construct in SR. Over the past decades, researchers presented studies addressing specific concerns in AM. However, all previous research works lack a systematic and comprehensive review of acoustic modeling issues. A systematic review is introduced to understand the acoustic modeling issues in speech recognition. This paper provides an extensive and comprehensive inspection of various researches that have been performed since 1984. The extensive investigation and analysis into AM was performed by getting the relevant data from 73 research works chose after the screening process between the years from 1984 to 2020. The systematic review process was divided into different parts to investigate acoustic modeling issues. Main issues in acoustic modeling such as feature extraction techniques, acoustic modeling units, speech corpora, classification methods, different tools used, language issues applied, and evaluation parameters were investigated. This study helps the reader to understand various acoustic modeling issues with comprehensive details. The research outcomes presented in this study depict research trends and shed light on new research topics in AM. The result of this review can be used to build a better speech recognition system by choosing a suitable acoustic modeling construct in SR. Keywords—Acoustic modeling; speech recognition; systematic review; acoustic unit; MFCC; classification


I. INTRODUCTION
This Speech Recognition(SR) is intended to convert spoken term into text. Nowadays, with an increasing number of devices, people are using a speech recognition system such as Siri with iPhone, Alexa from Amazon, and Cortana for windows. Speech recognition systems are becoming popular due to different commercial and personal purposes [1]. As speech recognition is influencing every field of life, so it has been a concern of researchers as humans always wanted to talk to machines. Speech recognition understanding systems have helped human beings in different ways. In recent years, researchers also started experimenting to learn human activities from audiovisual inputs using neural networks even. Speech recognition systems are applied in speech-enabled devices, medical, machine translation systems, home automation systems, and the education system [2].
Acoustic Modeling is an initial and essential process in speech recognition. The acoustic model establishes the relation between acoustic information and linguistic unit. Most of the calculations are performed in acoustic modeling due to feature extraction and statistical representation, so it primarily affects the recognition process. Statistical representations are prepared from extracted features. The distribution of extracted features with particular sound is modeled in AM to establish the link between extracted features and structures of the linguistic unit. Various feature extraction techniques, such as based on human perception and working of voice production mechanisms, have been reported [3]- [5]. Features were extracted for AM in speaker-independent mode recognition as these systems impose difficulties in speech recognition [6]- [9].
For developing acoustic models, the selection of classification methods is also an important step. Many research works have been reported for acoustic modeling based on different classification techniques [10]. The research work reported using different classification methods such as based on hidden Markov model(HMM), discriminative training for optimization of the model parameter, artificial neural networks(ANNs), deep neural networks(DNNs), and sequence to sequence acoustic modeling.
Further, AM is also linked to many concepts. It requires an understanding of the acoustic-phonetic knowledge, microphone and environment variability issues, gender, and dialectal differences. Further, for determining the connection between linguistic units and acoustic observation, rigorous training is required [11]. AM is also directly linked to pronunciation modeling, variability modeling related to speaker, environment, and contexts also [12]. Acoustic models using subword units also experimented for recognition enhancements [13]. The subword modeling units, such as phone diaphones, syllables, and context-dependent phones, were used [5]. It was also reported that phoneme based models are used to overcome a huge quantity of data for creating trained models. Different models, such as context-dependent, also experimented. Triphone based context-dependent models were used to reduce contextual effects [14], [15]. Researchers also addressed acoustic modeling for a multilingual SR system by using clustering with decision trees taking advantage of the data in the languages other than target language [16]. Further, different language modeling techniques are used in speech recognition. N-gram models are widely used that can model word prediction based on probability [17]. AM in SR also faces different challenges. The task of acoustic modeling is complicated, as well as exciting [18]. The www.ijacsa.thesai.org design of adequate modeling has been a constant effort from the starting of Automatic Speech Recognition (ASR) [19]. The problem of data scarcity has always been a concern for the researchers. Researchers and different groups have developed different speech corpora as per the requirements. However, still, researchers are facing the lack of speech corpus in the public domain, especially for low resource languages for the realization of recognition frame works [20]. Researchers developed acoustic modeling methods using deep neural networks (DNNs) for zero resources language for unsupervised SR [21]. The selection of feature extraction in mismatch and noisy condition makes acoustic modeling a challenging task. Researchers experimented with different robust feature extraction techniques with further processing in acoustic modeling for different environmental conditions. Additional acoustic modeling task is complicated due to contextual variability, pronunciation variability, and speaker variability. Researchers attempted to improve speech recognition by using different acoustic units, robust feature extraction, and different classification methods [13], [22]- [25].
During the past decades, researchers have presented reviews on different acoustic modeling techniques in SR. However, most of the researchers focused only on some specific issues in acoustic modeling and did not cover all the key issues in acoustic modeling. Very less paper has been reported, which shows a complete and systematic review of acoustic modeling in speech recognition. There is a need for systematic analysis of the earlier presented research works to elaborate basic and advanced concepts in acoustic modeling. This work intended to show a systematic literature review(SLR) to meet this gap and to provide a thorough review of AM issues for both novice users and specialists in the field of SR.We presented a comprehensive study in this field. Specifically, we emphasized the key issues related to feature extraction techniques, classification techniques, acoustic modeling units, speech corpora, language issues, different tools, and evaluation parameters for investigation. The research methodology used in this study has been adopted from [26]- [30]. The systematic review process was divided into requirement analysis for systematic review, the setting of research questions, formulation of searching criteria for research papers, the process of paper selection and rejection, setting of assessment measures for the collection of the papers in a systematic review, extraction of relevant information as per the research questions, and finally reporting the results with analysis and discussion. The research investigation focused on acoustic modeling issues in speech recognition by a comprehensive study of 73 research papers extracted from the research works between 1984 to 2020. A total of 127 papers were selected for the complete survey after the initial screening of 250 papers, out of 127 papers, 73 papers were selected for the systematic review process. Different research questions were framed to address acoustic modeling issues, and answers were provided by extracting relevant information from the research papers. With this review, we provide the speech research community with the understanding to decide among acoustic modeling methods as per the requirement. To better understand the AM concepts, we have also described the basic concepts in speech recognition and acoustic modeling.
Research findings show different research trends and highlight new research areas. The advantages and disadvantages of various issues are also provided as a guide to interested new and experienced researchers. We have attempted to address all possible aspects of acoustic modeling. Throughout this paper, a constant effort has been made to address issues in a comprehensive way to fill the research gaps. The paper contributed by exploring the following facts.
 Different feature extraction techniques for AM explored.
 Various classification techniques for AM identified.
 The need and different characteristics of speech corpora revealed.
 Different software and tools explored.
 Acoustic modeling units investigated.
 Various language issues used in speech recognition for AM identified.
 The types of publication (Journal, conference, workshops, lecture notes, thesis) identified.
 The specific names of the journal or conference that published the paper.
 Different evaluation criteria defined.
The paper is structured as follows. Section II depicts related work. The speech recognition process and acoustic modeling is elaborated in section III. Section IV clarifies the methodology for the systematic review. Section V is about result and analysis. Section VI depicts discussion. Last section finishes up with conclusion and future direction.
II. RELATED WORKS Reviews on various issues, including acoustic modeling, have been presented for the SR framework. Acoustic modelings with the acoustic-phonetic methodology and pattern recognition methods were addressed in [31]. Researchers discussed several factors to enhance SR. The factors include the usage of HMM modeling, the use of subword models, and corrective training. The focus of the paper was on the use of subword models with or without context dependency-based AM modeling. The researchers experimented with several methods to create acoustic models to characterize phone like units. The context-dependent modeling improved the recognition results.
Researchers presented a review of HMM-based speech recognition [32]. The study covered HMM architecture, different techniques, and related issues. The developers included different parameters such as the selection of optimal states, number of gaussian mixture models, context-dependent and triphone based modeling, feature vector, selection of speech databases, and speech-language model. The widely known HMM-based tool kit HTK was explained. It was concluded that HMM-based speech recognition technology was widely used and accepted by the researchers for the decades on a large scale. www.ijacsa.thesai.org Research work was presented for noisy conditions using a taxonomy-based approach [33]. The authors used different key attributes to offer insight into noise-robust methods in SR. The survey addressed the techniques which were successful over the years and had the future for further research. Further techniques were evaluated using five different criteria. The first measure was based on feature space versus the model domain to analyze the mismatch in training and testing conditions. The second criterion was based on compansion using formal information about acoustic distortion. The third criteria were regarding compensation with implicit versus explicit distortion modeling. The fourth criterion was based on uncertainty versus deterministic processing. The fifth criterion was based on the joint model, preparing versus disjoint preparation.
An overview of different modeling techniques such as hidden Markov models (HMMs), Deep neural networks (DNNs) and convolution neural networks (CNNs) was covered [34]. The advanced features of CNN architecture were also discussed. The advantages of using CNNs such as normalization of speaker variances by using local filters in the convolution layer were elaborated. It was concluded that in this decade, the researchers are focussing on DNNs and CNNs for acoustic modeling to overcome the challenges in the SR systems.
The study focused on the comparison of feature extraction, classification, and language models used in SR [35]. The paper was started with a description of the basic SR framework and with its key elements. Different popular feature extraction techniques such as Mel Frequency Cepstral Coefficients (MFCCs), Perceptual Linear Prediction (PLP) Cepstral Coefficients, Relative Spectral Perceptual Linear Prediction Coefficients (RASTA-PLP), Linear Prediction Cepstral (LPC), Discrete wavelet transform (DWT) and transformation techniques were applied. It was stated that MFCCs is widely used and renowned features. The classification method, such as HMMs, the ANNs, and SVMs described for the ASR system. It was stated that the hybrid approach of combining HMMs with other models is being experimented with by researchers. Findings also indicate that SVM based speech recognition systems are also being adopted due to their better performance than ANNs. Finally, it was stated that spoken language also affects the speech recognition process. The comparative study of different issues was also presented to understand the topic better.
The review paper on machine learning (ML) in ASR presented [36]. The ML techniques in speech recognition discussed and provided insights into the ML paradigm in the SR process. Different machine learning approaches GMM-HMM, ANN, support vector machines (SVM), and Deep learning techniques described with their characteristics. Fundamentals concepts of neural networks also explained. It was concluded that ML techniques are widely being experimented in speech recognition, and recent advancements in deep learning work like Connectionist temporal classification (CTC ) based acoustic modeling is an exciting path towards continuous speech recognition for large vocabularies.
A review paper was presented to address acoustic modeling issues and refinements [37]. The first constructs and functioning of HMM and its constraints reviewed. Further advancements and improvements to conventional HMM were also explored. The current challenges and performance issues to speech recognition systems also investigated.
A survey of speech recognition using Deep Neural Networks(DNNs) was presented [30]. Research findings include the data related to different databases used, various feature extraction techniques, and modeling techniques. It was stated that for speech corpus, both public and private databases were used. The speech recognition systems were applied to different environments, such as noise, neutral, and emotional. Researchers The brain spiking neural networks (SNNs) were applied to explore large vocabulary speech recognition [38]. These networks are inspired by the brain working and have low computation cost. The work is the progress towards rapid and energy-efficient SR. The ASR can be developed using PyTorch, and it can be easily associated with the PyTorch-Kaldi speech recognition tool kit. The results show that the system provided better accuracy than their ANN counterparts. The time-delay neural network-based acoustic modeling presented for Hindi speech recognition [39]. It was indicated that TDNN showed improvement over GMM-HMM systems.
The presented work differs from the above-mentioned reviews, as we have given a detailed and thorough examination of the acoustic modeling and its related issues in speech recognition systems. The paper first provided an overview of speech recognition and AM. This study provided the reader with the appropriate background to fully understand the topic presented. The systematic review was carried out by using papers from 1984 to 2020. We have introduced a systematic review by including the research works from the beginning, middle, and recent years to understand the flow of acoustic modeling research in speech recognition.

III. SPEECH RECOGNITION PROCESS AND ACOUSTIC MODELING
A generalized speech recognition system includes preprocessing, feature extraction, acoustic modeling, and language modeling units with a recognition engine. Fig. 1 illustrates the SR framework with two phases. The complete recognition process was divided into two components acoustic analysis and acoustic/linguistic decoder. The preprocessing block consists of pre-emphasis to increase the magnitude of higher frequencies to flatten the magnitude spectrum and windowing of speech signals [40]. By applying to the window, a small segment of the speech signal, which is considered as stationary for speech processing analysis, is extracted [41]. The output of feature extraction block is feature vectors which are further used in acoustic modeling of speech utterances. The acoustic model is prepared from the speech database and www.ijacsa.thesai.org linguistic construct. The language model block contains all the programs related to the language modeling issues required for speech recognition. During the recognition phase of speech, word sequences probability is estimated by the language model (LM). Further, language models are used in speech recognition to make a decision regarding acoustically confused spoken utterances by incorporating syntactical and semantic constraints of the spoken language [42], [43]. It also restricts the search space of the recognition engine [44]. The speech recognition process finds the best sequence of words based on the acoustic model, language model, and recognition engine.
The development and design of speech corpus is an essential step towards acoustic modelling [43]. As nowadays, speech recognition systems are being developed for various needs, so the design and development of speech databases play a crucial role in acoustic modeling. The phonetic information is extracted for acoustic modeling from speech corpus. Speech corpora are also used to train and test recognition systems. Further, it is also an important decision to select the acoustic unit in acoustic modeling. Researchers used word-level acoustic modeling; however, there is always a problem of data scarcity in word-level acoustic modeling. The sub-word models are applied to overcome the requirement of a large number of word instances in training for word-based models. The subword models, such as based on phoneme, syllable, and triphones, are commonly used [45].
The phoneme based models are used to overcome the more training data requirement due to word-based models, especially in continuous speech recognition designed for enormous vocabulary size. The phoneme based system suffers from contextual effects. The contextual effects ere reduced by using triphone based systems that consider the left and right contexts of the phonemes. Triphone based systems suffer from data scarcity. The syllable based system is used to cover a larger acoustic unit [46]- [48] to reduce the contextual effects due to phoneme based system. Researchers also attempted to use universal phone sets for multilingual speech recognition and under resource languages [49], [50].
During feature extraction, the insignificant information is removed from the speech signal. Various methods based on speech perception and production have been applied, such as LPC, MFCCs, and PLP [51], [52]. Researchers have also worked to find features for different environments and speaker-independent systems. Different noise-robust feature extraction techniques applied. Acoustic models also generated from extracting features from spectrogram images using convolutional neural networks (CNNs) [53].
In acoustic modeling, different classification methods are used.
Automatic speech recognition classification methodologies can be categorized based on acoustic-phonetic knowledge, concepts on pattern recognition, and artificial intelligence(AI) [54], [55]. Widely used techniques are based on Hidden Markov Models (HMMs) and artificial neural networks(ANNs). Discriminative training is also used, which includes both feature extraction and classification in order to provide the minimization of classification errors. It ensures that the classifier will itself map an input space to more suitable for its proper classification [56]- [58].
The recent works have been reported using deep learningbased acoustic models. The researchers generated acoustic word models using contextual information for long conversational speech using a joint CTC/attention-based approach [59]. Speech recognition also improved by using Long Short Term Neural Networks(LSTM) based on language modelling [43]. Researchers investigated DNN based models obtained up to 30% relative error reduction over best discriminatively trained GMMs. The performance of the DNN based system is also influenced by feature vectors used [60]. The systematic review conducted in this paper is based on studies [27]- [29], [61]. They have divided the investigation into the planning phase, executing phase, and finally reporting phase. We have grouped the systematic review process into eight steps. Fig. 2 shows the methodology used to perform a systematic review.
The review process started with a requirement analysis of the systematic study in acoustic modeling. The second phase included identifying and formulating research questions as per our defined goals and gaps based on earlier surveys. The strategy to search the papers from different resources was decided in the third phase. The fourth phase is about inclusion and exclusion criteria for the determination of the research papers. The evaluation criteria for the final selection of the papers for the systematic review were prepared in the fifth phase. The sixth phase was regarding collecting the data from extracted papers. The results were reported in the seventh phase. The last phase presented evaluation and analysis. The following subsections demonstrate the review protocols used in this study in detail.

A. Formulation of Research Questions
To meet our goal of the study, different research questions were framed to conduct a systematic review. Various issues discussed are related to study papers utilized, types SR system used, language applied, language issue covered, speech corpora used, software and tools used, acoustic units experimented, extraction features utilized, classification methods used, and performance metrics applied. Table I lists the research questions. A total of ten research questions were formulated to reveal different aspects of AM in speech recognition.

RQ3
What are the languages found in the research investigation?

RQ4
Which are the various language issues used in speech recognition for acoustic modeling? RQ5 What are the different databases used in the study?

RQ6
What are the different software and tools found in the inspection of the works? RQ7 Which are the different acoustic modeling units used in the study?

RQ8
What are the different feature extraction techniques used in acoustic modeling?

RQ9
Which different classification techniques are used in speech recognition?

RQ10
What are different performance measurements in the speech recognition system

B. Search Strategy
For searching the research papers, all the key terms related to research questions were used. Further exploration was also done based on specific journals related to speech processing. Different connectors, such as 'OR' and 'AND' were used. Various resources such as Google search, Google scholar, IEEE explore, Springer, Taylor, and Francis, research within specified journals such as Speech Communication, Science Direct, university repositories for thesis, lecture notes, and books were searched.

C. Study Selection
Initially, we extracted a total of two hundred fifty papers. All replica papers and the same principles papers were eliminated. After this step, inclusion and exclusion criteria were applied. The papers were excluded, which contained speaker recognition and emotion recognition. The papers which were related to speech processing but do not contain acoustic modeling issues were also not selected. Papers related to acoustic modeling issues in speech recognition were selected, and papers for acoustic modeling for different acoustic units were also included. Then finally, a total of 127 papers were decided for the study.

D. Quality Assessment Criteria
The research papers for systematic review were chosen at last subsequent to applying quality assessment criteria on the explored papers got after inclusion and exclusion parameters, as discussed in the study selection section. The quality assessment criteria were based on 21 questions. Table II lists the quality questions used for the evaluation of a systematic research review. The following quality assessment rules were applied for the selection of the papers.

Rule1
: If the answer meets the full requirement, it is awarded 1.
Rule2: If the question is not answered, it is awarded 0.
Rule3: If the answer is satisfactory, it is awarded 0.5.

Rule4
: If the answer is above average, it is awarded 0.75.

Rule5
: If the answer is below average, then it is awarded 0.25. Then for every paper, the summation of marks is added for all 21 questions. We have included all the papers which got a score of 13 or above marks. Other papers were excluded from the study. Finally, we have included only 73 research papers.

V. RESULTS AND ANALYSIS
The systematic review process aimed at the investigation of AM issues in speech recognition. Research questions were framed, and relevant data were extracted to get the solutions for these questions from RQ1 to RQ10. The outcome of the study covers all the important concerning areas for acoustic modeling. The following sections describe the research outcomes with analysis.

B. RQ2 Aimed to Find different Types of Speech Recognition Systems used in the Study
Acoustic modeling issues were addressed for different types of SR systems in these study papers. Fig. 4 depicts the different kinds of speech recognition systems built. The speech recognition systems have been developed for isolated words, connected words, continuous speech, spontaneous speech, and multilingual speech.
The major areas of concerns were speaker-independent and dependent acoustic modeling, recognition in different noisy conditions, speech recognition for different devices, multilingual SR, recognition with weighted finite-state transducers(WFST), comparative analysis for different feature extraction techniques, recognition using subspace Gaussian mixture modeling, recognition using different subword units, and recognition for limited resource languages.
The "other" category types of the systems in Fig. 4 indicated either a combination of the methods or not explicitly mentioned. Significant research work was presented for continuous speech and connected words due to their more applications. The research findings also indicate that very little work has been reported towards spontaneous speech, conversational speech, and multilingual speech.
The reason for fewer works published for these types of systems is due to lack of resources and challenges such as context information, long conversation, and variabilities present in the environment and other conditions. It was observed that DNN based systems had been found performing better than conventional methods for these types of systems. Further multilingual SR systems are also being created by applying global phone sets.

C. RQ3 Aimed to Identify different Languages for Creating an SR Framework?
The researchers developed SR systems in different languages. Fig. 5 shows the different languages used in speech recognition. Research findings reveal that all over the world, researchers experimented for speech recognition. Different databases were developed for speech recognition. Most of the reported work belongs to the English language. It was revealed that researchers are facing problems due to a shortage of linguistic resources in SR. Multilingual speech recognition is also being experimented using a common phone set and the Global phone database.

D. RQ4 Intended to Find different Language Issues used in AM Modeling by Researchers
The studies reveal different issues about language in acoustic modeling. Language related issues are a selection of linguistic units, availability of linguistic resources, dialects, accents, contextual information, and speaker-related variabilities for acoustic modeling. It is essential to decide which language construct to use in acoustic modeling. Some languages are tonal, while others have many dialects, the acoustic models need to be generated as per the requirement. There is also a need for linguistic resources such as pronunciation dictionaries suitable for speech recognition. The researchers have used N-gram models and grammar-based rules for language modeling in speech recognition. The works also have been reported for multilingual speech recognition by developing global phone sets and speaker adaptive training.

E. RQ5 Aimed to Identify different Speech Corpus used in the Study
The studies indicate that different speech corpora were used for the realization of acoustic models. Fig. 6 shows the various databases used in the systematic review study. Research outcome revels that the TIMIT speech corpus was widely used by the researchers to explore phoneme based www.ijacsa.thesai.org speech recognition as it is a well documented and phonetically balanced speech corpus with broad geographical coverage. Multilingual database GlobalPhone was used for multilingual speech recognition. It was developed with high-quality read speech. It was recorded in twenty languages with labeled data and a pronunciation dictionary. Further, the investigation also shows that mostly speech databases are available for European and American languages. Research findings also indicate that all over the world, different speech corpora in different languages were created to realize SR systems for low resource languages. Further studies also show there is a need for resources such as speech corpora and language resources for these languages. Studies also reveal that researchers developed their databases for the speech recognition systems as per their research needs.

F. RQ6 Supposed to Analyze different Software and Tools used to Experiment in the Study
The different tools used in the studied papers are Sphinx, HTK, Julius, and Kaldi for developing SR systems. Most of the research papers in the study used the HMM-based tool kit HTK. The reason for using this tool was due to well documentation and HMM-based system. HTK supports different feature extraction techniques such as MFCCs with their variants, LPCs with variants, and PLPs with variants. It also supports context-independent and context-dependent modeling. Sphinx supports MFCC and PLP speech features with delta and delta-delta features. Some expertise is needed to understand and to work on the Sphinx tool. Kaldi is being used recently in the development of speech recognition systems. It also supports DNN based methods for developing speech recognition systems. However, knowledge of shell programming and scripting in Unix/Linux based is required. Different speech processing software, such as PRATT and wave surfer, were also used. Mat Lab software was also widely used.

G. RQ7 Inspected different Subword Modeling Techniques are used in the Study
Research works on different subword modeling techniques were reported during the systematic review. Fig. 7 shows different subword units used in the systematic literature survey. Research findings reveal that most common sub-word acoustic models are based on the word, phonemes, syllable, and triphones. The phoneme based acoustic models have widely used in the large vocabulary continuous speech recognition system(LVCSR) system. The phonemes set are limited for any language. The phoneme based system overcome the requirement of a large number of instances. Further, phonemes are less in number; many manipulations and confusion analysis can be used. Triphone based systems were also experimented to reduce the contextual effects suffered by the phoneme based system. Context-dependent state tied triphones, crossword triphones, and word-internal triphones were used in the experiments. Syllable based system was also used instead of triphones in some studies to reduce the effect of contexts. A syllable with initial -final and onsetnucleus and coda applied for subword modeling. The category "others" in Fig. 7 shows the models used based on demisyllable, grapheme, interdigit, and character-based models.

H. RQ8 Planned to Investigate the different Feature Extraction Techniques used in Acoustic Modeling
For generating acoustic models in SR, different feature extraction techniques were applied by the developers. After inspection, it was revealed that feature extraction in speech recognition was also a very much researched area in speech recognition. Researchers experimented with different feature extraction and transformation techniques to improve recognition accuracy. Fig. 8 shows different feature extraction techniques used in the systematic review. The investigations reveal that usually used feature extraction techniques are linear prediction coefficients(LPCs), Mel Frequency cepstral coefficients(MFCCs), and Perceptual linear predictive coefficients(PLPs) with their variants. Research results reveal that MFCCs are widely used coefficients. Experiments had been conducted using MFCCs with energy, first and second derivatives. Most of the research experiments were performed www.ijacsa.thesai.org with twelve MFCCs c. Some tests were also conducted using MFCCs with vocal tract area function, and power normalized cepstral coefficients(PNCC). Other feature extraction methods such as duration, intensity, mean zero-crossing, pitch, amplitude, formants, and short-time energy were also reported. Researchers also applied feature transformation techniques such as LDA and HLDA. Discriminative features were also implemented. It was observed that PLP coefficients provided better results in the case of speaker-independent speech recognition. Research works also reported vector quantization with extracted features. The advantages of vector quantization are reduced storage and reduced computation; however, the quantization error is a problem. The research findings also reveal that earlier speech recognition systems were based on time-domain processing methods, formant analysis, and linear predictive coefficients. Researchers also reported the advantages of MFCCs as good discrimination, the correlation between components, and the application of manipulation.

I. RQ9 Aimed to Find out different Classification Methods used for Systematic Reviews
Different classification methods were applied in speech recognition to develop acoustic models. Fig. 9 indicates the various classification methods utilized in this study. The commonly used classification methods are based on HMM, acoustic-phonetic approach, ANNs, dynamic time warping(DTW), Deep Neural Network(DNNs), Discriminative training, support vector machine(SVM), Fuzzy logic, CTC and Deep belief network(DBF). Research findings reveal that HMM-based systems were widely used during the past decades; however, in recent years, ANN and DNN based systems are being used. Further, research works were reported using different states, gaussian mixture models, contextindependent, and context-dependent models for HMM. Discriminative training methods with objective function maximum mutual information (MMI), minimum phone error(MPE), and minimum classification error(MCE) were also applied by the researchers to improve speech recognition. Artificial neural network approaches such as Kohonen Selforganising maps, Multilayer perceptron, Time -Delay neural network, Hidden Control neural network, the combination of hidden Markov model, and connectionist probability estimators have been applied. The main strength of ANNs is their discriminative property, which is an essential property that can be used with HMMs was stated by the developers/researchers. Advantages of ANNs are the ability to learn from input data, unsupervised learning, parallel computation, system development through learning, not programming, adaptable to the environment, handling of complex interaction, and easy to use and understand. Limitations are it requires large training speech utterances and long training time.

J. RQ10 was Prepared to Find out different Performance
Metrics used in the Study for SR Systems Different quality assessment criteria used by the researchers are recognition accuracy, word correctness, word accuracy, phone error rate, frame error rate, and word error rate. Most of the searchers used word accuracy and word error rate.

VI. DISCUSSION
Research answers to research questions were prepared after extracting the information from the finally selected papers for the systematic review process. Different research revelations have emerged from the study. It was observed that most of the research papers were provided by the IEEE library, Springer, and Science direct libraries. The conferences such as ICASSP, Eurospeech Conference on Information and Communication Technology, and INTERSPEECH are conducted explicitly for research in speech and audio processing. These conferences supplied a variety of research papers to address different problems in speech processing. It was also revealed that researchers are developing various types of speech recognition systems such as isolated words, connected words, continuous speech, spontaneous speech, and multilingual speech as per the requirement and addressed different modeling issues. Continuous speech recognition systems were widely used due to their large span of practical use.
Further, it was also observed that various acoustic modeling issues are addressed for speaker-independent, speaker-dependent acoustic modeling, different noisy conditions, speech recognition for different devices, and www.ijacsa.thesai.org multilingual speech recognition. It was also found that speech recognition is the most active research field, all over the world research community is trying to develop speech recognition systems in different languages. However, researchers are finding hardships in this field due to the unavailability of resources such as speech corpora and other linguistic resources for low resource languages. Most of the research work was reported for the English and European languages. Research outcomes also reveal that some languages such as English have systematic and well-defined speech corpuses such as TIMIT and phonetic dictionaries such as BEEP; therefore, researchers find it convenient to experiment with this standard speech corpora and dictionary. Most of the researchers are developing their resources for conducting the research work. It needs great effort in the part of these researchers to use different techniques for overcoming various constraints in this area.
Research outcomes also show that different acoustic units such as word, phoneme, syllable, character, and grapheme are being used by researchers to address issues such as related to context, data scarcity, and language modeling. Phoneme, word, triphone, and syllable based systems were generally used. The studies also reveal that phoneme based systems are widely used. Researchers are developing pronunciation dictionaries and applying language modeling techniques in speech recognition. N-gram language modeling and weighted finite-state transducers are also being used in speech recognition. Different tools and software are also being developed for acoustic modeling in speech recognition. Some of the widely used tools are HTK, Sphinx, and Kaldi. The PRATT and wave surfer were widely used for speech analysis. Matlab was also commonly used in the research.
A further area of research that was experimented extensively is feature extraction. A large number of papers have been reported by applying different feature extraction techniques to improve speech recognition. MFCCs and their variants are widely used feature extraction techniques. Further, various language issues are also being incorporated into speech recognition. Researchers also used knowledge resources in creating speech parameters.
Different classification techniques were applied to realize the different acoustic models. The commonly used classification methods are based on HMM, acoustic-phonetic approach, ANNs, dynamic time warping(DTW), Deep Neural Network(DNNs), Discriminative training, support vector machine(SVM), Fuzzy logic, CTC and Deep belief network(DBF). Research findings reveal that HMM-based systems were widely used during the past decades; however, in recent years, ANN and DNN based systems are being used. Different quality assessment criteria for measuring the performance of speech recognition are recognition accuracy, word correctness, word accuracy, phone error rate, frame error rate, and word error rate. Most of the developers used word accuracy and word error rate.

VII. CONCLUSION
Research questions aimed to investigate the issues regarding acoustic modeling to explore the research papers used, speech recognition system developed, languages used, language issues included, speech corpora used, acoustic modeling units applied, feature extraction techniques used, classification methods utilized, and performance metrics applied. Different quality assessment criteria were applied for the final selection of the papers. A total of seventy-three research papers were selected by applying quality assessment criteria, as mentioned in the research methodology section. The research papers have been included between 1984 to 2020 so that we attempted to include new and old researches in the field of speech recognition to understand the flow of speech recognition research in acoustic modeling. The research work started with the importance of acoustic modeling and its challenges. After that, the fundamental concept in speech recognition described understanding the acoustic modeling issues. The work presented here touched different aspects of acoustic modeling.
Research findigs show that IEEE library, Springer, and Science direct libraries provided most of the research papers. The conferences such as ICASSP, Eurospeech Conference on Information and Communication Technology, and INTERSPEECH aimed to address research papers in speech and audio processing.The investigation indicate that acoustic units such as word, phoneme, syllable, character, and grapheme were used to address context, data scarcity, and language modeling. The outcome also revealed that MFCCs, continuous speech recognition and N-gram language models were mostly used. Different classification methods have been applied. The HMM based systems were widely used for decades, but now days deep learning based systems are being experimented.Other findings also indicate that deveopers used mostly word accuracy and word error rate for the performance measurement of SR systems.
The presented research work provided deep insight into understanding different acoustic modeling issues by performing a systematic review. The outcome of the research shed light on the research flow in acoustic modeling issues and included new research areas also. The advantage of the systematic review was that research findings were revealed from the beginning, middle, and recent years of research in this field.
Research work may be extended by exploring further detailed analysis using acoustic modeling for recent techniques such as based on deep learning methods and conducting research to improve acoustic modeling in acoustic units.