Towards an Automatic Speech-to-Text Transcription System: Amazigh Language

—Various studies inside the domain of research and the development of automatic speech recognition (ASR) technologies for several languages have not yet been published and thoroughly investigated. Nevertheless, the unique acoustic features of the Amazigh language, for example, Amazigh's consonant emphasis, pose many obstacles to the development of automatic speech recognition systems. In this study, we examine Amazigh language voice recognition. We treat the problem by focusing on transitions in vowel and consonant sounds and formant frequencies of phonemes. We present a hybrid strategy for phoneme separation based on energy differences. This includes analysis of consonant and vowel features, and identification methods based on formant analysis.


INTRODUCTION
Automatic Speech Recognition (ASR) is used to transcribe human speech captured via a microphone into text that computers can understand in order to enhance human-machine (HM) communication. ASR has long been the subject of intense research. Formant frequencies have been studied for decades. ASR has long been a topic of active investigation. For many years, formant frequencies are believed to be an important factor in recognizing speech phonetic content [1]. To arrive at the stage of recognizing phonemes, we examine a specific case of this problem and focus on the vowels and consonants transition in the Amazigh language and the formant frequencies of phonemes that are important for determining the phonetic content of speech [2]. We trust that this effort, will highlight the importance of consonant-vowel changes and vocal parameter analysis in speech recognition.
We provide a strategy that includes separation of phonemes by differences in energy between consonants and vowels, vocal characteristics processing of phonetics units, and a recognizer algorithm based on formants. Formant analysis methods focus on associating physical aspects with phonology. This method determines speech types by analyzing linguistically distinct features of the speech.
The rest of this article is organized as follows: Section II describes the characteristics of the human voice and the Amazigh language. The remainder of Section III describes some mathematical and engineering techniques for language modeling, followed by a discussion of proposed phoneme recognition methods. Section IV provides further insight into the results and Section V concludes the article.

A. Human Ear and Acoustic Sound
Sound constitutes a wave that propagates in a material environment like small variations of pressure. This is perceptible via human ears at frequencies ranging from 20 Hz to 20 kHz. Nonetheless, the phonetic information is judged to be less than 10 kHz [3].
Due to the way sound signals are sifted, we modify the sufficient range because ears are not sensitive to stage distortion. This allows us to focus exclusively on complementary application modules.

B. Human Voice
The human voice is a collaboration of breathing and multiple phonatory organs. In voiced phonemes, sound is first produced by the vibration of the vocal cords [4]. It is manipulated differently depending on the cavities it passes through, primarily the pharynx and mouth. These cavities act as resonators, increasing frequencies corresponding to the resonant frequencies of specific phonemes. These enhanced frequencies are known as "formants" and are the features that phonologists search in spectrograms to identify phonemes being pronounced [5].
The Fig. 1 shows the formants (F1, F2, and F3) superimposed on spectrogram of speech signal « he took holidays » showing the alternating voiced and unvoiced sounds. In the voiced situation, a formant structure is presented.

C. Phonology and Phonetic
Phonetic by definition is study of phonetic units, the smallest particular phonetic unit being regularly defined as a phoneme. The opposition between the terms bath and bread, well, suggests that [b] and [p] are phonemes. In general, we can classify then as follows: classes and subclasses, more significant of which is "vowel" and "consonant".  Relatively vowels are lengthy-duration sounds with a flows and with considerable frequency characteristics consistency during times: in absence of highly pronounced prosodic features, formants appear horizontal on the spectrogram [6]. The vowel trapezius, shown in Fig. 2 illustrates their location in the planes specified by initial both formants, indicating articulation position of language.
However, consonants represent phonemes that encounter an obstacle when articulated (such as labial vowels, toothy teeth, palate closure in [k], etc.). In comparison, they are considerably shorter than vowels and significantly variable in length during time. Could be sonorous or loud. In the resonant scenario, only current formants are present, see the Fig. 3.

D. Speech Signal Frequency Parameters
The bandwidth of voice signal is much larger than the telephone bandwidth (4 kHz) and includes all information's necessary to know to decode human voice.
The fundamental frequency refers to the speed of opening and closing of the vocal chords during phonation. Its value is proportional to the individual's phonatory system size [7]. Voice frequency vary between 80 and 600 Hz based on age and gender.
The spectrogram is a representation in three dimensions, where the X-axis represents time, the Y-axis represents frequency, and the Z-axis represents frequency levels (symbolized by gray levels). Fast Fourier transform (FFT) with sliding window is applied to acquire the voice signal.

E. Amazigh Language
The Amazigh language, commonly called Tamazight or Berber, is one of humanity's earliest languages. Now it extends from the Red Sea to the Canary Islands and from Niger in the Sahara to the Mediterranean Sea, including the northern section of Africa. In Morocco, Amazigh language is classified into three main regional's varieties, based on historical, geographical, and sociolinguistic factors: Tarifite in the north, Tamazight in Central Morocco and the south-east, and Tachelhite in the south-west and the High Atlas. Even though half of the Moroccan population are Amazigh speakers, the Amazigh language has been reserved exclusively for informal and familial domains: Boukous (1995) [8]. In the last decade, with royal generosity, the language was institutionalized and included in Moroccan school system.
Since February 2003, Morocco's official Amazigh alphabet system, known as Tifinaghe-IRCAM, has been used in Moroccan school programs and Amazigh historical studies. The system uses the alphabet describe in Table I  The correct writing of words in the Latin letters closely resembles phonetic transcription and correctly conveys their pronunciation, includes twinned and vowel sounds.

A. Fourier Analysis
Transform of Fourier permits a time-frequency processing at a resolution suitable for speech signals that are quasistationary on intervals of 10-100 ms.

1) Fourier transform:
We are dealing with the pre-Hilbert sets of square integrables function ²(ℝ), and the orthogonal family of sine functions : → 2 / ∈ (we restrict ourselves physically in concret pulses)). Preparing the projection portion of each sinusoid provided by the scalar product, and not ignoring the complex conjugate with respect to the second number, the Fourier transform H(f) (or FT(g(t) ) of a functions g(.) allows projection over the vector space through which the sinusoids: Where the variables t and f refer to time and frequency, respectively. FT(f)(t) is the transform of g(.).
These transform would be employed in our study to investigate the contributions of each frequency range in the speech signal more qualitatively by evaluating the spectrogram, (Ohm's rule states that human ear is insensitive to the acoustic signal's phase) [12]. To avoid edge effects, we convert them into an N-period signal. Following that; the discrete transform expressed as: The frequency observations is related to the factor for �0, 2 � If represents the sample frequency, so = × . It should be emphasized that because the received information is restricted to half of the period As a result, modules are similar, with the term simply having the reverse indication. If we wish to evaluate the signal's about frequency scale of 10 kHz, we need to use a sampling frequency with 20 kHz [13]. Within reality, we employ the FFT (Fast Fourier Transform), which has a computational cost of O(Nlog2(N)) rather than O(N²) for straight computing, also use this redundancy to improve the computation.

B. Convolution and Transformation
The convolution product in continuous-time of two functions is given as: Convolution operators are commutative. Also, since there are associative equations, the following two are valid: As a consequence of commutativity: The Fourier transform of the two functions ordinary product is the convolution product of Fourier transforms. In addition, the Fourier transform of the two functions convolution product is the usual product of Fourier transforms: This finding is also valid for discrete cyclic representations. It will be applied in formants analysis as a result of the speech signal modeling adopted.

C. Windowing Issue
Practically, in addition to the x-signal discrete, the observation time of 2τ has over. Consequently, we see the signal convolution using a window function [14]: As according (7), Fourier transformation of ( ) × ∏( ) being a convolution product between window and signal, following equation give gate function: This cardinal sinus has a central lobe with a width of 2/τ. As a consequence, when the observation period approaches zero, the spectrum expands. To resolve this issue, we use a zero-energy concentration window that restricts this phenomena. A Hamming window was used in our study [15], [16].

D. Formants and Pitch Examination
1) Decoding problem from acoustic to phonetic: Due to the continuous nature of speech signals, it is difficult to identify different linguistic units such as words, syllables and phonemes in the recorded signal. This problem is known as phonetic acoustic decoding. [17], [28]. We used a process allowing us to identify the transitions between consonants and vowels. The syllables corresponding to this case are available to qualify [18].

2) Vowel-consonant (VC) and Consonant-vowel (CV) Transitions Detection:
Digital filtering is employed first to remove as much background noise as feasible [19].
The flowchart in Fig. 4 illustrates the identification of vowel-consonant and consonant-vowel transitions.  [20]. www.ijacsa.thesai.org 3) Formants: Minimums and maximums presents in the vocal signal spectrum correlate to resonance and antiresonance tract vocal, also named formants and anti-formants.
The formants (F1, F2, F3, and F4) used in this study, for vowels and consonants found in word that recovered in order to differentiate the phoneme formants [21].
Formant analysis is used to identify consonants and vowels. Before formants can be recognized, they must be processed to make them clearer [22].
The chart in Fig. 5 represents this preprocessing. Once spectrum corrected was acquired, we search for every formant's in a carefully selected frequency band to increase the probability of finding the formant with the highest average amplitude. Tables II and III show

4) Pitch:
Pitch is s a crucial component of human voice and widely recognized as perceptual fundamental of sound that is strongly attached to frequency and can be related to the vocal cord's vibration fundamental frequency, permitting audio frequency recognition. It is among the most essential auditory features of sounds, as well as quality and loudness [23], [30].
We used the "Get Pitch" command to extract the pitch, with set the pitch floor to 75 Hz, and set the pitch ceiling to 500 Hz.

E. Measurement Tools and Corpus
1) Tools: Phoneticians and academics utilize the open source program PRAAT [24] to identify various phonetic properties of speech. It is a very efficient software for analysing and recreating acoustic speech signal [25].
For collecting all of the characteristics presented in this study, wav files were registered and analyzed by PRAAT (see Fig. 6).

2) Measurements:
The foundations for measuring voice signals acoustically are pitch and the four formants, which are widely utilized as indicators of perceived speech quality [26]. The Table IV shows the Amazigh vowels used in this work.

3) Preparation of the corpus:
Ten persons (five women and five men) are chosen among a vocal database of Amazigh www.ijacsa.thesai.org people in Morocco comes from various regions with no distinctive geographical distribution. Age was used to coordinate subjects. The average age of the women was 35, ranging from 23 to 50 years. The men in this group range in age from 25 to 50 years, with an average of 36 years. Each speaker repeated the process 10 times. The total amount of evaluated words (10 speakers x 8 words x 10 repetitions), giving us 800 files to examine.
Our objective is to analyse the consonants, semi consonants, and vowels that are pronounced by studying the important voice parameters [27]. We manually recovered the vowels A, I, and U from Krad, Tanmirt, and Ayur words spectrograms. Based on the spectrogram of words Aghrum, Attas, and Tazalit, R, T, and Z are obtained consonants. More information about the database is shown in Table V.

4) Materials:
In this study, we use a microphones and a computer having 8 GB of RAM and an Intel Core i7 processor running at 2.5 GHz. Our experience indicates that Windows 10 LTSB is the prevalent operating system. In a silent room, the microphone were placed between 4 and 10 centimeters from the individual's lips. We recorded the wav file with the parameters shown in Table V. IV. RESULTS AND DISCUSSION Fig. 7 represents our approach to determining the acoustical power of syllable [ara]. The Fig. 8 give the temporal derivatives of acoustical power shown in Fig. 7.

A. Authors and Affiliations
After a series of studies [29], it has been experimentally estimated that the transition occurs near the peak where the signal has lost or gained 66 percent in extreme difference of intensity while comparing both phonemes. Crosses appear on the chart to indicate the transitions.
The algorithm for phoneme separation is very effective as follows: • The initial and final moments of silence were deleted; • The speakers didn't blow into the microphone during recording, causing audio signal saturation and resulting in extremely big peak which the program interprets like a transition; This method achieves a 75% success rate, which is dependent on parameters such as geographic place, recorded material, as well as verbal difficulties of speakers.

V. CONCLUSION
This study examined the difficulty of automatic speech recognition for the Amazigh language. In specifically, we attempted to extract the vowels and consonants from the Amazigh voice signal. It is now possible to develop an Amazigh corpus to complement those that already exist. This approach opens the door to the socioeconomic growth of the Amazigh community in Morocco.
This methodology to voice recognition enabled us to detect and to exploit by solutions implemented, a number of phonetic and spectral characteristics of the Amazigh voice signal. Better identification of the issue parameters (accentuation coefficients, quefrence cut cepstrum, etc.) in addition to a more precise analysis of formant transitions and trajectories and a main aspects of prosody in the speech are mostly feasible strategies to produce improved outcomes. In addition, speakers of a language like Amazigh should be proficient in both the language and the use of Information Technology resources.