Semi-Automatic Segmentation System for Syllables Extraction from Continuous Arabic Audio Signal

The paper describes a speaker independent segmentation system for breaking Arabic uttered sentences into its constituent syllables. The goal is to construct a database of acoustical Arabic syllables as a step towards a syllable-based Arabic speech verification/recognition system. The proposed technique segments the utterances based on maxima extraction from delta function of 1st MFC coefficient. This method locates syllables boundaries by applying the template matching technique with reference utterances. The system was applied over a data set of 276 utterances to segment them into their 2544 constituent syllables. A segmentation success rate of about 91.5% was reached. Keywords—Arabic speech syllables; automatic segmentation; boundaries detection; delta-MFCC features


INTRODUCTION
Speech and natural language processing (SNLP) is a vital topic in recent research.Computer Aided Language Learning (CALL) systems have received considerable attention in recent years.CALL system are used to improve learning and to evaluate pronunciation quality of speakers.Arabic is the spoken language in 60 countries around the world, so it is the second most spoken language in terms of the number of speakers [1].Quran is the basic reference of Arabic language.One of the most important issues in the Arabic world is the learning of Quran recitations [2].
A robust language learning system should have a vocabulary database in order to recognize uttered speech, localize and identify pronunciation mistakes and provide meaningful feedback to help users to improve their performance.Building an acoustical Arabic database of syllables, phonemes, etc., according with the requirements of the used application is the base for these researches.In this paper, a new method for the automatic segmentation of Arabic audio signal into its syllables is introduced.Classical Arabic is the old form of the Arabic language used in the Quran.Modern Standard Arabic "MSA" is a formal language commonly used in all Arabic-speaking countries.
Our system seeks to perform accurate allocation of syllables boundaries from continuous speech as a step towards building an Arabic database that contributes in developing many applications, such as: This paper is organized as follows; section 2 presents segmentation.Section 3 defines the Arabic syllable segmentation.Section 4 introduces the implementation of the proposed system.Section 5 discusses the segmentation experimental results.In section 6, conclusion and future work are presented.

A. Selecting a Template
Due to the importance of the subject, intensive studies have been conducted on speech segmentation employing different features.A theoretical framework for MSA speech segmentation using dynamic level building was introduced byEl Arif et al. [3].Gody presented a speech segmentation approach depending on wavelet transform and spectral analysis; the accuracy was 95% [4],.Tolba used wavelet transform achieving 81% accuracy [5].Yacine Yekache et al. reported a step toward developing Quranic reader using sphinx4 framework [6].Wang et al. [7], Fu et al. [8] introduced zero crossing rate "ZCR", pitch and energy profile as features for the segmentation of speech.In [9], a survey on Punjabi speech segmentation into syllables is presented using negative derivative of Fourier transformations. .In [10], a syllable based recognition system based on pseudo articulatory method is presented which contributes of more plausible style of speech recognition and new modeling of speech behavior.In [11], a group delay based approach is proposed which the short-term energy is processed for determining segment boundaries.An attempt is made by Sarada et al. [12] to automate the syllable transcription task for Indian languages.The method does not require any manual segmentation and a new feature extraction strategy is explored using multiple frame sizes and rates for both training and testing datasets.A technique based on short term energy was implemented in [13] for the automatic segmentation of speech signals in Punjabi speech into syllables.In [14], biologically inspired auditory attention cues are proposed for syllables segmentation from continuous speech.The method achieved 92.1 % accuracy of syllable boundary detection at frame level when tested on TIMIT.In [15], a time-frequency representation and the fusion of intensity and voicing measures were introduced for the segmentation of speech into syllables.A practical method for blind segmentation of continuous speech is presented by Villing et al. [16] using amplitude onset velocity and coarse spectral makeup to identify syllable boundaries.Mijanur Rahman et al. [17] developed a system that automatically segments words from the continuously spoken Bangla sentences.Mijanur Rahman et al. [17] developed a system that automatically segments words from the continuously spoken Bangla sentences.Our prior works of [18], [19] presented an algorithm for segmenting a subset of emphatic and non-emphatic sounds automatically from continuous spoken Arabic, where achieved a segmentation accuracy of up to 90 %.

A. Arabic syllables
Speech units can be phonemes, letters, syllables, words, etc.The segmentation problem may be viewed as an unlabeled splitting problem where the input sequence needs to be split into subsequences.This study focuses on the isolation of syllables; the syllable consists of nuclear vowel plus neighboring consonants.The vowel should be preceded by a consonant and followed by zero, one or two consonants.Thus Arabic language has five standard types of syllables: {CV, CV %, CVC, CV%C, CVCC} where "C": consonant, "V": vowel, "V%": long vowel.

B. System block diagrm
The block diagram of the proposed system is shown in Fig. 1.

1) Data collection, preprocessing, feature extraction and forming reference template
2) Test data entry and preprocessing.
3) Features Extraction 4) Automatic allocation of syllables boundaries through matching process.
5) Evaluation of the resulting isolated syllables.
Our system is based on the Holy Quran.The recordings of twelve speakers; each recited 23 continuous sentences (verses), constitute a dataset of 276 utterances.The texts of the collected data with its IPA mapping for according to [20] are reported in table 4 which includes (2544) syllables to be detected and segmented.The recordings from the reader "Mahmoud Khaleel El-Hosary" has been selected to form the template dataset which will be used as reference throughout the matching process, this choice is based on the well-known of his good realization for the rules of recitations of Quran verses.
Wide variability of speech may affect the accuracy of its analysis.So, good setup of pre-processing phase improves the performance of speech segmentation.The audio signal is divided into fixed length frames with overlapping to insure continuity [21].MIR© software toolbox [22] is used for applying the pre-processing steps, which are: trim silence at begnning and end of the audio signal, normalize the recorded data, form fixed length frames of 30 ms with 60% overlapping and smooth frame boundaries using Hamming window.

C. Extraction of features from template utterance
The selected tool for allocating syllables boundaries is the feature vector of local maxima picked from Delta function of the first Mel Frequency Cepstrum coefficient [23].

D. Process the new input utterance
This module is responsible for processing the new input utterance that needs to be segmented.Each verse (utterance) has a fixed "ML value" to extract the local maxima from (PD-1 st -MFC), depending on the syllabic structure of the utterance.The limiting value should be more than twice the number of existing syllables in the audio signal.Lower value of this limit results in less boundaries detection and more occurrences of connected syllables and vice versa.According to the prespecified picked local maxima from user input, characterization parameters are obtained constituting the test matrix.The matching process is now ready to be applied between Template Matrix "output of the second module" and Test Matrix "output of the third module".

E. Automatic identification of syllables boundaries through matching process
Identification of syllables boundaries of the user utterance is carried out by comparing its characterization parameters with the stored ones of the reference utterance using the matching technique schemed as shown in figure 6.Several methods can be used for the formulation of the rules in matching process based on distance measures techniques like Euclidean distance [24], Mahalanobis distance and Saito divergence [25].The Euclidean distance is used for measuring closeness throughout the matching of this module.The final allocation of boundaries is obtained by passing the output of matching process through two stages of decomposition as will discussed in the next two subsections.

a) Primary Allocation of Syllables:
Table 1 shows the matching result as distances measure between user and template parameters.The closest local maxima are identified according to the minimum distance as displayed at the last row of table 1.These maxima represent locations of the target boundaries.The first case in table 1 is the matching result between two identical utterances "Qul huwa Ɂallahu Ɂahad , ُ ‫ﱠ‬ ‫ﷲ‬ َ ‫ُﻮ‬ ‫ھ‬ ْ ‫ﻞ‬ ُ ‫ﻗ‬ ‫ﺪ‬ ‫ﺣَ‬ َ ‫أ‬ " for same speaker (HSARY), therefore the closest distances are zeroes, this ensures the efficiency of the algorithm.
Each utterance from the test dataset is processed through the matching module with a template of the reference speaker.As reported in the last row of table 2, the boundaries of targeted syllables is represented by the local maxima of indexes (1 st 2 nd , 5 th , 7 th , 10 th & 12 th ) where these maxima had the best closeness among the picked thirteen maxima from (PD-1 st -MFC) of the reader (AUOOB) with the six candidate maxima from the template utterance of the reader (HSARY).

b) Connected Syllables Breakup:
In some cases the output has one or more connected syllables, as shown in Fig. 7.This phenomenon appears as a result of one of the following interpretations: a. Adjusted "ML value" is not enough to get the ideal number of local maxima, so that the maximum which represents the missing boundary was not taken it into account.b.Framing duration is big rather while one or more of speech articulations disappeared inside one frame, since the frame is the smallest unit in the speech signal and should be selected precisely to avoid merged or spited syllables.c.The recorded audio has a composite noise at this point, where the selected feature as a tool for segmentation is unable to detect the boundary between syllables in this area.
In this situation, detection of the missed boundary is performed in a semi-automatic manner.Local maxima are picked with a number equal to the missing boundaries.

IV. SEGMENTATION RESULTS
Since the system is speaker independent, utterances from different speakers were tested with an overall accuracy of 91.5 % as shown in table 3.

V. CONCLUSION
The main purpose of this paper is to implement precise semi-automatic speaker independent system for building a database of Syllables banks from continuous Arabic uttered speech.The developed method employs the vector of local maxima picked from peaks of the delta function of first Mel Frequency Cepstrum Coefficient as cutting tools that predict possible locations of syllables boundaries inside the continuous speech.The final boundaries are allocated by taking into account the number of segments predicted and the closeness between the predicted and the reference segment boundaries.The results have shown that the system was able to break up a set of 276 Arabic utterances into its syllables with up to 91.5 % accuracy.
a) Diagnosis and treatment of speaking pathology.b) Teaching the recitation rules of the Holy Quran.c) Training systems for correct Arabic pronunciation for children and non-native speakers.d) Facilitate the man-machine communication and help its progress.

Fig. 1 .
Fig. 1.Schematic diagram of the proposed system

Fig. 2
Fig.2shows the procedure of obtain MFCC from the audio signal and Fig.3illustrates how local maxima are extracted from delta function of 1 st MFCC along speech signal.

Fig. 2 .
Fig. 2. Obtaining the feature vector from speech stream

Fig. 4
Fig.4clarifies the template matrix creation through selection of candidates.There are two factors affecting the accuracy of getting candidate that identify a syllable boundary at the template utterance, the frame length and/or the percentage of frames overlapping and the ML value "number of holding local maxima", as illustrated in Fig.5.

Fig. 4 .
Fig. 4. Template matrix of candidates' parameters (a)Input signal.(b) Delta function of 1 st MFCC with candidate maxima.(c) Characterization parameters of the 2 nd candidate

Fig. 6 .
Fig. 6.Segmentation of speech syllables from input signal through template matching.(a)Input Signal.(b) Smoothed D elta function of 1P st P MFCC along whole signal.(c) Candidate maxima selection using hand tuning.(d) Peaks configuration of the test input with ML value=9.(e) Matching process between (c), (d).(f) Resultant maxima from matching process.(g) Output segmented syllables with Arabic/IPA labels

Fig. 7 .
Fig. 7.The Role of second stage in the fourth module (a)One missing boundary at the first stage.(b) Two connected syllables occurrence.(c) Missing boundary allocation by the second stage.(d) Output isolated syllables by the second stage.(e) Total resultant syllables

TABLE I .
MATCHING RESULT BETWEEN TWO IDENTICAL UTTERANCES FROM THE SPEAKER (HSARY)

TABLE II .
MATCHING RESULT BETWEEN TWO UTTERANCES OF DIFFERENT SPEAKERS (HSARY & AUOOB)

TABLE III .
ACCURACY RESULTS OF THE TEST SAMPLES

TABLE IV .
SPEAKER INDEPENDENCY TEST RESULTS