Greedy Algorithms to Optimize a Sentence Set Near-Uniformly Distributed on Syllable Units and Punctuation Marks

An optimum sentence set that near-uniformly distributed on syllable units and punctuation marks is important to develop a syllable-based automatic speech recognition (ASR). It is usually extracted from a mother set of millions of unique sentences using Modified Least-to-Most (LTM) Greedy algorithm. The Modified LTM Greedy is capable of minimizing the number of syllables but ignores distributing their frequencies. Hence, two schemes are proposed to minimize the number of syllables as well as to distribute their frequencies near-uniformly. Testing on a mother set of 10 million Indonesian sentences shows that both schemes perform better than the Modified LTM Greedy for two syllable units: monosyllables and bisyllables. Keywords—read-speech corpus; optimum sentence set; syllable; punctuation marks; Modified Least-to-Most Greedy


I. INTRODUCTION
Since the begining 2000, some researchers show that the context-dependent syllable-based ASR systems perform better than the context-independent phone-based ones, as described in [1], [2], and [3].Today, the promising state-of-the-art ASR called sequence-to-sequence attention-based model is also designed using a syllable-based model [4].However, the syllablebased ASR needs a much larger read-speech corpus for the training process [5].Therefore, developing such speech corpus is a challenging issue.
The speech corpus is commonly recorded on a minimum sentence set near-uniformly distributed on both syllable units and punctuation marks for thousands of speakers varying on gender, age, and dialect [6], [7], [8], [9], [10].Punctuation marks in a sentence affect how it is being interpreted, mostly by differing intonation [11] and [12].The speakers may use different intonation to make their intentions clear.A sentence "It's me." is a monotone statement, while "it's me?" gives a higher tone for the syllable 'me?'.Hence, a syllable-based ASR needs a read-speech corpus developed using a minimum sentence set balanced on syllables and punctuation marks [13] and [14].
Commonly methods used to extract a minimum sentence set from a mother set are greedy-based algorithms, such as the Least-to-Most (LTM) Greedy Algorithm [15].This algorithm is then slightly improved to be the Modified LTM Greedy which is capable of extracting a minimum sentence set in quite fast execution time [16].But, the Modified LTM Greedy only concentrates on minimizing the number of phonetic units but ignores balancing their frequencies.
In this paper, the Modified LTM Greedy is adapted to extract a minimum sentence set from a mother set of around 10 million sentences based on their syllable.Two additional schemes are proposed to make the Modified LTM Greedy capable of extracting a minimum sentence set, near-uniformly balanced on both syllables and punctuation marks, to be used to develop a state-of-the-art syllable-based ASR.Both additional schemes are carefully designed to minimize the number of syllables as well as to balance their frequencies.

II. GREEDY ALGORITHMS
The Modified LTM Greedy algorithm described in [16] performs well to extract a phonetically-rich sentence set.Unfortunately, it just focuses on minimizing the number of phonetic units but ignores balancing their frequencies.Hence, in this paper two additional schemes are proposed to improve the performance of the algorithm in minimizing the number of syllables as well as balancing their frequencies.

A. Modified LTM Greedy Algorithm
The Modified LTM Greedy algorithm produces a sentence set from a mother set by taking the best sentences based on a scoring formula.The pseudocode adapted from [16], with an adjusment to handle syllables instead of phonemes, is described as follows: 1) Let A = mother set, U = all to-be-covered syllables, B = empty set; 2) From U take all syllables with the lowest frequency and put them in U sub ; 3) From A select all sentences containing at least one syllable in U sub and put them in A sub ; 4) Compute the score of each sentence in A sub using a formula where S i is the score for the ith sentence, N i is the number of to-be-covered syllables in the ith sentence, and T i is the number of all syllables in the ith sentence; 5) Choose a sentence with the best score and put it in B and remove all syllables contained in the sentence from both U and U sub ; 6) Repeat step 3 to 5 until U sub is empty; 7) Repeat step 2 to 6 until U is empty.
The pseudocode can be explained in a simple way using some illustrations in Fig. 1 to 5. In these illustrations, the mother set (A) contains only five sentences, as listed in Table I, to make any step in the pseudocode clear.Dia menonton di rumah belajar (He is watching in the learning house) 4 Lagi-lagi dia menonton di rumah (Again he is watching at home) 5 Menonton video di rumah (Watching video at home) In step 1, the Indonesian syllabification model described in [17] is used to generate all syllables contained in each sentence as well as a list of to-be-covered syllables, which contains 14 unique syllables, with their frequencies (U ).The minimum set B is empty.Next, in step 2, all syllables with the lowest frequency in U are selected and moved into U sub .In step 3, all sentences containing at least one syllable in U sub are then selected and moved into A sub .Then, in step 4, the score of each sentence in A sub is calculated using the formula in Eq. 1.The second sentence, with 9 out of 10 to-be-covered syllables, has a score of 0.9.Meanwhile, the fifth sentence, with 9 out of 9 to-be-covered syllables, has a higher score of 1.0.Finally, in step 5, the fifth sentence with the best score of 1.0 is chosen, saved into B, and all syllables contained in this sentence are removed from both U and U sub .These steps are repeated until both U sub and U are empty.When both stoping criteria are reached the algorithm produces a minimum set of two sentences, i.e. the fifth and the second sentences, that consits of all 14 unique syllables to-be-covered.

B. Semi LTM Greedy 1
In the first proposed scheme, the Modified LTM Greedy is revised by replacing the step 5 with four new steps below: 1) Let K be a real number in the interval (0, 1); 2) From A sub select the top-score sentences, which have scores ≥ (the best score ×(1 − K)), and put them in a new set D; 3) From D choose a sentence with the maximum number of to-be-covered syllables and remove all syllables contained in the sentence from both U and U sub ; 4) Clear D.
This proposed scheme can be explained using an illustration in Fig. 6.In this illustration, let K = 0.05.From the mother set (A), which is sorted by the score calculated using the formula in Eq. 1, select the top-score sentences and put them into a new set D. Next, from D choose a sentence with the maximum number of to-be-covered syllables, i.e. 24, instead of the highest score.This scheme is designed to handle the possibility of the Modified LTM Greedy algorithm in taking the local optimum when looking for the best sentence.It will produce a larger sentence set B.

C. Semi LTM Greedy 2
In the second proposed scheme, the Modified LTM Greedy is updated by replacing the step 5 with four new steps below: 1) Let K be a real number in the interval (0, 1); 2) Select the top-score sentences, which have scores ≥ (the best score ×(1 − K)), and put them in a new set D; 3) From D, choose a sentence with the lowest new score calculated using a formula: where f is the frequencies of all have-been-covered syllables in the minimum set B and remove all syllables contained in the sentence from both U and U sub ; 4) Clear D. This scheme is proposed to overcome the weakness of the Modified LTM Greedy algorithm in balancing frequencies of the syllables.By taking sentences with the lowest frequencies of syllables have been covered in the minimum set B, the duplication of syllables should be reduced.

III. EXPERIMENTAL SETUP
In this research, a mother set containing 10,000,034 sentences is collected by crawling some newspaper websites.Two dictionaries (phonemic and syllabic-based) of 80K unique words are developed using the Indonesian graphemeto-phoneme conversion system described in [18] and the Indonesian syllabification system described in [17] respectively.Converting the mother set of 10 M sentences using both dictionaries produces 121,860,535 monosyllables (6,804 unique monosyllables) and 132,445,220 bisyllables (308,710 unique bisyllables).
Using the mother set, some experiments are performed based on two scenarios: 1) Scenario 1: The Modified LTM Greedy.In this scenario, the mother set is extracted using the Modified LTM Greedy for both monosyllable and bisyllable.
2) Scenario 2: The Semi LTM Greedy.In this scenario, the mother set is extracted using the Semi LTM Greedy 1 and the Semi LTM Greedy 2 with K = 0.05, 0.1, 0.2 and 0.33 for both monosyllable and bisyllable.The extracted minimum sentence sets balanced on syllables and punctuation marks are compared to those resulted by the Modified LTM Greedy.

IV. RESULT AND DISCUSSION
Two scenarios described in the experimental setup are tested for both monosyllables and bisyllables to compare their performances.The experiments are conducted using a single processor i5 with 4 GB RAM.The total run time per experiments for the monosyllables is 4 hours while for the bisyllables is 9 hours.

A. Monosyllable
Extraction of the mother set of 10 M sentences using the Modified LTM Greedy produces a sentence set of 6,804 unique monosyllables in 4,056 sentences with the total number of monosyllables is 31,575.The average frequency of syllable f = 4.64 with the standard deviation σ = 30.91.Next, extraction of the mother set using the Semi LTM Greedy 1 and the Semi LTM Greedy 2 produce the results illustrated in Table II and Fig. 7.  Table II shows that the Semi LTM Greedy 1 is successful in reducing the total number of sentences, but it increases the total number of syllables as the value of K does.This is probably the case where the algorithm does not really consider the redundancy of syllables when taking the best sentence resulting in a large number of syllables.
On the other hand, the Semi LTM Greedy 2 is capable of reducing the standard deviation of the result set relatively as the K increases, but with the number of sentences increases as the K does.The formula used in the algorithm considers the frequencies of have-been-covered syllables and then takes the sentence with the smallest total frequencies.This prefers to select shorter sentences and make the result set larger.Fig. 7 shows that the Semi LTM Greedy 2 manages to lower the number of occurrences of more dominant syllables.Table III shows that the Semi LTM Greedy 1 manages to reduce the number of sentences and standard deviation using K = 0.1.The scenarios of the Semi LTM Greedy 2 show that it manages to reduce the standard deviation quite well, with the A simple Pareto optimization using Fig. 10 shows that: a) Experiment 3 dominated experiments 1, 2, 4, and 5; and b) Experiments 3, 6, 7, 8, and 9 do not dominate each other.Thus, it can be concluded that the experiments 3, 6, 7, 8, and 9 are the optimum Pareto set those should be able to be used as a train set for a syllable-based ASR system.The sentence set from Experiment 6 should be used if the system needs a relatively low standard deviation and total sentences.Experiment 3 produces the set best suited for any system requiring as few sentences as possible, while Experiment 9 if least standard deviation.

V. CONCLUSION
The Semi LTM Greedy 1 algorithm is capable of reducing the number of sentences in the extracted sentence set, but the Semi LTM Greedy 2 manages to reduce standard deviation significantly.The Semi LTM Greedy 1 reduces more sentences as the K increases.The Semi LTM Greedy 2 reduces more standard deviation as the K increases.A simple Pareto opti-mization can be used to produce the best sentence set for the designed syllable-based ASR.

Fig. 6 .
Fig.6.Semi LTM Greedy 1: select the top score sentences in the mother set (A) and then choose a sentence with the maximum number of to-be-covered syllables

TABLE II .
EXTRACTION OF THE MOTHER SET FOR MONOSYLLABLE

TABLE III .
EXTRACTION OF THE MOTHER SET FOR BISYLLABLE