Instant Diacritics Restoration System for Sindhi Accent Prediction using N-Gram and Memory-Based Learning Approaches

--The script of Sindhi Language is highly complex due to many complexities including abundance of homographic words. The interpretation of the text turns so tough due to the possibility of multitudinal meanings associated with a homographic word unless given specific pronunciation with the help of diacritics. Diacritics help the readers to comprehend the text easily. Due to the rapidly developing nature of this era, people don’t bother writing diacritics in routine applications of life. Besides creating difficulties for human reading, the absence of diacritics does also make the text abstruse for machine reading. Relatively alike human, machines may also lead to semantic and syntactic complexities during computational processing of the language. Instant diacritics restoration is an approach emerged from the text prediction systems. This type of diacritics restoration is an unprecedented work in the realm of natural language processing, particularly in Indo-Aryan languages. A proposition for a framework using N-Grams and Memory-Based Learning approach is made in this work. The grab-point of this mechanism is its 99.03% accuracy on the corpus of Sindhi language during the experiments. The comparative edge of instant diacritics restoration is its being source of expedition in the performance of other natural language and speech processing applications. The future development of this approach seems vivid and clear for Sindhi orthography is highly similar to those of Arabic, Urdu, Persian and other languages based on this type of script. Keywords--Sindhi Language; Instant Diacritics Restoration; Text Prediction; N-Grams; Memory-Based Learning


INTRODUCTION
Sindhi orthography abounds in such words which possess different meaning but identical morphological structure.These words are called homographs in linguistics.The solution to this problem is the assignment of diacritic marks to the homographs.Sindhi orthography has two types of diacritic signs used for the correct pronunciation of the words [1].The superscript signs assigned over the letters and subscript ones beneath the letters.The routine scripts of Sindhi language are written without diacritics such as newspapers, magazines and books.Such absence brings about critical challenges facing computational processing of the language [2].In more elaborate way, homographic words can be interchangeably meant or interpreted if diacritics are absent.They may be meant and pronounced erroneously as well.Without disambiguation, it is rather difficult to figure out the intended meaning and pronunciation of words during the process of different linguistic and speech processing applications.
The automatic assignment of diacritics in Sindhi script is essential for its processing into natural language and speech applications [3] [4].Therefore, the literature of this type of research is replete with the details of the research works on diacritic restoration particularly by using statistical approaches [5] [2].Firstly, the results of previous research works are not satisfactory or at acceptable level and secondly, the instant diacritics restoration is taken into consideration for the first time for Sindhi.The objective of the study is the development of automatic system that will convert the un-diacritized words into the diacritized ones by assigning the diacritic signs instantly during typing.This research study aims at the development of automatic system that assigns diacritics to the words which at first are un-diacritized during typing instantly.For this, an investigative study with the combination of N-Grams and Letter-Level Approaches is carried out to meet the objective.
The rest of the paper is organized as follows: some research contributions of diacritics restoration of Arabic script-based languages are presented in Section II.The overview of corpus preparation is given in Section III.The proposed model for the task of instant diacritics restoration is described and depicted in Section IV.In Section V, execution process of developed software application is explained, while in Section VI, implementation process of proposed model and detail evaluation of calculated results are given and finally, the paper is concluded in Section VII with core results and conclusion.

II. RELATED WORK
The study of literature on this topic reveals that diacritics restoration is performed at letter and word level.Diacritics restoration has been centered by using various techniques at word and letter level as well, like N-Grams [6] [7], Neural Networks [8], Maximum Entropy [9], Memory-Based Learning [10] [11], and Weighted Finite State [12].Majority of researchers has received encouraging results at word level using N-Gram language model [6] [7] [2] whereas Memory-Based Learning Approach [13] also yields good results at www.ijacsa.thesai.orgletter level for the same task on Arabic script-based languages including Sindhi [14].
The task of automatic Sindhi diacritics restoration is mainly considered and taken by the researchers using statistical approaches such as maximum entropy [1], N-grams [5] and memory-based learning approach [14].The acceptable results are achieved with memory-based learning and N-gram based language modeling approaches.Hence, the proposed instant diacritics restoration mechanism is also based on the N-Grams and Memory-Based Learning approaches.Making use of this mechanism high accuracy in less time is attained.

III. CORPUS PREPARATION
As a matter of fact, two types of data sets are always required for experimentation of diacritics restoration systems [1].Therefore, two types of corpora are designed and developed.The first subsumes complete diacritized text and the second undiacritized text.In addition to them, a lexicon is also built.The experiments of the proposed method were performed by making use of both types of data sets; corpora and lexicon.
A data set of corpus having 2, 65,257 words are built in Sindhi language for the purpose of training and testing the system.The organized information of the developed corpus in is given in Table I.The corpus is classified into three segments: the antique books that are completely written with diacritics like Shah Jo Rosalo [15], the poetry books that possess partially diacritized text and the recently published text of different genres which are entirely void of diacritics like newspapers, magazines and text books.

A. Developed Lexicon
In addition to the development of Sindhi corpus, a lexicon of Sindhi text has been created for it is an essential component for the proposed method of instant diacritization.The mechanism of the instant diacritics restoration has the basis of memory based learning approach with the aid of letter level learning approach.Relatively, a table having the letters in different forms of diacritized as well as un-diacritized is developed.The specimen of this table is given in "Fig.1".It should be noted here that each letter is assigned a unique number for the identification.This identification is required for the execution of the letters into the system.

IV. PROPOSED MODEL
The nine components work altogether as the constituents of the proposed mechanism: Calculation of word probabilities, specimens of letters, pattern matching and comparative function of homographic structures, K-NN Classifier and Class Labels, calculation of distance between instances using overlap metric, calculate the features weight, nested hash and tokenization.The proposed model in "Fig.2" is used to show the execution process of the complete system.
The corpus functions as a patron on which the probabilities are dependent; hence, training corpus design is a delicate matter to deal with.The more specified training corpus leads to the more accurate probabilities which help the task to be achieved conveniently.The N-grams are probabilistic models that help the provision of direction for the assignment of probabilities to the words.The unigram, bigram, trigram and so on models are used for the calculation of probabilities.A unigram is an N-gram of 1, bigram of 2, and consequently trigram of 3, and so on with the progressive numbers [16].The text is a sequential series of structured words and can be given representation as below:

For a bigram grammar
The trigram is same as bigram except the condition on two previous words as under.
The ultimate product on the part of the system is the provision of the option to the user to choose the suitable or correct words as per the requirement.Therefore, the language modeling is used for the computation of N-Grams up to quad one.The probabilities of all the words given in the corpus are individually calculated and stored into a specified table in the designed lexicon.The purpose of this whole process is to support the further process of the mechanism.( , ,... , ) (3) www.ijacsa.thesai.orgAfter the words probabilities are calculated, the system starts computation of the available instances of each diacritized letter.For this, almost all the possible instances of all the letters in corpora calculated with every diacritic mark; i.e., ‫بُ‬ , ِ ‫ب‬ , َ ‫ب‬ are calculated altogether with the surrounding letter (N letter) on both left and right sides.At the same time, the calculated instances are saved in a multidimensional array ascending.At least 1224688 instances are taken from the available corpus taking care of the particular notations given to the white spaces (SP), commas (CO) and dots (DO) alike [11] [13].A vector based multidimensional array is used for the storage of these examples.The corpus same from [1] is given below and the related sample of feature vectors extracted from the same source is presented in Table II.www.ijacsa.thesai.orgThe absence of diacritical marks lead to many complexities in the text regarding various possible vowels sounds used in a word [11].The word ‫سکن‬ may be taken for example.The system performs comparison of the pattern of the un-diacritized word with the diacritized ones available in the corpus.System receives two types of words ِ ‫َن‬ ‫ُک‬ ‫س‬ and ِ ‫َن‬ ‫ِک‬ ‫.س‬ Pattern matching process is carried out using regular expression approach.The system, then, acknowledges the pattern of un-diacritized input word with the diacritized one.The suitable word on the basis of the highest probability is fixed at the same location.Sample regular expression example is given graphical representation below: The complete group of examples is extracted from the corpus for each complex letter structure.Each letter from the set is taken one by one including the surrounding neighbors from both sides.Then, the system compares with the available instances in the corpus.The KNN classifier is used for this comparison process.The value of each feature vector is calculated and stored in the built-in metric.All of the values of each feature are weighted and tagged with labels whether matched or mismatched structures.These instances are divided in accordance with the assigned labels.The instance based learning algorithm is taken into use for the comparison of new problem examples with instances stored already in the memory.K-nearest neighbor algorithm is the proven simplest method of an instance-based learning one; on the other hand, K-NN method categorizes the objects based on the nearest training example in the feature space.The core model is given below [17]: All of the input instances are compared individually with the all the closest neighbors by using KNN classifier.Finally, the system accepts the most frequent ones.A multidimensional array in the system saves the training examples containing feature vectors.The label specifies each example according to its class.The highest numbers of votes including with neighbors categorize the labeled entity.
While the process of classification undergoes, a unique test instance is fed to the system, using the distance ∆(X, Y).This computes the sameness of the new examples and all of the other examples in memory.Overlap metric is used for this task particularly considering the distance between instances manifested by N-features.It is only to show the distance per feature [13] [14].
The metric performs counting of the entire number of feature-values in both patterns regardless of matching or mismatching for the addition of the domain knowledge bias to the weight.
For the weight of the features, statistical information is calculated through an examination to reach the better predictors of the class tags.Information Gain (IG) examines each feature individually and prepares measurement for the information to be produced and stored knowledge for valid class label.
Immediately after the above process, hash table begins the process of storing data in an associated network manner.This table stores the data in the array format and each data value receives a unique index within.This way the data is quickly accessed after knowing the index of the required data.Hashing technique is widely known technique that is used for the conversion of a range of key values to a range of the array indexes.
Tokenization of the script of Sindhi is also one of the challenging tasks due to the complexities in the text, 1 () () (5) www.ijacsa.thesai.orgparticularly the complexities of homographic structures.A compound word needs to be entitled as a single token but the embedded space required in between creates ambiguity for the tokenization process.The embedded space is required in between due to the cursive nature of Sindhi script and its connecting and non-connecting letters.Therefore, more attention is to be paid because of these complications facing the tokenization.Mahar's [1] tokenization model is taken in this research project.
In fact, Sindhi script abounds in homographic words.As a result, the ambiguity is often observed when the text is undiacritized.A simple word and root word of Sindhi ‫قسم‬ has such constituent letters which may be interchangeably taken in almost two way as ُ ‫م‬ َ ‫َس‬ ‫ق‬ (an oath) (noun), ُ ‫ِم‬ ‫ِس‬ ‫ق‬ (kind) (noun).The taken words without diacritics are exactly identical.Thus, they create ambiguity for NLP applications.Viterbia Algorithm is one of the efficient approaches to find the most likely path transitions in such cases.This algorithm produces the most likely possible word on the basis of the highest probability value calculated by using N-grams [16].

V. EXECUTION PROCESS OF APPLICATION
Text prediction is the basic idea that ignition to the Instant Diacritics Restoration.The former was proposed to save time and energy simultaneously by offering assumptions of possible upcoming set of letters after typing the beginning letters of words.By typing each succeeding letter, the user receives possible suggestions in different forms of popup to adopt with a single click only rather than typing all the upcoming letters of the word.For example, user wants to type the word ‫.انسان‬After typing the first letter, he will be shown some popup carrying some most possible and frequently used words begging with ‫.ا‬Then, he will type the next letter ‫,ن‬ he will again be shown some set of most possible and frequently used set of letters after the two begging ones.If he finds the same letter in the popup, he would just hit a single click to get the word typed rather than hitting five strokes for all the five letters in the word.This function of text prediction gave birth to the idea of instant diacritics restoration.The predictive approach of instant diacritization facilitates the user to type the words with their exact pronunciations which further helps in reading it correctly.The editor actively and simultaneously works with the user and assigns the diacritics automatically.The user has to type the words only.The diacritics will automatically be assigned immediately.For example, the user wants to type the word ‫انُ‬ َ ‫س‬ ۡ ‫ِن‬ ‫,ا‬ he first types the first letter ‫,ا‬ the editor will assign it the superscript diacritic sign initially, for the system is assigned this task for every first letter.After ‫,ا‬ the user types another letter ‫,ن‬ the system will immediately calculate the probability of the possible diacritics to this couple of letters and assign to ‫,ن‬ simultaneously the to ‫ا‬ will change into .
The user is to type ‫س‬ now, as he types ‫س‬ the system again goes for the calculation of the probability of the possible diacritics to this combination of letters and assigns the diacritics to all of the three according the highest found match in the corpus.Now, the user moves ahead to type ‫ا‬ and then ‫,ن‬ the system will simultaneously work with the letters and the diacritics while calculating the probabilities of the letters and diacritic signs from the given corpus.After the user is done with typing ‫انُ‬ َ ‫س‬ ۡ ‫ِن‬ ‫,ا‬ the system finalizes its diacritics with the same procedures detailed above.The same process takes place by typing each letter in the editor.

VI. IMPLEMENTATION AND RESULTS
The training and testing set design stand as the foundations to the final results.Therefore, both are mainly concerned till the results are derived.Different techniques like Word Error Rat, Diacritic Error Rate, Precision, Recall and F-measures were in the use previously.We have also taken Precision which is one of them due to the fact that its performance is observed to be better at letter level approach [1].Moreover, the complex letters assign the target features for being trained; hence, the task is performed at the lowest basic level of letters.Three mainly used diacritics, i.e., Zabar, Zair and Pesho in Sindhi are considered in experiments.
The Letter Level Learning method processes every letter taken from the corpus and creates a ten letters vector.Each vector is put into an array.Consequently, each letter is preprocessed with its calculated probability.After receiving the testing data set, system throbs the comparison of all the undiacritized letters of the testing data set with the preprocessed data available in the arrays and after the said process replace the letter with the diacritized one.
From the total sets of instances taken from the developed corpus, 159330 instances are experimentally tested from each set.The testing examples are approximately 15% of the whole set of examples.Table III, Table IV and V depict the results attained with N=1, 3 and 5.The tables show the ambiguous letters extracted from the developed corpus, the precision as the result by applying instance-based learning at letter level.Three different window sizes were tested to reach the best one.Among the window sizes of two, six, and ten letters (i.e., N= 1, 3, 5), the calculated accuracy with N=1 is 92.52%, accuracy of 95.12% is received when N=3 and 99.03% is calculated with N=5.Window size for the greatest and most efficient accuracy was observed up to ten nearest accompanying letters (i.e., N=5) where N stands for the number of letters from each side of the letter under process.The calculated cumulative precisions with different experimented window sizes are shown in "Fig.3".The figures, given in the tables, show that a considerable difference can be found among them; in addition to this, the calculated results reveal that the window size is also decisive in increase and decrease of results.Therefore, N=5 proves to be the most suitable and reliable window comparatively.

VII. CONCLUSION
Automatic instant diacritic restoration is essential component for many NLP applications.The restoration is attempted with the most possible intelligent use of two approaches; N-grams based and Letter Level Learning-based.Each of both methods has their own specifications along with the limitations.The proposed mechanism in this study is experimented on our developed corpus of Sindhi language.The window (N=5) is found the best one after testing different sizes.The Precision with this window is achieved at 99.03%.The proposed method is also capable for the instant diacritics restoration of Arabic, Urdu and Persian languages after slight modifications.

TABLE .
III. AMBIGUOUS SET OF LETTERS, EXAMPLES AND ACHIEVED PRECISION WITH N=1