Evaluation of Gated Recurrent Unit in Arabic Diacritization

Recurrent neural networks are powerful tools giving excellent results in various tasks, including Natural Language Processing tasks. In this paper, we use Gated Recurrent Unit, a recurrent neural network implementing a simple gating mechanism in order to improve the diacritization process of Arabic. Evaluation of Gated Recurrent Unit for diacritization is performed in comparison with the state-of-the art results obtained with Long-Short term memory a powerful RNN architecture giving the best-known results in diacritization. Evaluation covers two performance aspects, Error rate and training runtime. Keywords—Gated recurrent unit; long-short term memory; arabic diacritization


I. INTRODUCTION
Natural languages require different processing steps in order to perform Natural Language Processing (NLP) tasks, such as Text-to-speech synthesis (TTS), speech recognition, sentiment analysis, information retrieval, etc.In the case of Arabic, an additional preprocessing step is mandatory: Diacritization, or diacritic restoration.Diacritics are signs placed below or above a letter indicating a different phonetic value.
Arabic is a semitic language with two varieties: Classical and Modern.Classical Arabic is the pure language spoken by Arabs; Modern Standard Arabic (MSA) is an evolving variety with constant new terms to meet the modern innovations and changes.Generally, Arabic (Classical or MSA) is transcripted without diacritics, leading to different ambiguities at various linguistic levels as explained in [1].
According to [2], in over 77% of cases, a non-vocalized word can have several possible diacritizations and consequently different possible meanings.1 gives an example of this aspect and lists some of the possible diacritization forms of the string ‫"صدق"‬ and the inferred meaning.
Arabic diacritization received a lot of interest and went through different models: Rule based models, statistical models and hybrid models.
Rule based models rely on existing linguistic rules formulated, in most cases by human experts.They have proven an acceptable efficiency in diacritization, given the lack of linguistic resources.The major drawback of rule-based models is the laborious, costly and time-consuming task to formulate and maintain rules that covers all rich linguistic aspects of Arabic.Moreover, Rule based models require strong linguistic knowledge.
Statistical models attempt to learn a diacritization model from diacritized texts; by predicting the probability of distribution of a sequence of words or characters.Authors in [3] present a review of these methods, using Hidden Markov chains, n-gram or finite state transducers.
The weakness of statistical models is their reliability on large corpus of fully diacritized text.Their strength is that no linguistic knowledge or tools like Pos Taggers or morphological analyzers are needed.
Hybrid methods come from the statement that the strength of linguistic knowledge combined to statistic methods would yield to better results.
Recurrent neural networks (RNN) language models have been used to solve diacritization problem as a statistical or hybrid method.The results are impressive and the error is proven to be asymptotic with a DER of 5.08% over all characters using long short-term memory (LSTM) a powerful RNN architecture proving its performance in various tasks.Moreover, RNN have been used successfully without relying on any linguistic tools, and solely on diacritized corpus.However, the superior performance of RNN comes at the cost of expensive model training, reaching days or weeks in some experimental settings requiring a significant computation capability.
As a solution to these issues, authors introduced new RNN architectures with simple internal architectures.GRU, introduced in [4] is one of these RNN; in some tasks, it seems to perform better on the training runtime, and maintain a www.ijacsa.thesai.orgcomparable accuracy to LSTM.To our knowledge, diacritization has never been addressed by GRU.
In this paper, we present the results of our evaluation of GRU to enhance the diacritization results regarding its performance in training runtime and error scoring.We use the performances scored by LSTM in diacritization as a baseline.We show that we achieve to maintain the state-of-the art results with better scoring on the training and runtime.

A. RNN Performance
The first motivation of this study is to enhance the performances of Arabic diacritization.The evaluation of GRU on Arabic diacritization is conducted in comparison with LSTM.In literature, many studies have made this evaluation in various tasks.For example, in [5], authors used a probabilistic approach to determine which RNN architecture is optimal.They evaluated thousand different RNN architectures and identified some that outperform LSTM and GRU in some tasks but not all.In [6], authors compared LSTM and variants over large-scale tasks such as speech recognition, handwriting recognition, etc. Authors conclude that variants of LSTM do not improve significantly the performance.
In [7], an empirical comparison between LSTM and GRU is performed for music and speech modeling.The study did not conclude the superiority of one on the other and then considers GRU to be a better choice since it uses less parameters.
We find the studies in literature inconclusive and insufficient to generalize over all tasks, taking into consideration the considerable differences that might remain in experimental settings and characteristics of the addressed problem.

B. Arabic Diacritization
The Arabic diacritization problem is mainly addressed as a classification problem over seven classes corresponding to the possible diacritics.
Diacritization is divided into two sub-types: The morphological diacritization giving satisfying results reaching an error of 3% to 4%, while syntactic diacritization is still to be improved with a rate of 9.9%.
Earlier approaches in diacritization are rule-based models, like in [8], where morphological analyzer is used for semiautomatic diacritization.Work in [9] presents "Alserag", another rule based system working through three modules: Morphological analysis, syntactic analysis and morphphonological module.The system scores 8.68% as diacritization error rate (DER) and 18.63% as word error rate (WER).The main drawback of rule-based models is their high development cost; and the fact that creating linguistic resources such as corpuses are laborious task that need to be reproduced over the studied language.
More recent studies benefit from the evolution of machine learning to learn a diacritization model from vocalized text, the study in [3] presents an overview of these models: Using Maximum Entropy, Hidden Markov Models (HMM) and weighted finite state machine.
In [10], a maximum entropy model is trained for sequence classification to restore the diacritic of each character.They used The Arabic treebank corpus, containing 600 documents form newspapers with over 340k words.The system achieves a DER of 5.5% and a WER of 18%.
More recently, Machine-learning models tend to rely on less and less external resources such as in the study in [11] where the model relies solely on diacritized text.To our knowledge, among systems depending on no external resources, this study gives the best results scored to now with a DER of 8.14% for the end-case character, and 5.08% for all characters.

III. RECURRENT NEURAL NETWORKS
RNN are computational approach based on large connected neural unit forming a directed graph.
In basic RNN we assume that , … , ) is the input sequence, , … , ) the hidden sequence and , … , ) the output sequence, the basic RNN computes h and y by executing the equations ( 1) and ( 2) from t=1 to T iterations: f is the activation function for hidden layers, w the matrix weights and b the bias vector, the hidden bias vector.
RNN are suitable for capturing dependencies among sequential data types.The problem of RNN is that they remain weak on long-term dependencies as studied in [12] where authors proved the difficulty to capture long-term dependencies because of the "vanishing" or "exploding" of stochastic gradients.
Gated recurrent neural networks (GNN) have been proposed to resolve this problem; LSTM and GRU belong to the category of GNN.

A. Long Short-Term Memory (LSTM)
Introduced in [13], LSTM are special case of RNN capable to resolve long-term dependencies issue encountered in standard RNN by using a gating mechanism.
LSTM has the property to remember patterns selectively.Making them suitable for a number of sequence learning problems such as language modelling, translation, speech recognition, and Arabic diacritization.
Study in [11] uses LSTM to build a language-independent diacritizer trained solely from vocalized text without referring to any external tools.Authors run several RNN architectures and achieve with a 3-layer-Bidirectional LSTM to reach a DER of 4.85%.
In the work in [14], the problem is approached with bidirectional LSTM considering diacritization as a sequence transcription problem.The system does not require any previous treatment (lexical, morphological or syntactic) The WER scored in this study is 5.82%.www.ijacsa.thesai.org

B. Gated Recurrent Unit (GRU)
First proposed in [4], GRU is generally incorrectly considered as a special-case of LSTM, because of the fact that global architecture is quite similar to LSTM.In fact, GRU is quite different form LSTM .It defines two gating signals instead of three in LSTM, an update signal and a reset gate.GRU has no cell state.Unlike LSTM, it exposes the memory content at each time step.The transition between the previous memory content and the new memory contents is made using leaky integration controlled by the update gate.
GRU has shown its efficiency in many studies like [15] and [16], it achieves promising results in classification tasks and reduces the training runtime since few iterations are needed to update the hidden states and the internal structure of a cell is simplified.
However, GRU still comes second to LSTM in terms of performance.Therefore, GRU is mainly used in situations where fast training is needed with limited computation capability.

A. Experimental Setup
Our goal is to set up an environment based on GRU maintain state-of-the art results and enhance the computation efficiency.For this purpose, we use as a comparison pattern the approach used in [11], giving the best results to our knowledge in same conditions, relying solely on diacritized text.
We compare the error rate and runtime of both GRU and LSTM over same datasets and experimental setup.For this purpose, we use the setup described in Fig. 1.The GNN in the figure stands for the network used, namely GRU and LSTM.
We use single recurrent layer for our networks for limited computation capacity in one hand, and in the other hand to omit potential issues related to multilayer deep learning architectures.
The network needs exclusively two inputs for training, represented in two separate documents, the first one contains the diacritized text, and the second contains the equivalent undiacritized text.
In the whole process, we use the Buckwalter Arabic Transileration, an ASCII scheme, representing Arabic orthography strictly one-to-one.We tried different environments for experiments, and the results might be quite different to an environment to another; in this paper, we present the results of experiments with Python, Numpy and Theano [17].We use Theano in order to parallelize computation of GPU and giving the best possible results with the limited computation capacity.
We first go through Word embedding, i.e. mapping the characters sequence into letter vectors.We use for this purpose word2vec [18] implemented with Theano and Lasagne Framework.
being the input vector being the output vector being the update gate vector being the reset gate vector W, U and b the parameter matrices and vector g being the chosen function, here we use Tanh as it has been proved to converge faster in practice.

B. Dataset
The Dataset used in this work is a part of the corpus "Tashkeela" introduced in [19].The corpus contains 75 millions of fully vocalized words extracted from 97 books mainly religious and media online from different sources in classical and modern Arabic language.We use the approach described in [10] by splitting the part of the corpus into three parts, Train/Dev/Test as described in Table 2. V. RESULTS AND DISCUSSION

A. Evaluation Metrics
To evaluate the accuracy of GRU we adopt the same metrics used in most of the diacritization studies, for instance [14], [11], [3].The metrics are the Diacritics Error rate (DER) that compares the diacritization of the predicted word with the input word at a character level.In addition to the word error rate (WER), that compares the predicted word to the original at a word level, in other terms, if an error is detected on a character, the whole word is considered incorrectly diacritized.
To evaluate the performance of GRU, we consider the average epoch runtime.

B. Diacritic Error
Table 3 lists the results of the networks error rates, measured with diacritic error rate (DER) over all diacritics DER (ALL) and DER over the last character only DER (Last) on the dev dataset.Experiments show that LSTM achieves better results than GRU in both the DER of all characters and the DER of the last character; however, the results obtained by GRU are still good.

a) Qualitative Analysis
To identify the errors we get from the system implemented with GRU, we proceeded in a direct way by classifying a sample set of the words incorrectly diacritized.
For classification, we use the diacritization scheme error proposed in [20].Table 4 and Table 5 show the distribution of errors among the sub-categories.We notice that both networks behave in the same way, and distribution over categories is quite the same; we notice that most of the errors are made in Form/Spelling, this is to be expected as it has been reported in works like [21] that the diacritization tools are less performant in case-ending diacritics.

D. Discussion
We notice that GRU scores better results and outperforms LSTM in training.The Training time is reduced by 18.82%.
The evaluation showed that GRU gives comparable results to LSTM in diacritization accuracy and improves the training process.Consequently, we assume that GRU gives satisfactory results in Arabic diacritization.However, we consider our study to be completed by running different experiment settings over different datasets.
) being input vector being the previous cell output being the previous cell memory being the current cell output being the current cell Memory = Weight vectors for gates (forget: f; candidate: c; i/p gate: i; o/p gate: o) For training, test and dev, we use the data described in the section IV-B.
Performances Fig. 2 compares average Epoch Runtime of GRU and LSTM.The results do not include the embedding process training.

TABLE I .
POSSIBLE DIACRITIZATIONS FOR THE STRING ‫"صدق"‬

TABLE III .
DER FOR GRU AND LSTM OVER DEV DATASET

TABLE IV .
GRU ANNOTATED ERROR CATEGORIES