On the Training of Deep Neural Networks for Automatic Arabic-Text Diacritization

Automatic Arabic diacritization is one of the most important and challenging problems in Arabic natural language processing (NLP). Recurrent neural networks (RNNs) have proved recently to achieve state-of-the-art results for sequence transcription problems in general, and Arabic diacritization in specific. In this work, we investigate the effect of varying the size of the training corpus on the accuracy of diacritization. We produce a cleaned corpus of approximately 550k sequences extracted from the full dataset of Tashkeela and use subsets of this corpus in our training experiments. Our base model is a deep bidirectional long short-term memory (BiLSTM) RNN that transcribes undiacritized sequences of Arabic letters with fully diacritized sequences. Our experiments show that error rates improve as the size of training corpus increases. Our best performing model achieves average diacritic and word error rates of 1.45% and 3.89%, respectively. When compared with state-of-the-art diacritization systems, we reduce the word error rate by 12% over the best published results. Keywords—Arabic text; automatic diacritization; bidirectional neural network; long short-term memory; natural language processing; recurrent neural networks; sequence transcription


I. INTRODUCTION
The Arabic language is vastly spoken and written in many countries around the world. Arabic scripts mainly exist in two forms: Classical Arabic (CA) represented in holy scripts and old books, and Modern Standard Arabic (MSA) which is a contemporary form of CA used nowadays to write stories, books, newspapers, and formal speeches. Moreover, people use dialects that differ from one region to another, to communicate in their everyday lives [1].
Arabic sentences consist of sequences of words, written from right to left, composed of letters and diacritics. Diacritics are generally zero-width characters that appear in the form of marks added above or below the letters. They provide syntactic and semantic distinction that is essential to pronounce and understand Arabic texts [2]. However, diacritics are optional in most texts, especially MSA texts. This causes problems in understanding the text for non-native speakers and children since they may not be able to infer diacritics from the context. Moreover, it poses challenges on automatic Arabic language processing applications which require text to be diacritized such as automatic speech recognition (ASR), text to speech (TTS), and machine translation (MT) [1].
The Arabic language consists of 28 letters and eight basic diacritics. A total of 36 variants of the Arabic letters result from adding the six Hamza letters ‫ء(‬ ‫آ،‬ ‫أ،‬ ‫ؤ،‬ ‫إ،‬ ‫,)ئ،‬ the Teh Marbuta ‫,)ة(‬ and the Alef Maksura ‫)ى(‬ to the basic 28 letters. These variants have the Unicode hexadecimal codes 0621-063A and 0641-064A. The eight basic Arabic diacritics are: three short vowel diacritics (Fatha, Damma, Kasra), three nunation (Tanween) diacritics, double consonant diacritic (Shadda), and the no-vowel diacritic (Sukun). Arabic diacritics have the Unicode hexadecimal codes 064B-0652. The nunation diacritics are Fathatan, Dammatan, and Kasratan. They can only appear on the last letter of the word. Shadda diacritic is usually combined with either a short vowel or nunation diacritic. With these combined forms, we get a total of thirteen possible different diacritization of a letter in the Arabic language. Table I show the Arabic diacritics along with their transliterated names and list their shapes and sounds when written on the letter Beh ‫.)ب(‬ Diacritics can be classified into two categories: lexemic diacritics and inflectional diacritics. Lexemic diacritics distinguish between words in Arabic morphology that have the same orthography (spelling) but different pronunciations and meanings [3]. Example 1 in Table II shows how adding diacritics to the word ‫كتب‬ in two different ways results in two different pronunciations and meanings. The diacritized word َ ‫َـب‬ ‫ت‬ َ ‫,كـ‬ pronounced "kataba", is a verb which means "he wrote". The diacritized word ‫ُب‬ ‫ت‬ ُ ‫,كـ‬ pronounced "kutub", is a plural noun which means books. Specifying which diacritization form to use for a word depends on the context.  Table II shows how diacritizing the word ‫كتب‬ using one of the two mentioned forms depends on the context it appears in. More specifically, it differs based on the third word in the sentence. In the first case, it is diacritized as the verb "kataba" since the third word is the noun lesson ‫)الدرس(‬ indicating that this is a verb-subject-object sentence. In the second case, it is diacritized as the noun "kutubu" since the third word is the adjective useful ‫)مفيدة(‬ indicating that this is a nominal sentence. In more complex sentences, diacritizing a word may expand to depend on words even further away in the sentence. Inflectional diacritics distinguish different inflected forms of the same word. The diacritic of the last letter in the word depends on the position and role of the word in the sentence [3]. Example 1 in Table III shows how placing the noun ‫ُب‬ ‫ُت‬ ‫ك‬ "kutub" in three different positions changes the last letter ‫ب‬ diacritic between Fatha, Damma, and Kasra. The last letter diacritic is often referred to as end case diacritic. Restoring this diacritic is considered a challenging task even when performed manually since it depends on the way the sentence is formed syntactically. Moreover, words (both nouns and verbs) may be inflected by appending suffixes that add features such as voice, number, person, tense, case, and other categorical information [1]. Table III shows how the diacritic of the last letter changes when the verb ‫َب‬ ‫َـت‬ ‫ك‬ is inflected in three different ways to represent masculine second narration using Fatha in the word ‫َـب‬ ‫َـت‬ ‫ك‬ ‫تَ‬ , feminine second narration using Kasra in the word ‫ت‬ ‫َـب‬ ‫َـت‬ ‫,ك‬ and first narration using Damma in the word ‫تُ‬ ‫َـب‬ ‫َـت‬ ‫.ك‬ Inflected words make syntactical position of the word affect the diacritization not only of the last letter, but even the letters before. Example 3 in Table III shows the plural noun ‫ُبه‬ ‫ُت‬ ‫ك‬ which is inflected by adding the possessive masculine pronoun ‫.ـه‬ The diacritization of the letter ‫ب‬ which is the letter before the pronoun ‫ـه‬ is the one affected by the position of the word in the sentence.

Example 2 in
Consequently, recovering diacritics of undiacritized Arabic text is a challenging yet an important task. Many models have been proposed to automate the process of diacritizing Arabic texts. The performance of these models has been measured using two main metrics that represent the accuracy of the model in providing correct diacritics for the input undiacritized text. These metrics are the diacritics error rate (DER) and word error rate (WER). DER is computed by finding the percentage of wrong diacritics to the total number of characters in the input sequences. WER is computed by finding the percentage of words with at least one wrong diacritic to the total number of words in the input sequences.  Although the best previous solutions have shown steady improvement in accuracy over time, we think that the latest accuracies can be improved further using better models and training datasets. In most cases, the accuracy is restricted due to the lack of large, cleaned training dataset with acceptable diacritization to character rate. In this work, we extend the cleaning process performed in [4] to include the entire Tashkeela dataset. We concentrate on finding the effect of the training dataset size on the diacritization accuracy and on reducing the error rates through using larger datasets. Finding the effect of the dataset size on model accuracy and the best training size would hopefully help interested researchers to reach even better accuracies. The cleaning process was performed in steps such that eight corpora are extracted and cleaned with incremental sizes in terms of number of sequences. We perform experiments that use these corpora to explore and analyze the effect of increasing the training dataset size on the accuracy of our baseline model.
We build on our previous experience in designing a model that exploits the efficiency of bidirectional long short-term memory (BiLSTM) recurrent neural networks in automatic diacritization of Arabic texts. These networks are characterized with their ability to utilize long-term past and future contexts to predict diacritics. Our work produces a cleaned dataset of 543,364 sequences with diacritization to character rate of at least 80%. This dataset can be used to experiment training more sophisticated diacritization models. Moreover, our bestperforming BiLSTM model achieves DER of 1.45% and WER of 3.89%.
The rest of this paper is organized as follows. The next section reviews systems proposed to automate the diacritization of Arabic text. Section III provides background information of sequence transcription and recurrent neural networks. Section IV illustrates our experimental setup. Section V presents and discusses the results of our experiments and compares our best results with the results of previous best performing models. Finally, we conclude our work in Section VI.
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 8, 2021 278 | P a g e www.ijacsa.thesai.org II. LITERATURE REVIEW Diacritization is the process of adding diacritics to the letters of undiacritized texts. This operation is essential to many applications that involve translation and text-to-speech (TTS) conversion. Many models have been proposed over the years to automate the process of diacritizing Arabic text. These models involve rule-based models, statistical models, and hybrid models. Rule-based natural language processing (NLP) systems depend on using a set of well-defined languagedependent rules which are formed by exploiting solid linguistic knowledge. These systems are based on dictionaries and/or morphological and syntactic analyzers/generators [5] [6]. Although rule-based approaches achieve acceptable results, their main drawback is the difficulty of maintaining and including all aspects of the language in a comprehensive set of rules. This is even more significant with a complex language morphologically and syntactically like the Arabic language [7].
Statistical approaches use large corpora of diacritized texts to predict the probability distribution of diacritics for a sequence of characters. The main advantage of these approaches is that they do not depend on a set of rules to solve the problem and hence do not require solid linguistic knowledge. Statistical methods that have been applied to Arabic text diacritization include hidden Markov models (HMM) [8][9], n-grams [10], finite state transducers (FST) [11], conditional random fields (CRF) [12], and neural networks. Recently, most proposed systems combined statistical approaches with linguistic knowledge such that the stochastic process is guided by language specific rules, introducing hybrid approaches [3,[13][14][15][16][17][18][19].
More recently, RNNs have been successfully used to solve restoring diacritics of Arabic texts as a sequence transcription problem. Our previous work in [20] proposed, trained, and tested a bidirectional LSTM network that transcribes raw undiacritized Arabic sequences with fully diacritized ones. Error correction techniques were used as a post processing step to the output of the network to overcome some transcription errors. We also experimented preprocessing the RNN input using a morphological and syntactical analyzer in [21]. Mubarak et al. [22] implemented a sequence-to-sequence model using an encoder-decoder LSTM RNN with contentbased attention. They used a fixed length sliding window of character-based n-words in the training process and a voting algorithm of n-gram probabilistic estimation to select the most likely diacritic form of a word. They trained their model using 4.5 million tokens and tested it using the freely available WikiNews corpus of 18,300 words.
In [4], Fadel et al. tested and compared a few existing webbased automatic diacritization tools. They produced a cleaned subset of 55K sequences from the Tashkeela dataset which is split into training, testing, and validation sets. In [23], they implemented and tested several neural network models that belong to two main approaches, feed forward neural networks and recurrent neural networks. They explored several models using different types of input layers, using a CRF classifier instead of the softmax layer, and optimizing gradients normalization using block-normalized gradient (BNG).
Darwish et al. [24] proposed an approach to automatic diacritization that consists of two bidirectional LSTM RNNs. The first network is responsible for core-word (i.e., all letters other than the last letter of the word) diacritics and the second is responsible for case-ending (i.e., last letter) diacritics. They trained and tested their approach on two sets: one that represents MSA texts and the other represents CA texts. Their model included post correction using a unigram language model.
In our most recent work [25], we trained and tested RNN models using two datasets: Linguistic Data Consortium's Arabic Treebank part 3 (LDC-ATB3) [26], and the cleaned subset of Tashkeela [4]. We performed extensive experiments to explore and analyze the effect of tuning several network parameters, such as the number of network layers and using dropout, on the accuracy and execution time of the tested models. We also experimented models built using different network architectures, alternative approaches to handle problems in sequence lengths, and multiple encoding methods for the diacritized output sequences.
Madhfar and Qamar [27] implemented and experimented automatic diacritization using three character-level deep learning models. The first model is a network that consists of six layers: an embedding layer, followed by three bidirectional LSTM layers, a projection layer, and finally, a softmax layer. The second model consists of an encoder and decoder with location-based attention. The third model consists only of the encoder part of the second model. Its core architecture is implemented using a 1-D convolution bank, a multi-layer highway network, and a bidirectional GRU network. The model is named CBHG (1-D Convolution Bank + Highway network + Bidirectional GRU).
In this paper, we experiment training a deep BiLSTM model using several datasets with incremental sizes extracted from the Tashkeela dataset. Our goal is to test the accuracy of the trained model in each case, thus investigating the effect of the training set size on the accuracy of diacritization. Moreover, our work includes extracting a cleaned corpus of the full dataset of Taskeela which includes only sequences with diacritization to letter rates greater than 80%.

III. SEQUENCE TRANSCRIPTION
Many machine learning tasks can be implemented as a sequence transcription problem, in which input sequences are translated into corresponding output sequences. These include speech recognition, machine translation, and text to speech [28]. Arabic text diacritization has been expressed successfully as a sequence transcription problem as well [20][21][22][23][24][25][26][27]. In our work, an input sequence X consists of characters x 1 , x 2 , x 3 , … . , x T that represent the undiacritized sequence. The output sequence Y is a sequence of diacritics y 1 , y 2 , y 3 , … . , y T such that yi is the diacritic of the letter xi.
Recurrent neural networks (RNNs) have proved to perform best on sequence transcription problems. This is because cells' hidden states are functions of all previous states with respect to time. This provides RNNs with their ability to maintain correlations between data points in the input sequence and the www.ijacsa.thesai.org capability of pointing backward in time [28]. Basic recurrent neural networks are generalization of feedforward neural networks to sequences [29]. Given a sequence of inputs ( 1 , 2 , … , ) , a standard RNN computes a sequence of outputs ( 1 , 2 , … , ). At each time step, a recurrent neuron receives the output vector from the previous time step ( −1) , in addition to the input vector ( ) . Hence, ( ) is a function of ( ) and ( −1) , which is a function of ( −1) and ( −2) , which is a function of ( −2) and ( −3) , and so on. Consequently, ( ) is a function of all input vectors since = 1 [30].
Sequence transcription problems solved using RNNs can be classified into four categories based on the lengths of input and output sequences [30]. One-to-one networks take an input sequences and produces an output sequence of the same length. Sequence-to-vector networks transcribe input sequences into one final output by ignoring all previous outputs. Vector-tosequence networks take one input vector and produce an output sequence. The general sequence-to-sequence network has output sequence that is generally not of the same length as the input sequence. This type is often implemented using the encoder-decoder architecture [31]. In this work, we implement automatic Arabic diacritization as a one-to-one sequence transcription problem since for each input sequence of characters; the output sequence of diacritics is of the same length.
Long short-term memory (LSTM) RNNs were first proposed in [32] to deal with the basic RNNs' problem of decaying or slowly changing weights. This results in their disability to learn long dependencies in the input sequences. LSTM networks, on the other hand, which use purpose-built memory cells, can converge faster, and detect long-term dependencies in the sequences [28]. Each memory cell has two states, the short-term state (also used as the cell output) h (t) and a long-term state c (t) . These states are updated using an input gate, a forget gate, an output gate, and a cell activation unit. The operation of these gates collectively enables the LSTM cell to capture long term patterns by recognizing important inputs, preserving them as long as they are needed, and extracting them whenever they are needed. Fig. 1 shows a basic RNN cell and an LSTM cell.
Conventional unidirectional RNNs can make use only of previous context. However, many sequence transcription problems, including diacritization, require exploiting future context as well. Bidirectional RNN layers achieve this by comprising two unidirectional layers that process the sequence in both time directions producing two hidden vectors. The output is a function of both vectors and, consequently, exploits past and future contexts [33]. Fig. 2 shows the general structure of the bidirectional neural network unfolded for three timesteps. RNNs are made even more powerful by stacking multiple layers on top of each other forming a deep RNN. Deep networks are necessary to solve complex transcription functions. In such architectures, the output sequence of onelayer acts as the input sequence for the next layer.

IV. EXPERIMENTAL SETUP
In this section we provide the details of the experiments conducted in this work. We illustrate the methodology used, how datasets were extracted and preprocessed, the scheme used to encode sequences, and the structure of our baseline model. We performed all experiments on the Cyclone supercomputer of the High-Performance Computing Facility of The Cyprus Institute [34]. The processing and memory specifications of the used resources on the platform are listed in Table IV.

A. Methodology
We performed several experiments which involved training of our baseline model using corpora with different sizes. All experiments went through two phases: the first phase is training the model and the second phase is testing its diacritization accuracy. In the training phase, diacritics are removed from diacritized training sequences to generate undiacritized sequences. Generated undiacritized sequences represent the model input sequences whereas diacritic sequences are the model target sequences. Both undiacritized input sequences and diacritic target sequences are fed to the model after being encoded. Fig. 3 shows the steps performed in the training phase of the performed experiments. In the testing phase, diacritics are removed from diacritized testing sequences. The trained model takes the generated undiacritized sequences as input to predict their diacritics. We perform minor corrections to the output sequences according to rules developed in our previous work in [20]. Corrected output sequences are stored in a text file named diacritized_output.txt. We test the accuracy of the model by comparing the model diacritized sequences, in the file diacritized_output.txt, with the correctly-diacritized target sequences, stored in a file named target_output.txt, in measures of DER and WER rates. Fig. 4 shows the steps performed in the testing phase of the performed experiments.

B. Training Datasets
The Tashkeela dataset [35] consists of 75 million diacritized words. In its main part, it is collected from 97 books filtered from 7079 books of Shamela library which is an Islamic electronic library. These books are example of CA text. Only 1.15% of the Tashkeela dataset consists of MSA texts which is drawn from modern books and crawled from the Internet. This makes Tashkeela mainly an example of CA. In [4], Fadel et al. extracted a subset of 55,000 sequences from the Tashkeela dataset with diacritization to character rate of at least 80%. The subset was cleaned by removing English letters and extra whitespaces, fixing some diacritization issues, and separating numbers from words, among other techniques. The subset was divided into 50,000 sequences for training, 2,500 sequences for validation, and 2,500 sequences for testing. This subset was used in our previous work in [25] to train and test the developed model.
In this work, we use the cleaning and filtering scripts developed by Fadel et al. [4] to extract the larger datasets used in our experiments. In addition, we wrap sequences such that they have maximum lengths of 400 characters. This step is performed to reduce the training time and memory usage and is based on experiments we conducted in our previous work [25]. One of the main goals of this work is to study the effect of incrementing the training data size on the diacritization accuracy. We use the 50k training sequences of Fadel et al. as a base dataset from which smaller training sets are derived and larger training sets are formed by adding more sequences to it. Three smaller subsets are derived by randomly selecting 6,250, 12,500, and 25,000 sequences from the basic dataset. Since the basic dataset is cleaned and filtered to meet the diacritization to character rate of 80%, except for wrapping to 400-character length, no further work was needed for these subsets.
In order to construct larger datasets, we randomly select sequences from the Tashkeela corpora to be added to enlarge our sets, starting with the 50K set. The sequences are selected to have at least 80% diacritics to characters rate. Then, they are processed using the cleaning scripts. To avoid duplication of sequences in our sets, the selected sequences are checked not to be already included in the set to be enlarged. Finally, we wrap sequences lengths to 400 characters. By repeating this process, we extracted datasets that consist of 100,000, 200,000, and 400,000 sequences. We also obtain the largest set used in our experiments which results from including all available sequences from Tashkeela that satisfy the above criterion, which is 543,364 sequences. Except for the largest dataset, the sizes of the datasets are incremented by doubling the number of sequences from one set to the next. Moreover, the incremental process by which a new set is formed by adding sequences to the current set maintains the inclusion property, such that each dataset is a subset of the next. Table V shows size statistics of the used datasets in terms of word count, letters per word, words per sequence, and the number of sequences after the dataset is wrapped. All used subsets have close letters per word and words per sequence rates. For all experiments, we use the same validation set of 2,500 sequences, and testing set of 2,500 sequences to test the DER and WER of the trained model in each experiment.

C. Data Encoding
Sequences used in our experiments are either undiacritized consisting of letters only, or diacritized consisting of both letters and their diacritics. Undiacritized sequences are encoded using the Unicode representations of their letters. For diacritized sequences, we experimented using different encoding schemes in our previous work [25]. A one-to-one encoding scheme which represents each diacritic produced the best results in all performed experiments. Hence, we use this encoding scheme in this work. This scheme benefits from the fact that letters must not change between the input and the output sequences. Only diacritics must be added. Hence, it limits the classes at the output to the number of possible diacritics codes which is 16. Table VI shows the binary codes used for the eight Arabic diacritics. In Arabic, a letter may have two diacritics if one of them is Shadda. In this case, the diacritic code is formed by ORing the Shadda code (1000) with the other diacritic code. Fig. 5 shows an example of encoding the diacritized word ‫َّاد‬ ‫ي‬ َ ‫ص‬ (hunter) which includes letters with no, one, and two diacritics.

D. Base Model
For building our models, we use Keras (Python deep learning library) with TensorFlow at the backend [36]. Our baseline model is an BiLSTM that consists of an embedding layer of 32 dimensions, four bidirectional LSTM layers each consisting of 256 cells, followed by a 16-cell fully-connected output layer. The Softmax function is used for activating the diacritic class with the highest probability at the output layer. Adam optimizer is used in training, and the sparse categorical cross entropy is used as the loss function. The batch size is set to 128 sequences for all experiments. In addition, the maximum number of epochs used in training is 100 with early stopping such that training stops if the validation accuracy does not improve for five consecutive epochs. Fig. 6 shows the structure of our baseline model.  The following subsections present the experiments performed and discuss their results. We also compare our best results with previous work.

A. Experiments
We experimented training our baseline model using the eight corpora we extracted from the Tashkeela dataset. We evaluated the trained model in each experiment in terms of time required to train the model and the model accuracy. We report the training time both in terms of the training total execution time and the average training time per epoch for each of the eight experiments. Table VII shows  We report the performance of the models during training using the validation set in terms of validation loss and validation accuracy. Fig. 7 shows the validation accuracy and Fig. 8 shows the validation loss as functions of the training epochs for each experiment. In all reported results, we refer to each experiment by the number of sequences of its training dataset (i.e., 6,250, 12,500, 25,000, …). Training using larger number of sequences generally results in slower learning, higher values of accuracy and lower loss values. The best validation accuracy and validation loss achieved are 0.988 and 0.016, respectively, using the largest dataset of 543,364 sequences.
We tested the diacritization accuracy of the trained models using the eight extracted corpora. For all testing experiments, we use the test set of 2,500 sequences defined by Fadel et al. [4]. Fig. 9 shows diacritization error rates and word error rates for the eight models. The results show that both DER and WER improves as the number of sequences used in training increases. The best improvement, which is 22%, is observed when the training set increases from 6,250 sequences to 12,500. The improvement decreases gradually as we move towards larger datasets. No improvement is observed in the error rates when increasing the training set from 400,000 to 543,364 sequences. The best DER and WER achieved are 1.45% and 3.89%, respectively. We analyze errors of our system by enumerating the errors according to the number of errors per word and presence of end-case diacritization errors. The results of this analysis are shown in Fig. 10 and 11. Fig. 10 shows that for all dataset sizes, most of the miss-diacritized words have one diacritic error. Words with three or more diacritic errors are not frequent contributing to less than 6% of the errors in all experiments. Moreover, the ratio of multiple errors per one word decreases with larger datasets. Fig. 11 shows the end-case diacritization errors contribution in the DER and WER ratios. As explained earlier, end-case diacritization depends on the context and is subject to complex inflection rules. In our results, end-case diacritization errors contribute to about half of the word errors in all experiments. The best DER and WER values when ignoring end-case diacritization errors are 0.91 and 1.95, respectively.

B. Comparison with Previous Work
Our best results are reported here are for the model trained using 400,000 sequences.  Most previous work used either LDC's Arabic Treebank Part 3 (ATB3) [26], which represent an example of MSA, or Tashkeela, which represents an example of CA, or both. To the best of our knowledge, our previous work in [25] achieves the best published results for ATB3. The size of the ATB3 dataset is limited to 22,170 training sequences, which makes it unsuitable for the experiments we perform in this work. We do not include the results of Darwish et al. [14] since they use different training and testing datasets in both their MSA and CA experiments and hence comparison would not be fair. They used the training dataset of the RDI diacritizer in [18] and a test set of WikiNews for their MSA experiments. For their CA experiments, they used data from an undefined publisher.
The best DER and WER achieved in this work are 1.45% and 3.89%, respectively. This improves over our previous work which used a subset of the Tashkeela dataset and reported a DER of 1.97% and a WER of 5.13%. We compare our results with the best results of Fadel et al. [23] and Madhfar and Qamar [26] since both works use the Tashkeela dataset for training and testing. We outperform the model developed by Fadel et al. in DER and WER both with and without case ending. However, they achieve better last letter diacritization error rate.
Among the models they experimented, Madhfar and Qamar report the best DER and WER values for their CBHG model. In our comparison, the CBHG model of Madhfar and Qamar achieves the best DER in all cases. However, our bestperforming model word error rates outperform those of the CBHG model indicating that our model results in less percentage of wrongly diacritized words. Noting that they perform their own cleaning and filtering process, but with different rules, to extract datasets used in training their models. It's worth mentioning that our best-performing model outperforms the baseline model of Madhfar and Qamar, which is a deep BiLSTM RNN. DER and WER values reported for their baseline model are 2.24% and 8.74%, respectively.
It can be observed that our base model achieves results that are comparative to more complex models such as the CBHG model proposed by Madhfar and Qamar. This shows that training using a large clean dataset with high diacritization to letter rate provides competitive diacritization accuracy. Training more-sophisticated models using such a dataset would certainly provide even better results. Although this work involves experimentations using a basic BiLSTM RNN, it generates cleaned corpora with incremental sizes that can be used to experiment with several other models. Moreover, it shows that state-of-the-art error rates could be achieved when training using large clean corpora.

VI. CONCLUSION
Automating diacritization of Arabic texts is a crucial operation for many Arabic NLP applications. In this paper, we have conducted several experiments to study the effect of changing the training data size on performance. Our work included generating several cleaned subsets of the Tashkeela corpora with incremental size in terms of number of sequences. Our largest subset, which consists of 543,364 sequences, can be used for training other models and comparing them, such as the model used by Madhfar and Qamar [26]. Our baseline model is a deep LSTM bidirectional RNN. We evaluated the performance of our baseline model during training using each of the generated corpora by monitoring the validation loss and accuracy using the validation set. We tested the diacritization accuracy of the model after being trained using each corpus by finding its DER and WER values when diacritizing the 2,500sequence testing set.
Our experiments indicate that performance of the trained model improves as training set size increases. However, improvement in DER and WER values decreases as the number of sequences increases. Best achieved DER and WER values are 1.45% and 3.89%, respectively, using a training dataset size of 400,000 sequences (about 17 million words). Our WER value is the best when compared with other state-ofthe-art results. In order to further improve the performance, we aim to experiment with other proposed models and to develop a loss function that considers unharmful differences between the output and target sequences when training is performed.
ACKNOWLEDGMENT This work was supported by computing time granted on the Cyclone supercomputer of the High Performance Computing Facility of The Cyprus Institute.