A Deep Learning Approach for Handwritten Arabic Names Recognition

Optical Character recognition (OCR) has enabled many applications as it has attained high accuracy for all printing documents and also for handwriting of many languages. How-ever, the state-of-the-art accuracy of Arabic handwritten word recognition is far behind. Arabic script is cursive (both printed and handwritten). Therefore, traditionally Arabic recognition systems segment a word to characters first before recognizing its characters. Arabic word segmentation is very difficult because Arabic letters contain many dots. Moreover, Arabic letters are context sensitive and some letters overlapped vertically. A holis-tic recognizer that recognizes common words directly (without segmentation) seems the plausible model for recognizing Arabic common words. This paper presents the result of training a Conventional Neural Network (CNN), holistically, to recognize Arabic names. Experiments result shows that the proposed CNN is distinct and significantly superior to other recognizers that were used with the same dataset.


I. INTRODUCTION
Arabic is the official language of more than twenty countries and the mother tongue of more than 300 million people [6]. Arabic is one of the six United Nations official languages. The Arabic script is also used as a medium of writing for other languages like Persian. Moreover, Arabic script is the former script (Ottoman script) of the Turkish language [23]. Unfortunately, research in Arabic language recognition is far behind when compared with languages with the same size and importance. For instance, number of big popular Arabic datasets that are widely used is very few [14]. Moreover, most of these datasets are small. For instance, the IFN/ENIT dataset which is widely used by researchers and also used in ICDAR Arabic recognition competition contains only 26559 word images (names of Tunisian cities) [17], [22].
Arabic script is cursive (both handwritten and machineprinted text). Fig. 1, presents the Arabic alphabet and their different shapes. This makes handwriting recognition very challenging, as we need word segmentation before applying character recognition system. The segmentation methods reported in the literature are far from being robust and very accurate [1], [5]. Segmenting Arabic words to characters is very difficult [12]. There are two reasons for this difficulty. First Arabic letter shape is context sensitive. Some Arabic letters have four shapes according to their position in the word (see Fig. 1). Secondly, in Arabic writing dots are very important and they are not few as fifteen letters out of twenty eight have dots above or below them. Arabic writers do not place these dots carefully on their proper place and this leads to much confusion. although Arabic writing is horizontal from right to left some letters overlapped vertical. For illustration Fig. 2, contains an image of the Arabic word (arguments in English). The image shows three different handwriting. For comparison the Figure also shows some printing for the same word. This is an example for a word that expert Arabic reader recognizes it as one unit without trying to figure out each letter and its dot(s). An analytical automatic recognition system should find the proper segmentation first and then it should find the proper coupling of the dots with its corresponding letter.
In the literature, there are good results for isolated Arabic character recognition (in some papers the recognition is more than 97%) [21], [4], [2], [3]. However, publication on word recognition are few with low recognition accuracy rate [15], [16]. This low rate for word recognition accuracy is mainly due to the error in segmentation [1]. Arabic readers generally tend to recognize common Arabic words and names holistically (i.e. without segmenting them). For new words or non-common ones, the Arabic readers identify the word letters and then recognize the word. This idea becomes appealing for automatic recognition and found good support in last couple of decades [10]. For Arabic words, holistic paradigm is faster and, in some cases, more accurate than analytic paradigm [19], [11]. The disadvantage of the holistic paradigm is its need for huge training data. Moreover, the holistic paradigm does not work alone as the recognition system needs to revert to the analytic paradigm when the word under consideration is not in common words lexicon. bypassing the problems that occur as a result of over segmentation and under segmentation is the main advantage of holistic approach-based classifiers for OCR.
Arabic names used today have much repetition. For instance, in SUST-ARG dataset [13], the twenty most frequent male names represent 52% of the total male names. Application forms normally contain the applicant name. Therefore, for the automatic processing of these forms, the holistic paradigm seems appropriate to quickly recognize common names and to resort to the analytical paradigm to recognize uncommon ones.
In recent years, deep learning has gained great popularity in the pattern recognition filed [9]. It becomes the focus of many researchers since it represents the easiest way to deal with huge data and it automates the feature extraction task. From our literature survey, we notice that although deep learning is the state of the art for OCR, the number of papers which use deep learning for Arabic character recognition is very few [21], [4], [2], [8]. Moreover, All these work address isolated  Arabic letters and numerals recognition. Therefore, there is a lack of research that uses deep learning to recognize Arabic words holistically (i.e. using holistic paradigm).
The rest of the paper is organized as follows: The next section describes the materials and methods. The material and methods starts by presenting SUST-ARG names dataset followed by describing the preprocessing of the data, and finally describes the proposed model. Section 3, presents and discusses the results. the conclusions and future works are given in Section 4.

A. Dataset: SUST-ARG names
The Dataset used in this paper, SUST-ARG names, is collected by the Arabic Recognition Research Group in the College of Computer Science and Information Technology, Sudan University of Science and Technology. This dataset has been collected in two stages. In the first stage 8028 names written by 2007 students have been extracted from the Certificate application form which used by the registrar office in the college (Fig. 3). From this data the most repeated names have been identified. In the second stage, a new form is designed to collect data for Forty common names (Fig. 4). The right column in Figure 4 contains male names and the left for female names. One thousand forms have been filled by university students. The forms were preprocessed and a dataset that contains forty thousand names is constructed. This dataset contains two parts, twenty thousand male names dataset and twenty thousand female names. Only male names are used in the experiments of this paper. Table I, contains male names samples.

B. Data Preprocessing
The data collection forms were digitized using a scanner and then the names are extracted from the scanned paper. Most of the names are not in the center of their boxes as depicted in Table I. The initial experiments show that raw data is not suitable as input to the proposed CNN. The extracted images are then preprocessed by eliminating the surrounding white spaces and down scaling the image to 28 x 56 pixels. After that all the images are converted to black Background and white foreground.

C. The Proposed CNN Architecture
Convolutional Neural Networks (CNNs) is one of the important classes of deep learning, which is mostly applied in computer vision to analyze and identify visual imagery [20]. They are found to be very efficient in computer vision problem and they have state of the art performance when provided with large amount of training data [18]. Normally, a CNN architecture consists of a sequence of layers that are stacked in specific organization where the output of each layer is used as an input for the layer that comes after it. The basic layers that can be found in any CNN architecture are: convolutional layer, pooling layer, and the fully connected or dense layer. The most important layer in the CNN architecture is the convolutional layer and normally it is the first layer (the input layer). It uses what is known as convolutional filtering to detect the presence of features that can be used to distinguish specific image that is provided as an input by creating features map. These feature map records the exact position of features in the image, where small movements in this position will result in a totally different feature map. Down sampling is one of the approaches to address the problem that can be created by the features map, and it is implemented in the CNN architecture by the means of a max pooling layer, which scales the number of features generated by the convolutional layer and keep the most important features. There are two important functions that can be used in the pooling layer: Max pooling and Average pooling, where the former takes the large value in each patch of features i.e. the large value in a set of pixels and the later calculates the average of each patch. Because the distribution of each layer input changes with the change of its previous layer parameters during the training process, a batch normalization is needed to normalize layer inputs and hence speed up the learning process [7]. The layer that is used to classify the input images is known as the fully connected layer and normally it is the last layer in the CNN architecture. It produces outputs equivalent to the number of the classes in the classification problem. These outputs are probabilities for the class's labels. Activation functions are essential part of the CNN architecture to accomplish the complex functional mapping between the input and the outputs, which can be used as inputs for the next layer. The proposed CNN architecture model for Arabic names recognition uses four convolutional layers and each layer uses Rectified linear unit (RelU) activation function, which is very popular activation function. According to the literature, it has very good experimental results. Simply it is defined as shown in equation 1.
The function will return zero if the value is negative and the value itself if it is positive, which means that it grows linearly for positive values. Because Max pooling layer can select the brighter pixels from the inputs and it suits the case where the background is dark i.e. the same as the preprocessed images in our dataset, we used it after the second and the fourth convolutional layers. Our proposed model uses two dense layers and it uses softmax activation function after the second Dense layer. The softmax function forces the output of each unit to be between 0 and 1. It divides each output such that the total sum of the outputs is equal to 1 and mathematically can be expressed as shown in equation 2.
where z is a vector of the inputs. j is an index for the output units and it goes from 1 to K.The complete architecture of our proposed model is depicted in Fig. 5.

III. RESULTS AND DISCUSSION
We used keras library under TensorFlow environment to construct the proposed CNN model. Ten fold cross validation is used to train and test the proposed model, where the entire dataset is divided into 10 folds. One of these 10 folds is used as a testing set and the remaining folds are used for training the proposed model. This process is repeated 10 times by taking a unique fold in each time. The final results is calculated as the average of the results that are obtained from the repetition. The accuracy of prediction is used as performance measure. The accuracy obtained for the testing names is 99.14%. Table  II shows some of the missed classified names.
The literature survey reveals that there are two papers which describe recognition models using the same data set [19], [11]. Both of the papers have used only the male names part of the dataset (same as the experiments of this paper). The classifier of the first paper is HMM and the best classification accuracy is 63%. The classifier of the second paper is probabilistic neural networks, where the classification accuracy is 89% (see Table III).
Despite the fact that the classification accuracy of the proposed model is very high, however, one may consider that the small number of classes (only 20 names) is a major drawback. As we have mentioned holistic classifier does not work alone. The classifier will recognize common names only and the system should use the analytical approach to recognize the new names or uncommon ones. Therefore, there is critical trade-off here, increasing the number of names, speeds up the processing but the cost of training is higher. Moreover, increase the number of classes may decrease the accuracy. This is a question for further research.

IV. CONCLUSION
This paper tests the usage of Holistic paradigm to recognize Arabic names without segmentation. As deep learning is  TABLE III. THIS TABLE SHOWS THE ACCURACY OF THE PROPOSED   CLASSIFIER TOGETHER WITH THE ACCURACY OF TWO CLASSIFIERS  TRAINED ON THE SAME DATASET. Classifier Accuracy The proposed model 99.14% Hidden Markov Model [19] 63% Probabilistic neural network [11] 89% currently the state of the art for such system, a CNN model is designed, trained and tested using SUST-ARG-names dataset. The goal of the paper is to study the idea of building a system which recognizes common Arabic names very quickly. Although the accuracy is very high, the number of names selected for the experiments of this paper is relatively small (20 names). Comparing the proposed model with previous similar work on the same dataset is the main reason for selecting this small number. According to the statistics on the given dataset, these 20 names represent more than 50% of SUST students' names. It seems clear that increasing the number of names to represent, for instance, more than 90% of the used name is the future work that must be done to support and prove the practicality of this model.