Printed Arabic Text Recognition using Linear and Nonlinear Regression

Arabic language is one of the most popular languages in the world. Hundreds of millions of people in many countries around the world speak Arabic as their native speaking. However, due to complexity of Arabic language, recognition of printed and handwritten Arabic text remained untouched for a very long time compared with English and Chinese. Although, in the last few years, significant number of researches has been done in recognizing printed and handwritten Arabic text, it stills an open research field due to cursive nature of Arabic script. This paper proposes automatic printed Arabic text recognition technique based on linear and ellipse regression techniques. After collecting all possible forms of each character, unique code is generated to represent each character form. Each code contains a sequence of lines and ellipses. To recognize fonts, a unique list of codes is identified to be used as a fingerprint of font. The proposed technique has been evaluated using over 14000 different Arabic words with different fonts and experimental results show that average recognition rate of the proposed technique is 86%.


INTRODUCTION
Optical Character Recognition (OCR) is a process of analyzing images of printed or handwritten text to translate character image into a machine editable format [1]. Printed character recognition has been extensively used in many areas especially after popularity of electronic media, which increases necessity of converting printed text into the new electronic media.
Several recognition techniques have been proposed recently to recognize printed Arabic characters [2,3,4,5,7]. Nevertheless, the problem of recognizing printed Arabic characters still is an active area of research and still has many challenges. Arabic text has many characteristics that make the process of recognizing printed Arabic text as a difficult task. These characteristics include [2,3]: -Arabic text is cursive script and its characters are connected even in machine printed documents (see Fig. 1).
-Neighboring characters in Arabic script usually overlapped (see Fig. 2), which increases the difficulty of isolating characters.
-Several characters in Arabic script have the same form and the difference is one or more dots in different locations (see Fig. 3).
-Character form may vary between fonts (see Fig. 4). Therefore, the large number of existing Arabic fonts increases the difficulty of recognition task.
-Arabic text has 28 characters and 10 numerals. As shown in Table 1, each character has up to four forms depend on its position in the word (isolated, beginning, middle, and end). Therefore, it is expected that there are 120 different character forms in each font after adding the new character ( ‫ال‬ ), which is created by writing ALIFON ( ‫أ‬ ) after LAMON ( ‫ل‬ ). Although, this number of forms has been mentioned, described and used in many researches [r8][r20][r22][r23], the problem is more larger than this. Unfortunately, each character may have different forms in the same location in the same font. As shown in Fig. 5, BAAON ( ‫ب‬ ) has up to five forms at the beginning of the word in the same font )Traditional Arabic font(. Its form does not only depend on its neighbor but also depend on the neighbor of its neighbor as in BEMA ( ‫بما‬ ) and BEM ( ‫بم‬ ). This paper proposes printed Arabic text recognition technique using linear and ellipse regression techniques. Characters are recognized by using Codebook, which contains code for each character form as well as fingerprints to recognize fonts. Characters code and fonts fingerprint are represented as sequences of points, lines, and ellipses. Linear regression is employed to avoid difficulties of representing line segments using ellipses such as infinite major axis and very small minor axis.
Main contribution of the proposed technique is generating codebook contains fingerprint of each font and code for each possible character form without using corpuses. This feature allows the proposed technique to follow rapid generation of new fonts. Moreover, the proposed technique uses regression The rest of this paper is organized as follows. Section 2 overviews current related work. Section 3 discusses the proposed technique for recognizing printed Arabic characters. Section 4 explains evaluation methodology and experimental results. Finally, the paper is concluded in Section 5.

II. RELATED WORK
Although, there are a large number of printed Arabic characters recognition approaches have been proposed in the last few years, there still needs to enhance recognition rate in Arabic OCR systems. This section overviews some of these approaches.
Rashad et al. [2] have compared between K-Nearest Neighbor (KNN) and Random Forest Tree (RFT) classifiers in recognizing printed Arabic characters. First, global binarization has been used to binarize images. 14 statistical features have been extracted from each character by using horizontal and vertical transitions techniques. Finally, KNN and RFT have been applied to recognize characters. Their experiments show that, although, KNN is faster than RFT in training and testing, RFT performs better than KNN, where recognition rate of RFT is 98% while recognition rate of KNN is 87%. Chergui et al. [10] have proposed multiple classifier system (MCS) to recognize Arabic optical characters. The proposed classification engine is based on serial combination of Radial Basic Function (RBF) and set of Adaptive Resonance Theory networks (ART1). RBF-based classifier is used to give a score for the most likely classes based on the first 49 Tchebichef moments, which are extracted after normalizing, aligning, and thinning processes. By using Tchebichef moments, image has been represented with minimum amount of information redundancy. Finally, an adaptive resonance theory network has applied on each group obtained from applying RBF-based classifier. Experimental results have shown that the proposed classification engine outperforms RBF based classifiers and ART1-based classifiers. www.ijacsa.thesai.org Amara et al. [11] have overviewed Arabic OCR using Support Vectors Machines (SVM). Although, SVM has proven its efficiency in different domains among other classification tools, SVM has not been effectively applied in recognizing Arabic characters. The authors have concluded that there are still many challenges face current algorithms that apply SVM in Arabic OCR, such as precision, consistency, and efficiency. The authors have found that best recognition rate has been reached by applying one-against-all technique with Gaussian RFB kernel. Best RFB kernel parameters are determined by using Ten-fold cross validation.
Jiang et al. [12] have proposed small-size printed Arabic text recognition approach based on hidden Markov (HMM) model estimation. Although, applying hidden Markov model has some advantages (such as no pre-segmentation), bad image quality of small-size printed Arabic text makes it difficult to find accurate model boundary. In the proposed approach, state number of HMM has been optimized and bootstrap approach has been modified to improve accuracy of finding model boundary of small-size printed Arabic text with bad image quality. Bootstrap approach has been modified by using some HMMs with different state number and select HMM with the best performance before Viterbi alignment. Their experimental results show that error rate of word recognition is decreased 13.3% and error rate of character recognition is decreased 14%.
Ahmed et al. [13] have employed a special type of recurrent neural network, called bidirectional long short-term memory (BLSTM) networks, to propose segmentation-free optical character recognition system. BLSTM has proven its efficiency in many research areas due to its ability to remember events when there are long time lags between events. However, BLSTM requires pre-segmented training data, and post-processing to transform outputs into label sequences. Therefore, layer called connectionist temporal classification (CTC) has been used with BLSTM to label unsegmented sequences directly. The proposed approach has been evaluated with cursive Urdu and non-cursive English scripts. Although, their experiments show that accuracy of the proposed approach is 11.99% with non-cursive, its accuracy is 99.18% with cursive. Therefore, the proposed approach needs more investigation to enhance it accuracy with cursive scripts.
Accuracy of optical character recognition system influences by graphical entities (e.g., horizontal or vertical edges, symbols, logos) that are exist in printed document image. To overcome this problem, Rani et al. [14] have proposed algorithm to detect such graphical entities. The proposed algorithm detects graphical entities by using Zernike moments and histogram of gradient features and detects horizontal and vertical lines by masking the image with rectangular structuring element. Their experimental outcomes show that accuracy of the proposed algorithm is 97% in detecting graphical entities and 92% in detecting horizontal and vertical lines.
Sarfraz et al. [3] have proposed offline Arabic character recognition system. The proposed system has four stages. In the first stage, text-preprocessing stage, removes isolated pixel and correct drift. Pixel is considered isolated if it does not have any neighboring pixels. Drift is corrected by rotating the image according to the angle with highest number of occurrences between all angles of all lines segments between any pair of black pixels in the image. In the second stage, line and word segmentation are performed by using horizontal and vertical projection. Words are segmented into individual characters by comparing vertical projection profile with fixed threshold. Feature space is built by using moment invariant technique. Finally, characters are recognized by using two different approaches: syntactic approach and a neural network approach. Their experiments show that recognition accuracy of syntactic approach is 89% -94%, while recognition accuracy of neural network approach is 98%.
Abdi et al. [15] have proposed text-independent Arabic writer identification and verification approach. Beta-elliptic model has been adapted by the proposed approach to construct its own grapheme codebook instead of extracting natural graphemes from a training corpus using segmentation and clustering. The size of the generated codebook is reduced by using feature selection. Feature vectors are extracted using template matching to perform writer identification and verification.
Zagoris et al. [6] have proposed an approach to differentiate between handwritten and machine-printed text. Text image is segmented into blocks. Each block is represented as word vector, which contains local features that are identified using Scale-Invariant Feature Transform. Based on Support Vector Machines, the proposed approach decides whether text block is handwritten, machine printed, or noise by comparing its word vector with codebook.

III. THE PROPOSED RECOGNITION TECHNIQUE
Architecture of the proposed recognition technique is shown in Fig. 6. In the preprocessing stage, text image is thinned using Zhang-Suen thinning algorithm [16] and segmented into disconnected sub-words (see Fig. 7).
Relations between segments are represented using Freeman code as following. For each segment with a set of pixels ( ) , center point ( ) is defined, where All center points are sorted from left to right and from top to bottom. Directions from each point to the following three www.ijacsa.thesai.org points (if exists) are represented by Freeman code, which is shown in Fig. 8.
In the second stage, code is generated using the proposed encoding technique, which is described in subsection A. Generated code is compared with characters' code from codebook (described in subsection C) using the proposed matching technique (described in subsection B) to recognize its characters. Finally, recognized words are introduced.

A. Proposed Encoding Technique
Arabic script is cursive script. Therefore, Arabic words can be represented by a sequence of lines and curves. The proposed recognition technique generates a sequence of points, lines, and ellipses to represent each sequence of connected characters or sub-characters. Points are used to represent dots that exist in several characters. To collect sequences of connected pixels that can be regressed to lines, it computes the line that passes through the first two pixels. New pixel is added to the list if its distance from the line is less than or equal pre-specified value (accuracy factor). After adding new pixel to the list, line is recalculated to find best line, which fits to all pixels in the new list. If distance between current pixel and the previous line is greater than accuracy factor, algorithm is terminated. If length of line segment that correspond to collected pixels is greater than a pre-specified length, code for this sequence of connected pixels is generated as ( ), where and are parameters of the line ( is distance of the line from origin, and is the angle that the line between closest point on the line and origin makes with the polar axis), and is the length of line segment.
To find best line that fits to a set of pixels, the proposed recognition technique applies linear regression methodology described in [8]. Best line that fits to a set of pixels ( ) , is identified by the following equation: Table 2 shows codes of line segments that are detected in sub-words in Fig. 7.
In the previous step, pixels that can be regressed to lines are extracted. In this step, all remaining pixels are clustered to sets such that pixels in each set are connected and can be regressed to ellipse. Each ellipse is described using 6 coefficients as following:  [9], optimal solution of equation (6) can be found by finding eigenvector of matrix with minimal positive eigenvalue , where After finding best ellipse that fits to a set of pixels, code is generated to represent its ellipse arc as ( ), where ( ) is the center of the ellipse, are major and minor axises, is anticlockwise angle of rotation from x-axis to the major axis (range of is from 0 to 180), is the start angle of the arc, and is the end angle of the arc. As shown in Fig. 9, set of pixels are represented as anticlockwise arc from start to end points. To calculate and , center point of ellipse is moved to origin. and are calculated, where is anticlockwise angle of rotation from x-axis to the line from origin to start point, and is anticlockwise angle of rotation from x-axis to the line from origin to end point. Finally, and are calculated as and . Table 3 shows examples of detected ellipse arcs from subwords in Fig. 7. For simplicity, numbers in Table 3 are rounded to nearest integers.
Code for sequence of connected characters or subcharacters is generated as ( ) , where ( ), is point , line, or ellipse code, is Freeman direction to . Fig. 10 shows example of subwords code. Null direction is represented using (9). Finally, completed code of word or list of words is generated as shown in Fig. 11.

B. Matching method
To extract characters' code and to recognize characters in text image, the following matching method is applied. Two line segments ( ) and ( ) are compared by comparing their vectors in Polar form.
, and are accuracy factors of length and directions of vectors. All parallel line segments with the same length are equivalent.

C. Codebook generation
For each font, codebook contains a unique code for each character form as well as a unique list of codes that represents fingerprint of the font.
To identify unique code for each character form, character forms have to be collected first. However, each character may have different forms in the same location in the same font. Therefore, for each location, list of all possible combination of connected characters with maximum length three are generated. This number of characters has been chosen because some characters at the beginning of word change their forms depend on neighbors of their neighbors. Sub-words in each list are classified and code is generated for each class. Table 4 shows connectivity of Arabic characters that are used to generate sub-words. New character KASHEEDA ‫)ـ(‬ is added, which can be used in many places in Arabic words. As shown in Table 4, there are 36 character with right connectivity and 25 with left connectivity. Table 5 shows number of sub-words that can be generated for each location. Code is identified for each sub-word by generating a set of images using different font sizes. Each image is converted to code and common code is extracted to represent sub-word code. The reason behind using different sizes is that, although, character form is not affected by font size in most of existing fonts, thinning process is affected by font size. Fig. 12 shows the word AKALA ‫)أكل(‬ using different sizes of Times New Roman font. As shown in Fig. 12, although, LAMON ‫)ل(‬ character has the same form in the original word with different font sizes, LAMON ‫)ل(‬ character has different thinned forms. In Fig. 12, Zhang-Suen thinning algorithm is applied during thinning process. Same result was reached even with different thinning algorithms (such as Guo-Hall thinning algorithm). Thinning Arabic script has many problems as mentioned in [18]. These problems increase difficulties of generating unique code for each character form using one font size.  The proposed technique has been implemented using OpenCV. To implement linear and ellipse regressions, Geometric Regression Library (GeoRegression) is used.
GeoRegression is an open source Java geometry library for scientific computing [19].
Codebook is generated for two fonts (Times New Roman, and Tahoma). Fig. 13 and Fig. 14 show examples of generated characters' code in Times New Roman and Tahoma fonts. Table 6 shows examples of sub-words that are generated to identify code of character (‫()ط‬in Tahoma font and in middle of word). In Table 6, linear and ellipse parts of each sub-word as well as its code are shown. Common parts are detected using the proposed matching technique and exploited to generate character code. Common parts are underlined in Table 6. Fig.  15 shows generated code of character ‫.)ط(‬ To illustrate validity of the generated character code, Arabic word ‫)أمطار(‬ is printed in image using font Tahoma with size 50. As shown in Fig. 16, code of the word ‫)أمطار(‬ contains code sequence that matches with code of character ‫.)ط(‬ Performance of the proposed recognition technique has been evaluated using 14822 different words, which are collected from Holy Quran. Sequences of words are selected, converted to images, and used as inputs to the proposed technique. The proposed technique has recognized 12750 words with Tahoma font and 12350 words with Times New Roman font. Which means that recognition rate of the proposed technique is 86% in Tahoma font and 83% in Times New Roman font.
As future work, the proposed recognition technique will be examined with most of existing fonts to evaluate its performance with different fonts as well as with multi-font texts. Moreover, optimization technique will be exploited to optimize accuracy factors used in the proposed technique.