Arabic Word Recognition System for Historical Documents using Multiscale Representation Method

In the last decades, huge efforts have been made to develop automated handwriting recognition systems. The task of recognition usually involves several complex processes including image pre-processing, segmentation, features extracting and matching. This task usually gets harder by processing historical documents as they involve skews, document degradation and structure noise. Although, the success that has been achieved in English language, the recognition of handwritten Arabic still constitutes a major challenge for many reasons. The characteristic of Arabic language, as a Semitic language, differs from other languages (e.g., European languages) in several aspects such as complex structure, implicit characters, concatenation and, writing styles and direction. This work proposes a full recognition system for the task of word recognition from from Arabic historical documents. In the proposed system, a novel feature extraction method is presented to define robust features from Arabic words. Prior Feature extraction, each input image is pre-processed and segmented resulting in segmented words. After that, the features of each word/sub-word are defined based on Multiscale Convexity Concavity(MCC) analysis of contour word shape. For feature matching, a circular shift method is proposed to burn the computational cost instead of using traditional dynamic time warping (DTW) which exhibits high computational cost. Finally, the proposed algorithm has been evaluated under well-known dataset, namely, Ibn Sina, and showed high performance for historical documents with low computational cost. Keywords—Word recognition; multiscale convexity concavity analysis; historical documents; dynamic time warping


I. INTRODUCTION
Many researchers have investigated the problem of handwritten recognition in English or Latin languages, while a few researches have targeted Arabic handwritten recognition due to the complexity of the Arabic Language. The Arabic language is an important language. More than 700 million people around the world use Arabic characters for writing and reading in Arabic or other languages like Farsi or Urdu. Arabic handwritten recognition application plays a very important role, whether online or offline, in automatic document recognition and archiving as well as in many other fields, such as office automation, cheque verification, mail sorting, and large variety of banking business [1], [2], [3]. The Arabic language, as a Semitic language, has many unique features and characteristics, which pose additional challenges for those who are interested in automatic recognition of Arabic documents.
This work proposes an efficient word recognition system for Arabic historical documents based on Multiscale Convex-ity Concavity (MCC) representation. The proposed system was divided into five main phases, namely morphological image processing, segmentation, feature extraction, and feature matching and dissimilarity. The first phase aims to preprocessed the input image by removing noise and connecting the disconnected parts of the writing lines. The next phase (segmentation phase) involves line and word segmentation. The line segmentation method is based on defining the separable lines between text lines. Unlike other methods which rely on defining the base line, our method basically works by detecting high and low activities along with each line using image histogram. Based on these activities, the separable line between text lines is defined. After line segmentation, each shape (word or part of word) is defined based on 8 neighbour connected pixels, followed by merging one or more shapes to produce a single word.
In feature extraction phase, the contour shape is extracted and analyzed using Multiscale Convexity Concavity (MCC) followed by Discrete Cosine Transform (DCT) resulting in what is named as MCC-DCT features. Once the MCC-DCT features are obtained, the similarity between two different words is carried out by comparing (matching) their MCC-DCT features. For matching, we propose a circular shift method instead of using dynamic time warping, and thus significantly reducing the computational cost. However, this could decrease the recognition rate.
The remainder of this paper is organized as follows: Section 2 presents a summary of related work. Section 3 presents the proposed Holistic word recognition system in details. Finally, the experimental evaluation of the proposed system is presented in Section 4.

II. RELATED WORK
Feature extraction is considered on of the most important phases of word recognition and plays a crucial role to achieve robust and accurate recognition. Many statistical and structural features [4], [1], [5], [6], [3] have been widely used in word recognition along with different classifiers including hidden Markov model (HMM) [7], [8], neural network [9], [10], [11], support vector machine (SVM) [12], [13] and others. Statistical features can be represented as numerical measures defined over the whole image or some regions of the image. Several measures, including pixel densities, histograms of chain code directions and Fourier descriptors, can be used in this type Structural features. (c) Statistical features. [1], [14] of features. In contrast, structural features can be represented as a composition of structural units such as loops, ascenders, descenders, branch-points and dots. These units are usually (not always) defined from the skeleton or the contour of the image. Fig. 1 [1], [14] shows an example of both statistical and structural features .
Tomasz and Noel [15] proposed a Contour-based Shape Representation for computing the similarity between two shapes, represented by 2D closed contours. This representation showed to be invariant under translation, scaling, and rotation processes. It was also found that the proposed representation could be used for database recovery or for detecting the regions with a specific shape in a sequence of images in video. El-Hajj et al [4] proposed analytical approach to define features based on baseline position. A set of measures, including densities, concavity and transitions were extracted from a patch related to the baseline. Baseline dependent features were then formulated from these measures and passed to HMM classifier for classification. Chen et al. [5] applied a set of Gabor filters to extract features (Gabor features) from Arabic handwritten words. SVM classifier was then used for classification task. In addition, they combined Gabor and gradient-structuralconcavity (GSC) features to achieve better recognition rate In [16], a learned feature model was introduced for offline Handwriting Recognition by applying a statistical bag-offeatures model. This model is then integrated with Hidden Markov Model (HMM) for the task of word recognition. Almazán et al. [17] proposed an approach for word spotting and recognition based on joint embedding space which is defined from word images and text strings. The proposed model was shown to achieve high accuracy in both Document Images and Natural Images with minimal training data. Nemouchi et al. [18] introduced an Arabic handwriting recognition system by combining methods of decision fusion approach using the HMM-Toolkit (HTK). Prior feature extraction, each image was pre-processed and segmented into lines. Sliding window technique was then used for feature extraction while HMMs was applied for classification. Chherawala and Cheriet [19] proposed an Arabic word descriptor (AWD) for the task of lexicon reduction. Their algorithm consists of two stages. The first stage computes the structural descriptor (SD) for each connected component (CC) of the word image. The second stage normalizes the structural descriptors (SDs) to form the Arabic word descriptor (AWD). The reduced lexicon was obtained by first ranking the original lexicon based on the distances between AWD of input word and AWDs of original lexicon. Those words within n top ranks in the original lexicon were used to formulate the reduced lexicon. For recognition task, the proposed lexicon reduction method was combined with different type of recognition techniques including analytic word recognition and holistic word recognition.
In the recent years, several organizations have tried to digitize a large amounts of historical handwritten documents as they could be destroyed because of their age [20]. This has led to open a new trend of text recognition, focusing on understanding and analyzing historical documents. Unlike standard handwritten documents, historical documents are characterized by low quality and large variations of writing styles making them hard to understand even by a human [21]. Different approaches have been proposed in this context including Gradient, Structural and Convexity (GSC) features [22], [23], [24], Gabor features [25], Dynamic Time Warping (DTW) [26], HOG Features [27], Multiscale Convexity Concavity features [28] and others.

III. HOLISTIC WORD RECOGNITION
The key of the proposed work is to analyze the closed contour for each world / sub word through multi-scale representation. This requires for each document to be pre-processed, segmented (line and word segmentation) prior extracting contour points. Fig. 2 shows a full flowchart of the proposed system, starting from input image and ending up with classified word . We can notice that the proposed system is divided into four main phases, namely Morphological image processing, segmentation, Multiscale Convexity Concavity (MCC) representation, and matching and dissimilarity.

A. Morphological Image Processing
In this phase, the colored image is first converted to binary image as shown in Fig. 3.
The value of each pixel in the binary image is then normalized using morphological erosion and dilation. Morphological dilation adds pixels to the boundaries of image shape, while erosion removes pixels from image shape's boundaries. Fig. 4 shows the basic concept behind of dilation operation.
The main idea of applying Morphological erosion and dilation, here, is to connect the disconnected parts of the writing lines. As the ink intensity is not constant over all lines of the word, auto binarization could results in disconnect of the continuous lines. Dilation has the effect of merging disconnected parts of the same word, while erosion refines the word so it dose not over think.

B. Segmentation Phase
Segmentation phase is applied to segment each word in the text image. This phase consists of two parts: line and word segmentation.

1) Line Segmentation:
The line segmentation method is based on defining the separable lines between text lines. Unlike other methods which relay on defining the base line, this method basically works by detecting high and low activity along each line using image histogram. Based on these activities, the separable line between text lines is defined. This can be achieved by defining sets of low and high activities based on threshold β. If the activity of the current line is less than β, then it belongs to low activity sets, otherwise it belongs to high activity sets. Now, each low activity set, falling between two high activity sets, contains a single separable line. This line is chosen to be the lowest line activity in the low activity set. Fig. 5 illustrates the process of line segmentation.
2) Word Segmentation: Prior segmenting each word, shapes at each line is segmented based on 8 neighbour connected pixels. Fig. 6 shows shape segmentation where each shape (word/sub-word) represented in different color (colors are used for illustration only). This segmentation is followed by merging multiple shapes (one or more) to perform word segmentation, producing segmented words. This merging is done according to distances between shapes. If the distance between shapes (two or more) are less than the distance threshold α, these shapes are merged to produce a single word.
By applying all segmentation phase on full text image, each word is segmented as shows in Fig. 7.
It is clear from Fig. 7 that some words are incorrectly segmented. This is because the writing style here is inconsis- Fig. 7. Segmented Words tent. The distance between words sometimes is less than the distance between shapes (sub-words) within the same word. This problem can be solved by adding some constraints on the writing style of writers to be consistent. Note that the value of the distance threshold was defined experimentally through a set of empirical tests. Although this way is simple and does not require any extra process, it could be inefficient in some cases, notably when the difference between writing styles is big. This problem can be minimized by investigating an automated way to define the distance threshold. For example, defining the distance threshold by analysing the minimum and the maximum distances between shapes (sub-words) for a document within the same writing style. However, this requires a deep investigation to compromise between simplicity and accuracy which can be addressed in the future work.

C. MCC Representation
After segmentation phase, closed contour shape of each word is extracted as illustrated in Fig. 8. Note that some words have multi-closed contour regrading to the number of subwords. Multi-scale analysis, in such cases, cannot be applied directly since it is applicable only on single contour (single contour per word).
To avoid this issue, we ether connect sub-words together (within a single word) prior extracting the contour shape, or extracting MCC representation from each sub-word separately (within a single word). The first approach was used by [28] [29] for English word recognition. However, unlike English word, separated sub-words which represent single word, cannot be connected together as this could totally change the word. In contrast, second approach dose not change the shape of Arabic words making it suitable to represent word features. In both approaches, very small contour shapes such as dotes and diacritics (e.g, "˜" and ".") are ignored because they are not stable shapes, producing unstable results. Further more, most of the old historical documents do not use movements and dots in their writings. To achieve that, the size of each closed contour within a single word is evaluated using a threshold value ρ. Any contour shape, with size less than ρ, is filtered out as illustrated in Fig. 8. Now, each contour need to be normalized to have the same number of contour points for different shapes. It is worth noting that number of contour points is varied regarding to the size of the shape (sub-word) and no limitation on that. An example of such cases illustrated in Fig. 9.
Different strategies can be applied here in order to unite the length of all contour shapes. It is good to define a selection criteria to extract a fixed number of dominant points from each contour. However, a such criteria requires a lot of investigation before applying. Instead, each contour shape is sampled to have N unit length (points).
After length normalization, each contour point is analysed using Gaussian kernel. Since each contour point includes the value of two dimensions (x and y), Gaussian kernel is applied for each dimension separately. Suppose we have a contour C with N points represented by (x (u), y(u) coordinators, where u ∈ {1, 2, ..., N }. The coordinators, x (u) and y(u) are convolved with Gaussian kernel ψ σ resulting a smoothed contour C σ at scale σ with x σ and y σ coordinators as the following, the kernel size is set to be fixed for all scales. Note that the value of scale σ determines the smoothness degree of the Gaussian kernel ψ σ . For simplicity, we assume that σ takes an integer values between 1 to 10.
To define the convex and concave peaks, each contour point is evaluated based on the difference (displacement) occurs through consecutive scales. More specifically, The displacement of each contour point through consecutive scales are used to build a rich multiscale representation of convexity and concavity. Assume that a contour point u at scale σ denoted as (x σ (u), y σ (u)), (as shown in Eq. 1). The MCC representation We can note that M (u, σ) represents the displacement in contour point u through scale σ and scale σ−1. Contour points with sharp convexity/concavity produce large displacements over different scales. The variable ν in Eq. 2 was used to distinguish between convex and concave curves. A positive value was assigned to M (u, σ) if it represents a convex curve. In contrast, a negative value was assigned to M (u, σ) if it represents a concave curve. Fig. 10 shows displacements in the both convex and concave curves of contour C through scales σ − 1 and σ. Point p 1 of the inner contour (C σ ) falls inside contour C σ−1 leading to convex peak (positive value). In contrast, Point p 2 of contour C σ falls outside contour C σ−1 leading to concave peak (negative value). where, where L represents the number of scales while s represents a scaling function.

D. Matching and Dissimilarity
In this section, a contour matching algorithm was first proposed in order to measure the dissimilarity between MCC Fig. 11. An Example of MCC Representation matrices. Since each word may have more than single contour, a Word-to-word matching strategy has then been introduced to define the the dissimilarity between two words.

1) MCC matching:
The distance between two contours (MCC matrices) was carried out by matching contour points through different scales. Assume that d(A a , B b ) represents the distance between contour points a A ∈ A and b B ∈ B through scales σ ∈ (1, 2, ..., I). The distance can be defined as follows, the operators v A σ and v B σ represent a normalizing factor for each scale σ. The first operator is defined as v A σ = max (A σ,a ) N a=1 + min (A σ,a ) N a=1 and, in the same manner, the second one is defined.
By Matching each point in A and B, we get a distance matrix D where the index (i, j) represents the distance between points a i and b j as shown in Fig. 12. The final dissimilarity measure between A and B is calculated by finding the cumulative distance D = D min /H. D min represents the optimal path (minimal diagonal distance) which obtained by using the Dynamic Time Warping(DTW) [29]. From Fig. 12, the examined indices were pointed by a blue color, while the selected path was pointed by black lines.
Matching Optimization: The complexity of defining D is H × H since we match all contour pints together. Instead, we can match only corresponding points from A and B together which only cost H. To achieve that, A and B need to be reordered (reorder the columns) so they have the same sequential order based on a reference index. This index was chosen to be the highest convexity displacement between the coarsest and the finest scales in Gaussian representation. In other word, the highest convexity value resulted from summing MCC representation through different scales (rows in MCC matrix). Assume that g and q are reference indices for A and B, respectively. The reordered representationsÃ andB can be defined as follows: where CircularShif t(x) i function is shifting circularly the elements (columns) of x regarding to the reference i. Now, by matching corresponding points betweenÃ andB using Eq. 4, the distance matrix can be represented as 2) Word-to-word matching: In the previous section, we calculated the dissimilarity matrix between contour shapes. However, Arabic words usually consist of multiple shapes (sub-words). This requires to match between corresponding sub-words A i and B j within words W A and W B , respectively. The final dissimilarity measure between W A and W B is defined as follows:

IV. EXPERIMENTAL RESULTS
In this section, the performance of the proposed system has been evaluated for the task of word recognition from historical documents. To achieve this, our system has been evaluated on well-known historical Arabic datasets, namely, Ibn Sina [30]. The name of the dataset is related to a famous Persian scholar, Ibn Sina, since it is derived from his philosophical work. This dataset consists of 60 pages with approximately 25,000 subwords. Around 1200 different classes are represented in the dataset, distributed on different classes. Fig. 13 shows a page sample selected from the dataset.
We randomly selected images from dataset to build two sets of reference and test sets, with the ratio of 10 to 1, respectively. The reference set forms our lexicon while the test set is used to evaluate the performance of the proposed system.   achieved recognition rates of 85% at rank 1, while MCC-DCT circular−shif t achieved recognition rates of 84.5% at the same rank. We can note that the recognition rate of MCC-DCT circular−shif t is 5% less than standard MCC-DCT. However, it is a trade off between accuracy and complexity, where MCC-DCT circular−shif t exhibited much lower complexity than MCC-DCT. Table I shows the recognition rate of both methods along with average processing time of each word. It is clear that MCC-DCT circular−shif t is 59 times faster than the standard MCC-DCT.

V. CONCLUSION
In this work, we introduced a word recognition system capable to process and recognize holistic words in Arabic historical documents. The main contribution of the proposed system is to define robust features, extracted from holistic word, based on Multiscale Convexity Concavity (MCC) representation. As Arabic word could include multiple shapes, the features of each shape, within a single shape, are defined separately. For matching, the corresponding shapes between examined words are firstly matched and then the resulted distances (from matching) are combined to compute the the over all distance between these words. To avoid the high computational cost resulting from dynamic time warping match, we proposed a circular shift match which significantly burn the computational cost. Our experiments showed that our system capable to compromise between high recognition rate and low computational cost. In the future work, we plan to optimize our circular shift matching by investigating more advanced criteria to define the reference index such the energy function.