A Proposed Hybrid Technique for Recognizing Arabic Characters

Optical character recognition systems improve human-machine interaction and are urgently required for many governmental and commercial departments. A considerable progress in the recognition techniques of Latin and Chinese characters has been achieved. By contrast, Arabic Optical Character Recognition (AOCR) is still lagging although the interest and research in this area is becoming more intensive than before. This is because the Arabic is a cursive language, written from right to left, each character has two to four different forms according to its position in the word, and most characters are associated with complementary parts above, below, or inside the character. The process of Arabic character recognition passes through several stages; the most serious and error-prone of which are segmentation, and feature extraction & classification. This research focuses on the feature extraction and classification stage, being as important as the segmentation stage. Features can be classified into two categories; Local features, which are usually geometric, and Global features, which are either topological or statistical. Four approaches related to the statistical category are to be investigated, namely: Moment Invariants, Gray Level Co-occurrence Matrix, Run Length Matrix, and Statistical Properties of Intensity Histogram. The paper aims at fusing the features of these methods to get the most representative feature vector that maximizes the recognition rate.


INTRODUCTION
OCR is the process of converting a raster image representation of a document into a format that a computer can process.Thus, it may involve many sub-disciplines of computer science including image processing, pattern recognition, artificial intelligence, and database systems.Despite intensive investigation, the ultimate goal of developing an optical character recognition (OCR) system with the same reading capabilities as humans still remains unachieved and more so in the case of Arabic language.Most commercially available OCR products are for typed English text because English text characters do not have all the extra complexities associated with Arabic letters.
Arabic is a popular script.It is estimated that there are more than one billion Arabic script users in the world.If OCR systems are available for Arabic characters, they will have a great commercial value.However, due to the cursive nature of Arabic script, the development of Arabic OCR systems involves many technical problems, especially in the segmentation and feature extraction & classification stages.Most characters have dot(s), zigzag(s), madda, etc, associated with the character and this can be above, below, or inside the character.Many characters have a similar shape, the position or number of secondary strokes and dots makes the only difference.Although many researchers are investigating solutions to solve the problems, little progress has been made.
Feature extraction is one of the important basic steps of pattern recognition.Features should contain information required to distinguish between classes, be insensitive to irrelevant variability in the input, and also be limited in number to permit efficient computation of discriminant functions and to limit the amount of training data required.In fact, this step involves measuring those features of the input character that are relevant to classification.After feature extraction, the character is represented by the set of extracted features.
Features can be classified into two categories: Local features, which are usually geometric (e.g.concave/convex parts, number of endpoints, branches, joints, etc), and Global features, which are either topological (connectivity, projection profiles, number of holes, etc) or statistical.
The objective of this paper is to examine the performance of four of these global statistical features; namely: Moments Invariants (MIs), Gray Level Co-occurrence Matrix (GLCM), Run Length Matrix (RLM), and Statistical Properties of Intensity Histogram (SFIH), and to study the effect of fusing two or more of these features on the recognition rate.
The rest of the paper is organized as follows.Section II summarizes the related work.Section III introduces the proposed approach.Results and discussion are presented in Section IV.The paper is terminated by concluding remarks and proposals for future work.

II. RELATED WORK
The features extraction stage, playing the main role in the recognition process, controls the accuracy of recognition by the information passed from this stage to the classifier (recognizer).These information can be structural features such as loops, branch-points, endpoints, and dots; or statistical which includes, but is not limited to, pixel densities, www.ijacsa.thesai.orghistograms of chain code directions, moments, and Fourier descriptors.Because of the importance of this stage many approaches and techniques have been proposed.
In [1], two methods for script identification based on texture analysis have been implemented: Gabor filters and GLCMs.In tests conducted on exactly the same sets of data, the Gabor filters proved to be far more accurate than the GLCMs, producing results which are over 95% accurate.
[2] presented a new technique for feature extraction based on hybrid spectral-statistical measures (SSMs) of texture.They studied its effectiveness compared with multiple-channel (Gabor) filters and GLCM, which are well-known techniques yielding a high performance in writer identification in Roman handwriting.Texture features were extracted for wide range of frequency and orientation because of the nature of the spread of Arabic handwriting compared with Roman handwriting.The most discriminant features were selected with a model for feature selection using hybrid support vector machine-genetic algorithm techniques.Experiments were performed using Arabic handwriting samples from 20 different people and very promising results of 90.0% correct identification were achieved.
In [3], a novel feature extraction approach of handwritten Arabic letters is proposed.Pre-segmented letters were first partitioned into main body and secondary components.Then moment features were extracted from the whole letter as well as from the main body and the secondary components.Using multi-objective genetic algorithm, efficient feature subsets were selected.Finally, various feature subsets were evaluated according to their classification error using an SVM classifier.The proposed approach improved the classification error in all cases studied.For example, the improvements of 20-feature subsets of normalized central moments and Zernike moments were 15 and 10%, respectively.This approach can be combined with other feature extraction techniques to achieve high recognition accuracy.
In [4], a new set of run-length texture features that significantly improve image classification accuracy over traditional run-length features were extracted.By directly using part or all of the run-length matrix as a feature vector, much of the texture information is preserved.This approach is made possible by the utilization of the multilevel dominant eigenvector estimation method, which reduces the computation complexity of KLT by several orders of magnitude.Combined with the Bhattacharyya measure [5], they form an efficient feature selection algorithm.The advantage of this approach is demonstrated experimentally by the classification of two independent texture data sets.Experimentally, they observed that most texture information is stored in the first few columns of the RLM, especially in the first column.This observation justifies development of a new, fast, parallel RLM computation scheme.Comparisons of this new approach with the co-occurrence and wavelet features demonstrate that the RLMs possess as much discriminatory information as these successful conventional texture features and that a good method of extracting such information is key to the success of the classification.
In [6], Zernike and Legendre Moments for Arabic letter recognition have been investigated.Experiments demonstrated both methods' effectiveness in extracting and preserving Arabic letter characteristics.ZM is used due to its ability to compute the complex orthogonal moments precisely.The system has achieved satisfactory performance when compared with other OCR systems.The translational and scaling invariant, on the other hand, had struggled in LM to detect rotational invariant forms in the experiments.The objective for maximising the correct matching and retrieval from the Arabic database while minimising the false positive rate has been achieved.
[7] explores a design-based method to fuse Gabor filter features and co-occurrence probability features for improved texture recognition.The fused feature set utilizes both the Gabor filter's capability of accurately capturing lower frequency texture information and the co-occurrence probability's capability in texture information relevant to higher frequency components.Fisher linear discriminant analysis indicates that the fused features have much higher feature space separation than the pure features.Overall, the fused features are a definite improvement over non-fused features and are advocated in texture analysis applications.

III. PROPOSED APPROACH
Substantial research efforts have been devoted during last years to AOCR and many approaches have been developed (structural, geometric, statistics, stochastic…).However, certain problems remain open and deserve more attention in order to achieve results equivalent to those obtained for other scripts such as Latin.Besides, other methods must be explored and various sources of information have also to be used [8].
The process of isolated Arabic optical character recognition comprises three main stages: Preprocessing, Feature extraction, and Classification.The structure of the proposed approach is shown in Figure 1.
The training dataset includes the 28 (100 x 100) jpg images of the isolated Arabic characters shown below: The test datasets include: 1) 3 datasets composed of the clean set corrupted by salt and pepper noise of intensity 1 %, 3 %, and 5 % respectively.
The procedure proceeds as follows: 1) In the preprocessing phase, the noise removal is carried out using median filter, and the binarization is done with histogram thresholding.
2) In the feature extraction phase, four feature sets are calculated using MIs, GLCM, RLM, and SFIH, respectively.These initial feature vectors are used for evaluating the maximum possible recognition rate for the corrupted datasets, using each set of features.The relations used for calculating the features of the different techniques are discussed below.

A. MIs Features:
The regular moment of a shape in an M by N binary image is defined as: where f(i,j) is the intensity o f the pixel (either 0 or 1) at the coordinate (i,j) and p+q is said to be the order of the moment.The coordinates of the centroid are determined using the relations: Relative moments are then calculated using the equation for central moments defined as: A set of seven rotational invariant moment functions which form a suitable shape representation were derived by Hu [9,10,11].These equations, used throughout this work are shown in Appendix A i .

B. GLCM Features
The GLCM is a tabulation of how often different combinations of pixel brightness values (grey levels) occur in an image [13].The GLCM is used for a series of "second order" texture calculations.GLCM texture considers the relationship between groups of two (usually neighboring) pixels in the original image.at a time, called the reference and the neighbour pixel.The neighbour pixel is chosen to be the one to the east (right) of each reference pixel.This can also be expressed as a (1,0) relation: 1 pixel in the x direction, 0 pixels in the y direction.Each pixel within the window becomes the reference pixel in turn, starting in the upper left corner and proceeding to the lower right.Pixels along the right edge have no right hand neighbour, so they are not used for this count.
To create a GLCM, use the graycomatrix function.The graycomatrix function creates a gray-level co-occurrence

MI GLCM RLM SFIH
Feature Extraction MI GLCM RLM SFIH www.ijacsa.thesai.orgmatrix (GLCM) by calculating how often a pixel with the intensity (gray-level) value i occurs in a specific spatial relationship to a pixel with the value j.By default, the spatial relationship is defined as the pixel of interest and the pixel to its immediate right (horizontally adjacent), but you can specify other spatial relationships between the two pixels.Each element (i,j) in the resultant GLCM is simply the sum of the number of times that the pixel with value i occurred in the specified spatial relationship to a pixel with value j in the input image.
Because the processing required to calculate a GLCM for the full dynamic range of an image is prohibitive, graycomatrix scales the input image.By default, graycomatrix uses scaling to reduce the number of intensity values in grayscale image from 256 to eight.The number of gray levels determines the size of the GLCM.To control the number of gray levels in the GLCM and the scaling of intensity values, using the NumLevels and the Gray Limits parameters of the graycomatrix function.The GLCM can reveal certain properties about the spatial distribution of the gray levels in the texture image.For example, if most of the entries in the GLCM are concentrated along the diagonal, the texture is coarse with respect to the specified offset.You can also derive several statistical measures from the GLCM.The set of features extracted from the GLCM matrix [14] is shown in Appendix A ii .

C. RLM Features
Run-length statistics capture the coarseness of a texture in specified directions.A run is defined as a string of consecutive pixels which have the same gray level intensity along a specific linear orientation.Fine textures tend to contain more short runs with similar gray level intensities, while coarse textures have more long runs with significantly different gray level intensities [15].
A run-length matrix P is defined as follows: each element P(i, j) represents the number of runs with pixels of gray level intensity equal to i and length of run equal to j along a specific orientation.The size of the matrix P is n by k, where n is the maximum gray level in the image and k is equal to the possible maximum run length in the corresponding image.An orientation is defined using a displacement vector d(x, y), where x and y are the displacements for the x-axis and y-axis, respectively.The typical orientations are 0°, 45°, 90°, and 135°, and calculating the run-length encoding for each direction will produce four run-length matrices.
Once the run-length matrices are calculated along each direction, several texture descriptors are calculated to capture the texture properties and differentiate among different textures [15].The set of RLM features is shown in Appendix A iii .

D. SFIH Features
A frequently used approach for texture analysis is based on statistical properties of intensity histogram.One such measures is based on statistical moments.The expression for the n th order moments about the mean is given by: Where z i is a random variable indicating intensity, p(z i ) is the histogram of the intensity levels in the image, L is the number of possible intensity levels and is the mean (average) intensity.The set of features following this approach is shown in Appendix A iv .
Feature selection helps to reduce the feature space which improves the prediction accuracy and minimizes the computation time.This is achieved by removing irrelevant, redundant and noisy features, i.e., it selects the subset of features that can achieve the best performance in terms of accuracy and computation time.It performs the Dimensionality reduction.Principal Components Analysis (PCA) is a very popular technique for dimensionality reduction.Given a set of data on n dimensions, PCA aims to find a linear subspace of dimension d lower than n such that the data points lie mainly on this linear subspace.Such a reduced subspace attempts to maintain most of the variability of the data.Applying PCA for dimensionality reduction, we get the minimum number of features giving the maximum possible recognition rate obtained earlier using the full feature vector), for each procedure.Analyzing the effect of feature fusion by fusing the features of each two of the four procedures, and evaluating the resultant recognition rate.
Classification is the main decision stage of the OCR system in general.In this stage the features extracted from the primitive is compared to those of the model set.As the classification is generally implemented according to the criterion of minimizing the Euclidian distance between feature vectors, it is necessary to normalize the fused features.The normalization should comply with a rule that each feature component should be treated equally for its contribution to the distance.The rationale usually given for this rule is that it prevents certain features from dominating distance calculations merely because they have large numerical values.A linear stretch method can be used to normalize each feature component over the entire data set to be between zero and one.A feature selection procedure can be used after the feature vectors are fused.A weighting method called feature contrast, is employed to perform an unsupervised feature selection.
Denote the i th n-D fused feature vector as The feature contrast of the j th component of the feature vector is defined as: Then each feature component is weighted by its feature contrast divided by the maximum feature contrast of all feature components, that is, A common strategy of feature fusion is first to combine various features and then perform feature selection to choose an optimal feature subset according to the feature data set itself, such as by principal component analysis (PCA).www.ijacsa.thesai.orgAs we are interested mainly in feature extraction, no great emphasis is paid for the classier.We will implement only the basic classifier; namely: Nearest-Neighbor Classifier, based on the Euclidean distance between a test sample and the specified training samples.Let xi be an input sample with p features (xi1,xi2,…,xip) , n be the total number of input samples (i=1,2,…,n) and p the total number of features (j= 1,2,…,p) .The Euclidean distance between sample x i and x l (l =1,2,…,n) is defined as:

IV. RESULTS & DISCUSSION
In the training phase, four sets of features are calculated for the clean dataset using the four methods under consideration (MIs, GLCM, RLM, and SFIH).The PCA algorithm is also applied in each case, and the corresponding feature vectors are stored for further processing.
In the testing phase, the same approach is followed for the data in the nine corrupted datasets.Using the full feature vectors, the recognition rate is determined for each method, and is labeled as the maximum possible recognition rate that can be achieved in this situation.As the feature vectors are sorted in a descending order as a result of applying PCA, we searched for the minimum number of features giving the maximum possible recognition rate for each method.According to Table 1, the maximum recognition rate was achieved using MIs, followed by GLCM, and RLM.The SFIH gave the least recognition rate.Figure 3, clarifies these results.The minimum number of features satisfying maximum recognition rate was found to be 2, 3, 4, and 1 for IMs, GLCM, RLM, and SFIH, respectively.The effect of the number of features of MIs on the recognition rate is shown in Table 2 for the different types of noise.On the average, the effect of the number of features of MIs on the recognition rate is shown in Figure 4.The effect of the number of features of GLCM on the recognition rate is shown in Table 3 for the different types of noise.On the average, the effect of the number of features of MIs on the recognition rate is shown in Figure 5.The effect of the number of features of RLM on the recognition rate is shown in Table 4 for the different types of noise.On the average, the effect of the number of features of MIs on the recognition rate is shown in Figure 6.The effect of the number of features of SFIH on the recognition rate is shown in Table 5 for the different types of noise.On the average, the effect of the number of features of MIs on the recognition rate is shown in Figure 7.However, fusing features of GLCM with features of SFIH, gives very small enhancement in the recognition rate as shown in Figure 10.Three types of noise, with different intensity levels were used for estimating the gained enhancement, namely; salt & pepper, impulse, and Gaussian noise, with intensity levels of 1%, 3%, and 5% for each type of noise.It was found that the fusion of the moment features with those of GLCM leads to about 100% recognition rate for all noise intensity levels used.Further investigation is needed for fusing more than two types of features and using higher intensity levels to generalize the obtained results.

Figure 2 .
Figure 2. Sample Images of different datasets

Figure 4 .
Figure 4. Average recognition rates of MIs as a function of the number of features

Figure 10 .
Figure 10.Average recognition rate of Fusing GLCM features and SFIH features V. CONCLUSIONS & FUTURE WORK This paper investigates the performance of approaches for the statistical feature extraction techniques, namely; Moment Invariants, Gray Level Co-occurrence Matrix, Run Length Matrix, and Statistical Features of Intensity Histogram, and proposes a hybrid technique fusing features from the four methods for enhancing the Arabic characters recognition rate.Three types of noise, with different intensity levels were used for estimating the gained enhancement, namely; salt & pepper, impulse, and Gaussian noise, with intensity levels of 1%, 3%, and 5% for each type of noise.It was found that the fusion of the moment features with those of GLCM leads to about 100% recognition rate for all noise intensity levels used.Further investigation is needed for fusing more than two types of features and using higher intensity levels to generalize the obtained results.

TABLE 1 .
Maximum possible recognition rate for the corrupted datasets, Figure 3. Average recognition rate of different approaches

TABLE 2 .
The relation between the number of features and the obtained recognition rate for MIs

TABLE 3 .
The relation between the number of features and the obtained recognition rate for GLCM

TABLE 4 .
The relation between the number of features and the obtained recognition rate for RLM

TABLE 5 .
The relation between the number of features and the obtained recognition rate for SFIH As the main objective is to emphasize the effect of hybridization (feature fusion) on the enhancement of recognition rate, Tables6, 7, and 8 illustrate the resultant recognition rate due to fusing features GLCM with RLM , MIs with GLCM, and GLCM with SFIH.

TABLE 6 .
The GLCM features with the RLM features

TABLE 7 .
The moment features with the GLCM features Figure9.Average recognition rate of Fusing IMs and GLCM features

TABLE 8 .
GLCM features with the Statistical features