Triangle Shape Feature based on Selected Centroid for Arabic Subword Handwriting

—Features are normally modelled based on color, texture and shape. However, some features may have different constraints based on types, styles and pattern of an image. The Arabic subword handwriting, for example, cannot be recognized by color and not suitable to be characterized based on texture. Therefore, features based on shape are suitable to be used for recognizing Arabic subword handwriting since each of the character has various characteristics such as diacritics, thinning and strokes. These characteristics can contribute to particular a shape that is unique and can represent Arabic subword handwriting. Currently, geometry shape such as triangle has been adopted to extract useful features based on triangle properties without implicating any triangle form. In order to increase classification accuracy, these properties have been categorized into several zones where the number of features produced is directly proportional to the number of zones. Nevertheless, shape representation does not implicate any triangle properties such as ratio of side, angle and gradient. By using shape representation, it helps in reducing the number of features. Thus, this paper presents feature based on triangle shape that can represent the identity of Arabic subword handwriting. The method based on triangle shape identifies three main coordinates of triangle formed based on selected centroids. The AHDB dataset is used as a testing data. The Support Vector Machine (SVM) and Random Forest (RF), respectively were used to measure the accuracy of the proposed method using triangle shape as a feature. The accuracy results have shown better outcome with 77.65% (SVM) and 76.43% (RF), which prove the feature based on triangle shape is applicable for Arabic subword handwriting recognition.


I. INTRODUCTION
Subword handwriting is one of the popular handwriting studies that have been actively explored for many years due to challenges in identifying the styles, patterns, and signatures of subword handwriting.The recognition of the handwritten words is based on the recognition of segmented characters from subword.Images of handwritten documents will be processed from imaging from pages to lines, and words to subwords.
Due to the challenging task in subword handwriting, there have been intensive responses and encouragement by numerous researchers to develop or improvise the existing recognition methods and systems.Based on [1], work in Arabic character recognition is limited.However, Arabic handwritten character recognition systems have achieved much improvement over the years.Since many languages such as Farsi, Curds, and Urdu used Arabic characters in their writing, it makes tasks more challenging due to different words used, strength and sequential order of the writing.
Over the past decades, a lot of handwritten subwords databases [1]- [3], which contain images were developed.Image processing is required to process the images, for example, to convert the image into binary pixels.Image processing is one of the vital elements that are widely used in a research area within engineering and computer science disciplines.Arabic handwritten documents currently exist in a big number of resources in physical and web form, providing a challenge for the word recognition process.The features extraction process will play an important role before the classification process.The problem with segmentation to the single characters is that the characters may be overlapped and some of the characters share the same shape, for example, characters: , ‫"ب‬ " " " ‫ت‬ and ‫."ث"‬ The features for recognizing Arabic subword have been introduced, which led to in-depth studies due to variation of style, pattern, characteristic and type of Arabic subword handwriting.Thus, the generated features must have hallmarks that can differentiate them from another subword.Two groups namely analytical and holistic are used as a recognition method in handwritten text.In the analytical group, a word segmented into components such as character or subword, a feature is extracted from each other, and a general vector is obtained from each word.Besides, the character segmentation is needed, and the errors may occur in the recognition step.In the holistic group, a feature vector is extracted from the whole word image without any need to image segmentation.
In this study, a holistic group is applied in producing novel features for Arabic subword handwriting.The feature based on triangle shape is proposed using three main coordinates of triangle formed based on selected centroids.This paper is organized as follows.In Section II, the related work is discussed.The proposed method is discussed in Section III.Next, the experiment and evaluation of study is presented in Section IV.Finally, Section V concludes this paper.

II. RELATED WORKS
A huge number of pre-modern texts have been scanned as subword images to remain against aging.Nevertheless, the ability of researchers to handle with the images were limited due to the difficulty to handle certain tasks such as query search [4].According to [4], it is important to provide researcher with algorithms for automatic transliteration and transcription of scanned images, which would extract the textual content of the image and reproduce into an editable text file.
With the advanced technology, many approaches and methods have been introduced for Arabic handwritten text recognition.The feature learning framework had been proposed by [5] using a Bag-of-Feature (BoF) paradigm for Arabic handwritten text recognition.Besides that, scale invariant feature transform (SIFT) descriptor was used by [6] to represent the object in detail to reduce the computation cost.However, [6] has stated that the complexity of testing image cannot be too high when performing object recognition and image retrieval on big data.This is because a vector with 128 dimensions represents one feature point and an image will have several feature points.Thus, more time is needed to compare the feature points individually.Therefore, it is important in selecting suitable features applicable to represent the image.The structural approach is applied in generating triangle shape feature based on selected centroid.Then, the holistic approach is applied where the whole subword image is used without segmenting each character from subword image.
The holistic approach has been applied by [7] in producing features for AHDB subword images using Discrete Cosine Transform (DCT) and histogram of oriented gradient (HOG).The features are produced based on whole subword image without any word segmentation.An array of the best 50 DCT coefficients and 324 of HOG features are produced as the parameters of the features for subword images.Besides, a study in [5] also has used holistic approach in producing features based on Bag-of-Feature (BoF) paradigm.The BoF framework is exploited by [5] as to learn robust feature representations for Arabic handwriting recognition.Several approaches have been implemented as there are few stages in the framework that will use different approaches.The Harris detector and dense sampling have been applied for selecting representative image regions.Then, Principle Component analysis (PCA) is applied to reduce Scale-Invariant Feature Transform (SIFT) descriptors to 64-D vectors.In a study by [8], a holistic group approach also has been applied in generating novel features for recognition of Persian/Arabic handwritten words.The generated feature is proposed based on a geometric attribute of components forming the word.The number, angle, location and size of a line are the parameters that represent the features in [8].
The geometry features have been adopted in object recognition, which is especially used for identifying the style and pattern of writing, font, authors, and number of authors, place of writing and originality of the documents.Apart from that, these features also have been extensively used for recognizing the type of writing and calligraphy in existing documents especially for ancient manuscripts [9].The geometry features can be produced based on geometry shapes such as polygons including triangles, squares and pentagons.These polygons have respective properties that can be used in object recognition.
Most of the properties, for example, triangle properties have been used by researchers to produce proposed features for image classification [9], [10].The properties are extracted after the polygon is formed.The geometry method also has been broadly used in various domains such as face recognition [11]- [13], fingerprint recognition [14]- [16], vehicle detection [17], intrusion [18] and digit recognition [9], [10].Each of the domain has a special form that uses an indicator to determine the corner points of the formed geometry shape.
In face recognition, eyes and nose are face elements that are used as indicator to determine the points on the face.The minutiae, ridges and valleys were used as indicator in fingerprint recognition.In vehicle detection, flat road assumption has been used as indicator to search for vehicles that are located on the ground.Besides that, geometry method was also used in recognizing digit recognition and calligraphy [9], [10], [19].A local foreground image was applied to construct triangle points based on the size of image.The author of [9] proposes new features based on triangle properties.The triangle is formed based on three triangle points of corners A, B and C. The determination of the three triangle points of corners plays a big role in triangle formation.Any fault in determining the exact coordinates of triangle points can affect the triangle formation.The midpoint of triangle is important to determine the position of triangle's point of A and B.
However, the current algorithm to extract features from face and fingerprint recognition respectively cannot be implemented in recognizing subword images.The limitation of elements used such as eyes, mouth, nasal tip, ear hole and corner of mouth in face recognition as well as minutiae in fingerprint recognition cannot be applied due to the aforementioned non-elements that exists in subword images.Thus, elements from both face and fingerprint recognition cannot be used as new feature parameter based on triangle geometry for subword images.Nevertheless, the current algorithm using triangle geometry in digit recognition is possible to be applied on subword image.However, there are constraints where every feature must be produced for all the 33 zones, which eventually lead to the increasing number of features into 297.The algorithm has increased training time in feature extraction process concomitantly with big data image used.Thus, a research on feature based on triangle shape is needed to be extended in order to facilitate subword image.

A. Pre-processing
Before proposed features are produced, binarization process is performed in pre-processing stage for selecting adequate threshold of grey level for extracting objects from image background.Thus, Otsu thresholding method [20] is applied to convert subword image into binary form.The binarization process will transform image into binary form where '0' represents foreground of image while '1' represents background of image as shown in Fig. 1.

B. Feature Extraction 1) Zoning method:
In this stage, the zoning method is applied to divided image into several zones, which contains useful information that can be extracted as the features.The zoning method is known as one of handwriting recognition method where handwriting image will be divided into several zones that provide regional information according to feature needs.There are four types of zoning method applied namely Cartesian plane zone, horizontal zone, vertical zone and 45degree zone.These zoning methods also have been used in digit recognition [10], [19].TABLE I shows the summary of zoning method information while Fig. 2 illustrates the image output from Cartesian plane zone method.Based on Fig. 2, binary image is divided into five zones including main image using Cartesian plane zone method.The binary image is measured based on height and width of the image.The height and width of binary image is obtained based on the number of binary pixels including '0' and '1'.2) Geometry Method: After applying zoning method, the features can be extracted from each of zones using geometry method.In this study, triangle geometry method has been applied to generate features where the triangle shape is formed inside divided zones.There are 33 triangle shapes formed based on a total number of zones for all types of zoning method (refer to TABLE I).The features are generated based on triangle shape where the three main coordinates of triangle are formed based on selected centroids.There are six types of possible centroids that form the triangle shape as shown in TABLE III.The algorithm to obtain three main triangle coordinates is shown in Fig. 3.After identifying three main triangle coordinates based on selected centroids, the coordinates are used to extract the features based on triangle shape.The number of features based on triangle shape produced 99 features (3 features × 33 zones).The description of triangle shape features based on three main triangle coordinates from selected centroids is shown in TABLE II.

I. EXPERIMENT AND EVALUATION
In this study, Arabic subword handwriting from AHDB database is used.This dataset contains more than 2000 images of Arabic words and texts written by a hundred different writers where 70% data is used as training data while 30% is used as testing data.As to evaluate the data, Support Vector Machine (SVM) and Random Forest (RF) are applied to measure the data based on accuracy.As known, the SVM is one of most popular approach that has been used in measuring classification accuracy for handwriting recognition.Thus, the libSVM is required to gain the highest cross-validation (CV) accuracy for each of the SVM parameter.The Gaussian kernel is applied to search the best grid point of cost and gamma with highest cross-validation.Then, the best value of cost and gamma are used to train the dataset using SVM.With the best value of cost and gamma, a good accuracy is achieved accordingly to the dataset nature and characteristic.The cost and gamma value for proposed method in [9] and our proposed method respectively is shown in TABLE IV.The results of accuracy based on SVM classifier are compared between prior method [9] and the proposed method.Based on TABLE V, the accuracy result for the proposed method has shown better outcome by obtaining 77.653% compared to proposed method by proposed method of [9] which obtained only 76.122%.The results based on SVM showed that the proposed method has achieved target to apply minimum number of features by using triangle shape feature.Number of features is possible to be reduce from 297 to 99 by using different approaches of triangle shape features types.Furthermore, triangle shape feature can differentiate triangle shape from another triangle shape types.
Besides that, the accuracy results are also compared using other classifier based on different features used on AHDB dataset.Based on TABLE VI, the accuracy result based on random forest has shown good result for our proposed method by increasing about 7% compared to the proposed method by prior method [21].It has shown that the proposed features using triangle shape is efficient and applicable to be used in Arabic handwritten text recognition.However, the handwriting styles, pattern and types may influence in producing the features, which made recognizing the Arabic handwriting text more challenging.

II. CONCLUSION
This paper presents a feature based on triangle shape that formed three main coordinates using selected centroids.The proposed feature based on triangle shape has been proven applicable to be used as a feature for recognizing Arabic subword handwriting.The results based on SVM and RF have shown good result for the proposed method compared to prior methods.The further research can be extended where other geometry shapes can be applied as a feature.

2
Length of side b  = � 2 +  2 − 2. ° 3 Length of side c  = � 2 +  2 − 2. ° Input: binary image of zone Output: triangle shape points (A, B, C) Begin • Read image I from dataset • N ← total number of pixels at x-axis • Get point C (centroid) • h ← centroid height of zone, w ← centroid width of zone • Get point A. Find  =    <=  − 1 • Get point B. Find  = 0   <=  EndTABLE III.DESCRIPTION OF TRIANGLE SHAPE BASED ON SELECTED CENTROID

TABLE I .
SUMMARY OF ZONING METHOD Fig.2.Output Image after using Cartesian Plane Zone Method.

TABLE IV .
COST AND GAMMA RESULTS USING LIBSVM FUNCTION

TABLE VI .
CLASSIFICATION ACCURACY RESULTS BASED RF