Recognition of Odia Character in an Image by Dividing the Image into Four Quadrants

This paper deals with optical character recognition of Odia characters written in a particular font family ‘AkrutiOriAshok-99’ with different font sizes 18, 20, 22, 24, 26, 28, 36, 48 and 72 in Bold style. The font ‘AkrutiOriAshok-99’ is a font from the typing software ‘Akruti’. The basic idea behind the approach followed in this paper is the character decomposition into four quadrants and then extracting features from each quadrant. The image processing techniques like converting the image to gray, resizing of image and converting gray image to binary are used in this approach. The system explained in this paper has two major parts: DictionaryBuilding and FindingMatch. For DictionaryBuilding, dictionary of images which are created either by scanning a document or a document converted to image, both written in same font family in different sizes. The features are extracted from each image in any font size in the ‘Dictionary’ using Preprocessing, FindPath, GettingFeaturesLeft or GettingFeaturesRight, VisitSubQuad, RemainingSubQuad, WriteToExcel and CommonFeature modules. The part FindingMatch is responsible for finding a correct match in the dictionary for the input image. For this, FeatureExtraction and Recognition modules have been used. Longest Common Subsequence (LCS) has been used for finding the common feature in DictionaryBuilding as well as finding the correct match. A total of 1800 characters, 200 characters of each font size have been tested and 98.1% of correctness has been


I. INTRODUCTION
In present days, the textual data are either scanned or converted to image by using software to store the data in the form of image. It is required to recognise the characters present in the scanned document or document converted to image by using some algorithm. For recognition of characters in an image, efficiency in the segmentation of lines, words and characters should be achieved.
In Odia language, the alphabets are grouped into three categories: Swara Barna, Byanjana Barna and Atirikta Barna [1] (Fig.1). Only Chandra Bindu ( ), Anusara ( ) and Bisarga ( ), which are part of Byanjana Barna can be used with all the alphabets of Swara Barna, other alphabets of Byanjana Barna and all alphabets of Atirikta Barna to form words. When a Swara Barna is used with the alphabets of Byanjana Barna and Atirikta Barna, the former is used as a symbol with latter to form words. These symbols are known as Matras [2]. When a Byanjana Barna alphabet is used as a symbol with the other alphabets of Byanjana Barna, these are called Juktakhyara [2].
There are different types of software available for typing Odia language in a computer. Akruti and Microsoft Indic Language Input tool for Odia are some popularly used typing software.
This paper has concentrated on the recognition of Swara Barna, Byanjana Barna and Atirikta Barna. The alphabet is written in a document in a particular font family 'AkrutiOriAshok-99' in a particular font size in bold style. The font sizes that are considered in this paper are 18,20,22,24,26,28,36,48 and 72. This document is either scanned or converted to image by software. The approach described in this paper first creates a dictionary of images written in 'AkrutiOriAshok-99' font family, with font sizes 18,20,22,24,26,28,36,48 and 72 and in bold style. The features of these images are extracted using Preprocessing, FindPath, GettingFeaturesLeft or GettingFeaturesRight, VisitSubQuad, RemainingSubQuad modules and the extracted features are written to the excel file using WriteToExcel module. A common feature is extracted from the extracted features from the images in dictionary using CommonFeature module. For finding a correct match for the input image in the dictionary of features, FindingMatch has been used. For finding a correct match, CheckCommonFeature module of Recognition has been used. If a correct match has not been found by the CheckCommonFeature module, then MatchCommonFeature module has been used to find a correct match. If in some cases, these two modules of Recognition are unable to find a correct match, the TraceAnotherDirection module of Recognition has been used. The Preprocessing module in 'DictionaryBuilding' and 'FindingMatch' converts the image into gray image and then the white spaces surrounding the Odia alphabet in the gray image are removed using the Phase -1 of RemoveNoise module of [3] (RemoveBoundarySpaces). For converting image to gray, OpenCv package of python has been used. After the elimination of white spaces from gray image, it is resized into 64 x 64 and the resultant resized image is converted to binary by using OSTU's method of thresholding [4,5,6]. Gray image is a type of image where intensity is stored as an 8-bit integer, hence each pixel can have intensity value ranging from 0 -255 [7]. Binary Image is a type of image where image data is represented in terms of 0 and 1 [7,8,9]. The basic idea for extracting features followed in this paper for DictionaryBuilding and FindingMatch is dividing the image into four quadrants and then tracing continuous path of black pixel in a particular direction in each quadrant. Experimentally, a specific direction of tracing has been agreed upon for each quadrant. The inputs to DictionaryBuilding and FindingMatch are a directory named as 'Dictionary' (consists of all alphabets of Swara Barna, Byanjana Barna and Atirikta Barna of Odia language) and a directory named as 'Input' (consisting of an image of Odia alphabet) respectively. The files present in 'Dictionary' are accessed using os package of python [10]. The extracted features for DictionaryBuilding and FindingMatch are written in excel files, DictionaryFeatures.xlsx' and 'InputFile.xlsx' respectively by using openpyxl package of python [11]. The common feature is extracted from the extracted features present in DictionaryFeatures.xlsx' by using Longest Common Subsequence (LCS) [12,13,14,15,16] and the common feature is written to the excel file, 'CommonFeature.xlsx' by using openpyxl package of python. Both in DictionaryBuilding and FindingPath, Numpy package of python [17,18] has been used for rounding off values and Matplotlib package of python [19] for sub-plotting four quadrants of the given image in one single figure (Fig. 3). Data structures like List and Dictionaries of python are used for holding multiple values. List is a data structure which behaves as a dynamic array in python and multiple values can be appended to it [20,21]. Dictionaries consist of key values and for each key value there will be a specific value [20,21].
The key values and values for each key value in dictionaries can be a number and can also be a string.
In other words, the proposed system concentrates on dictionary building by extracting features from the images present in 'Dictionary' directory and storing the extracted features in an excel file 'DictionaryFeatures.xlsx'. As per the research, the same character in different font sizes results in a number of features. Therefore, it is needed to find out a common feature among all the font sizes. To achieve this Longest Common Subsequence (LCS) has been used so that there will be one common feature for a particular character. This common feature for the particular character has been stored in an excel file, 'CommonFeature.xlsx'. The above process is done by using phases of 'DictionaryBuilding'. Then feature is extracted from an input image and this feature is searched in 'CommonFeature.xlsx' by following the phases of 'FindingMatch' to get a correct match. The proposed approach will help to recognise Odia characters from a scanned image or a document converted to image and these recognised characters can be written into a document and further editing can be done.

II. RELATED WORK
The system introduced in [22], segments handwritten text into lines, from lines, words were segmented and from words characters were segmented. This system had used the water reservoir principle introduced in [23]. The input to the system was a document which was handwritten in Odia. To segment lines, the document was divided to find vertical stripes. Based on vertical projection profile and structural features of Odia characters, text lines were segmented into words. For character segmentation, at first, characters that were connected were detected. Using water-reservoir-concept touching characters of the word were then segmented. The word segmentation module was tested on 3700 words and it was noticed that the word segmentation module had an accuracy of 98.2%. The proposed technique for the isolated and connected character identification had an average accuracy of 96.7%. From the experiment it had been noticed that, in 98.6% cases, isolated characters fall into isolated group. From the experiment it had been noticed that 96.7% accuracy was obtained from two-character touching components. The accuracy of the proposed scheme on three character touching components was 95.1%.
The system introduced in [23] uses a technique for automatic segmentation of handwritten connected numerals. This system had worked on the images of French bank checks from French Company (Itesoft). Initially, the images were in gray scale (256 levels) and they had used histogram based thresholding approach to convert the gray image into binary image. Features were extracted by using the technique called water reservoir. Reservoir was obtained by the accumulation of water poured from the top or from the bottom of the numerals. Top reservoirs were formed when water was poured from the top and bottom reservoirs were formed when water was poured from the top after rotating the component by 180 o . Water reservoirs were the white regions of the component. 117 | P a g e www.ijacsa.thesai.org The features that were considered in the scheme were: number of reservoirs, position of reservoirs with respect to bounding box of the touching pattern, shape and size of the reservoirs, centre of gravity of the reservoirs and relative positions of the reservoirs. The segmentation result was verified manually and observed that 94.8% of the connected numerals were accurately segmented.
The system described in [24] recognises odia compound character by analysing strokes. The approach had identified 12 strokes that are enough to describe any Odia character. The input character was resized into a 60 x 60 image and then divided into nine equal halves called zones. Each zone consists of some strokes. There are nine zones and 12 strokes so; each feature vector of the character was represented in a 1 x 108. The value of similarity between strokes and zone were arranged in a vector format. Structural Similarity Index had been used as it is based on the concept that the structure of the image is independent of the illumination. The training set had been prepared from the 211 classes of Kalinga font. The system was implemented in windows machine and on MATLAB platform. The independent character recognition accuracy was achieved as 92%. The system also covers many test samples of degraded Kalinga characters. A complete OCR was also designed to work on scanned text document.
The approach described in [25] deals with handwritten Odia character recognition. This system has two level of classification. The input to the first level of classification was a cropped image. Then the input image was binarized followed by thinning. The mid value of the image was found. Then the image was divided into three equal halves row wise and two halves column wise, making it six zones. The distance between the pixel value and the centroid was calculated and this was done for all pixels for a zone and then average distance was calculated for that zone. The angle between image centroid and the pixel was calculated and this was done for each pixel in a zone. Then the average of the angles was calculated. In second-level classification, the cropped image was taken as input and it was divided into nine zones. Then the same procedure that was carried out in first-level classification was also followed in second-level classification. The first-level classification output six average distances and six average angles. The second-level classification also output nine standard deviations, nine average distances and nine average angles. Then Artificial Neural network was used for classification.
The system introduced in [26] considered each character as composition of sequence of high-level strokes and low-level strokes. They had identified low-level strokes in the system explained in [27]. In [26], they had identified forty eight visually non-redundant high-level strokes which form the maximum of a Gujarati character. Each high-level stroke is a combination of point, curves and lines. The proposed method start scanning from the center region of the character in left to right order and extract all junction points. The 3 x 3 neighbourhood of each junction point was then scanned in clockwise order to obtain the starting point of each high-level stroke. The high-level stroke ends at endpoint or until next junction point is not reached using contour tracing method. The system had used finite state machine to identify high-level stroke. For classification, the system had used Naive Bayes Classifier and Hidden Markov Model. The overall accuracy achieved using Naive Bayes Classifier and Hidden Markov Model was 93.26% and 96.87% respectively.
III. SYSTEM ARCHITECTURE This approach consists of 'DictionaryBuilding' and 'FindingMatch' parts. The output of the above two parts are given as input to the Recognition module to find a correct match. The overall system architecture has been shown in Fig.  2

A. DictionaryBuilding
This part deals with building dictionary of features extracted from the dictionary of images of Odia alphabets which are created by scanning a document or a document converted to image by using software, both written in a font family, 'AkrutiOriAshok-99' in a particular font size. The different font sizes used are 18, 20, 22, 24, 26, 28, 32, 48 and 72. For a particular font size, images of Odia alphabets of that font size are stored in a directory. Hence, nine directories are created as nine different font sizes have been used. These nine directories are stored in a directory named as 'Dictionary'.
The input to the 'DictionaryBuilding' is the 'Dictionary' directory. The directories in 'Dictionary' and the image files in each directory are accessed using os package of Python. Each image file goes through 'Feature Extraction in DictionaryBuilding'.

1) Feature Extraction in DictionaryBuilding
The 'Dictionary' directory consists of nine directories, each directory dedicated to a particular font size. For example, the directory dedicated to font size 18 consists of images of each Odia alphabet written in font size 18, directory dedicated to font size 20 consist of images of each Odia alphabet written in font size 20 and so on. All the images in all these directories of 'Dictionary' undergoes Preprocessing, FindPath, GettingFeaturesRight or GettingFeaturesLeft, RemaingSubQuad, VisitSubQuad modules to extract the features of the images and these extracted features are written into an excel file using WriteToExcel Module. For each directory in the 'Dictionary', a sheet is created in the excel file named as 'DictionaryFeatures.xlsx' and the features are written in that sheet. The overall process of feature extraction of 'Dictionary' images has been shown in Fig. 5.

a) Preprocessing Module
The input to the Preprocessing module is the directory 'Dictionary'.

Algorithm:
Input: Directory 'Dictionary' For each image in the directories of the 'Dictionary', the following steps have been followed: 1. The image is converted to gray image. 2. The white spaces that surround the text in the gray image are removed using Phase -I of RemoveNoise module of [3], that is, RemoveBoundarySpaces. This gives an image that consists of Odia alphabet only. 3. After white spaces have been removed, the image is resized into 64 x 64 by using inter-cubic interpolation. 4. The resized image consisting of Odia alphabet only is then converted into a binary image named as 'BinaryImage' using OSTU's method.
In 'BinaryImage', the pixels that form the Odia alphabet are called black pixels and they are represented as 0 whereas the pixels that form the other areas of the 'BinaryImage' are called white pixels and they are represented as 1.

5.
The 'BinaryImage' is divided into two equal parts, both horizontally and vertically. In this way, this image is divided into four equal quadrants. The dimension of this image is m x n (m = 64 and n = 64), where 'm' is the number of rows and 'n' is the number of columns. The row that equally divides the 'BinaryImage' horizontally is named as 'MidRow' and it is found out by using the following formula: The column that equally divides the 'BinaryImage' vertically is named as 'MidCol' and it is found out by using the following formula: shortName is the name of the image file present in any DicItem th directory of 'Dictionary'. Suppose quadNo = 1, quadrant = B, DicItem = 2, DicInnerItem = 12 then a sheet named '2' will be created in the excel file, 'DictionaryFeatures.xlsx' whose path has been provided in the 'DataPath' parameter, and then the extracted feature is being written in the '12 th ' row (as DicInnerItem = 12) and '1 st ' column (as quadNo = 1) of the sheet. The value in the parameter 'shortName' is written in the fifth column. REPEAT STEP 5 WHILE J < col 5.
IF REPEAT STEP 12 WHILE I < row 12.
REPEAT STEP 13 WHILE J < col 13. IF

26.
EXIT d) GettingFeaturesRight Module The aim of this module is to get a continuous trace of black pixels scanning from left to right. When the first black pixel is found in FindPath Module while scanning the quadrant from the specific corner, the coordinates of the pixel (I value and J value) are passed to this module to get a continuous trace of black pixels in the specified quadrant. This module is used in quadrants B, C and E.

f) RemainingSubQuad Module
A continuous trace of black pixels is found by either GettingFeaturesLeft or GettingFeaturesRight starting from the first black pixel found in FindPath module and the name of the sub-quadrant is appended in 'LSubQuad'. But the continuous trace of black pixels may not have accessed some part of the quadrant ('B' or 'C' or 'D' or 'E'). To ensure, all parts of the quadrant have been accessed, the remaining parts are accessed using RemainingSubQuad Module. First, this module checks if all the sub-quadrants have been accessed for the black pixels. This is done by checking the contents of the 'LSubQuad' list. In other words, if a sub-quadrant does not have any black pixel then that sub-quadrant is not allowed to be present in the 'LSubQuad' list. If the name of the all subquadrants that have black pixels have appeared at least once in 'LSubQuad' then, RemainingSubQuad exits, otherwise, RemainingSubQuad is called recursively to scan the subquadrants until all sub-quadrants that have black pixels have been scanned and stored in the 'LSubQuad'.

'quadrant' consists of 'B', 'C', 'D' and 'E'.
Algorithm: RemainingSubQuad(quadrant) 1 . IF 'a', 'b', 'c' and ' SCAN black pixels of quadrant from bottommost and left-most corner to find the first black pixel and from there scan towards right following the similar procedure as in GettingFeaturesRight to find the continuous trace of black pixel.

18.
For each black pixel in the continuous trace APPEND 'd' in 'LSubQuad'.
REPEAT STEPS FROM 3 TO 6 WHILE sh < sheet 3.
APPEND the feature in first column of 'sr th ' row of 'sh th ' sheet of 'DictionaryFeatures.xlsx' in 'fiQuList'.

6.
APPEND the feature in fourth column of 'sr th ' row of 'sh th '  WRITE the value of 'Text1' in the first column of 'sr th ' row of the excel file, 'CommonFeature.xlsx'.

24.
WRITE the value of 'Text2' in the second column of 'sr th ' row of the excel file, 'CommonFeature.xlsx'.

25.
WRITE the value of 'Text3' in the third column of 'sr th ' row of the excel file, 'CommonFeature.xlsx'.

26.
WRITE the value of 'Text4' in the fourth column of 'sr th ' row of the excel file, 'CommonFeature.xlsx'.
REPEAT Find the maximum value in the array 'LcsForm' and it is stored in 'MaxValue'.

B. FindingMatch
This part deals with finding a correct match from the dictionary of common features stored in the excel file, 'CommonFeature.xlsx' when an image of Odia alphabet is provided as input. This input image is stored in a directory named as 'Input'. The 'FindingMatch' part undergoes through two phases: 'Feature Extraction' and 'Recognition'.

1) Feature Extraction
This phase undergoes through seven modules for extracting features from the input image present in the directory 'Input' and the features are written to an excel file named as 'InputFile.xlsx'. The different modules are: Preprocessing, FindPath, GettingFeaturesRight or GettingFeaturesLeft, VisitSubQuad and RemainingSubQuad for extracting features from the input image and the features are written in the excel file using WriteToExcel Module. The overall process of feature extraction of Input image has been shown in Fig. 5.

a) Preprocessing Module
The steps in this module are same as described in Preprocessing module of 'DictionaryBuilding' except the values passed to the parameters in FindPath module. The input to this module is the directory 'Input' consisting of an image of Odia alphabet. The input image is converted to gray image. The white spaces surrounding the Odia alphabet in the gray image are removed using the Phase -1 of RemoveNoise module of [3] (RemoveBoundarySpaces). Then the resultant image is resized into the dimension p x q (p = 64 and q = 64) where, 'p' is the number of rows and 'q' is the number of columns. Then the resized image is converted to binary image named as 'BinayImageIn'. The row that equally divides the 'BinayImageIn' horizontally is named as 'MidRow' and it is found out by using the following formula: The column that equally divides the 'BinayImageIn' vertically is named as 'MidCol' and it is found out by using the following formula: The four quadrants are found out from 'BinayImageIn' in the following way:

c) GettingFeaturesLeft Module
The steps in this module are same as described in GettingFeaturesLeft module of 'DictionaryBuilding' except that the steps here are applied to W quadrant.

d) GettingFeaturesRight Module
The steps in this module are same as described in GettingFeaturesRight module of 'DictionaryBuilding' except that the steps here are applied to U, V and X quadrants.

e) VisitSubQuad Module
The steps in this module are same as described in VisitSubQuad module of 'DictionaryBuilding' except that the steps are applied to U, V, W and X quadrants. Similar to as explained in the VisitSubQuad module of 'DictionaryBuilding', the quadrants are divided into four subquadrants, a, b, c and d. For each black pixel in the continuous trace, the sub-quadrant (either 'a' or 'b' or 'c' or 'd') is found out and appended in 'subQuad'. The value of 'subQuad' is returned and set to 'LSubQuad' in 'GettingFeaturesLeft' or 'GettingFeaturesRight', whichever has been called.

f) RemainingSubQuad Module
The steps in this module are same as described in RemainingSubQuad module of 'DictionaryBuilding' except that the steps are applied to U, V, W and X quadrants. If any portions of the quadrants U, V, W and X are not covered by the continuous trace of black pixels, those remaining portions are covered by this module and the name of sub-quadrants (either 'a' or 'b' or 'c' or 'd') are appended in 'LSubQuad'.

g) WriteToExcel Module
The steps in this module are same as described in WriteToExcel module of 'DictionaryBuilding' except that the features extracted from the quadrants U, V, W and X are written in an excel file named as 'InputFile.xlsx'. The absolute path of 'InputFile.xlsx' is stored in the 'DataPath' parameter and the file name of input image is stored in 'shortName'. The value in 'shortName' parameter is written in the fifth column of 'InputFile.xlsx'. Hence, the feature extracted from the quadrants U, V, W, X and the value in 'shortName' parameter are written in the first, second, third, fourth and fifth column of the first row of the excel file, 'InputFile.xlsx' respectively and there is only one sheet present in the excel file as there is no sub-directories of the 'Input' directory. For example, the final feature for the input image are: To achieve recognition of an Odia alphabet, the system explained in this paper is divided into two parts; one is 'DictionaryBuilding' and other is 'FindingMatch'. The For testing, an image of Odia alphabet is given as input to the 'FindingMatch' to find a correct match. Nine font sizes have been considered for this research and 200 images of Odia alphabet of each font size making a total of 1800 images are provided as input to 'FindingMatch' one at a time. The percentage of correctness has been shown in Fig. 7. For feature extraction, [22] and [23] had used Water-Reservoir Principle to get the shapes of the characters and numerals respectively; [24] had divided the characters into nine zones and traced the shapes in each zone; [25] had found out the centroid of the character and then the angle between the centroid and the pixel to trace the shapes of the characters; and [26] had first found out some low-level strokes to detect the high-level strokes and using these strokes, the shapes of the character had been traced. The proposed approach has also traced the shapes of the characters by first dividing the character into four quadrants and then scanning each quadrant in different directions to get the features in string format. The proposed system has also been compared with the systems in [22], [23], [24], [25] and [26], and the results have been tabulated in the Table I.  It has also been found that alphabet Chota U ( ) is recognised as Bada U ( ) in some font sizes because they have very little difference in their structure. The system faces the same challenge for the alphabets Ra ( ) and Ru ( ).

V. CONCLUSION
The approach described in this paper goes through two parts. First part deals with building a dictionary and the second part deals with finding a match for the image given as input. In the first part ('DictionaryBuilding'), a dictionary of images consisting of alphabets in the font family 'AkrutiOriAshok-99' and in different font sizes are prepared. Then features are extracted from the images and written in an excel file, 'DictionaryFeatures.xlsx'. LCS has been used to find the common feature from the extracted features and the common feature has been written in an excel file, 'CommonFeature.xlsx'. The second part deals with finding a match for the image that is given as input. In second part, features are extracted from the input image and matched with the feature present in 'CommonFeature.xlsx'. In some cases, if more than one match or no match is found then the four quadrants of the input image have been scanned in another direction. The overall correctness accuracy of the system has been achieved as 98.1%.
As the proposed approach recognises Chota U ( ) as Bada U ( ) and Ra ( ) as Ru ( ) in some font sizes, hence, further research can be done in future to eliminate this disadvantage. Elimination of this problem may increase the accuracy percentage. Moreover, research can be done to reduce the number of phases of the proposed system which may increase the efficiency of the system. 128 | P a g e www.ijacsa.thesai.org