Localisation of Numerical Date Field in an Indian Handwritten Document

This paper describes a method to localise all those areas which may constitute the date field in an Indian handwritten document. Spatial patterns of the date field are studied from various handwritten documents and an algorithm is developed through statistical analysis to identify those sets of connected components which may constitute the date. Common date patterns followed in India are considered to classify the date formats in different classes. Reported results demonstrate promising performance of the proposed approach


INTRODUCTION
Many institutions, business organisations etc. face the problem of processing handwritten document .No successful work regarding the decipherment of unconstrained cursive handwriting has been reported till date [1]. Nevertheless, when focused on certain restricted applications of handwritten text like revealing the location certain numerical data (phone number, pin code...), work becomes quite interesting. The deciphering of the location of the 'date' field in a handwritten document is one such interesting work which has been illustrated in this paper. This may find huge industrial importance as many handwritten documents are required to be sorted or categorized according to the dates mentioned on it. Our proposed algorithm is an advancement to make these industrial or organizational works automated. This will allow additional advantage to fax, photocopy and scanning machines, where sorting handwritten documents based on dates (mentioned in it) could appreciably be made automated.
Works regarding the recognition of a given date information has been reported by many [2][3] [4],each establishing a unique technique of its own. These algorithms however assume that the given input is a date field (i.e. the pixel locations of the 'date' field is already considered to be known). The challenging task remaining, however, is the detection or identification of those pixels from handwritten documents which may constitute the date field. Our paper focuses only on this challenging issue, so that those pixels which are extracted could be fed into the above mentioned algorithms for recognition, thus making our work a pioneering one in the field of Document Image Analysis.
In India, the most commonly followed date patterns are DD-MM-YY, DD/MM/YY and DD.MM.YY. There are more date patterns like DD-MM-YYYY, DD/MM/YYYY, DD.MM.YYYY etc. but our paper focuses only on the above three patterns. It could be convincingly said that the proposed algorithm to locate the former patterns could also be used to locate the later ones with slight alterations.
In this paper we necessitate that the spatial orientation of the connected components in a numerical date field follows a specific structure and can be exploited for the localisation task. We thus target to find all classified date fields in each and every text line of the handwritten document.

II. OVERVIEW OF THE PROPOSED ALGORITHM
The proposed algorithm comprises of a series of processes (depicted by a flowchart shown in Figure I) which includes Pre-processing, Scrutinization of Eight Consecutive Connected Components (ECCC) and Further Classification of DD-MM-YY and DD.MM.YY. Each of these processes is discussed in detail in the subsequent sections of the paper.
Since our study demanded us to have a well maintained database, a database was created (for both training and testing) by scanning numerous handwritten documents of various individuals. Each of these images (documents written on white paper) were scanned at 600 dpi and stored in JPEG format.
A section is also devoted to demonstrate the outcome of our experimentation. All the results obtained, having been enunciated to corroborate our study. www.ijacsa.thesai.org

III. PREPROCESSING
Since our algorithm basically focuses on the scrutinization of the spatial arrangements of connected components and not on other aspects such as colour, texture etc, all the handwritten documents which are considered for statistical analysis or testing are converted to binary image such that the background is assigned a 'zero' pixel value and all the handwritten components are assigned a pixel value of 'one'. The overall image thus appears as shown in Figure    Once the document is converted into binary image (in the above mentioned way), all the text lines are extracted from it. Extraction of text lines implies grouping of connected components that belongs to the same line. For scrutinization of spatial features, the precise knowledge of these alignments is necessary. A histogram projection based text segmentation technique (inspired from [5]) is used.

IV. SCRUTINIZATION OF EIGHT CONSECUTIVE CONNECTED COMPONENTS (ECCC)
The text lines extracted are then used for further examination. Since all the above specified classes (DD-MM-YY, DD/MM/YY and DD.MM.YY) deals with eight connected components so a group of eight consecutive connected components (ECCC) is extracted one at a time (say for example C 1 ,C 2 ,C 3 .....C 8 ; where all C i belong to the same text line and C 1 is the first connected component of the ECCC). The widths of the minimum bounding rectangle enclosing these eight connected components are calculated and the maximum of these is found out and stored (say as W max ).A condition:-X min (C i+1 )> X min (C i ) is used to eliminate instance(s) like the dot of 'i', noises, disoriented connected components (shown in Figure V    The outline of the process is described as follows:-1) The horizontal interspatial distance between the above processed eight connected components is calculated (say for example S 1 ,S 2 ,.....S 7 ; where S i is the horizontal interspatial distance between C i and C i+1 ). It is then checked to see that the value of no S i exceeds the value of 1.5times of W max. . This relation has been found out experimentally to avoid cases shown in Figure VII. It is a common observation that when dates are written, all the components representing it are within a certain horizontal interspatial distance from its neighbouring.

2) If the set of eight consecutive connect components
(say C i ,C i+1 ,....C i+7 ) obeys with the conditions of the above step(Step I), then it is sent for further examination(Step III), else the next set (i.e. C i+1 , C i+2 ,....C i+8 ) is considered and processed(Step I). This process goes on iteratively until all the set of eight consecutive is considered for a particular text line. When a text line is checked thoroughly (i.e. all the set of eight consecutive components is scrutinized), then the next text line is processed. 113 | P a g e www.ijacsa.thesai.org It could be easily learnt that in any classified format (as discussed above), the first, second, fourth, fifth, seventh and eighth constitute a numerical field. This process is inspired from [6] where features are defined to characterise the regularity of numerical fields. The feature vector is defined comprising of the following component f1, f2, f3, f4, f5, f6. Where for the set ECCC (say from C i to C i+7 ) f1= , f2= , f3= , f4= , f5= , f6= ; where H represents height and Y represents Y co-ordinate of the centre of gravity of the minimum bounding rectangle enclosing the connected component. A training set of 250 documents is studied to learn the range values in which these features lie. These relations of the connected components with its immediate neighbours reveal features which may characterise it as a numerical field [6].

4) Spatial Orientation of Numerical fields with respect to its Separators:-
The above classified categories of date formats accommodate three types of separators, which are slash (/), dash (-) and dot (.). Learning of the spatial orientation of the numerical field with respect to its separators is the crux of our algorithm. A pattern is studied from a database  A feature vector is defined comprising of elements Y min (C 2 ), Y min (C 3 ), Y min (C 4 ), Y min (C 5 ), Y min (C 6 ), Y min (C 7 ), Y max (C 2 ), Y max (C 3 ) , Y max (C 4 ) , Y max (C 5 ) , Y max (C 6 ), Y max (C 7 ); where Y min and Y max implies the minimum and maximum values of the Y co-ordinate of the minimum bounding rectangle.
Relationships are obtained among these features elements by training around 250 documents, these kinships are expressed (for the above defined classifications: DD/MM/YY, DD-MM-YY or DD.MM.YY and NON-DATE SET) in the form of mathematical inequalities (shown below).

For Class DD/MM/YY:
For Class DD-MM-YY or DD.MM.YY The above eight cases of inequalities (defined for each of the above two categories i.e. DD/MM/YY and DD-MM-YY or DD.MM.YY) are used to categorise a set of ECCC into the above defined date formats. A set of ECCC falls into either of the categories if and only if it satisfies all the eight conditions defining that class. Those sets of ECCC which do not fall into either of the above categories are rejected and are labelled as 'NON-DATE' sets.

5) Registering pixel locations:-
Once the set of ECCC is labelled as 'date', the pixel location range (i.e. a rectangle having the co-ordinates X min ,Y min (C i ),X min, Y max (C i ), X max, Y min (C i+7 ), X max, Y max (C i+7 ) ) is extracted. This region is now registered as 'date'. The output of a sample document ( Figure    Both these classes of dates share common spatial attributes, hence categorising them based on the above features (or conditions) is not possible. The only distinguishing factor among them is the 3 rd and 6 th element of the set of ECCC.
A feature vector comprising of elements W cc3 and W cc6 is defined; W cc3 and W cc6 denote the width of the 3 rd and 6 th connected component respectively. A database comprising of 246 handwritten dates is trained to classify these classes based on the feature vector defined. Then KNN classifier (with value K=3) is used to classify the testing data (result shown in Table  I). www.ijacsa.thesai.org

VII. CONCLUSION AND FUTURE WORKS
The proposed algorithm shows quite an interesting result. It can be clearly seen (from table I) that FRR (False Rejection Ratio) is far less than that of FAR (False Acceptance Ratio), moreover the percentage of efficiency increases as the number of documents considered(for testing) is increased. The high percentage of FAR is due to cases as depicted by Figure XI. FRR is basically due to illegible handwriting, deviations from the normal patterns (or syntax) and occurrence of double digits ( Figure XII).
Since the localisation technique does not involve any recognition process, so the overall algorithm could be rated as quite simple and fast. As mentioned earlier this prescribed algorithm could be modified to localise more classes of dates.
Future works include studying similar patterns among alpha-numeric date formats and addressing the failure in localising dates (numerical) pregnant with 'double digits'. Figure XI: showing cases due to which FAR increases. The above script bears the same pattern as that of a date. Figure XII: showing the case of double digits; the digits '2' and '0' are interconnected.