An interactive Tool for Writer Identification based on Offline Text Dependent Approach

Writer identification is the process of identifying the writer of the document based on their handwriting. The growth of computational engineering, artificial intelligence and pattern recognition fields owes greatly to one of the highly challenged problem of handwriting identification. This paper proposes the computational intelligence technique to develop discriminative model for writer identification based on handwritten documents. Scanned images of handwritten documents are segmented into words and these words are further segmented into characters for word level and character level writer identification. A set of features are extracted from the segmented words and characters. Feature vectors are trained using support vector machine and obtained 94.27% accuracy for word level, 90.10% for character level. An interactive tool has been developed based on the word level writer identification model.


Introduction
The significance and scope of writer identification is becoming more prominent in these days. Identification of a writer is highly essential in areas like forensic expert decisionmaking systems, biometric authentication in information and network security, digital rights administration, document analysis systems and also as a strong tool for physiological identification purposes.
In forensic science writer identification is used to authenticate documents such as records, diaries, wills, signatures and also in criminal justice. The digital rights administration system is used to protect the copyrights of electronic media. Two broad categories of biometric modalities are: physiological biometrics that perform person identification based on measuring a physical property of the human body (e.g. fingerprint, face, iris, retinal, hand geometry) and behavioral biometrics that use individual traits of a person"s behavior for identification (e.g. voice, gait, signature, handwriting). Hence writer identification falls under the category of behavioral biometrics. Handwritten document analysis is applied in fields of information retrieval either textually or graphically [1].
Writer identification mode can be generally classified into two types as online and offline. In online, the writing behavior is directly captured from the writer and converted to a sequence of signals using a transducer device but in offline the handwritten text is used for identification in the form of scanned images. Off-line writer identification is extensively considered as more challenging than on-line because it contains more information about the writing style of a person, such as pressure, speed, angle which is not available in the off-line mode.
Writer identification approaches can be categorized into two types: text-dependent and text-independent methods. In textdependent methods, a writer has to write the identical text to perform identification but in text independent methods any text may be used to establish the identity of writer [2].
Various approaches and techniques have been proposed so far for writer identification. Writer identification using connected component contours codebook and its probability density function was proposed in [3]. This paper exhibits better identification rates by combining connected-component contours with an independent edge-based orientation and curvature PDF. In [4], eleven macro-features and micro-features have been used for writer identification. Authors in [8] have used a set of features extracted from lines of text correspond to visible characteristics of the writing such as width, slant, height of the three main writing zones and also features based on the fractal behavior of the writing for writer identification. A system for writer identification using textural features derived from the gray-level co-occurrence matrix and Gabor filters has been described in [12]. In the research work [14], Morphological features obtained from transforming the projection of the thinned writing have been computed and used for writer identification. A HMM based approach for writer identification and verification built an individual recognizer for each writer and train it with text lines of writer was proposed in [15]. A system developed in [16] for writer identification and verification takes two pages of handwritten text as input and determines whether the same writer has written those two pages and features like character height, stroke width, writing slant and skew, frequency of loops and blobs have been used.
This research proposes text dependent writer identification based on scanned images of English handwriting. The scanned images are segmented into words and these words are further segmented into characters on which pre-processing and features extraction tasks are performed.
Features like edge based features, word measurements, moment invariants used in the existing research work are taken into account. Edge based features are computed using edge detected image. Edge based directional distribution and edge hinge distributions are two edge based features. Features such as length of the word, height of the word, height from baseline to upper edge, height from baseline to lower edge, ascender and descender baseline are word measurement features. Moment invariant calculates a set of seven moments for a given image.
This paper includes additional features that were not taken into consideration in [1]. They are the character level features like aspect ratio, loops, junctions and end points. The proposed work is implemented using Support Vector Machine, a supervised learning technique and an interactive tool has been developed.

Proposed Writer Identification Model
The basic property of handwriting is that there exists writer invariant which makes writer identification possible. The writer's invariants reflecting the writing style or writing individuality of handwriting can be defined as the set of similar patterns. Also, two samples of a writer cannot be same. Hence accurate prediction of writer is highly important and challenging task. Hence it is proposed to design and develop a tool for recognizing a writer based on his / her handwriting using pattern recognition technique.
The essential tasks of writer identification are data acquistion, scanning, segmentation, feature extraction, training and writer recognition. The architecture of the proposed system is shown in Fig.1.

Data Aquistion
The data acquistion is an important task in writer identification. In order to acquire an acceptable data, identical words written using the same pen by different writers have been used. Words that have been collected are not case sensitive and the words are scanned using scanner of resolution 300 dpi. A total of 1000 JPEG text images from 10 writers of different age groups and 100 words per writer are obtained.

B. Pre-Processing
In pre-processing segmentation, noise removal, binarization, edge detection and thinning operations are performed. Segmentation: The scanned handwritten document containing 100 words of a writer is segmented into words using the edge pixels. These words are further segmented into characters till reaches 100 characters per writer. Noise Removal: Noise in an image is removed using median filtering.
Binarization: This converts gray scale image into binary image using Otu"s method. Edge Detection: Edges in the binary image are detected using sobel method. Thinning: Morphological operations are used for thinning the binary image.

Feature Extraction
Feature extraction plays a vital role in improving the classification effectiveness and computational efficiency. A set of distinctive features describing the writing style and writer"s invariance is extracted to form a feature vector. The features are described below.

Edge Direction Distribution
In edge direction distribution, first the edge of the binary image is detected using Sobel detection method. The edge detected image are labelled using 8-connected pixel neighbouhood. Then the number of rows and columns in an binary image is found using size function. Next, the first black pixel in an image is found and this pixel is considered as center pixel of the square neighbourhood. Then the black edge is checked using logical AND operator in all direction starting from the center pixel and ending in any one of the edge in the square. In order to avoid redundancy the upper two quadrants in the neighbourhood is checked because without on-line information, it is difficult to identify the way the writer travelled along the edge fragment. This will gives us "n" possible angles. Subsequently, the verified angles of each pixel are counted into n-bin histogram which is then normalized to a probability distribution which in turn gives the probability of an edge fragment oriented in the image at the angle measured from the horizontal. Here "n" is taken as 4, 8, 12, and 16.

Edge Hinge Distribution
To capture the curvature of ink trace, which is very distinctive for different writer, edge hinge distribution is needed, which is calculated with the help of local angles along the edges. Edge hinge feature considers two edge fragments emerging from center pixel and, subsequently, joint probability distribution of the orientations of the two fragments of a "hinge" are calculated. Finally, normalized histogram gives the joint probability distribution for "hinged" edge fragments oriented at the angles 1 and 2. The orientation is counted in 16 directions for a single angle. From the total number of combinations of two angles only non-redundant values are considered and the common ending pixels are eliminated.

Run Length Distribution
Run lengths are determined on binarized image taking into consideration either the black pixels consistent to the ink trace or the white pixels matching to the background. Scanning procedures are of two types: horizontal along the rows of the image and vertical along the column of the image. Next, the probability distribution is interpreted by using the normalized histogram of run lengths. Orthogonal information to the directional features is obtained by using the run lengths.

Auto-Correlation
Auto-correlation function identifies the presence of predictability in writing. By giving the offset value, every row of the image is shifted onto itself. Then the normalized dot product is found between the original row and the shifted row. Autocorrelation function is computed for all rows and the sum is normalized to obtain a zero-lag correlation of 1.

Entropy
Entropy provides the average information of an image such as luminance, contrast and pixel value. It is calculated using the formula:

Moment Invariants
Geometric moment invariant is commonly used in pattern recognition. A distinctive set of features calculated for an object must be able to identify the same object with another possible different size and orientation. Moment invariants can be used to recognize object when the object is changed in transformations. Here the following seven moments are computed.
Length of the word and character is found by successively penetrating each column in the binary image to find the first and last pixels in the image and store their column numbers. The length of the image is calculated by subtracting the column number of last pixel to the column number of first pixel.

Fig. 2 Length of the character Height
Height of the word and character is found by consecutively probing each row in the binary image. The first and last pixels of the image are found and the corresponding row numbers are stored. The height of the image is computed by subracting from the row number of last pixel to the row number of first pixel.

Fig. 3 Height of the character Area
Area of the word and character is calculated as the product of height and length.

Height from baseline to upper edge
The height of the text from baseline to upperedge is calculated by first determining the baseline position of the image. This is functioned by casting an array where the index is row number in the image. Then, the number of black pixels in each row is calculated and the results are stored in an array. After completing the entire image, the maximum value of the array is identified and the corresponding row number is stored as the baseline. The length of the image from the baseline to the upper edge is computed by subtracting the row number of first pixel in the image from its row number of the baseline.

Fig. 4 Length from baseline to upper edge Height from baseline to lower edge
The height of the binary image from the baseline to the lower edge is determined by calculating the baseline row number, as above. Then, the row number of the last pixel of the image is considered. The height of the image from the baseline to the lower edge is calculated by subtracting the last pixel row number to the baseline row number.

Fig. 5 Length from baseline to lower edge Ascender and descender baseline
Ascender baseline is the first non-zero value of column and the descender baseline is the last non-zero value of column of the vertical histogram of the line.

Fig. 6 Ascender and Descender Baseline Aspect Ratio
Aspect ratio is considered as one of the global features in writer identification. It is calculated as ratio of width to height.

End Points
End-points contain only one pixel in their 8-pixel neighborhood. It is computed using end point function which gives the number of end points in the thinned image.

Junctions
Junctions occur where two strokes meet or cross and are found in the skeleton as points with more than two neighbors. It produces number of junctions, positions of each junction, angle and distance between the junctions of the thinned image.

Loops
The loops of a character are the major distinguishing feature for many writers. The loop function gives loop length, angle of loop, position of the loop, area and average radius of the loop of the edge image.

Slope Angle
The slope of angle A is the ratio of height to length. In geometry, it is also referred to as the tangent of the angle A and denoted by tan(A), which gives us the slope angle.

Slant Angle
It is the angle of the word forms against the baseline. It is estimated on structural features by maxima and minima of the word are detected and targets uniform slant angle estimation.
Thus a total of 26 features are extracted from a single word image. The same set of 26 features is extracted from each character image. Finally two independent training dataset each consisting of 1000 instances has been developed using MATLAB.

Support Vector Machine
Support vector machine is a training algorithm for learning classification and regression rules from data. SVM is very suitable for working accurately and efficiently with high dimensionality feature spaces. The machine is presented with a set of training examples, (xi,yi) where the xi are the real world data instances and the yi are the labels indicating which class the instance belongs to. For the two class pattern recognition problem, yi = +1 or yi = -1.
SVMs construct a hyperplane that separates two classes and tries to achieve maximum separation between the classes. Separating the classes with a large margin minimizes a bound on the expected generalization error. The simplest model of SVM called Maximal Margin classifier, constructs a linear separator (an optimal hyperplane) given by w T x -γ = 0 between two classes of examples. The free parameters are a vector of weights w, which is orthogonal to the hyperplane and a threshold value γ. These parameters are obtained by solving the following optimization problem using Lagrangian duality.
Where D ii corresponds to class labels +1 and -1. The instances with non-null weights are called support vectors. In the presence of outliers and wrongly classified training examples it may be useful to allow some training errors in order to avoid over fitting. A vector of slack variables ξi that measure the amount of violation of the constraints is introduced and the optimization problem referred to as soft margin is given below.

Minimize
In this formulation the contribution to the objective function of margin maximization and training errors can be balanced through the use of regularization parameter c. The following decision rule is used to correctly predict the class of new instance with a minimum error.
The advantage of the dual formulation is that it permits an efficient learning of non-linear SVM separators, by introducing kernel functions. Technically, a kernel function calculates a dot product between two vectors that have been (non linearly) mapped into a high dimensional feature space. Since there is no need to perform this mapping explicitly, the training is still feasible although the dimension of the real feature space can be very high or even infinite. The parameters are obtained by solving the following non-linear SVM formulation (in Matrix form), Where Q = DKDand K -the Kernel Matrix. The kernel function K (AAT) (polynomial or Gaussian) is used to construct hyperplane in the feature space, which separates two classes linearly, by performing computations in the input space. The decision function is given by where, u -the Lagrangian multipliers.
When the number of class labels is more than two, the binary SVM can be extended to multi class SVM. One of the indirect methods for multiclass SVM is one versus rest method. For each class a binary SVM classifier is constructed, discriminating the data points of that class against the rest. Thus in case of N classes, N binary SVM classifiers are built. During testing, each classifier yields a decision value for the test data point and the classifier with the highest positive decision value assigns its label to the data point. The comparison between the decision values produced by different SVMs is still valid because the training parameters and the dataset remain the same.

Experiment And Results
Two independent experiments have been carried out, one for word level writer identification and another for character level writer identification. Two training datasets that have been developed for word level and character level are used for implementation. The datasets are normalized using min-max normalization and the normalized datasets is used for learning SVM.
The normalized datasets are trained independently using SVM light for support vector machine with linear, polynomial and RBF kernels with different parameter value for C, where C is the regularization parameter. The parameters d and gamma are associated with polynomial kernel and RBF kernel respectively. The performance of trained models is evaluated using 10-fold cross validation for its predictive accuracy and the learning time. The prediction accuracy is the ratio of number of correctly classified instances in the test dataset and the total number of test cases.

Word Level Writer Identification
The regularization parameter C is assigned values between 0.5 and 50 for linear kernel. For polynomial and RBF kernels the value for C is assigned as 0.5, 1 and 5, d is assigned from 1 to 4 and g is taken from 0.5 to 5 respectively. It is found that the model performs better for the value C = 5.
The results of word level writer identification model based on SVM with linear kernel are shown in Table I

Character Level Writer Identification
The training dataset that has been developed using character level images is used here for SVM learning. The parameter settings for character level training are same as word level training.
The results of character level writer identification model based on SVM with linear kernel are shown in Table IV The results of character level writer identification model based on SVM with RBF kernel are shown in Table VI.

Comparative Analysis
The average and comparative performance of SVM"s with various kernels for word level and character level writer identification is given in the Table. VII and shown in Fig.7 and Fig.8.  From the above comparative analysis the predictive accuracy shown by SVM with polynomial kernel is higher than the linear and RBF kernel. The time taken to build the model using SVM with polynomial kernel is more, than linear and RBF kernel. As far as the writer identification is concerned accuracy plays major role than learning time in identifying the writer.
Also it is found that about 94.27% predictive accuracy for word level writer identification and 90.10% predictive accuracy for character level writer identification are shown by SVM based prediction model. Hence it is found that word level writer identification performs better than character level writer identification. The SVM polynomial based model which is producing higher accuracy is taken into consideration for developing writer identification tool.

Writer Identification Tool
An interactive writer identification tool is developed using MATLAB by incorporating the SVM model with GUI. This tool is used to predict the individual when his/her offline handwriting of a new word is given as input. Screenshots for writer identification tool are shown in Fig. 9 to 12.

Fig. 12 Predicting the writer Conclusion
This paper describes the modeling of writer identification problem as classification task. Two independent dataset has been prepared in order to facilitate training and implementation. The outcome of the experiments indicates that the SVM with polynomial kernel for word level writer identification predicts the writer of the handwritten document more accurately than the other models. Based on this model writer identification tool has been developed to predict the writer. Online text independent approach can be implemented as a future work,