Scale-based Local Feature Selection for Scene Text Recognition

—Scene text recognition has drawn increasing concerns from the OCR community in recent years. Among numerous methods that have been proposed, local feature based methods represented by bag-of-features (BoFs) show notable robustness and efficiency. However, as the existing detectors are based on assumptions about local saliency, a vast number of non-informative local features would be detected in the feature detection stage. In this paper, we propose to remove non-informative local features by integrating feature scales with stroke width information. Experiments taken both on synthetic data and real scene data show that the proposed feature selection method could effectively filter non-informative features and improve the recognition accuracy.


INTRODUCTION
In recent years, scene text recognition (STR) [1] technologies have got increasing concerns from OCR community and other related fields.Compared with surrounding text, scene text is more connected to image contents in most cases.Thus the rich semantic information contained in scene text often plays vital roles in a host of computer vision applications, including impaired people assist, visual land-mark robot navigation and intelligent traffic system.
Even numerous potential applications exist, the STR is still challenging due to the following disadvantages: (1) The scales of scene text, even in same sentences, vary a lot; (2) The shapes and styles may be different since scene text are specially designed to fit different requirements; (3) Scene images always contain illumination changes, viewpoints variations and other disadvantages such as a non-flatness surface; (4) In most cases, no context information is provided.
During the past decades, a number of methods are proposed in response to these disadvantages.The existing methods inSTR area could be divided into two categories according toothier basic ideas.One of which is to achieve accurate STR by developing traditional OCR methods.Most approaches under this idea contain three procedures, which are, text detection, segmentation and character recognition.For instance, Chenand Yuille [2] train strong classifier which contains multiple features by integrating weak classifiers with AdaBoost to extract text regions, then text are recognized by employing commercial OCR software.Coates et al. [3] apply scalable leaning algorithm to feature extraction, text detector and classifier to produce high accurate STR system.Kai et al. [4]designed an end-to-end system for scene text recognition, in which Random Fern [5] is utilized as raw character detector as well as classifier.Moreover, they proposed to improve the accuracy of STR by introducing pre-defined vocabulary.
Another idea is to treat scene text as objects.Thus researchers can transplant object recognition methods that are proposed mostly against image degradations and uncontrolled environments into STR area.For example, De Campus et al. [6] build up a STR framework by following classic BoFsmethods in which sample images are described by frequency histogram of local features.They also compare the effectiveness of different local descriptors by taking experiments one representative benchmark.Zheng et.al [7] recognize scene characters by matching detected SIFT [8] features between input samples and pre-build template images.Different from BoFs method that totally omits position information, they consider the relative position of local features by using MPLSH [9].Diem and Sablatnig [10] build a historical document analysis system based on local descriptors and achieve a state-of-art accuracy for ancient character recognition.
Among these methods, the ones based on local features [6], [7], [10] show notable robustness and effectiveness, especially when in small sample size situations and situations containing image degradations [11].They are more robust because they represent sample images using sets of local features and omitting other highly variable factors.It is obvious that their accuracy largely depends on the effectiveness of detected local features.However, even most local feature detectors assume that salient image patches are informative, the meanings of effective are different in different applications.Specific to our problem, not all detected saliency image patches reflect local structures of characters.Thus, for improving the accuracy, criteria are needed to filter features which are not related to the text.
In this paper, we focus on local feature based STR and propose a novel criterion which integrate stroke width information with local feature scales to remove noninformative local features and achieve higher accuracy.Our idea is based on the fact that text is constituted by strokes with specific width.Thus there should be an appropriate proportion between local feature scale and the corresponding stroke width if these features reflect local text structures such as corner and cross.Experiments taken on both natural and synthetic text images show that the proposed approach could effectively improve the accuracy of local feature based STR.www.ijarai.thesai.orgII.RELATED WORKS Many techniques are developed for filtering redundancies and noises from original features set.In this paper, we make the specific consideration about methods based on codebook model.A classical codebook method includes local feature detection, codebook generation, quantization, and finally classification.Most efforts for feature selection are taken on codebook generation stage and code-word selection stages.In this section, we briefly introduce typical existing methods according to their categories and discuss differences between these methods and proposed method in the end.

A. Compact Codebook Generation
In codebook generation stage, the algorithm seeks for a group of code words (also referred as 'codebook'), which could describe the feature space effectively.A vast number of methods are proposed to generate effective codebook.For instance, Tuytelaars and Schmid [12] extract high-dimensional descriptors for sample images by partitioning feature space using lattices with regular sizes and then combine similar dimensions to make the descriptors more compact.The most widely applied idea is to get codebook utilizing unsurprised cluster algorithms such as K-means [13], which get the most descriptive k centers by minimizing the variance between k centers and the training data.Different from k-means that is dense sensitive, Jurie and Triggs [14] proposed a radius-based clustering which clusters all features within a fixed radius of similarity radius to one cluster.

B. Code-word Selection
Besides generating a compact codebook, a host of algorithms are proposed for picking the most effective subset from the original codebook.Code-word selection is equal to feature selection problem since sample images are rep-resented by frequency histograms of code-words and each bin corresponding to a feature dimension.Distinguishing by whether class labels are given existing methods could be divided into supervised and unsupervised ones.
Supervised methods analyze the relationship between the class labels and code words and then pick more discriminate subset based on pre-defined criteria.Literature [14] gives a performance evaluation for three typical methods including MI [15], OR [16] and Linear SVM weights [17] on representative datasets.Moosmann et al. [18] proposed to build supervised indexing trees using an ERC-Forest that considers semantic labels as stopping tests.The work in [19] aims to find the Descriptive Visual Words (DVWs) and Descriptive Visual Phrases (DVPs) for each image category.
For unsupervised situations, Zhang et al. [20] proposed to pick out the most discriminative code words which lead to minimal fitting errors between data matrix and indicator matrix.Maximum variance selects features with the largest variances and unsupervised feature selection for PCA selects a subset of features that can best reconstruct other features.Laplacian score [21] selects features tat preserve the local geometrical structure best.Q-α [22] measures the cluster coherence by analyzing the spectral properties of the affinity matrix.

C. Proposed Method
Different from the above methods, the proposed method in this paper filters non-informative features by per-forming a pre-selection based on analyzing both feature scale and stroke width information.Its advantage is that the algorithm effects before codebook generation stage and thus could avoid errors that occur in the following process.This means the proposed methods could be more effective when facing small sample size problems, which are common in STR and historical document analysis.

III. SCALE-BASED LOCAL FEATURE SELECTION
The fundamental assumption of designing most local feature detectors is that salient image patches are informative.In fact, the concepts of 'informative' are different in different situations.Specifically, in STR process, it is not promised each salient patches indeed reflects character structure.Thus criteria are needed to remove features that are not effective.
According to whether they are helpful for distinguishing different characters, we divide detected local features into informative and non-informative.Features belong to the first category always localize in character bounding-boxes and they are salient since they contain character structures such as corners and stroke crosses.In contrast, most features that belong to the second category are generated by cluttered background and noises, thus do not provide information forSTR.It is worthwhile to emphasize that large local features that cover the majority of a character should be categorized into the second type since these features are not robust enough when numerous variations are included.
However, it is difficult to remove non-informative local features automatically as it is difficult to give a formally definition for non-informative features.The target can be achieved by training a binary classifier that could distinguish on-informative features from informative ones, however, a large number of training samples are needed to train such a classifier and the existence of varies fonts makes sample collecting rather difficult.Moreover, labeling all features manually is labor expensive and hardly objective.Another idea is to optimize learned codebook according to class label as we discussed in section II, which is under sophisticated mathematical model.These methods that select features by analyzing the relationship between code words and class labels also need large training dataset.
In this paper, we propose a novel local feature selection criterion that selects effective local features based on the ratio between character stroke width and local feature scale.

A. Feature Scale and Stroke Width
Our idea is based on the observation that it is impossible to write small character with wide strokes and large characters with thin strokes.Thus the ratio between character size c s and stroke width w in the text area should keep within a reasonable range to ensure the character is recognizable.At the same time, for each detected local feature which reflects a local structure on character, its scale f s should also be indirect www.ijarai.thesai.orgThis means that for a reasonable character, the scale of a representative local feature should have a stable ratio r with stroke width w .Based on this idea, we can filter non-effective features by checking whether the ratio r is in an interval   min max r , r .
The reason we do not directly apply character size for feature selection is that local structures are directly instituted by strokes and thus the ratio between stroke width and feature scale is more stable than the ratio between character size and feature scale.Moreover, stroke width is more accurate then character size in two reasons.Firstly, the segmentation in scene images is difficult which would lead to inaccurate character size.Secondly, characters in the same size have different stroke width because of the existence of multi-font.
To prove this, we count the frequency histograms of the detected local features according to their feature scales and ratio parameters respectively.The definition of stroke width and the calculation of ratio parameters are described in detail in section IV. gives the frequency of the ratio between feature scale and corresponding stroke width.We find that the ratio parameter depends on a uniform long-tail distribution which certify that a relationship exists between local feature scales and stroke width.

B. Scale-based Local Feature Selection
Typical local feature detectors such as SIFT and Multi-Scale Harris contain three stages.In the first stage, for each pixel (i, j) I in an image I , its local saliency H corresponding to scale s is evaluated by using measurement function F .By noting the neighborhood of point (i, j) I as (i, j) r , we have: ( , , ) (r( , , )) H i j s F i j s  (1) Then the algorithm searches local extreme through both spatial and scale space to find local maximums as candidate feature points, which we note as C .At last, a global thresholding process is taken on C abide by following equation: Where , ij L indictors whether pixel (i, j) r is the center of an acceptable local feature and s th is the threshold of feature saliency.Different from the above process considering the local saliency only, in our work, the relationship between the feature scale s and the stroke width w is also considered.
Thus the probability that a local region is effective could be described as ( ,s, w) PH . According to Bayes formula, we have ( ,s, w) ( | s, w) (s, w) Noticing that the calculation of local saliency H is independent to stroke width w , the probability ( ,s, w) PH could be simplified into ( | ) P H s .Furthermore, in this paper, we describe the relationship (s, w) P between s and w by a sign function of ratio r and use another sign function to describe ( | ) P H s , we get ,, ( ) P(r ) Thus we could give the feature selection algorithm based on the above analysis.According to Algorithm 1, we can improve the accuracy and efficiency by removing non-informative local features.Section IV demonstrates the effect of the proposed algorithm.A. Experiment Setup 1) Experimental Data:To prove that local features with proper scales are more effective, we conduct experiments on a representative benchmark which is referred as 'char74k' [6].The 'char 74k' dataset contains both synthetic and natural samples.Synthetic samples include 52 classes of English characters (capital letters and lower case letters) and 10 classes of numbers (0~9).For each class, 1016 character samples are generated according to 256 different system fonts with 4 different styles.For natural samples, characters are cropped manually from scene images.shows some typical samples of 'Fnt' data and 'NS' data in this benchmark.This dataset is selected for two reasons.Firstly, it contains typical scene character samples which are segmented manually and labeled in detail.Secondly, synthetic data could be used as baseline in our experiment since these samples certify accurate stroke width information and all detected local features are useful for character recognition.Moreover, we collect our own Chinese words dataset (the dataset will be referred as 'CH' in the following parts of this paper) beside the above benchmark using Internet searching engine according to 12 different key words.For each text image we get, accurate text regions are cropped and labeled manually.

Examples of CH data are shown in Fig 2(c).
2) Local Feature Detection:We employ two typical detectors, which are, Hessian-affine and difference of Gaussian (DoG).According to the literature [6], the combination of DoG detector and SIFT descriptor performs much better than others.
3) Stroke Width Extraction:In this paper, stroke width information is extracted by utilizing stroke width transform [23].For each pixel in a text image, if it is localized between two edges pixels with opposite gradient directions, its stroke width value is defined as the distance between these two edge pixels.If more than one pair of edge pixels are found, the stroke width value is set as the minimum one.On the contrary, stroke width value is set as infinite when the algorithm cannot find pixels like that.For more details about stroke width extraction, readers could refer to the original paper by Epstein et.al.[23].Two factors should be considered for extracting precise stroke width.The first one is the thresholds for edge detector (Canny here) should be selected very carefully since the precision of SWT heavily depends on the results of edge detection.The second is that the algorithm needs to know whether the character pixels are darker than the background or opposite.In practice, it is without any difficulties to assign parameters of edge detector for synthetic data as these images have high contrast (binary images, actually).Moreover, all synthetic samples have darker pixels compared to the background.For natural images, thresholds of Canny operator are assigned much lower by considering the image contrast and the contrast between text and background are assigned manually.
Based on detected local features and extracted stroke width value, we can calculate the ratio r for each local feature.

B. Character and Word Recognition
Text recognition is achieved based on classic bag-offeatures framework, which is similar to literature [6].In our experiments, 30 training samples and 15 testing samples are selected randomly for each class.Then local features are detected and described as mentioned above.The visual word vocabulary is generated by using k-means cluster algorithm, and the number P of visual words for each class is assigned equally (varies from 2 to 10 in the following experiments).
Finally, each sample is quantized into feature vector according to the vocabulary and thus each sample image is described bya PC  dimension vector where C is the number of classes.Support vector machine (SVM) with RBF kernel is chosen as classifier due to its effectiveness and representativeness and ' 1 VS all 'strategy is employed to solve multi-class problem.
Besides, we perform recognition separately for numbers, lowercase letters and capital letters to avoid the influence of similar symbols such as 'o'and '0', 'p' and 'P'.Thus the accuracy for NS and Fnt data is calculated by using the weighted average according the following equation In the feature selection stage, a group of samples for each class are selected to find the best threshold for filtering especially large or small features.For each training process, we   remove features that have extremely large or small ratio parameter as percentage.The best filter threshold is found by employing grid search.For Fnt data, the algorithm search the best threshold from 1% to 10% for both large and small sides.The reason for limiting the searching range is that very few non-informative features are detected for Fnt data.The experimental results also show that the best thresholds in the neighborhood of 1% in most cases for Fnt data.We can find that the selection slightly improve the accuracy of Fnt data.Besides, the results of feature selection using linear SVM weight is also shown in the Fig 3 .The results of MI and IG are not attached as LSVMW over-performs them.We can discover that both Scale-based feature selection and LSVMW-based method can improve the accuracy of Fntdata.However, the improvement of scale-based method is not very obvious and weaker then LSVMW-based one.The reason is that most detected local features are informative since no cluster background and noises are included in Fnt data.For the NS data which include more noises, scale-based feature selection strategy overruns both original data and LSVMW-based feature selection.The results show that the scale-based feature selection brings more benefits when a rare word number is used and the efforts of LSVMW is close to our method when the number of words increases.The reason is that when rare word number is used, the influence of noises is more obvious in that error code words will reduce the accuracy, and the proposed method is more effective for filtering non-informative features and avoiding the generation of error code words.
The accuracy for CH data is calculated under the same method.We can discover that the proposed scale-based feature selection method obviously overruns original and LSVMWbased method.This encouraging result further proves that we can filter non-informative local features by considering both feature scales and stroke width.To examine the improvement of the proposed method in greater detail, the recognition accuracy of original data and filtered data is calculated.Moreover, the improvement brought by stroke width information is evaluated as follows: noting the recognition accuracy on original feature set as ori  I.
From Table I, we can see that supervised feature selection algorithms such as LSVMW are more effective for clean data and the proposed method is more effective when samples contain more noises and degradation such as NS data and CHdata.

V. CONCLUSION
In this paper, we proposed a new approach for filtering textindependent local features by considering both stroke width information and feature scales.The proposed approach is tested on representative benchmarks and the encouraging experimental results (a maximum improvement of 25.56% for CH data and 19.34% for natural data) prove the existence of relevancy between stroke width and feature scales.Different from traditional methods which need a group of training data, the proposed approach can effectively filter on-informative local features when only a few samples are used.Moreover, it is notable that the proposed approach is evidently effective for degraded images and small sample size situations.These two advantages ensure the proposed method could be widely applied in the fields such as historical document analysis and text-associate image retrieval.At the same time, we can find that there is much room for improvement in recognition rate for local feature based algorithms.Therefore, our future work include developing probability model which aims at increasing the accuracy of local feature based STR and building end-toend scene text analysis system.

Fig. 1 .
Fig. 1.Diagram (a) shows the frequency histogram of feature scales local features extracted form 'char 74k' dataset.(b) is the frequency histogram of ratio parameters that is calculate by feature scales divided by corresponding stroke width.
Fig 1(a) shows the frequency of local feature scale and Fig 1(b)

Fig. 3 .
Fig. 3. Diagram (a) shows the recognition accuracy of LSVMW, proposed method and original BoFs based method on Fnt data.Diagram (b) and (c) shows the corresponding results on NS and CH data, respectively.

.
The results are shown in Table (a)(b) (c)

TABLE I .
IMPROVEMENTS BROUGHT BY SCALE-BASED LOCAL FEATURE SELECTION