Deep Neural Networks Combined with STN for Multi-Oriented Text Detection and Recognition

Developing systems for interpreting visuals, such as images, videos is really challenging but important task to be developed and applied on benchmark datasets. This study solves the very challenge by using STN-OCR model consisting of deep neural networks (DNN) and Spatial Transformer Networks (STNs). The network architecture of this study consists of two stages: localization network and recognition network. In the localization network it finds and localizes text regions and generates sampling grid. Whereas, in the recognition network, text regions will be input and then this network learns to recognize text including low resolution, curved and multioriented text. Deep learning-based approaches require a lot of data for training effectively, therefore, this study has used two benchmark datasets, Street View House Numbers (SVHN) and International Conference on Document Analysis and Recognition (ICDAR) 2015 to evaluate the system. The STN-OCR model achieves better results than literature on these datasets. Keywords—Spatial Transformer Networks (STNs); Deep Neural Networks (DNN); ICDAR dataset; multi-oriented text; STN-OCR


I. INTRODUCTION
Text detection from scenary images is becoming focused area of research. It has attracted many researchers [1]- [3] from computer vision area due to its various applications such as tagging people in security cameras, understanding street signs for navigation, sign recognition in driver assisted systems, vehicle identifications, navigation people with low vision and processing bank cheques. In recent years, digital devices like smart phones or cameras are being used to produce a lot of multimedia contents (such as images and videos) across the world. It is now easy to capture the world's sceneries in the digital images through mobile devices as the prices are decreasing and performance is increasing. Not only the contents are generated at very large scale and easily uploadable but also accessed by billions of people on the internet. Many systems are developed to extract information for various purposes. However, the solutions are still under discussion with researchers.
Text detection and recognition from images in real world scenarios such as sign recognition in driver assisted systems, vehicle identification by reading license plates is major area of computer vision applications. Recent work [4], [5] presents Deep Neural Network for text detection with good results on horizontal text. However multi-oriented text is still lacking [6]. It is further discussed the very studies that text detection is a challenging task due to variations in text: orientation of text, text alignment, text visibility, multi-language text, low resolution or diversity of languages. Moreover, a recent research [7] is conducted for arbitrary text detection but still has some lackings in detecting multi-oriented text such as in object occlusion, large character spacing. These challenges remained continue even in state-of-art methods.
This study provides the solution for text detection and recognition problem from scenery images in arbitrary direction. Generally, there is no any restriction on text type in images, so if a human can read text whether of any type such as sign boards, calligraphy or newspaper then the systems should also detect and recognize text. The purpose of this work is all about giving generalized approach for multi-oriented text detection and recognition. Fig. 1 presents the abstract overview of the STN-OCR model. Getting any information from scenery images is not simple task, it involves deep feature extraction. Many approaches [6]- [11] to this type of computer vision problems have been proposed. The research [12]- [15] in this area is mostly about end-to-end text recognition systems consisting two stages including text detection and text recognition. Text detection is reffered to finding text instances and highlighting textual part in images and text recognition means identifying that localized textual part of images. Text recognition stage evaluates that localized part of image and produces output in text form. Most of the existing work is focused on only one of these two stages either text detection or text recognition. This paper is further divided into few sections for better presentation. For instance, the coming Section II discusses the related studies to ground the study need and value. Section III is about methods and methodologies. Section IV discusses the results and discussions obtained from the experiments. The last Section V concludes the study with certain remarks and recommendations. www.ijacsa.thesai.org

II. RELATED WORK
Text detection approaches focus on the first stage of two stages pipeline of scene text detection and recognition. It basically performs segmentation and produces words bounding boxes in scenery images.
Finding and localizing textual regions in complex backgrounds is really a challenging task and there are two approaches which overcome this task.
The first approach is character region which is used in Chen et al. [17], Epshtein et al. [18], Tian et al. [19], Yi et al. [20], Neuman et al. [12]- [14], [21]. Character regions based methods localizes character regions based on connected component analysis then characters are combined to form a word based on neighboring characters. Second approach is sliding windows which is used in Quack et al. [22], Anthimopoulos et al. [23], Posner et al. [24]. Sliding window based methods uses sliding window to find out textual regions within image and then uses machine learning techniques to recognize text.

A. Traditional Approaches
Traditional methods are mostly based on manual feature extraction in which human were involved to perform the scene text detection. These types of systems use features like stroke width transforms (SWT) [18], MSERs [12] and HOG-Features [7] to find textual regions and provide output to next stage which is text recognition. For text recognition, multiple approaches can be used to recognize textual regions like sliding window classifiers [25], ensembles of support vector machines (SVM) [26], KNN classifiers using HOG features [27]. Limitation in these methods is that these need expert knowledges to achieve best results.
The method proposed by Yuanwang et al. [16] is Exhaustive Segmentation (ES) for text detection. In their study, with the help of ES, character portions are extracted from image and filtering out non-character regions using two-layer filtering. These both are performed in parallel and support vector machines (SVM) classifier is used finally to cut out text regions. This method covers low resolution, blurred and small sized texts. ICDAR 2013 and Street View Text (SVT) datasets are provided for evaluating the performance of ES [16] approach. The ES method still has shown some lacking such as Broken Strokes, low resolution and dot-matrix fonts as shown in Fig. 2. In Aneeshan et al. [8], a novel approach has been proposed for multi-oriented text detection in images. In their study, Fourier Laplacian filtering is used for textual portions identification and then applied maximum-difference map separating image into text and non-text regions. In the end, Hidden Markov Model (HMM) is used for verification of selected text portions in image and non-textual regions are neglected.

B. Deep Learning Approaches
In the last decade, most of the systems are developed on manually hand-crafted features but today those approaches have been exchanged with most recent deep neural networks approaches. For instance, the study conducted by Karatzas et al. [9] focuses on selective search approach along with deep neural networks to detect textual regions in scenery images. Gupta et al. [28] used YOLO architecture [10] to develop text detection model following fully convolutional DNN to localize text candidates. Output as textual regions of these systems is given as input to the DNNs for text recognition.
The work in Goodfellow et al. [29] focused text recognition model for house numbers. It was further improved by Jaderberg et al. [30] for every type of text recognition. In this system, single convolutional neural network is used that takes textual regions as input and perform text recognition (string text available in image). Complete end to end system proposed by Bissacco et al. [31] performs both text detection and recognition but text detection using traditional approached discussed above, manually hand-crafted features and then text candidates are binarized and provided as input to Deep FCNN that classifies each character region.
The work of Minghui et al. [4] provides word spotting and recognition end to end framework which is fast and accurate with single Deep Neural Network DNN named TextBoxes based on fully convolutional network FCN (LeCun et al. 1998). It outputs co-ordinates of text bounding boxes by determining text presence. Finally, aggregation of all boxes is the output using non-maximum suppression process. TextBoxes is trained on SynthText for 50k iterations and tune it up on ICDAR 2013 dataset for 2k iterations and finally ICDAR 2011 dataset is provided for test set. It outperforms on test set but failure in multi-oriented text-based images. Several systems for text detection and recognition using DNN are proposed by Jaderberg [30], [32].
The work in [32] developed bounding box regression CNN model for text detection and CNN model which performs classification based on textual regions as input, but it is limited to one single language as it classifies across pre-defined dictionary. In the study [30], sliding window approach is given for text detection and then CNN is used as sliding on textual regions in image. This CNN uses weight sharing with CNN for text detection. Work proposed by He et al. [33] uses both CNN and Recurrent Neural Network RNN. First, it creates slices for text candidates by using sliding window approach. Later, given input to text recognition CNN, this CNN produces features which are then forwarded to RNN to predict characters.  Hui et al. [6] presents method for text detection for horizontal-text in which major components including connected components extraction, character linking. Adaptive color reduction scheme is designed in this paper for CCs. Adjacent character model is also developed for connecting character which is trained through extreme machine learning. At the end, CNN with ELM is used for verification of text and non-text regions. Moreover, the method proposed in Yang et al. [5] is to detect and locate text in images by using two classifiers and non-text regions are cut by using local recursive search algorithm, and CNN is used to verify text candidates. Evaluation of proposed method is done through the ICDAR datasets such as ICDAR 2011, ICDAR 2013, ICDAR 2015. ICDAR datasets are benchmark used for specially for text detection from scene images. This method performed better on ICDAR datasets but multi-oriented texts are not detected.
The previous studies clearly show that there is still need of contribution like observing literature review [34], [35], most of the work is performed in text detection and recognition is based on single-orientation that is horizontal based and text detection in multi-orientation and multi-language is still very challenging task. Conferences such as International Conference on Document Analysis conferences and Recognition (ICDAR) or International Conference on Computer Vision ICCV are still held to find out latest research in this field. Multi-orientation and multi-language text identification are areas which needs to be explored.
Keeping the related studies in view, in this study, the system constituted based on the sliding window approach but with little changes. For instance, choice of sliding windows is not manually engineered but automatically learned by the model. Lastly, Spatial Transformer Network (STN) [36] is used as main building block for text detection.
III. METHODOLOGY STN-OCR model behaves like a human, it will start reading line by line in sequential manner and read each character step by step. Most of the recent systems for scene text detection and recognition do not follow this human approach of reading text. These systems perform operations on complete image and extract all information at once. In this study, human-based approach is followed to find and localize textual regions sequentially in images and then recognize those localized textual regions. In this regard, Deep Neural Network (DNN) model is developed which is comprised of two stages: 1) text detection and 2) text recognition. This section will focus on attention mechanism used in text detection stage and complete structure of methodology for STN-OCR [3].

A. Text Detection with Spatial Transformers
This study has used Jaderberg et al. [36] proposed method which is Spatial Transformer, a learnable module for Deep Neural Networks that receives some input ∈ * * , performs some spatial transformations to input feature map I and then produces an output feature map O. There are three main parts for this spatial transformation. Localization network is the first part which computes function f_loc, predicting the parameters Ɵ of spatial transformation. The second part is used to create a sample grid based on predicted parameters Ɵ as input. It maps input features from predicted parameters on output feature map, in this part the sampling grid is generated, and that grid is provided to third part as input to learnable interpolation method and finally outputs transformed feature map O. Further in this section, we will describe each part in detail.

1) Localization network:
In this part, feature map I ∈ R H * W * C with height H, width W and C Channels is given as input to localization network and produces output predicted parameters Ɵ spatial transformation to be performed. In this part, this study's system predicts N two-dimensional transformation matrices M θ n , where M is a matrix and n ∈ {0,1,2, . . . , N − 1}.
Localization network will find and localize N number of characters, words or text lines. For achieving oriented text detection, network will be based on affine transformation matrices that will apply transformations including rotations, translations, skew and zoom to the input feature map I, in this regard this system learns to adopt and produces features based on text rotation, translation and zoom.
In STN-OCR, feed-forward CNN along with RNN is used to produce N affine transformation matrices . The CNN model ResNet-50 [33] is used in this localization network. Using this network structure, it is observed that system's performance is better than other structures like VGGNet [37] etc. It solves the problem of vanishing gradient and preserve better accuracy as in other network structure system's accuracy is not saturated. Furthermore, this study also used Batch www.ijacsa.thesai.org Normalization just for experiments and then use RNN in this part. RNN used here is Bi-directional LSTM. Prediction of affine transformation matrices is done through hidden states ℎ , hidden states are basically generated by BLSTM.

 Localization Network Configuration
In localization network, residual neural network is used which is also known as ResNet architecture [38]. As this study is based on two stages so in this localization stage the images will be fed to the network where network will localize textual part. First layer of network will perform 3x3 convolution with 32 filters, second layer will perform same convolution with 48 filters and third layer with 48 filters. After each convolution layer, Batch Normalization [39] is performed followed by average pooling of 2x2 and stride 2. ReLU is used as activation function in each layer. After each layer, two residual layers are used with 3 x 3 convolution, each followed by Batch Normalization. After last residual layer, performed average pooling layer of 5 x 5 followed by BLSTM with 256 neurons. After above the model, sampling grid is generated where bounding boxes (BBoxes) are extracted for textual parts. BBoxes are extracted only for textual part as depicted in Fig. 4.

2) Generation of GRID:
In this part, the system uses grid 0 with co-ordinates 0 , ℎ 0 along with affine transformation matrices produces N grids of input feature map I. During this step, N output grids are generated containing bounding boxes B-Boxes of textual regions localized by the network.
3) Image sampling: In the second part, grid generator produced N sampling grids, now they are used to sample values of feature map I at their respective coordinates for each ∈ . Logically, these points will not lie with exact grid values in feature map I. So, this study has used bi-linear sampling that selects nearest neighbors' points.
In Fig. 3, working of grid generator and image sampler are shown. After N output grids are produced by grid generator, these N grids are fed to image sampler which selects images pixels at that location by using those sampling grids. This system automatically generates Bboxes by generated sampling grids vertices. Hence, combining these three-parts localization network, generation of grid and image sampling formulates Spatial Transformer that can be used generally in every part of Deep Neural Network. Spatial Transformer is first step in this system.

B. Text Recognition
Text detection stage returns N textual regions which are extracted from the input image. In this text recognition stage, each N regions are handled independently of each other. Processing of N regions is done by CNN.
Variant of ResNet is used in this CNN too because it was observed that ResNet producing better results in text recognition system. Text detection needs to obtain strong gradients from text recognition stage. Basically, in this stage, probability distribution over label space is predicted. Softmax classifiers are used to predict probability distribution.
After applying convolution feature extractor, we obtain the result ( ).

Recognition Network Configuration
Configuration of recognition network is same as in localization except convolution filters. This network contains total three convolutional layers having filters of 32, 64 and 128.

C. Training Network
ICDAR 2015 [40] is used to train the network, the training input set X used for training the network/model comprised of images and separate text file for each image. Each file contains coordinates x1, y1, x2, y2, x3, y3, x4, y4, label for words in each image where x1, y1 are top-left coordinates, x2, y2 are top-right coordinates, x3, y3 are bottom-right coordinates and x4 and y4 are bottom-left coordinates. Label is not used in the first stage which is text detection because at this stage the model is only learning localization and finding text candidates, but in next stagetext recognition the model uses label for recognition.
Text detection learns to find and localize text regions by using error gradients by calculating loss of text labels prediction. Text detection is performed with some pre-training steps because we observed initially that model does not combine text if there are multi-line texts within image. Optimization algorithm has great impact on network while training the model. It is observed that Stochastic Gradient Descent (SGD) is better on simpler tasks during pre-training the network and after pre-training the network using SGD, Adam optimizer [41] is applied for improving the already trained network. In text detection stage, the learning rate is kept constant in the first stage for longer period. This is resulting in finding and better localizing textual regions. Therefore, SGD is used and it works better for this. Text recognition stage further starts learning to recognize already predicted text regions from previous stage.  Street View House Numbers (SVHN) is benchmark dataset containing low resolution images and requiring low data processing and formatting. It can be said that it is like MNIST dataset [42]. But this contains a lot of variety of image including low resolutions, blurred images as this dataset has been developed from house numbers in Google Street View Images. It comes into two formats, one is like MNIST, cropped digits images and second in complete house door images along with bounding boxes for digits. Dataset contains too many images, 73257 digits for training and 26032 for testing.

 Expeiments on Datasets
The first dataset used for experimentation is ICDAR 2015. It is most challenging dataset because it consists of several images including different background, noise, cluttered, dot matrix fonts, blurry images and low resolution etc. Results are shown in Fig. 6.     SVHN is the second dataset on which the evaluatation of the network architecture is performed to prove this model can work on real data. In SVHN, house numbers are containing noise too. After experiments on SVHN dataset, it was observed that this study network architecture works on SVHN house numbers by finding, localizing and recognizing house numbers on sampling grid. For achieving best results, the study model was trained from very beginning by initializing random weights but only first stage that is localization network initialized with weights from already trained network. In this regard, localization network stage tends to produce better results. Table II shows accuracies on SVHN dataset while text recognition on real data which is house numbers. After ICDAR 2015, this system achieved better results with 97.5% accuracy on SVHN. Some results which are not handled in existing work [16] were already discussed in literature, but the study model works good on those images. This sutdy is experimented on Google's K80 GPU, 12GB RAM for testing the model, and it produces results within 2-3 seconds per image with color background.

V. CONCLUSION
In this study, end to end text detection and recognition model (STN-OCR) using single DNN is applied on latest benchmark datasets such as ICDAR2015. This system consists of two stages: text detection and text recognition. Text detection model finds and localizes text regions in image and then output of this stage is input of text recognition network which recognizes text regions in image. The main purpose was to detect multi-oriented text and it was achieved with better results. The study clearly shows that this model achieves 97.5% accuracy on SVHN and performed better on ICDAR 2015 than state-of-art methods. This model still is limited to combine words to make complete line/sentence. Moreover, future work includes to implement this model on other local/famous languages (i.e. Urdu/Hindi) and adjusting geometry design for finding directly curved texts.