Multilingual Artificial Text Extraction and Script Identification from Video Images

This work presents a system for extraction and script identification of multilingual artificial text appearing in video images. As opposed to most of the existing text extraction systems which target textual occurrences in a particular script or language, we have proposed a generic multilingual text extraction system that relies on a combination of unsupervised and supervised techniques. The unsupervised approach is based on application of image analysis techniques which exploit the contrast, alignment and geometrical properties of text and identify candidate text regions in an image. Potential text regions are then validated by an Artificial Neural Network (ANN) using a set of features computed from Gray Level Co-occurrence Matrices (GLCM). The script of the extracted text is finally identified using texture features based on Local Binary Patterns (LBP). The proposed system was evaluated on video images containing textual occurrences in five different languages including English, Urdu, Hindi, Chinese and Arabic. The promising results of the experimental evaluations validate the effectiveness of the proposed system for text extraction and script identification. Keywords—Multilingual Text Detection; Video Images; Script Recognition; Artificial Neural Networks; Local Binary Patterns.


I. INTRODUCTION
Over the recent years, there has been a remarkable growth in the amount of multimedia data in the form of images, videos and audios.With the advancements in image/video capture hardware and the increase in the number of online image and video databases, digital multimedia content is likely to increase manifolds in the days to come.With this has increased the need to have efficient indexing and retrieval mechanisms allowing users rapid access to the content they are interested in.Among different types of multimedia data, the focus of our research interest lies on videos.
In addition to the visual content, videos comprise audio, text and other objects.The audio and visual information in the video could be effectively employed for development of semantic indexing and retrieval systems [1] and has been an attractive research area for over two decades now [2], [3].In some cases, especially on the video sharing portals, users manually assign tags to videos allowing their retrieval.This retrieval, however, does not take into account the actual content of the video and is based on matching of tags only.In addition to the content of the video, a very powerful component, which could serve as an effective index, is the textual information in the video.
Text embedded in videos provides important, short and relevant information about the visual content.Examples of text occurrences include names of persons, sports scores, important dates, scene locations, movie credits, and stock rates etc.These embedded instances of text can be extracted and used as an effective index for retrieval from large video archival systems.As a result, development of automatic systems which could extract text from videos or images has been an attractive area of research in image analysis and pattern classification.Despite significant research on this problem, detection of textual information remains a challenging problem due to complex backgrounds, different font sizes and orientations and low contrast and resolution.
It is interesting to note that most of the research on this subject has focused on detecting text in a particular script.Properties of text in a particular script are exploited to detect its occurrences.Recently, there has been the trend of having multilingual text in videos especially the news channels where news tickers are flashed in multiple (generally two different) languages.It would be interesting to develop a generic system that could extract textual occurrences in videos or images irrespective of any language or script and this, in fact, is the subject of our study.The text detection module is generally integrated with text recognition (OCR) module to convert the occurrences of text in the image into text.For a detection system that works on a single script, the output of detector can directly be fed to the OCR module.In case of a multilingual detection system, however, the script of the detected text also needs to be identification so that it could be fed to the respective OCR system.This script identification has also been addressed in our work.This work extends our previous contributions on text detection and extraction from video images [4], [5], [6].The main contribution of this research includes development of a generic text detection system in a multi-script environment which is not tuned to detect text in a particular language.The proposed approach is a combination of unsupervised and supervised techniques.In the first step, an unsupervised approach exploits the visual properties of text to segment candidate text regions using image analysis techniques.These candidate textual regions are validated by an Artificial Neural Network which is trained to differentiate between text and non-text blocks on the basis of a set of features extracted using the Gray Level Co-occurrence Matrices (GLCM).The developed system also identifies the script of the detected text using texture based features computed from the Local www.ijacsa.thesai.orgBinary Patterns (LBP).The system evaluated on images with textual occurrences in five different languages (Urdu, English, Arabic, Chinese and Hindi) reports promising results on text detection as well as script recognition.
We first discuss the recent advancements in video text detection and extraction followed by the proposed methodology in Section III.Section IV describes the experimental evaluations conducted to validate the proposed methodology along with an analysis of the results realized.Finally, we conclude the paper with some ending remarks.

II. BACKGROUND
Considering the applications it offers, detection of textual content from images and videos has been a highly researched area over the last decade.Text appearing in videos/images is generally classified into two categories, artificial text and scene text.Artificial text, also known as caption or superimposed text, is the text embedded and laid over the videos during the editing process to provide additional information related to its content such as news captions, sports scores, stock rates, etc. Scene text, on the contrary, is the text which appears naturally in the scene and is captured by the camera as a part of scene.Examples of scene text include text appearing on sign boards, billboards, names on shirts and vehicles etc. [7].Detection and recognition of each category of text offers different types of applications.Scene text generally finds applications in robot navigation, license plate recognition and navigation of intelligent vehicles etc. Artificial text, which in general, is correlated with the content, is preferred for semantic indexing and retrieval of videos.Sample images containing occurrences of scene and artificial text are illustrated in Figure 1.The unsupervised approaches for text detection mostly exploit the statistical and temporal features of text and, in general, work well in relatively less complex images.However, these methods may produce more false positive in complex scenes.The techniques used in this class of methods are further classified into gradient, connected component, texture and color clustering based methods.
Gradient-based methods [4], [8], [9], [10], [11], [12], [13] use edge information to segment the video images.They assume that there is high contrast between text and its background.Generally, an edge filter (e.g.Sobel or Canny operator) is applied for text detection, which is usually followed by some morphological processing to merge the desired edges to determine text lines [10], [14].
Texture based methods [15], [16], [17], [18] assume that text appearing in video frames has a unique texture that differentiates it from other objects in the image.Since the textural properties vary with font style and size, a generic texture filter for varying scenarios is hard to devise [1].In addition, the computational complexity of these methods is also high as they require an exhaustive scan of whole image for text detection and localization.
Connected component based methods [19], [20], [21] either use region growing or splitting approach in order to group text pixels into clusters until all regions in the input image are identified.These methods are widely used for text localization due to their simple implementation.However, since these methods mainly rely on the contrast between text and background, they produce false alarms in case of low resolution images.
Color based methods [22], [23], [24], [25] use color information to cluster image content into text and non-text regions.These methods perform well for images with high resolution and simple backgrounds.However, these assumption may not be true in many real world scenarios where text may appear in various colors and can be superimposed on complex backgrounds.In addition, due to compression, images may suffer from color bleedings affecting the performance of color based methods.
In supervised approaches for text detection, a learning machine is first trained on a set of features extracted from both text and non-text samples.Generally, these features are extracted by scanning the image with a small window which are then fed to the classifier.Calssifiers like support vector machine (SVM) and artificial neural networks (ANN) have been extensively applied for this purpose [26], [27], [28], [29], [30], [31].In some cases, coarse-to-fine algorithms have also been evaluated where the candidate text pixels are first identified and then valiated by a classifer [32], [33].
With few exceptions, most of the text detection methods reported in the literature target text in a particular script.The literature is very rich when it comes to detect text in any of the languages based on the Latin alphabet (English, French, and German etc.).Detection of caption text in Chinese has also witnessed a significant research attention.For most of the other scripts, the research is either in its early days or is nonexistent.In our proposed system, we aim to develop a generic text detection system that is not tuned to detect text in any particular script and works on multilingual text as detailed in the following section.

III. PROPOSED METHODOLOGY
This section presents in detail the proposed methodology for text detection and script identification.As discussed earlier, the target application of such text detection systems is indexing and retrieval of videos.The general architecture of such a system is illustrated in Figure 2. Textual information extracted from videos is fed to an Optical Character Recognition (OCR) system to convert it into text.The focus of our research, however, is on the first part, i.e. detection and extraction of text and identification of the script of the detected text.
The proposed system can be divided into three main modules.An unsupervised approach is first used to detect potential text regions.These text regions are validated through a supervised approach that employs an artificial neural network as classifier.Finally, the script of the extracted text is recognized using texture based features.Each of these modules is discussed in the following sub-sections.

A. Text Detection
For detection of potential text regions in the image, the image is first converted to grayscale [5].A sequence of image analysis techniques is then applied to the image as discussed in the following.
1) Gradient Computation: Edges are a common feature of text in all scripts.Different scripts have different proportions of horizontal, vertical and diagonal edges corresponding to text strokes in each of these directions.In our study, we consider text in Urdu, English, Chinese, Arabic and Hindi, an example   In our implementation, vertical edges are computed using the first derivative (gradient) by convolution of the image with the respective Sobel mask.
Figure 4 illustrates two images and their respective (vertical) gradient images.It should be noted that objects other than text may also respond to the gradient operator.Hence, the gradient image, in addition to text strokes may also contain many unwanted edges which are removed in the subsequent steps.
2) Mean gradient: The textual content in images occurs in clusters hence a number of studies consider enhancing the magnitude of image gradients in the text regions while suppressing it in the non-text areas.Generally this is achieved by scanning the gradient image with a small window and performing some operations [10], [8].Authors in [10] exploit this idea using accumulated gradients where the gradient values in a predefined sliding window are accumulated.Shivakumara [8] employed the difference of the maximum and minimum values of pixels in a fixed neighborhood to calculate the value of central pixel in each window.In our study, we slide a horizontal www.ijacsa.thesai.orgwindow of size 1 × s on the gradient image and replace each pixel with the average of the gradient magnitude in the window [4].The motivation behind this operation is that edges in text regions appear in clusters.Hence, computing the average gradient in windows over text regions is likely to maintain high values.On the other hand, isolated gradients in the non-text regions, when replaced by the mean of neighboring pixels, are suppressed [4].Equation 1 summarizes the average gradient operation, s being the size of averaging window which is empirically fixed to 31 in our study.
The averaged gradient image is binarized to have text or textlike regions as white pixels on black background.Binarization threshold is computed using Otsu's global thresholding algorithm.As a result of binarization, gradients with weak magnitude are removed (become a part of background) and text-like regions are retrained which are merged together by applying morphological operations on the binarized image.
3) Morphological Processing: In order to combine the binarized gradients into larger components, we apply horizontal run-length smoothing algorithm (RLSA).As a result of this, components in the proximity of one another are merged together while the isolated components remain separated.It can be seen from Figure 5 that most of the textual content is merged into large components which correspond to words or groups of words.
4) Foreground Density Filter: Applying the horizontal RLSA to the binarized averaged gradients joins most of the textual elements into larger components.The image, however, still contains non-text components which need to be addressed.Exploiting the same idea that text components appear in clusters, we next employ a density filter on the image using a rectangular sliding window.The window is moved in the topbottom, left-right fashion and for each position of window the density of foreground (likely text) pixels is computed as.The foreground pixel density is compared to a pre-defined density threshold.If the pixel density at a given window position is greater than the threshold, the central pixel is assigned a value 1, else it is considered a non-text pixel and is assigned a 0.
Where t is the density threshold set to 0.8 while the window size is fixed to 10 × 10 pixels.As evident from Figure 6.The density filter, although effective, does not suppress all the unwanted non-text regions.We, therefore, apply some geometrical constraints on the detected components to further reduce the false alarms.

5) Geometrical Constraints :
With the realistic assumptions that size of the text on the image is large enough to be read by the audience, traditional geometrical constraints are applied to the localized bounding boxes.Another important property, as discussed earlier, is that text components are likely to occur in groups and not in isolation.Similarly, since we target horizontally aligned text, constraints can be applied to the aspect ratio of such text.Components satisfying the empirically determiend thresholds on aspect ratio, minimum height and minimum width are kept as potential text regions while the remaining components are discarded.Figure 7 illustrates the components retained as text after application of geometrical constraints on the two example images used as reference in our description.After having discussed the detection of potential text regions using an unsupervised approach, we present the validation mechanism of these detected text rectangles in the next section.

B. Text Validation
The output of the text detector mostly comprises valid text regions.However, some other objects, which exhibit text like properties, are also falsely detected as text regions.The objective of validation step is to take as input each text block localized by the detector and validate it using a supervised approach.This module comprises two phases, training and validation, each of these is discussed in the following.

1) Training :
A unique property of text in any script is its texture which can be exploited to distinguish it from other objects or complex backgrounds.Texture information can be captured using a variety of measures.In our implementaiton, we compute a set of features from the Gray Level Cooccurrence Matrices (GLCM) of text and non-text blocks to represent the texture.These features are then used to train a classifier, an artificial neural network in our case, to learn to discriminate text and non-text regions.
Training of the classifier requires samples of text and nontext blocks.We have used a training data set which comprises video images containing textual occurrences; 30 images for each script making a total of 150 images.The text rectangles in each image are manually extracted while rest of the image is considered as non-text region.For each text and non-text rectangle, we divide it into small blocks of 30 × 50.This gives a large number of text and non-text blocks which constitutes our training data.Some examples of text and non-text blocks can be seen in Figure 8.Each block (text or non-text) is converted to grayscale and a GLCM is computed for each block.The GLCM considers the relationship among two neighboring pixels and determines how frequently different combinations of gray levels co-occur for a given direction and distance.The size of GLCM matrix is the same as the number of gray levels in the image.It is therefore a common practice to quantize the gray levels to have a smaller GLCM.In our implementation, we quantize each block to 64 gray levels and compute the GLCMs using four displacement vectors (offsets).These offsets include (0,1), (1,-1), (0,-1) and (-1,-1) and correspond to four directions 0 • , 45 • , 90 • , 135 • .
Once the GLCMs are computed, several statistics can be computed from each GLCM and could serve as features to characterize the underlying texture of the input image (block).In our study, we compute the contrast, correlation, homogeneity, entropy and energy of each GLCM and use them as features to characterize each block.These statistics are summarized in Table I.These five statistics are computed for each of the four GLCMs (0 • , 45 • , 90 • , 135 • ) for each training block.Finally, the average of each feature for the four directions is computed giving a 5 dimensional feature vector [34].These features are fed to a feed forward artificial neural network.In our implementation, we use a neural network with 5 neurons in the input layer (corresponding to five features), 20 neurons in the hidden layer (chosen experimentally) and two neurons in the output layer, each neuron with a sigmoid activation function.The network is trained on 396 text blocks and 938 non-text blocks using back propagation algorithm.

Correlation
Pi,j

TABLE I: Summary of GLCM based features
2) Validation of Text regions: The trained neural network is employed to validate the candidate text regions produced by the detection module.Each detected rectangle is divided into blocks which are fed to the network for classification.If more than 60% of the blocks in a detected rectangle are classified as text, the rectangle is retained as a valid text region.Otherwise, it is considered a false positive and is discarded.This validation step is intended to remove the false alarms and improve the overall precision of the system.A relaxed threshold of 60% is used so that valid text regions are not eliminated during this step and recall of the system is not compromised.The final text rectangles are then separated from the background using the text extraction module discussed in the following.

C. Text Extraction
Text extraction is the step where the text components are segmented from the background.This step is straight forward if the background is homogenous but can pose difficulties on complex backgrounds.A number of global and local thresholding algorithms have been propsoed to segment text from the background both in scanned document images and video frames [35], [36], [37], [38], [14].In our implementaiton, we employ the Wolf's algorithm [14] which has been specifically developed for segmentation of video text from the background and is known to work better than many of the binarization algorithms.Examples of text extracted using Wolf's binarization [14] can be seen in Figure 9.This concludes our discussion on text detection which comprised detection of potential text regions, validation of these regions and segmentation of text from the background.We now present the script identification in the next section.

D. Script Identification
Script identification is aimed at identifying the script of the text detected by the detection module.Literature on script identification of video text is relatively limited as most of the text detection systems have been designed to operate on text in a known language.The existing literature on this subject is mostly on document images only and script identification from text in videos has been a less investigated area.In case of printed and handwritten document images, features at page, paragraph, line and word level have been explored for identification of script [39], [40], [41].Among recent video text script identification methods, supervised [42] as well as unsupervised [43] techniques have been employed.
For detection of multi-script text, the objective is to find the common properties of text in different scripts and exploit these properties to allow its detection.In script recognition, the objective is to exploit the variations between different scripts.In our study, we consider text in each script as a different texture and employ Local Binary Patterns (LBP) to capture the texture information.The histograms of LBPs computed from texts in different scripts are used to train a neural network which then classifies a given text as being one of the script classes.

1) Local Binary Patterns:
Local Binary Patterns, introduced by Ojala [44], [45] for texture classification, have been effectively applied to wide variety of texture classification problems [46], [47], [48], [49].The original LBP feature [44], [45] considers for each pixel V 0 a set of neighboring pixels.The pixel values of all the neighbors are compared with the value at central pixel.If the value of a neighboring pixel is less than the central pixel, the neighbor is assigned a value of 0, otherwise, it is assigned a 1.The resulting string of 0s and 1s is considered a binary number.The computation of LBP for a reference pixel is illustrated in Figure 10.
In a later study [50], the authors proposed extensions to the original LBP operator to take into account neighborhoods of different sizes.The generalized LBP is represented using the notation (P, R), where P represents the number of neighboring pixels while R is the distance of the neighboring pixels from the central pixel.In addition, based on the number of transitions between 0s and 1s, uniform and non-uniform binary patterns were introduced.LBP codes for which the number of transitions is less than or equal to 2 are considered uniform while those with more than 2 transitions are considered non-uniform [50].
To generate an LBP based descriptor of texture, the LBP is computed for each pixel in the image and the histogram of LBP is used as feature to characterize texture.In our implementation, we compute the (16, 2) LBP from the grayscale images of text blocks with dark text on bright background.For 16 neighboring points, this gives a 243 dimensional feature vector characterizing the texture of each script.
2) Training and Classification: An artificial neural network is used as classifier to recognize the script.The neural network is trained using the same training set that was used to train the network for text validation.Text rectangles from a total of 150 images, with 30 images per script are used as training data.The LBP histogram is computed from each image and the extracted histograms are fed to the network for training.The network comprises 243 neurons in the input layer (same as dimension of the feature vector/histogram), 200 neurons in the hidden layer and 5 neurons in the output layer (corresponding to 5 scripts).For recognition, the LBP histogram is determined from the detected text rectangle and is fed to the network which classifies it as being English, Arabic, Urdu, Hindi or Chinese text.

IV. EXPERIMENTS AND RESULTS
All experiments are carried out on the multi-lingual artificial text database developed at Image Processing Center (IPC) -a research facility at National University of Sciences and Technology (NUST), Pakistan.The database comprises a total of 500 video frames extracted from different news channels, sports videos, talk shows etc.These images contain occurrences of artificial text in five different languages namely English, Arabic, Urdu, Chinese and Hindi with 100 images of each category as the major text of the image.A subset of this data set (images with Urdu text) has been published as [51].The resolution of the images varies from a minimum of 320x240 to a maximum of 720x576 pixels.Out of the 100 images of each category, 30 images are used as training data (for training the ANN for text region validation and script identification) while 70 are used for testing.The ground truth data for the images was generated by labeling the text occurrences and storing the coordinates of each text rectangle.
Several evaluation metrics have been proposed to evaluate the performance of text localization systems [52], [53].In our system, we have employed the area based precision and recall measures.Let A E be the estimated text area given by the system and A T be the ground truth text area, then the precision P and recall R are defined as: The same idea can be extended to N images to compute the overall precision and recall values.For script recognition experiments, we report the confusion matrix and the overall correct classification rate of the system.

A. Text Detection Results
The text detection module first identifies potential text regions using an unsupervised approach.These candidate text rectangles are then validated by a supervised approach to find the final set of text regions.Detection results, in terms of precision and recall, for both of these are summarized in Table II   It can be seen from Table II that precision values are lower than that of recall values.There are mainly two reasons for this.The first reason is that the system parameters are tuned to achieve high recall and, low values of precision at the detection step are acceptable.The next step of text validation is aimed to reject the false alarms and improve the precision of the system.Since validation cannot detect the text regions which are missed by detection, the recall cannot be improved by the validation step and hence high values of recall are desired at the detection step.The second reason is that we are using an area based metric to compute precision and recall where area represents the number of pixels.Figure 11 illustrates an example of the ground truth text region and the text region detected by the system.Although the system has detected the text but since all three text regions are merged in one big rectangle (having background pixels in the detected region), this results in a low precision.It should be noted that the idea of having a validation step after detection is to enhance the precision of the system by rejecting the regions falsely detected as text.Although precision values in Table III are better than those in Table II, there is a slight decrease in the recall values.This is because while false alarms are reduced by the validation step, some text regions are also eliminated.Overall, however, increased values of F-measure reflect the usefulness of this validation step.

B. Script Identification Results
Script identification is aimed at identifying the script of the text extracted from the images.From the view point of application, script identification module should be fed the output of text detector.However, since the text detection does not extract all the text rectangles, script recognition experiments are carried out on manually extracted text blocks.This allows evaluation of script recognition on all the text blocks in our dataset.Out of a total of 1,448 text blocks, the script of 1,291 blocks was correctly recognized making it a classification rate of 89%.The detailed confusion matrix is illustrated in Table IV where it can be observed that the performance of script identification is more or less consistent across text in different scripts.For script identification, we have used the histogram of local binary patterns using (16, 2) neighborhood (LBP (16,2) ).By varying the neighborhood size, we study the variation in the classification rate as illustrated in Figure 12.Neighborhoods of (8,1), (8,2), (8,3), (16,1), (16,2) and (16,3) have been considered in out experiments.It can be observed from Figure 12 that the script recognition rates are not very sensitive to the neighborhood size with neighborhoods of 16 pixels naturally performing better than those of 8 pixels.We also performed a comparative analysis of the proposed system with well-known existing systems in the literature.The comparison can be carried out for text detection as well as script recognition.Text detection, however, has been evaluated by different metrics in different studies hence a meaningful comparison may not possible.We, therefore, present a comparison of the performance of different script recognition systems in Table V.It can be seen from the table that the database employed, the number of scripts and the number of images in each study is different making it difficult to perform a direct comparison of recognition rates.A maximum of 10 different scripts have been considered in [54] realizing a recognition rate of 91%.The system, however, has been evaluated on 100 test images only.The recognition system in [43] reports a correct classification rate of around 96% on 770 test images which indeed is very promising.Our proposed LBP based technique realizes a recognition rate of 89% on 500 test images in 5 different scripts.These results are comparable with most of the studies and we look to improve them further by introduction of other texture based features to complement the LBP features.

V. CONCLUSION
This work presented a system for detection of multilingual artificial textual content from video images, an important component for text based indexing and retrieval of videos.Script recognition was also considered in our study.Most of the state-of-the-art approaches for text detection target a single script/language.We have presented a generic text detection system that is not tuned on one particular type of text.The detection is implemented using a combination of unsupervised and supervised techniques.The unsupervised approach relies on image analysis techniques including edge information, morphological processing and geometrical heuristics to detect potential text regions in an image.These candidate text regions are then validated by an artificial neural network that is trained on text and non-text blocks using a set of texture features computed from Gray Level Co-occurrence Matrices (GLCMs).The proposed methodology evaluated on images containing textual occurrences in five different languages (Urdu, Arabic, Hindi, English and Chinese) realized promising results.
We also presented a script recognition module that takes text blocks as input and recognizes the script of the text.Each script is viewed as a different texture and the texture information is captured by computing the histogram of Local Binary Patterns.Recognition is carried out by an artificial neural network trained on text blocks from the five scripts considered in our study.The main idea of this module is to identify the script of the text rectangles detected in the images so that these rectangles can be further processed by their respective recognition engines.
The proposed system which presently targets extraction of text from images and recognition of the script of detected text can be extended to a complete video indexing and retrieval system.This will require either integration of recognition engines (for each of the scripts) or a word spotting based technique allowing indexing of videos on the extracted textual content.The video OCR itself is a challenging problem due to low resolution and complex backgrounds as opposed to document OCRs.Another interesting aspect which could be exploited is the temporal redundancy of text in videos.The www.ijacsa.thesai.org

Fig. 2 :
Fig. 2: General framework of a video indexing and retrieval system

Fig. 8 :
Fig. 8: Blocks used to train the neural network (a) Text blocks (b) Non-text blocks

Fig. 9 :
Fig. 9: Text detection and extraction examples in five different languages

Fig. 12 :
Fig. 12: Script identification rates as a function of different neighborhoods of LBP and TableIIIrespectively. Using the unsupervised detection scheme, an overall precision of 59% and a recall of 89% is achieved.It is interesting to note that the results are consistent across text in different languages demonstrating the generality of the system.

TABLE II :
Precision and recall of text detection (unsupervised)

TABLE III :
Precision and recall after text validation

TABLE IV
: Script Recognition -Confusion matrix (IJACSA) International Journal of Advanced Computer Science and Applications, www.ijacsa.thesai.org

TABLE V :
A comparison of script recognition systems present system works on static images and does not take into account the redundancy that exists across multiple frames in a video.Integrating the detection results of multiple frames could serve to enhance to overall accuracy of the system.It is expected that the ideas put forward in this research would be helpful to researchers working on video retrieval systems in general and text extraction in particular.