Impact of Deep Learning on Localizing and Recognizing Handwritten Text in Lecture Videos

Now-a-days, the video recording technologies have turned out to be more and more forceful and easier to utilize. Therefore, numerous universities are recording and publishing their lectures online in order to make them reachable for learners or students. These lecture videos encapsulate the handwritten text written either on a paper or blackboard or on a tablet using a stylus. On the other hand, this mechanism of recording the lecture videos consumes huge quantity of multimedia data in a faster manner. Thus, handwritten text recognition on the lecture video portals has turned out to be an incredibly significant and demanding task. Thus, this paper intends to develop a novel handwritten text detection and recognition approach on the video lecture dataset by following four major phases, viz. (a) Text Localization, (b) Segmentation (c) Pre-processing and (d) Recognition. The text localization in the lecture video frames is the initial phase and here the arbitrarily oriented text on video frames is localized using the Modified Region Growing (MRG) algorithm. Then, the localized words are subjected to segmentation via the K-means clustering, in which the words from the detected text regions are segmented out. Subsequently, the segmented words are pre-processed to avoid the blurriness artifacts as well. Finally, the pre-processed words are recognized using the Deep Convolutional Neural Network (DCNN). The performance of the proposed model is analyzed in terms of the performance measures like accuracy, precision, sensitivity and specificity to exhibit the supremacy of the text detection and recognition in lecture video. Experimental results reveal that at Learning Percentage of 70, the presented work has the highest accuracy of 89.3% for 500 count of frames. Keywords—Lecture video; text localization; segmentation; word recognition; Deep Convolutional Neural Network (DCNN)


I. INTRODUCTION
In the recent days, the professional lecture videos are abundant and the number is constantly growing in the web. These lecture videos are motivating the students towards teleteaching and e-learning [1] [2] [3] [4]. It is more crucial for the students to quickly understand the subject by viewing the video rather than reading the text. The advanced analysis techniques help to automatically collect the relevant metadata from these videos and hence the video lectures are becoming an easiest technique of online course learning [5] [6] [7]. In the MOOCs, it is vital to understand the lecture videos for educational research as it has become synonymous with distance learning. The better understanding of the lecture video lies in the vital cues like the figures, images and text [8] [9] [10] [11]. Among these vital cues, the text is available in almost all lectures as it can be utilized for variety of tasks like the extracting class notes, generation of keywords, search enabling and video indexing.
In the lecture video, the handwritten text can be on a blackboard or a paper and this text can be written using a stylus on a tablet and displayed on a screen or font rendered text appearing in presentation slides (digital text). These lectures are usually documented with typically positioned cameras. In general, the identification of the text from the presentation slides is a bit easier while compared to the handwritten blackboard text, since they are more legible. On the other hand, the handwritten text recognition is a different beast considering the amount of variations and the character overlaps. The Characters can be small/large, stretched out, swooped, stylized, slanted, crunched, linked etc. Digitizing handwritten text recognition is extremely exciting and is still far from solved -but deep learning is assisting us in improving the accuracy of the handwritten text recognition. The handwritten blackboard text recognition is additionally challenging and is not legible due to lower contrast, bad illumination, smaller size letter etc. Moreover, detection of the text on the blackboard or paper might be difficult and cluttered, if the lecture over-writes or writes over the figures and equations [12] [13] [14]. The Handwritten Text Recognition (HWR) focuses on the handwritten text in documents and it is practically inherent to complexity in case of different writing styles.
Over the decades, extensive research has been carried out in the field of text recognition on the lecture videos and a variety of methods and algorithms were developed. Word spotting is a key challenge and the majority of the up to date works uses DCNN for learning the features. DCNNs learn the features of the word from dissimilar attribute spaces and are invariant to diverse styles and degradations. Thus, with due interest to handwritten text identification on the lecture video with utmost accuracy, this work focuses on formulating a novel technique by specifically looking into the problems of existing works.
The major contribution of the current research work is highlighted below: • A novel deep learning based handwritten text recognition approach for video lectures is developed.
*Corresponding Author 336 | P a g e www.ijacsa.thesai.org • Initially, the text in the collected video frames is localized with Modified Region Growing Approach and these texts are segmented with K-means clustering.
• The segmented words are pre-processed and recognized using DCNN.
• The performance of the proposed model will be analyzed in terms of certain performance measures like accuracy, precision, sensitivity and specificity.
The rest of the paper is organized as: Section II discusses about the literature works undergone under this subject. Section III portrays about the proposed handwritten textual recognition in lecture videos. The resultant acquired with the presented work is discussed in Section IV. Finally, a strong conclusion is given to the current research in Section V.

II. LITERATURE REVIEW
In 2014, Yang et al. [15] have developed a novel framework for video text detection and recognition. A Fast Localization-Verification Scheme (FLVS) with the Edge Based Multi-Scale Text Detection (EMS-TD) was constructed in the text detection stage. This algorithm consists of three main steps: text gradient direction analysis, seed pixel selection and seed-region growing. The novel video text binarization algorithm was employed for better text recognition. The potential text candidates were detected with high recall rate by the edge based multi-scale text detector. Then, the detected text lines of the candidate in the video were refined by using image entropy-based filter. Subsequently, the false alarms in the lecture video were discarded by the authors with the help of the Stroke Width Transform (SWT) and Support Vector Machine (SVM). In addition, a novel skeletonbased binarization method was constructed to disconnect text from compound backgrounds in the text recognition phase.
The proposed text recognition model in lecture video was evaluated in terms of accuracy using the publicly available test data sets.
In 2018, Poornima & B. Saleena [16] has developed a new technique for successful repossession of the lecture videos from the database using the Correlated Naive Bayes (CNB) classifier. Here, the textual features as well as the image texture were extracted from the key frames with the help of the Tesseract Classifier (TC) and Gabor Ordinal Measure (GOM). The extracted feature dataset encloses three major types of features like the keywords, semantic words, and the image texture. On the basis of the similarity of the features, the authors grouped the video with K-means clustering. Finally, the texts were recognized from the lecture video on the basis of the correlation as well as the posterior probability. The proposed model was compared over the existing models in terms of precision and recall.
In 2018, Kota et al. [17] have constructed a Deep Learning based method for handwritten text, math expressions and sketches recognition in the online lecture videos. In the proposed model, the input from the whiteboard lecture video was recorded by the video processing pipeline using a still camera. Then, the summary of the handwritten elements on the whiteboard in the lecture was generated as keyframes over time. It suffers from the occluded content owing towards the motion of the lecturer. They implied the conflict minimization approach after spatio-temporal content associations with the aim of generating the summary of the key frames. In addition, the Coarse-Grained Temporal Refinement (CTR) was employed to the Content Bounding Boxes (CBB) to detect the variations in the detector output in terms of dissimilarity like the occlusions and illumination.
In 2015, Husain et al. [18] have projected a distributed system in which the lecture video frames were stored in the Hadoop's Distributed File System (HDFS) repository. Then, with the help of the HDFS, the processing operations and the highly concurrent images were processed. Further, the MapReduce framework was implied for reading text information as well as for counting the frequent appearance of the words. The proposed text recognition and word count algorithms were tested with the cluster size of 1 and 5 in the Hadoop framework. The resultant of the proposed model confirmed its application in the field of video processing and high-speed image processing.
In 2019, Dutta et al. [19] have investigated the efficiency of the traditional handwritten text recognition and word spotting methods on the lecture videos. The dataset was collected from LectureVideoDB having 24 different courses across science, management and engineering. Once the frames were stored, they were pre-trained using the TextSpotter. They localized the words in the video lecture using the deep Fully Convolutional Neural Network (FCNN) and to the output of FCNN; the Non-Maximal Suppression (NMS) was employed to detect the arbitrarily oriented text on the blackboards. Once, the location of the word is identified, the word was recognized using the Convolutional Recurrent Neural Network (CRNN) architecture and Convolutional Recurrent Neural Networks Based Spatial Transformer Network (CRNN-STN). Then, as a novelty they spotted the keywords in the video by extracting the features with two parallel streams of network and label information was concatenated using Pyramidal Histogram of Characters (PHOC) features.
In 2019, Miller [20] has designed a lecture summarization service model by leveraging Bidirectional Encoder Representations from Transformers (BERT) model. The initial contribution of this research work was based on the supervision of lecture transcript and summarizations, which help the users to edit, retrieve and delete the stored items. The second contribution of this research work was an inference from the BERT model with K-Means model in order to produce the embeddings for clustering. Further, on the basis of specified configuration, the summaries for users were generated by BERT model. Finally, the proposed BERT model was compared with the TextRank and the resultant exhibited no golden truth summaries, but there was improvement in the handling context words and was applicable to more lecture videos.
In 2015, Miller et al. [21] have constructed a new approach for Automated Video Content Retrieving (AVCR) within large lecture video archives. Initially, the audio lecture was separated from the video and the video was converted into image key-frame using Optical Character Recognition (OCR) 337 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 4, 2021 algorithm. Then, from the image, the keywords were extracted using the OCR algorithm. Subsequently, for the video content navigation, a visual guidance was provided by the key-frame detection as well as the automatic video segmentation model. Then, the video OCR was employed on the key-frames and Automatic Speech Recognition (ASR) in order to extract the textual metadata available in the lecture videos. Further, on the basis of the multimedia search diversification method, appropriate answers were collected on the basis of the words. The proposed model had provided more relevant information with more effectiveness to the users.
In 2019, Husain and Meena [22] have introduced a novel method for efficient Automatic Segmentation and Indexing of Lecture Videos (AS-ILV). The proposed model helps in faster reorganization of the specific and relevant content in the online lecture video. In the proposed model, the authors projected the automatic indexing of lecture videos from both slide text and audio transcripts with the extracted topic hierarchies in the video. On the basis of the slide text information, the authors have indexed the videos with higher character recognition rates in an accurate manner. As a novelty, the authors have overcome the problem of high Word Error Rate (WER) transcribed in the video due to the unconstrained audio recording with the semi-supervised Latent Dirichlet Allocation (LDA) algorithm. They have tested the proposed model with Coursera, NPTEL and KLETU classroom videos and the resultant of the evaluation exhibited average percentage improvement in F-Score, while compared to the existing one. Table I summarizes the abovementioned works along with the methodology, features and challenges.
Regardless of massive amount of works on lecture videos and MOOCs, there are incredibly a small number of which distinctively come across this problem. Among them, the SVT and SVM approach in [1] has high robustness and High recall rate. Apart from these advantages, it requires improvement in the text recognition rate with the aid of the context-and dictionary-based post processing and the text detection result need to be improved with the help of the text tracking algorithms. Further, in CNB, tesseract classifier and GOM [2], the computational time is lower and the precision as well as recall are improved. This technique suffers from retrieval of text from large dataset and hence optimization technique needs to be implied. In [3], the conflict minimization approach gives higher text detection rate. This technique need to handle occlusions and temporal refinement for end-to-end detection of content in video frames. Then, HDFS and MapReduce in [4] is a fault tolerant distributed system and it is cost effective. The memory shortage problems are created by large datasets and hence memory optimization needs to be implied. Moreover, CRNN and CRNN-STN in [5] is Applicable for low resolution and complex images. But, this technique does not use Applicable for low resolution and complex images. A higher trade-off is achieved between the speed and inference Performance by BERT model. But here the automatic extractive summarization is not perfect. Further, OCR algorithm [7] is good in providing the relevant information in a better way and can collect the appropriate answer for the words. As a controversy to these advantages, it too suffers from low recognition rate and high cost. The F-Score is enhanced by LDA algorithm; however, the WER is not removed completely. The research works in this area should focus on one or more of these problems.

III. HANDWRITTEN TEXTUAL INFORMATION RECOGNITION
The pictorial depiction of the adopted video handwritten text recognition approach is revealed in Fig. 1. In the presented scheme, a new handwritten textual image recognition approach is developed using an intellectual technique. The presented scheme comprises of four most important stages such as, Text Localization, Segmentation, Pre-processing and Recognition. At first, the video frames are acquired and the text within the video frames is localized using the Modified Region Growing Algorithm. Subsequently, the localized words are subjected to segmentation via the Kmeans clustering, in which the words from the detected text regions will be segmented out. Subsequently, the segmented words are pre-processed to avoid the blurriness artifacts as well. Finally, the pre-processed words are recognized using the DCNN. The resultant from DCNN exhibits the recognized textual information from the acquired lecture video.

A. Text Localization
The collected lecturer video frame encompasses the textual and the audio contents [23]. The textual contents in the images ( ) j i I , are converted into binary image and the white regions are extracted from it. Further, these extracted white regions ( ) j i I w , are subjected to region growing approach for localizing the texts. The segmentation of ( ) j i I w , occurs via seed points that have to be regularized. A seed point is the commencement stage for region growing and its selection is significant for the segmentation solution. The stages of region growing approach are portrayed in the following steps.
Step 1: The input image ( ) j i I w , is split into a huge count of blocks P , in which all the blocks encompass one centre pixel and several vicinity pixels.
Step 2: Then, fix the Intensity threshold ) ( i R .
Step 3: For the entire block P , carry out the subsequent course of action in anticipation of the count of blocks that reaches the entire count of blocks for an image.
Step 3(a): Find out the histogram G of all pixels in P .
Step 3(b): The most recurring histogram of the th P block, signified by h U is fine-tuned.
Step 3(c): Select any pixel, as per h U and distribute a pixel as seed point with intensity u Int .
Step 3(d): The adjacent pixels are computed with respect to intensity n Int .
Step 3(e): Find out the intensity variation of u and n (i.e.) n u Int Step 3(f): If i Int R Dif < add the consistent pixel to the region, and hence the region would grow, or go to step 3(h).
Step 3(g): Authenticate if the whole pixels are added to the region. If yes, go to step 2 and then carry out step 3(h).
Step 3(h): Re-estimate the region and discover the new seed points and perform the procedure from step 3(a).
Step 4: Finish the whole process.
The textual information acquired from the region-based approach is ( ) j i I text , , which is fed as input to k-means clustering for cropping the data from the texts.  Two separate stages are considered in this algorithm. In the initial stage k centers are selected randomly, in which the k value is previously fixed. Then, in the subsequent stage, every data object is moved to the closest center. Basically, the distance among every object of data and the cluster centers is determined using the Euclidean distance. This process is iterated until the termination criteria happen to a minimum. At the end of segmentation, the words ( ) j i I word , are extracted and they are subjected to pre-processing.

C. Pre-processing
The gathered words ( ) are pre-processed for enhancing the accuracy of recognition. The steps involved in pre-processing are listed below: Step 1: Initially, the collected words ( ) are subjected to histogram equalization, which involves transforming the intensity values, i.e. stretching out the intensity range of the image.
Step 2: In order to convert the ( )

D. Recognition
The weights of the binarized images ( ) j i I binary , are subjected to recognition via DCNN [25]. Actually, DCNNs are CNNs which encompasses numerous layers and it follows a hierarchical principle. Usually, deep CNNs involve several wholly-connected layers, i.e., layers with dense weight matrix W . To do the recognition process, the outputs are exploited as inputs to a SVM or RF and the output phase can be a softmax function as specified in Eq. (1), in which 1 indicates a column vector of ones.
In brief, DCNN performs the formulations as specified in Eq. (4)-Eq. (7), in which the output activation function x f could be softmax, identity, or other function.

A. Simulation Procedure
The proposed video lecture recognition approach is implemented in PYTHON and the resultant acquired is noted. The dataset for the evaluation is downloaded from LectureVideoDB. In addition, two public datasets are utilized for pre-training the word recognition models.
• IAM Handwriting Database: It includes contributions from over 600 writers and comprises of 115,320 words in English.
• MJSynth: This is a synthetically generated dataset for scene text recognition. It contains 8 million training images and their corresponding ground truth words. The sample image collected and its segmented images are depicted in Fig. 3.

FrP TrP
TrP PPV + = (9) Sensitivity: It is "positive correctly classified samples to the total number of positive samples". Mathematically, it is expressed in Eq. (10).

FrN TrP
TrP y Sensitivit + = (10) Specificity: It is the "ratio of the correctly classified negative samples to the total number of negative samples". This can be mathematically defined in Eq. (11).
FrP -false positive and.

B. Evaluation
The evaluation is done by varying the training percentage (TP). The resultant acquired in terms of positive measures for diverse count frames is shown graphically. Table II and Fig. 4 shows the accuracy of the presented work. It is observed that the presented work has the highest accuracy as 89.3 for 500 count of frames corresponding to both LP =60 and 70. The resultant values of precision acquired are tabulated in Table III and is exhibited graphically in Fig. 5. The highest precision of 95 is obtained for LP=70 for 500 count of frames. The sensitivity of the presented work is highest for both LP=60 and LP=70 at 500 count of frames and the resultants acquired represented in Table IV and Fig. 6. The highest value of sensitivity obtained for the presented work at 500 count of frames is 91.2. The specificity of the presented work for LP=60 and LP=70 is exhibited graphically in Table V and Fig.  7. The specificity of the presented work at LP=70 has the highest value of 91.2 for 500 count of frames and it is higher for LP=70 for every variation in count of frames. The experimental results show that the modern DCNN model shows promising recognition accuracy. On a whole, it is observed the detection rate is higher for LP=70.

V. CONCLUSION
Although OCR has been considered as a solved problem, Handwritten Text Recognition a crucial component of OCR is still a challenging problem statement. The huge discrepancy in handwriting styles across different people and the poor quality of the handwritten text as compared to the typed or printed text pose substantial hurdles in converting the handwritten text into machine readable text. However, working on this crucial problem is important due to its pertinence in multiple industries such as healthcare, insurance and banking. This paper presented a novel text detection and recognition approach on the video lecture dataset by following four major phases, viz. (a) text localization, (b) segmentation and (c) preprocessing and (d) recognition. In the initial phase, the text localization in the lecture video frames were accomplished using the MRG algorithm. Then, the localized words were subjected to segmentation via the K-means clustering, in which the words from the detected text regions were segmented out. Subsequently, the segmented words will be pre-processed to avoid the blurriness artifacts as well. Finally, the pre-processed words are recognized using the DCNN. The performance of the proposed model is analysed in terms of certain performance measures like accuracy, precision, sensitivity and specificity to exhibit the supremacy of the proposed text detection and recognition in lecture video. Experimental results reveal that at LP=70, the presented work has the highest accuracy as 89.3 for 500 count of frames. In future, some fusion-based DCNN models will be explored for further achieving more accurate detection of handwritten text recognition. Also, a more convincing and robust training could be applied with added preprocessing techniques. We would focus on developing a more comprehensive model with a reduced amount of training time.