Bangla Optical Character Recognition and Text-to-Speech Conversion using Raspberry Pi

Optical Character Recognition (OCR) technology is very helpful for visually impaired or illiterate persons who are unable to read text documents but need to reach the content of the text documents. In this paper, a camera-based assistive device is used that can be applied for visually impaired or illiterate people to understand Bangla text documents by listening to the contents of the Bangla text images. This work mainly involves the extraction of the Bangla text from the Bangla text image and converts the extracted text to speech. This work has been fulfilled with Raspberry Pi and a camera module by applying the concepts of the Tesseract OCR engine, the Open Source Computer Vision, and the Google Speech Application Program Interface. This work can help people speaking Bangla language who are unable to read or have a significant loss of visual sight. Keywords—Optical character recognition; Bangla text; speech conversion; Raspberry Pi; camera module


I. INTRODUCTION
Text and speech is the primary medium for communication among human beings. For accessing the text information, an individual needs visual sight. Besides the visual sights, an individual can also know the information using their listening capability. According to the World Health Organization, the amount of visually impaired people is 285 million and the amount of blind people is 39 million in the whole world [1]. More than 90 percent of the visually impaired people exist in developing countries [2] and on the other hand, according to UNESCO [3], 27.11% of adults are illiterate. All these facts have raised the importance of research to develop systems that can help visually impaired persons to overcome their limitations.
Raspberry Pi [4] is a single-board computer. For exploring computing and learning various programs such as python, scratch, etc. this device helps people a lot. It has the capability of calculating, playing music, gaming, and other functions that are done by a computer. The main advantages of this device are portability and low cost. For experimental and innovation activities this board is designed. Its two types of model differentiate each other based on the USB port. For the abovementioned features, Raspberry Pi has become an essential and ideal tool for IoT and automation research.
In the recent era, OCR [5] has been used for converting images to text. It helps millions of people to know the information from scripts such as airline tickets, medical documents, mail, etc. in their perspective file. In the recent advancement of OCR technology and algorithms such as the Tesseract OCR engine [6], it can recognize a huge number of characters in various languages. The application of OCR touches every technological organization in the world. It also included the recognition of characters from handwriting scripts in various languages.
Bengali or Bangla is an Indo-Aryan language. As a primary or secondary language, around 210 million people speak in Bengali, among them around 100 million, and 85 million speakers are from Bangladesh and India, respectively [7]. Bangla OCR is different from other languages because of the basic structure of the Bengali script. Bengali letters as in Fig. 1 have different transformers and edges. Besides, a large number of characters are contained in the Bengali script. There are 57 characters in Bengali scripts among them 21 characters are a vowel and 36 characters are consonants. Because of curves in the character, sliding, and stroke characters researchers face various challenges.
In this paper, we propose a Bengali OCR based system. The image which is captured by the user using the camera included in the proposed device is analyzed in various phases of Tesseract (OCR Engine) methods. The text in the input image is extracted using the method of Open Source Computer Vision (OpenCV). For converting the text into a speech, GTTs (Google Text to Speech) library is used and it works offline. In Raspberry Pi board, a slot is used for connecting the headphone and the user needs to be connected with the headset. Thus, the user can hear the speech of the text through the headphone.
This work is ordered as follows: Section II outlines the existing relevant work. Then, Section III explains the working methodology with the system architecture that includes system hardware and software implementation. System evaluation is demonstrated in Section IV. Subsequently, the conclusion of the paper by mentioning some future works is appended in Section V.

II. RELATED WORKS
Very few research has been done that would help blind people to follow or recognize objects to pass their daily life smoothly. The aspects of this kind of problem are notably complex and this kind of work is exceptional to do. Many researchers have discovered several possible solutions for text to speech conversion using diverse methodologies.
R. Naveen et al. [8] introduced a method of a camera-based assistive framework to assist blind people to recognize faces, signs, obstacles, text objects and the feelings of persons and gave it as a required audio output through the earphone that has been used in their daily life using the raspberry pi. This work is mainly devised to blind navigation purposes but not shown any result-oriented data and didn't mention the output accuracy rate or level.
L. Nagaraja et al. [9] proposed a camera-based assistive method that read texts from an image for blind peoples and printed notes and papers and converted it to speech. This work is a vision based recognition system based on raspberry pi and the images are extracted from OCR engines. Due to the resolution problem of the images, this work has not achieved 100% accuracy. V. A. Devi and S. S. Baboo [10] developed an embedded system based OCR based framework that can read especially Tamil texts from an image using the raspberry pi. This work is only for Tamil text conversion from an image and it is only captured over a printed page.
G. K. Sagar and T. Shreekanth [11] have implemented an OCR system that converts text to speech using raspberry pi for helping blind people and people with poor vision. In this TTS system images are captured through a webcam and processed by the TTS and amplified by audio and give the output to the speaker. Classification is done by the Template Matching and Neural Networks. S. A. James et al. [12] used the raspberry pi to implement an automatic book reader which is OCR based. Here, OCR is used to recognize the text of the reader and. By using the Adaboost learning model, calculation of text classification and adjacent character grouping is performed. C. A. Todd et al. [13] introduced a tool which is known as audio haptic. This is mainly for visually impaired users. This work is based on web technologies with audio and haptic interaction to aid the visually impaired. D. Velmurugan et al. [14] produced a smart book reader with a raspberry pi controller which is also OCR based. Using the various methods, the pixels of the image are transformed. The extracted text is then converted to speech which is listened to by headset or another device. C. S. T. Thu and T. Zin [15] implemented OCR to recognize English characters (capital) and numbers using MATLAB. This work is done by two major steps like OCR system and TTS conversion. Here Neural Network is used for classification.
H. M. Htun, T. Zin, H. M. Tun [16] developed a TTS system that converts text to speech and numbers, words, and sentences for visually impaired and handicapped people. Here, the token is obtained by segmenting the input sentence, and each of the words is considered as POS tagging. Bigram Model is used for the POS tagger.
A. Goel et al. [17] implemented a system for English book readers in Raspberry Pi. Here python programming is used for text extraction from captured images and audio speech conversion.
H. Rithika and B. N. Santhoshi [18] proposed a model that aids the user to listen to the text image's content in the fancied language. It includes the text extraction from the input picture and transforming this to interpret speech. The speech-language depends on the user. This work is accomplished by employing a Raspberry Pi and a camera module and also with the help of the theories of the Tesseract OCR engine and GSAPI.

III. PROPOSED METHODOLOGY AND SYSTEM ARCHITECTURE
The proposed system is divided into two sections as system hardware implementation and system software implementation. In our daily life, there are many Bangla scripts. The user who can't read the Bangla scripts papers, he captures the image using a raspberry pi camera. After capturing the image, the image is processed and analyzed in raspberry pi using the tesseract (OCR engine) and OpenCV [19] methods. Then the text of the analyzed image is extracted. Using the GTTs library, this extracted text converts to speech offline. Finally, the user hears the speech that is written on the Bangla scripts. Fig. 2 shows the proposed system architecture.

A. System Hardware Implementation
The following parts are used as the constitution of the hardware system of the device: a Raspberry Pi Camera Module for image capturing, breadboard for push-button, to execute image recognition programs Raspberry pi board uses and to produce the desired output speech a headphone uses. The system hardware organization is shown in Fig. 3. a) Raspberry Pi: Raspberry pi [4] developed by the Raspberry Pi organization for basic computer science at schools in developing countries, is a group of single-board computers. There are various versions of Raspberry Pi based on memory uses or USB support. Broadcom BCM2835 SoC is used as a first-generation Raspberry Pi. In this system, the Raspberry Pi 3 B model is used. 275 | P a g e www.ijacsa.thesai.org b) Pi camera: The Pi camera module has the excellency of taking pictures and recording videos in Raspberry Pi. In Raspberry Pi, there is a port to connect the Pi Camera. There are various versions of the Pi camera. Most of them can deliver a clear image. In this device, Raspberry Pi Camera V2 is used for capturing images. It has 5MP Resolution and supports 1080p30 video recording. c) Memory (Storage): There is no hard disk or solid-state drive in the device. Instead of this, a micro SD card is used for booting the Linux kernel-based OS (operating system). In Raspberry pi, there is a slot for inserting a micro SD card. This card is also used for the data storage of capturing images. When the image is captured, it is stored on the card. The analyzing image is also stored in the memory card. The specification of the used hardware components is provided in Table I.   Memory Samsung 16GB micro SD d) Camera enabling: In Raspberry Pi, there are three ways to enable the Pi camera. One is to set manually and another one is to use the command line and also in python code. It's better to enable through the command line. Without enabling a Camera, it doesn't work. After enabling the camera, its configuration automatically adjusts to the Raspberry Pi. e) Push-button setup in breadboard: A breadboard is a board of rectangular plastic having multiple tiny holes. In this project, a push-button is used to capture the image which is integrated with the breadboard. When the push button is hit, the image is captured. To set up the push button in the breadboard, the circuit diagram of Fig. 4 is used. Fig. 5 shows system setup after connecting all of the hardware components.

B. System Software Implementation
In a system software implementation, we have installed the Raspbian Operating System [20] in a memory card and inserted it in the raspberry pi. After installing Raspbian OS, we have also installed various tools, packages, and libraries. Table II shows the details of some software components.   The captured image from the Raspberry Pi camera is stored in a file as a JPEG format. The stored image is analyzed in python script through the following steps as sketched in Fig. 6.
The details of the above steps are described below: a) Acquisition of image: In this step, a Bangla text image is captured using the Pi camera. Captured Bangla text image is sent to the prepossessing step where various unwanted noise is reduced.
b) Prepossessing of image: By applying relevant morphological transformation like dilation, back hat transformation, threshold, producing the necessary contours, discrete cosine transformations, and forming bounding box, the unwanted noise in the image is banished in the prepossessing of image. First of all, the captured image is rescaled to the relevant size and then converted into a grayscale image for further processing. The grayscale image is then compressed using the discrete cosine transformation. The compressed image is very helpful for further processing. In the compressed image, there exist various unwanted highfrequency components. Those components are omitted by setting the vertical and horizontal ratio. For decompression of the image, the inverse discrete cosine transform is applied. There are two operations like back top-hat transformation and dilation which are used in the image. Then the operation named black top-hot transformation is employed in the image. This operation helps to extract the object or elements that are smaller than the defined. After this operation, the dilation operation is applied for adding the pixel to the edges of the object of the image. The number of pixels depends on the shape and size of the present object. Now the thresholding algorithm is applied in the present image. Among all the thresholding algorithms, here adaptive thresholding is chosen. Then using specific functions of OpenCV, the contours of the image are generated. There are many bounding boxes of the objects or elements in the present image. For drawing those, the generated contours are used. For extracting every character of the present image, the drawn bounding boxes are used. Finally, by applying the OCR (Optical Character Recognition) engine, the full text of the present image is detected.
c) Extraction of text: From the input image, the recognized text is extracted in this step. This extraction is performed using the Tesseract OCR engine.
d) Text to speech converter: Applying the GTTs engine, the extracted text is converted to speech in this step. With the help of some predefined libraries of this engine, we performed the text to speech conversions. In the GTTs engine, there are online and offline systems. In our project, we have used an offline system for user portability. e) Desired output speech: When speech is generated, the user can easily hear it through the headphones. As it is based on a Bangla text image, the user can easily hear the speech in Bangla language.

IV. SYSTEM EVALUATION
For the system analysis, there needs to be some Bangla scripts paper. A monitor is included for proper monitoring of recognition of text from the captured image in the system. The Fig. 7 shows the final view of our system.
For the result, an image is captured using the Raspberry Pi camera. It is considered as the input image. Image prepossessing, extraction of texts are handled by the various defined methods of the tesseract (OCR engine). The accuracy of the extraction of text in captured images is not 100% because of the mid resolution pi camera. After extraction of text, a user can easily hear the speech through a headphone that is connected to the earphone slot in Raspberry Pi. For the precise image, the accuracy of text extraction is satisfactory. Sometimes the accuracy of text extraction is 100% for some precise images. Fig. 8, 9, 10, and 11 show the input image, the Bangla text extraction from the input image, the texts displayed on the screen, and a user of hearing speech, respectively.

V. CONCLUSION AND FUTURE WORKS
This research uses raspberry pi, pi camera, Tesseract OCR engine, etc. to help the people listening to the content of the Bangla text image who are visually impaired or illiterate. It can also be used by any person who wants to listen to the content of the image instead of reading the content of the image. We have achieved 97.4% accuracy for precise Bangla text images. For the middle range of a Pi camera, the quality of the captured image is not so good in low light. During the night the quality of the captured image is obscure. As a result, sometimes the accuracy of the extraction of text is not up to mark.
In the future, we would like to enhance this system by appending the higher resolution Pi camera to increase accuracy for text extraction and by eliminating noise from speech using advanced algorithms. Furthermore, we would like to improve the portability of the system by compacting the hardware design through design improvement and hardware upgrade. We would also like to extend our research to extract text on Bangla handwritten script.