Automated Extraction of Large Scale Scanned Document Images using Google Vision OCR in Apache Hadoop Environment

Rifiana Arief; Achmad Benny Mutiara; Tubagus Maulana Kusuma; Hustinawaty

doi:10.14569/IJACSA.2018.091117

DOI: 10.14569/IJACSA.2018.091117

PDF

Automated Extraction of Large Scale Scanned Document Images using Google Vision OCR in Apache Hadoop Environment

Author 1: Rifiana Arief

Author 2: Achmad Benny Mutiara

Author 3: Tubagus Maulana Kusuma

Author 4: Hustinawaty

International Journal of Advanced Computer Science and Applications(IJACSA), Volume 9 Issue 11, 2018.

Abstract and Keywords
How to Cite this Article
{} BibTeX Source

Abstract: This Digitalization of documents is now being done in all fields to reduce paper usage. The availability of modern technology in the form of scanners and cameras supports the growth of multimedia data, especially documents stored in the form of image files. Searching a particular text in a large-scale scanned document images is a difficult task if the document is in the form of images where the text has not been extracted. In this research, text extraction method of large-scale scanned document images using Google Vision OCR on the Hadoop architecture is proposed. The object of research is student thesis documents, which includes the cover page, the approval page, and abstract. All documents are stored in the university's digital library. Extraction process begins with preparing the input folder that contains image documents (in JPEG format) in HDFS Apache Hadoop and followed by reading the image document. The image document is then extracted using Google Vision OCR in order to obtain text document (in TXT format) and the result is saved to output folder in Hadoop Distributed File System (HDFS). The same process is repeated for the entire documents in the folder. Test results have shown that the proposed methods was able to extract all test documents successfully. The recognition process achieved 100% accuracy and the extraction time is twice as fast as manual extraction. Google Vision OCR also shows better extraction performance compared to other OCR tools. The proposed automated extraction systems can recognize text in a large-scale image document accurately and can be operated in a real-time environment.

Keywords: Automation; extraction; google vision OCR; hadoop; scanned document images

Rifiana Arief, Achmad Benny Mutiara, Tubagus Maulana Kusuma and Hustinawaty, “Automated Extraction of Large Scale Scanned Document Images using Google Vision OCR in Apache Hadoop Environment” International Journal of Advanced Computer Science and Applications(IJACSA), 9(11), 2018. http://dx.doi.org/10.14569/IJACSA.2018.091117

@article{Arief2018,
title = {Automated Extraction of Large Scale Scanned Document Images using Google Vision OCR in Apache Hadoop Environment},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2018.091117},
url = {http://dx.doi.org/10.14569/IJACSA.2018.091117},
year = {2018},
publisher = {The Science and Information Organization},
volume = {9},
number = {11},
author = {Rifiana Arief and Achmad Benny Mutiara and Tubagus Maulana Kusuma and Hustinawaty}
}

Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.

Automated Extraction of Large Scale Scanned Document Images using Google Vision OCR in Apache Hadoop Environment

Upcoming Conferences