The Science and Information (SAI) Organization
  • Home
  • About Us
  • Journals
  • Conferences
  • Contact Us

Publication Links

  • IJACSA
  • Author Guidelines
  • Publication Policies
  • Metadata Harvesting (OAI2)
  • Digital Archiving Policy

IJACSA

  • About the Journal
  • Call for Papers
  • Author Guidelines
  • Fees/ APC
  • Submit your Paper
  • Current Issue
  • Archives
  • Indexing
  • Editors
  • Reviewers
  • Apply as a Reviewer

IJARAI

  • About the Journal
  • Archives
  • Indexing & Archiving

Special Issues

  • Home
  • Archives
  • Call for Papers
  • Proposals
  • Guest Editors

Computing Conference

  • Home
  • Call for Papers
  • Submit your Paper/Poster
  • Register
  • Venue
  • Contact

Intelligent Systems Conference (IntelliSys)

  • Home
  • Call for Papers
  • Submit your Paper/Poster
  • Register
  • Venue
  • Contact

Future Technologies Conference (FTC)

  • Home
  • Call for Papers
  • Submit your Paper/Poster
  • Register
  • Venue
  • Contact

Future of Information and Communication Conference (FICC)

  • Home
  • Call for Papers
  • Submit your Paper/Poster
  • Register
  • Venue
  • Contact
  • Home
  • Call for Papers
  • Guidelines
  • Fees
  • Submit your Paper
  • Current Issue
  • Archives
  • Indexing
  • Editors
  • Reviewers
  • Subscribe

Article Details

Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.

Optical Character Recognition Engines Performance Comparison in Information Extraction

Author 1: Tosan Wiar Ramdhani
Author 2: Indra Budi
Author 3: Betty Purwandari

Download PDF

Digital Object Identifier (DOI) : 10.14569/IJACSA.2021.0120814

Article Published in International Journal of Advanced Computer Science and Applications(IJACSA), Volume 12 Issue 8, 2021.

  • Abstract and Keywords
  • How to Cite this Article
  • {} BibTeX Source

Abstract: Named Entity Recognition (NER) is often used to acquire important information from text documents as a part of the Information Extraction (IE) process. However, the text documents quality affects the accuracy of the data obtained, especially for text documents acquired involving the Optical Character Recognition (OCR) process, which never reached 100% accuracy. This research tried to examine which OCR engine with the highest performance for IE using NER by comparing three OCR engines (Foxit, PDF2GO, Tesseract) over 8,562 government human resources documents within six document categories, two document structures, and four measurements. Several essential entities such as name, employee ID, document number, document publishing date, employee rank, and family member's name were trying to be extracted automatically from the documents. NER processes were done using Python programming language, and the preprocessing tasks were done separately for Foxit, PDF2GO, and Tesseract. In summary, each OCR engine has its drawbacks and benefit, such as Tesseract has better NER extraction and conversion time with better accuracy but lack in the number of entities acquired.

Keywords: Named entity recognition; information extraction; optical character recognition; government human resources documents

Tosan Wiar Ramdhani, Indra Budi and Betty Purwandari, “Optical Character Recognition Engines Performance Comparison in Information Extraction” International Journal of Advanced Computer Science and Applications(IJACSA), 12(8), 2021. http://dx.doi.org/10.14569/IJACSA.2021.0120814

@article{Ramdhani2021,
title = {Optical Character Recognition Engines Performance Comparison in Information Extraction},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2021.0120814},
url = {http://dx.doi.org/10.14569/IJACSA.2021.0120814},
year = {2021},
publisher = {The Science and Information Organization},
volume = {12},
number = {8},
author = {Tosan Wiar Ramdhani and Indra Budi and Betty Purwandari}
}


IJACSA

Upcoming Conferences

Future of Information and Communication Conference (FICC) 2023

2-3 March 2023

  • Hybrid | San Francisco

Computing Conference 2023

13-14 July 2023

  • Hybrid | London, UK

IntelliSys 2022

1-2 September 2022

  • Hybrid / Amsterdam

Future Technologies Conference (FTC) 2022

20-21 October 2022

  • Hybrid / Vancouver
The Science and Information (SAI) Organization
BACK TO TOP

Computer Science Journal

  • About the Journal
  • Call for Papers
  • Submit Paper
  • Indexing

Our Conferences

  • Computing Conference
  • Intelligent Systems Conference
  • Future Technologies Conference
  • Communication Conference

Help & Support

  • Contact Us
  • About Us
  • Terms and Conditions
  • Privacy Policy

© The Science and Information (SAI) Organization Limited. Registered in England and Wales. Company Number 8933205. All rights reserved. thesai.org