Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.
Digital Object Identifier (DOI) : 10.14569/IJACSA.2018.090351
Article Published in International Journal of Advanced Computer Science and Applications(IJACSA), Volume 9 Issue 3, 2018.
Abstract: Improving the accuracy of Arabic text recognition in imagery requires a big modern dataset as data is the fuel for many modern machine learning models. This paper proposes a new dataset, called QTID, for Quran Text Image Dataset, the first Arabic dataset that includes Arabic marks. It consists of 309,720 different 192x64 annotated Arabic word images that contain 2,494,428 characters in total, which were taken from the Holy Quran. These finely annotated images were randomly divided into 90%, 5%, 5% sets for training, validation, and testing, respectively. In order to analyze QTID, a different dataset statistics were shown. Experimental evaluation shows that current best Arabic text recognition engines like Tesseract and ABBYY FineReader cannot work well with word images from the proposed dataset.
Mahmoud Badry, Hesham Hassan, Hanaa Bayomi and Hussien Oakasha, “QTID: Quran Text Image Dataset” International Journal of Advanced Computer Science and Applications(IJACSA), 9(3), 2018. http://dx.doi.org/10.14569/IJACSA.2018.090351