Skew Detection and Correction of Mushaf Al-Quran Script using Hough Transform

Document skew detection and correction is mainly one of base preprocessing steps in the document analysis. Correction of the skewed scanned images is critical because it has a direct impact on image quality. In this paper, the authors proposed a method for skew detection and correction for Mushaf Al-Quran image pages based on Hough transform method. The technique uses Hough transform lines detection for calculating the skew angulation. It works for different version of Mushaf AlQuran image pages which has skewed text zones. Moreover, it can detect and correct the skew angle in the range between 20 degrees. Experiment conducted on different Mushaf Al-Quran image pages shows the accuracy of the method. Keywords—Skew detection; skew correction; Hough transform; preprocessing; binarization; image analysis


I. INTRODUCTION
Document Image processing is one of the fields that are rapidly growing faster in nowadays.It aims to convert paperbased documents to forms that are proper for storage.It can be defined as the method that is used to perform some operation on specified image such as (Digitization, Storage, compression, Re-printing) [1].Besides that, there are different aspects that image processing could be the base such as, electronic engineering and computer science too.One of the problems in this field is that, the text in a document may be rotated when scanning which leads to produce a skewed text in the printed document as in Fig. 1.As a result of that, the quality of the document is decreased and that will lead to multiple problems in analysis the image as well as reduce performance of optical character recognition (OCR) [18].This paper focuses on skew detection and correction for Mushaf Al-Quran image pages.By comparison to other language scripts, skew detection and correction for Mushaf Al-Quran script is quite different as it has diacritical marks as well as the handwritten style is different too as compared to normal Arabic scrips.Hough transform is a simple feature extraction technique that is widely used in computer vision, image analysis and image processing as well.It can simply use to find lines in image by linear transform to detect straight lines [9].

II. RELATED WORK
Many studies of skew correction are published but for different languages such as English, Urdu, Chinese.However, in the document, text can be written on serval text lines.A various methods are used for skew detection and correction based on different algorithm like, Projection profile, nearest neighbor clustering, Fourier transform, cross correlation and others.Skew can be defined as the angle that deviates from xaxis.Furthermore, accurate skew detection and correction helps other processes of OCR to be more successful.In [2] a novel method was proposed to recognize Arab / Jawi and roman digit by OCR.In [3] skew in documents can be classified into three class namely global skew, multiple skew and no-uniform text line skew.In [4] document analysis depends on preprocessing stage , the much better the image is preprocessed, a much better result of analysis the image is.Furthermore, it increases the quality and the accuracies in the OCR systems.In [5] skew detection and correction can be the first step in the process of the document analysis as well as understanding processing steps as it has a direct effect on the reliability and efficiency of the segmentation and feature extraction stages.Currently, a lot research in Arabic documents bus less work is intensively been explored for Mushaf Al-Quran.Initially method to estimate the skew angle in a paper as in Fig. 2 is to draw a line through the text characters, and then the angle of the drawn line with the horizontal edges of the original paper is the skew angle.www.ijacsa.thesai.orgGenerally, all ordinary pages have the skew angle of zero.However, the skew angle occurs due to different reasons.The main purpose of skew detection and correction is required to improve the quality of the scanned documents.In [6] O"Gorman paper, all these techniques can be categories into three groups as projection profile, Hough transform and nearest neighbor clustering.In [3] an evaluation for the most frequently skew detection techniques cited in their paper as (i) Projection Profile Analysis (PP), (ii) Hough Transform (HT) and (iii) Nearest Neighbor(NN).A comparison between the three techniques, the comparison started the weakness and strengths of each method as well as to compare the performance for both of them in term of the speed and the accuracy.Their evaluation showed that nearest neighbor techniques is the fasted one among them according to the speed but in other hand, its accuracy estimation evaluation is poor comparing to the other techniques.Furthermore, project profile technique gave the best estimation for the angle when it comes to the accuracy, in opposite its time is the longest to be executed.In [7] an efficiency discussion of two techniques Principal Component analysis (PCA) and Hough transform is presented to overcome problems that spoils the scanned documents.In [8] projection profile method is proposed for skew detection for handwritten signature, they used horizontal projection for detecting the skew angle and correct it using rotation transformation.In [9] a method proposed for detecting the skew and correct it for the handwritten Devanagari script using the technique Hough transform.The proposed method is to detect the skew and correct it at the word level as Devanagari script as in Fig. 3 is a little difficult comparing to other scripts because of the style of the writing as well as the writing style differs from one person to the other one [9].The proposed method consists of preprocessing stage followed by word extraction stage is made in the image in order to extract the words, lastly Hough transform algorithm is applied in order to detect the skew of the word.In [4] proposed a novel skew detection and correction approach for scanned documents contains of two stages, first find the angles of the lines in the image with the respect of x-axis and second find the exact skew angle from the angles that are extracted from lines in the first stage.In [11] a proposed a simple and fast algorithm that determine the skew angle of the image as well as the slant angle of the text characters using the gradient orientation histogram.Additionally, the angle can be obtained using searching for a peak in the image histogram, the image can be corrected by a rotation at such an angle.In [1] proposed a new technique that detect the skew and correct it for the Arabic printed scripts based on connected component analysis and pixel projection.Moreover, the proposed technique take the advantage of the sharp writing line property for Arabic language that is obtained from histogram projection of the image for skew detection.In [12] an image moments are used for skew detection and correction.An image moment is the calculation of the weighted average (Moment) of the pixels "intensities of the image.So, moments are employed to find the primary axis of every object in the document instead of applying the Hough transform.Finally, by using a feature that depends on the size of the object, the weighted average angle is estimated.In [13] skew detection method that uses run-length and Hough transform algorithm is presented.The proposed method reduce the amount of data in the image through using black horizontal and vertical run-lengths histograms which also reduces computational calculation of Hough transform and increase the speed of skew detection.

III. MUSHAF AL-QURAN SCRIPTS CHALLENGES
Mushaf Al-Quran is the holy book millions of Muslims around the world.It can be in two versions digital or printed form, although is in Arabic, but the way it is written is different from any Arabic/Jawi based document as it has "diacritics".In [14] a proposed method for identifying types of Arabic calligraphy in Malay accent script that is written in Jawi.Fig. 4 illustrates most of Arabic script challenges as in [15], [16] have presented Arabic scripts challenges as the following: Fig. 4. Arabic Script Challenges.www.ijacsa.thesai.org 1) The connectivity challenge: Arabic text can be only scripted cursively, that means all graphemes are connected together, this happens whether the text was handwritten or font written as in Fig. 5. 2) The dotting challenge: Dots in Arabic scripts are used to differ between the characters sharing similar graphemes.Accordingly, if a dot is missed with the process of skew detection, then that will affect the meaning of the text.Fig. 6 illustrates the dotting challenge.3) The multiple grapheme cases challenge: In Arabic orthography it"s very due to have the connectivity in text which means that same letter can be different in the way how it"s written based on the position of it in the Arabic word.Fig. 7 illustrates the letter ‫ع‬ with different writing styles.

5) The diacritics challenge:
The usage of diacritical marks helps to resolve linguistic ambiguity of the text.[14] However, in some case they goes vertical while the main text is going straight on line (horizontal) from right to left.Therefore, that makes some confusion for skew detection step in OCR.Fig. 9 illustrates a segment of Mushaf Al-Quran text with diacritics marks.

IV. PROPOSED METHOD
The proposed methodology for Mushaf Al-Quran skew detection and correction is described here.The proposed method consists of six stages namely as convert to grayscale image, binary image, foreground image, Hough transform method to detect lines, calculate skew angle and finally rotate image as in Fig. 10.

A. Grayscale Image
As in Fig. 11 some of Mushaf Al-Quran pages comes with different colors, so there is a need to re therefore there is a need for the conversion to grayscale image to get high performance of skew detection.There are some reasons for converting color Mushaf Al-Quran images to grayscale images as the following: 1) Reduce color: in color images, sometimes information of the images doesn"t help to identify the important areas on the images and other features such as lines on the images.
2) Grayscale (8bits) images makes it easy for implementing binary algorithms because there are only two shades of colors in grayscale images which are white and black whereas color images has blue-green-red.
3) Algorithms applied to grayscale images are much faster than the once applied to RGB images.www.ijacsa.thesai.org

B. Binarization
In [17] an amendment has been made by applying Otsu"s method for to improve noise and prepare images for the new proposed extraction feature method.In [10] also Otsu"s method is applied for Arabic characters dynamically in order to choose the discriminant threshold on the image.Therefore in this paper Otsu"s method is used too.Once an image is formed in grayscale form, next preprocessing step is applied on the image is the Binarization.Binary images are the images that have only values for each pixel, the two possible values are black and white.However, in this step a binary image is created from the original image to help to detect only the important areas and parts of Mushaf Al-Quran images.Fig. 12 illustrates the difference between normal image and binary image.

RGB Mushaf Al-Quran image
Binary Mushaf Al-Quran image

C. Foreground Mushaf Al-Quran Image Detection
Once a binary image is created, a morphology is applied to detect areas that have text in Mushaf Al -Quran images and then convert gotten text to lines using this morphology in the direction of x (close morphology is used here).This morphology produces image contains lines which will be used in the next stage for the angle calculation.Fig. 13 illustrates the difference between color image and foreground image.

D. Angle Calculation
This is the most important stage where skew angle is calculated.Text in previous stage is converted to connected lines, so line detection comes second.The connected words in the previous stage can be considered as straight lines which helps to apply Hough transform method for line detection.To make it clear, this stage can be achieved in by two important steps as the following 1) Line detection: using (1) helps to detect straight lines in the images using the equation.14 for each point (X 0, Y 0 ) there are other set of points which can create lines as (X 1, Y 1 ), this set of points that create line can be defined with equation (1).The two values (r, θ) in the above formula represents the lines (connected words in Mushaf Al Quran images that gotten from the previous stage) that goes within (X 0, Y 0 ).In other words, the line in the image space represented as a point in the parameter space as in Fig. 15.Likewise, for linear Hough transform, two dimensional array are used for detecting lines in the image space in which each line is represented with two values of (θ, p) respectively.Further, straight lines are represented with peak strong point in the accumulator array in the image space as in Fig. 15 above.Once all peak points are detected, then it"s easy to find line segments by end points of the peak values.The more intersections in the image space leads to the longest line among lines and that is the required line to calculate the skew angle.
2) Angle Calculation: skew angle is calculated using the longest straight line.In other words, it can be calculated with the deviation of the line with horizontal axis.
3) Rotate image: Once the skew angle is detected, it becomes easy for rotate it in order to correct the skew.However, several methods are used for skew correction like (direct method, indirect method contour oriented projection based and others), rotation of the image is done through Affine Transformation (2).
Where (x, y) are the coordinates of the skew detected line, (θ) is the angle detected by Hough transform method.The above equation is for couter-clockwise.
Equation ( 2) in which the rotate the calculated skew angle to horizontal angle.The line is rotated with θ angle, if the detected angle is positive, the angle is corrected to the negative angle with the same value and the vice versa.Fig. 16 illustrates the last stage of the proposed method "Rotate image".

V. EXPERIMENTAL RESULTS AND DISCUSSION
In this paper, a proposed method was tasted on Mushaf Al-Quran images that have skewed text.In addition, 50 Mushaf Al-Quran images were tasted by our proposed method.The proposed method has been implemented using Java programming language, Test environment used a PC with Intel i5-2430M CPU @2.40 GHz with 12GB of memory, also Opencv function was implemented with Java code for detecting skew lines in the image as well as for measuring skew angle.In addition, the documents image of Mushaf Al-Quran were self-obtained from the source (https://www.nourelquran.com/quranforall/fahd/index.php[19] and they were manipulated with different skew angle using software ImageJ.The accuracy for skew correction was about 90% for the images been tasted.Mostly, The Mushaf Al-Quran images that are colorful or have high resolution have lower accuracy in skew correction conversely with the images that have lower resolution in which the proposed method works perfectly.Therefore, Mushaf Al-Quran images have to be pre-processed before applying Hough transform method on, as in Fig. 11, converting input image to grayscale image helps to increase skew detection and process.Binary image also is a good way to increase skew detection as it was explained in proposed method section.The proposed method detecting and correcting skew angles through six stages namely, convert to grayscale, binary image, foreground image detection, HT transform method, calculate skew angle, rotate image.Table I shows a sample Mushaf Al-Quran images before and with skew correction at different angles.www.ijacsa.thesai.org

Fig. 7 .
Fig. 7. Multiple Grapheme.4) The ligatures challenge: Character in Arabic script can be compounded together at certain positions of the Arabic word.Ligatures can be found at almost all the Arabic fonts.Fig. 8 illustrates ligatures challenge.

( 1 )
Hough transform is one of the most used feature extraction technique in computer vision, image analysis and digital image.It was introduced by Paul Hough 1962.So based on the Fig.

TABLE I .
SAMPLE RESULTS FOR SKEW DETECTION AND CORRECTION USING THE PROPOSED METHOD