The Trend of Segmentation for Arabic Handwritten Touching Characters

The paper is a comprehensive study of existing research trends in the sector of Arabic language, with a focus on state-of-the-art methods to illustrate the existing condition of various theory in that sector, with the goal of facilitating the adaptation and extension of prior ones into new systems and applications. In the Arabic alphabet, there are 28 letters. Depending on its place in the word, every Arabic letter has over one shape; a single character may have from one to four shapes. The Touching between character and the Overlapping occurred in the handwritten. Historical documents contained a massive knowledge and culture. There are many old books that need to be converted into readable format. Which would take a long time if humans converted it. However, the main problem is the lack of research in Arabic Handwritten especially for segmentation of touching characters. Thus, current trends of the segmentation techniques are investigated to identify the current state-of-the art of segmenting touching characters in other domains for constructing enhance techniques for Arabic touching characters. In this paper, it reviewed approaches for the segmentation of the touching characters. This paper presents the trend of approaches for the recognition process and segmentation of Arabic handwritten touching characters. In this paper, it highlighted the strength of each technique, the method used, and the drawback of the techniques. Based on the outcome, this will provide a good foundation for constructing a better technique for segmentation of Arabic touching characters, especially from the degraded documents. Keywords—Component; character segmentation; Arabic handwritten; character touching; recognition


I. INTRODUCTION
Arabic is now the official language of nearly 26 nations, with a population of 280 million people globally. It is among the six official languages of the United Nations (UN) (Chinese, Arabic, English, French, Russian, and Spanish). Furthermore, several of its vocabulary and forms are used in Persian (Farsi), Jawi, Kurdish, Urdu, and Pashto.
Some individuals here nowadays mostly use pen and paper to write notes (for instance). That strategy has a number of flaws. Handwritten text is difficult to retain and access in an efficient and appropriate manner. Searching through them and sharing them with others is a time-consuming process. A lot of critical knowledge may be lost and not utilized efficiently if that content was not available in electronic form.
The segmentation might confront various difficulties. In addition, character should not be too tinny and neatly segmented to better identify the recognition process [1]. The Arabic word is often a line that draws this intricacy of segmentation [2]. Because it used computers in almost every aspect of life, it also known the modern era as the information technology era. The computer is a necessary component of human life. Although, compared to humans, computers do not have nearly as much intelligence. Humans can recognize any sort of text picture from old and deteriorated texts in libraries, but computers cannot comprehend these text images directly [3] Offline handwritten touching Arabic characters segmentation is a popular topic in study, however it's fraught with difficulties because to differences in writing, overlapping, and touching letters. The segmentation becomes tough when two characters are related to each other [4]. Mostly, all libraries and national archives throughout the world hold large volumes of historical and deteriorating documentation as a book. To convert these important resources to a machine-readable file, special care must be taken [5]. The Arabic language comprises 28 letters, each of which has a distinct form. Because letters in writings are combined to create words, these connections affect the appearance of the letters, thus the shape of an isolated character differs from the shape of a character in the middle and end of the word [6]. Segmentation is closely connected to recognition since it is a highly significant and key phase that splits a picture into sub-units such as lines, words, and letters [7]. OCR (Optical Character Recognition) is a technique that converts scanned or other kinds of pictures into editable format [8]. But even though picture segmentation is not strongly associated with image recognition, the two are inextricably linked. Segmentation process is a critical foundation for image recognition [4]. Picture segmentation, a critical process, splits the picture into tiny pieces.
Even though handwriting is common and varies from person to person, segmentation, which is used to break the text into lines, words, and characters of handwritten text, is still a difficult task. As a result, many observers are going to investigate answers to solve the problem, and some of them have made notable achievements; however, more research is needed to improve the performance of already developed systems. Although it is impossible to explain all the established approaches in this work, the study conducted by addressing the difficulties of touching Arabic handwritten letters [9]. 475 | P a g e www.ijacsa.thesai.org However, this paper aims to show the results and specifications of each segmentation method to assist researchers in determining the best technique for their work.
The rest of the paper is arranged as follows. Section II explains the fundamentals of the Arabic language's characteristics Section III describes the works that are related. Results and discussion details are in Section IV. Section V discusses the conclusion and next work for further study.

II. RELATED WORK
According to a review of the published literature on the segmentation of touching characters, there is a lack of research effort for handwritten and typed Arabic characters when compared to the number of techniques proposed for other languages such as Chinese and English.
In [13], for printed Arabic text, propose a segmentation based on Omni typeface and open-vocabulary OCR. The APTID-MF dataset was chosen as the basis for the suggested approach. This method does not need an explicit font type identification stage. The method used in this work requires cautious management, since picture samples produced by conventional image augmentation algorithms might lose important features and can be linked.
According to [14], to segment Arabic handwritten text, a region-based approach is used to extract the diacritic. After grayscale the picture, they binarize it, then use the region-based method, and finally extract the diacritic from the image. The researcher utilized the Al Quran as a dataset and added 10 handwritten Arabic pictures. This study also addresses diacritics, which are crucial to the syntax and semantics of a word. While it is part of the alphabet, the points and hamza ‫"ء"‬ are considered as diacritics.
Meanwhile [15] the researcher identifies fork points on handwritten Chinese character skeletons. The primary goal of this study is to increase the proportion of segmentation and recognition. The method identifies the feature point in the binary picture, then thins and smooths the character image to identify the fork and endpoints. Following that, they make some changes to eliminate the erroneous branches. They make use of the DHCCCRL database. The rectification of form distortion and the selection of 6,000 handwritten Chinese character pictures are two of the work's highlights.
In addition, [16] method for developing a junction detecting algorithm the researcher omitted a database in this study. This study is just for the Printed Uppercase Alphabet and only between two characters. In this study, just one segmentation instance was investigated. The case presented in this study is neither trustworthy nor practical. In fact, the touching in the writing is more difficult. Fig. 1 illustrates an instance in which they tested the two characters created by the researcher and placed a straight line between them; normally, it is not touched in this manner during natural handwriting compared to the case shown in the figure below, which is quite significant. Moreover [17] a technique based on junctions was used to create a handwritten Devanagari character by using a combination of feature to extract the character. Beginning with Handwritten Character Transformation to Bit-mapped Binary Images, the binary image was scaled, and then The Extraction was performed. They get the data from the CVPR Unit, ISI, and Kolkata. One benefit of this research is the collection of 4900 handwritten Devanagari characters. There are five options in this research. On the other side, it might be claimed that there is an advantage to having a lot of options. If, for example, two characters cannot be effectively segmented, the other option can be used. Furthermore, in [18] Inam Ullah used the junction method while handling Arabic handwritten text. The picture is transformed to binary, and just one point of thickness is kept, making it easier to discern endpoints. However, the intersection set theory is then applied to determine the junction point and broken character. The major goal of this research is to use the algorithm to convert a handwritten, unreadable old Arabic book into a readable one. One advantage of this study is that identifying the endpoints aids in the discovery of the broken character. The researcher chooses the contact point by hand from one of four datasets: IFN/ENIT, CEDAR, and IFN/ENIT, Arabic Dataset, AHDB, and Arabic Handwritten 1.0.
In [19], segmentation of Arabic handwritten text has been performed using contour analysis. In this research, the page is divided into lines initially. Second, the line is divided into subwords, and last, the sub-words are divided into characters. This method makes use of the database IFN/ENIT. Instead of identifying the baseline or intersecting points, this study replicates the human analogue in Arabic text writing.
Likewise, in Inam Ullah [9], the touching Arabic handwritten characters were segmented using contour tracing. Remove unnecessary noise from a binary picture. Identifying the End, Touching, and Neighboring Points Direction should be written. In the end, they are divided into characters. Many databases were considered, including AHDB, IFN/ENIT, Arabic handwritten 1.0, IBN SINA, IAM, and NIST. Because of proper segmentation, this study could achieve 97.27 percent.
Referring to [20] Corner detection in pictures is a fundamental computer vision problem.
In Lamia Berriche (2020) [24] the technique used is Seam carving-based and Datasets are IESK-ArDB and IFN/ENIT this method leads to Result of 95.67% clear remark for this research is that small characters could be considered secondary components. 476 | P a g e www.ijacsa.thesai.org Finally, according to [5], the researcher ran one set of 100 words without overlapping and another set of 100 words with overlapping from the benchmark database. And next apply the Method on the handwritten words and report the results for only the second batch. As it stated, it is a simple method that is straightforward to use and quick. Slant correction approaches do not give good results when writing characters with severely slanted and horizontally overlapping characters. Few letters, such as u, v, w, m, and n, are over-segmented or skipped segmented. In Core-zone detection, the researcher advised to count the white pixel until the first major change happens. But how can determine if this one is significant or not? This is a fluid word, and anyone may argue for or against it. Because science only speaks the language of numbers. Their method is straightforward; however, they cannot provide the results of segmentation before and after using the Core-zone detection. As a result, it can be determined if it is essential or not. The researcher simply stated that the first set of words is excellent, with no percentage showing how much is good, so that may compare the overlapping and non-overlapping sets. Also, make it more dependable.

A. Location/Direction in Writing
In both handwritten papers and machine printed materials, Arabic text is written from right to left, but numerals are written in the same way as numbers in other languages, for example, from left to right [10], [11].

B. Shape of Arabic Characters
Because Arabic writing letters are interconnected with each other, virtually every character in the Arabic language changes its shape in writing in word according to its placement in the word. Fig. 4 depicts Arabic letters that change shape depending on their location in an Arabic word, as well as instances of how these characters are linked to form words. The picture illustrates four Arabic letters as an example. However, not much different in the rest of the Arabic language letters regarding the shape forms of the letters compared to the selected four alphabets [12].

IV. CHALLENGES AND LIMITATIONS
There are several obstacles for academics to address in this field, and there is a desire for new ways to develop as computer technology improves and resource constraints diminish [26].
Based on Ouwayed and Belaid's study [23], Kang [22], Aouadi [21], and Saber et al. developed a method for segmenting touching Arabic letters in the same word or other words on the same line or other lines. These existing approaches are template-based segmentation techniques, in which a glossary file is created for all potential touching graphics, that is not only time-consuming due to the variation in Arabic writing and similarities in Arabic characters, but it also fails to address the issue of touching Arabic handwritten characters. Whereas these approaches employed self-defined criteria to govern segmentation accuracy, the segmentation process of touching character pictures suffered as a result.
Over or under segmentation happens because of datasets utilized, languages type (since Arabic has more issues than other languages), type of data (printed or written by hand), and suggested segmentation technique.
By referring to [25] there were some of the challenges such as: Datasets of Arabic handwritten characters, preprocessing noise, Techniques that are cutting-edge, Documents of low resolution and quality, Segmentation, Systems that operate in real time.
The factor that considered as the main factor which is the segmentation. Certain earlier efforts relied on manually dataset segmentation, while others relied on segmented databases. A few of the accessible datasets are not segmented, while others relied on segmented datasets. It's crucial to find a scalable approach to automatically divide documents into lines and subsequently into words (or characters), particularly for big and ancient datasets. Another difficulty in segmentation is dealing with ligatures and the large quantity of Arabic characters.
The multiple sub-words could affect the segmentation process some of the words with single sub-word such as: ‫"ﷴ"‬ and it could reach to five sub-words for example: ‫"أوروﺑﺎ"‬ which could increase the difficulties to recognize it as one word during the segmentation process.

V. RESULT AND DISCUSSION
After the research method is to find the most successful approach for Arabic handwritten touching character segmentation, but because of the many factors that need to be considered, such as paper quality, number of touching characters that have been tested, database selected, methods used, algorithm applied, and time taken to segment character. Because of all of that, it is difficult to give certain results, especially if some of these factors are not mentioned in the study. However, the author has reviewed ten of the approaches. Table I shows the sample of comparison for each method with its database selected and the result. Author has found a serious need for a specific database which could improve the future research and ease the way for the research to become more reliable, which has a logical result to be compared among the other studies. The author discovered that the junction algorithm developed by InamUllah yields the highest percentage of segmentation accuracy while being a simple process consisting of three main steps: binary process, thinning process that allows tracing the boundary of the character and if there are more than two binary points, it means there is a junction point to be segmented, and segmentation. However, this study has limitations for future work, such as: during the thinning process, some of the elements may be missing or counted as secondary objects; additionally, the alphabet may be triggered due to its tail. Furthermore, the method could not segment more than one junction point at the same time.

VI. CONCLUSION
The results of this study revealed current research trends in the field of Arabic. It emphasized the present state of several research elements in that field. This can encourage and make it easier to adapt and extend existing systems to new applications and systems. Arabic has a vast and undiscovered reach; nevertheless, little research has been done in that field previously.
We exhibited some of their prior work that was similar to contemporary state-of-the-art methodologies, with fewer mistakes and a high degree of abstraction. As demonstrated in the difficulties section, this identification is meant to give recommendations for future advancements in the field.
Because of the quality of screening, touching handwritten characters is present in old manuscripts. The Author therefore found that touching characters occurs widely in English, Chinese, Devnagari, Numbers and Arabic handwritten historical materials by exploring the literature for the review. This paper is scanning several approaches to help the researchers in this field to find the advantage and disadvantage of these approaches. For future research, the researcher encourages develop a database for touching characters in Arabic language to give more attention to multiple overlapping.

ACKNOWLEDGMENT
The author would like to thank his brothers Abdulrahman Algaradi and Abdullah Algaradi for sponsoring him during this research.