A Novel Rule-Based Root Extraction Algorithm for Arabic Language

Non-vocalized Arabic words are ambiguous words, because non-vocalized words may have different meanings. Therefore, these words may have more than one root. Many Arabic root extraction algorithms have been conducted to extract the roots of non-vocalized Arabic words. However, most of them return only one root and produce lower accuracy than reported when they are tested on different datasets. Arabic root extraction algorithm is an urgent need for applications like information retrieval systems, indexing, text mining, text classification, data compression, spell checking, text summarization, question answering systems and machine translation. In this work, a new rule-based Arabic root extraction algorithm is developed and focuses to overcome the limitation of previous works. The proposed algorithm is compared to the algorithm of Khoja, which is a well-known Arabic root extraction algorithm that produces high accuracy. The testing process was conducted on the corpus of Thalji, which is mainly built to test and compare Arabic roots extraction algorithms. It contains 720,000 word-root pairs from 12000 roots, 430 prefixes, 320 suffixes, and 4320 patterns. The experimental result shows that the algorithm of Khoja achieved 63%, meanwhile the proposed algorithm achieved 94% of accuracy. Keywords—Root; stem; rules; affix; pattern; corpus

Prefixes are attached at the beginning of the words, where suffixes are attached at the end, and infixes are found in the middle of the words [4]. For example, the word ‫وج١ٛرىُ‬ which meaning is "like your houses" in English, ‫ن‬ is the prefix, which is a connected preposition, ‫وُ‬ is the suffix, which is the subject here, and ٚ is the infix. So, the root is ‫,ث١ذ‬ where in English the preposition and subject are written separately. So, for the "houses" word no prefix, no infix, "s" is the suffix, and the root is "house".
In Arabic, words are made from roots and patterns. Patterns are non-consonant letters groupings which can be interceded on as templates [5]. Patterns can be added to the root of the word or can be found within the roots of the word following well-defined models [6]. Many words have the same pattern. The root of any words can be easily extracted if the word and the pattern are known. For example, if the words ‫ٚاعزجذٌزٗ,‬ ‫ٚاعزغٙشرٗ‬ ‫ٚاعزغجشرٗ,‬ have the pattern of ‫,ٚاعزفعٍزٗ‬ therefore the roots will be ‫عٙش‬ ‫عجش,‬ ‫ثذي,‬ respectively.
As a result of a thorough investigation of existing algorithms, in this work, a new rule-based Arabic Root Extraction Algorithm (AREA) is proposed. Our algorithm is an extensive enhancement and improvement work which is done to overcome the limitations of the previous works that can be used in both IR and NLP applications in an effective way. www.ijacsa.thesai.org This paper is organized as follows. In Section II, the discussion regarding previous studies and their drawbacks is presented. Section III describes the proposed methodology, including details of each process. Section IV explains the experimental implementation of our algorithm and the evaluation process. Section V concludes the main points of the paper and gives some future directions.

II. PREVIOUS STUDIES
AREA can be categorized into a database search approach, statistical based approach, and a rule-based approach [7].

A. Database Approach
Database search approach is the simplest strategy; it simply looks for the root of the word in the lookup table. The database would also include a list of patterns that match different Arabic words and can be used to help identify different roots.
Most well-known works using this approach are Al-Fedaghi with Al-Anzi algorithm [8], and Al-Shalabi algorithm [9]. They proposed an algorithm to generate the root and pattern of a given Arabic word. The main problem of this type is when there is no pattern or root is matched from the database. The limitation of this method is the need to constantly update the database. Also, there is a possibility that the algorithm will detect more than one pattern for certain words.

B. A Weight-Based Approach
With this approach, the algorithm assigns different weights to letters in the word, and then, using mathematical calculations to find the root. Al-Serhan, Al Shalabi and Kannan algorithm [10] is an example of this approach. The main problem of this algorithm is it gives the same priority for the extra letters as the original letters. For example, it gives the same priority to ‫د(‬ ‫غ,‬ ‫ة,‬ ‫ن,‬ ‫)ف,‬ with ( ‫ػ,ص‬ , ‫ص‬ ‫س,‬ ‫,ر,‬ ‫,ؿ‬ ‫ط,ػ‬ ‫س,‬ ‫,),ض‬ although these letters sometimes are not the original root letters. For example, if a word contains the letters ‫ة(‬ ‫ف,‬ ) as a prefix or the letter ‫ن(‬ ) as a suffix, the algorithm fails to identify the root. This happens when it gives the letters' root less priority than other letters in the word. For example, if the letters' roots in ( ‫,أ‬ َ, ْ ‫)ئ,‬ and the extra letters are in ( ‫ط,ي‬ ,ٖ).

C. A Rule-Based Approach
Most of the AREA in the literature today are rule-based. In the rule-based approach algorithms, a set of rules are built to find the Arabic root from the original word. In most cases, this approach will also use a database of patterns and affixes as well. These algorithms affected by the way the rules are arranged as well as the number of rules. Such algorithms would also involve a pre-processing to find a possible root.
Khoja and Garside algorithm [11] is the most popular rulebased Arabic root extraction algorithm. Khoja and Garside algorithm reported 96% accuracy of their algorithm using newspaper text.
Al-Shalabi [12] presents Arabic root extraction algorithm, which is a rule-based algorithm that is used to extract trilateral roots of Arabic words. This algorithm has been tested on a corpus of 72 abstracts, 10582 words from the Saudi Arabian National Computer Conference and they achieved 92% of accuracy.
Another work, Al-Kabi and AL-Mustafa algorithm [13] is based on affix removal. They tested their algorithm on small data sets containing 1,827 words. The system unable to analyse 55 words, since their patterns are unknown. This failure mostly due to foreign (Arabized) words. The system is able to analyse the rest (1,772 words), but it was stated that the accuracy of extracting the right roots is 91%.
Sonbol, Ghneim and Desouki algorithm [14] is another rule-based root-extraction algorithm where the principal idea is based on the encoding of Arabic letters with a new code that preserves morphologically useful information and simplifies it's capturing toward retrieving the root. They conducted their experiments using two different corpuses. The first corpus consists of lists of word-root pairs (167162 pairs). The second corpus is a collection of 585 Arabic articles from different categories (policy, economy, culture, science and technology, and sport). This corpus consists of 377793 words. Overall, the algorithm yields about 96%-98% of accuracy.
Ghwanmeh, Al-Shalabi, Kanaan, Khanfar and Rabab'ah algorithm [15] proposed a rule-based algorithm to find trilateral Arabic roots. According to Ghwanmeh et al, their algorithm only unable to analyse words that are normally foreign, irregular, or do not have trilateral roots. A corpus of 242 abstracts from the Proceedings of Saudi Arabian National Computer conferences in machine-readable form is used in the testing procedure. The set of abstracts was chosen randomly from the corpus for analysis. The results obtained showed that the algorithm extracts the correct roots with an accuracy rate up to 95%.

III. METHOD
This section describes the methodology for the new Arabic root extraction algorithm. The presented algorithm will find all possible roots for each word. The root is the base form of the word that gives the main meaning of the word.

A. Normalization
Normalization is the process that leads to the removal of unwanted letters, punctuations, and non-letters. The normalization steps consist of the followings:   Duplicating any letter that has the Shaddah: ‫ﱠ"‬ " symbol. www.ijacsa.thesai.org

B. Extracting the Constant Letters from the Word
The proposed algorithm finds all the possible roots of the word without removing prefixes and suffixes. It starts by extracting the constant letters in a word by applying the rules in the Table I. The starting process of the presented algorithm differs from most of the previous algorithms, because it does not start removing prefixes and suffixes from the words' derivations. Particularly, removing prefixes and suffixes from the words' derivations leads to omitting many letters from the root which leads also to wrong results. Most of the previous algorithms remove the prefixes and suffixes from the words' derivations which is depends on the expectation' processes. In other words, most of the previous algorithms do not sure exactly that prefixes and suffixes are affixes or not. For instance, consider the word " ‫."اعزّبع‬ Most of the previous algorithms remove the prefix ‫"اعذ"‬ from the word because they depend on the expectation' processes that the prefix ‫"اعذ"‬ is founded in their prefix's lists. As a result, they remove it directly.
Next, we categorize the Arabic letters into groups as the work of Sonbol's Arabic root extraction algorithm. In Arabic, letters are categorized into two main groups; Constant and Nonconstant letters. Constant letters are: If these letters appear in the derivation word, it also should appear in its root. For instance, the word ‫"اٌغبؽذْٚ"‬ has ‫د"‬ ‫ؽـ,‬ ‫"عـ,‬ constant letters. These constant letters must be part of the root. Therefore, constant letters are not being considered as affix letters. } and an extra letter ‫."ح"‬ We face many urgent issues that need more understanding than constant letters' work because constant letters may appear in the derivation words, but not appear in their root. Find out the constant letters in the word. If the number of constant letters is more than one letter, then they will be considered as one of the expected roots.
The input word ‫,"اٌزمبس٠ش"‬ the constant letters are ‫,ق{‬ ‫,س‬ ‫,}س‬ then ‫}لشس{‬ is one of the possible roots for the word ‫."اٌزمبس٠ش"‬ 2 Check Ebdal rules to minimize the constant letters.

C. Converting the Non-Constant Letters to the Constant Letters
The Non-constant letters in the derivation's word are the original root letters in some cases and considered as the additional letters to the root in other cases, depending on the position of the letters. In this section, a certain set of rules are applied to each letter in the non-constant letters' group in order to convert these letters into constant letters.
1) The prefix letters ‫,ي{‬ ‫,ط‬ ‫,ف‬ ‫:}ة‬ The Prefix letters ‫,ي{‬ ‫,ط‬ ‫,ف‬ ‫}ة‬ are one of the non-constants' letters. They are attached at beginning of the words. A certain set of rules has been implemented on each letter on the prefix letters' list to convert these letters from non-constant letters to a constant letter.

a) Prefix letter ‫ي‬
Initially, the letter ‫}ي{‬ is a non-constant letter. It can be converted to a constant letter by applying the following rules: Rule1: If the letter ‫ي"‬ " exists after the first constant letter, then the letter ‫"ي"‬ is treated as a constant letter. For example, with the word ‫,"اعزمً"‬ the letters ‫ق{‬ ‫}ع,‬ have been identified constant letters. And the letter ‫"ي"‬ exists after the first constant letter. So, the letter ‫"ي"‬ is treated as a constant letter. Then the constants' letters list becomes ‫ع{‬ ‫ق,‬ ‫.}ي,‬ Rule2: Check the position of the letter ‫"ي"‬ in the word. If the letter ‫"ي"‬ exists in the second half of the word, then it is treated as a constant letter. For example, consider the word ‫."اعزٍُ"‬ The letter ‫"ي"‬ is positioned in the second half of the word. Thus, in this case, it is considered a constant letter.

Rule3:
If the letter ‫"ي"‬ is preceded by the letters ‫,"اي"‬ it is treated as a constant letter. As it is in the word" ‫"اٌٍ١ً‬ Rule4: The letter ‫"ي"‬ is treated as a constant letter if it has been preceded by one of these letters ‫ن"‬ ‫ط,‬ ,ٞ ,َ ,ٖ ‫د,‬ ,ْ".
As it is in the following words" ‫رٍّ‬ ‫ظ‬ ‫وٍّظ,‬ ‫٠ٍّظ,‬ ‫ٍّٔظ,‬ , ‫عٍٛن‬ ‫ِالن,‬ ‫."٘الن,‬ b) Prefix letter ‫"ط"‬ Initially, the letter ‫"ط"‬ is a non-constant letter. It can be converted to a constant letter by applying the following rules: Rule1: If the letter ‫"ط"‬ exists after the first constant letter, the letter ‫"ط"‬ is treated as a constant letter. For example, with the word ‫,"أعٕبط"‬ the letter ‫"ط"‬ has been identified a constant letter. Letter ‫"ط"‬ exists after the first constant letter. So, ‫"ط"‬ is treated as a constant letter. The constants letters list becomes ‫ط"‬ ‫."ط,‬ Rule2: If the letter ‫"ط"‬ is preceded by the letters ‫,"اي"‬ it is treated as a constant letter. As it is in the word" ‫."اٌغجبع‬ Rule3: The letter ‫"ط"‬ is treated as a constant letter if it has been preceded by one of the letters ‫٘ـ"‬ ‫ن,‬ ‫ط,‬ ‫ة,‬ ‫."ي,‬ As it is in the words" ‫ثغّبع‬ ‫",ٌغّبع,‬ Rule4: The letter ‫"ط"‬ is treated as a constant letter if it hasn't been followed by one of the letters ‫د"‬ ,ٞ ,ْ ‫ا,‬ ‫."أ,‬ As it is in the words ‫عالٌُ"‬ ‫."عىبْ,‬ Rule5: Check the position of the letter ‫"ط"‬ in the word. If the letter ‫"ط"‬ exists in the second half of the word, it is treated as a constant letter. For example, with the word ‫,"ِ١إٚط"‬ the letter ‫"ط"‬ is positioned in the second half of the word. Thus, in this case, it is considered a constant letter.
Rule6: When the letter ‫"ط"‬ exists in the prefix part of the word, it is not possible to decide if the letter ‫"ط"‬ is a constant letter or not. For instance, the word" ‫."اعزّبع‬ c) Prefix letter ‫ف‬ www.ijacsa.thesai.org Initially, the letter ‫}ف{‬ is a non-constant letter. It can be converted to a constant letter by applying the following rules: Rule1: If the letter" ‫"ف‬ exists after the first constant letter, the letter ‫ف"‬ " is treated as a constant letter. For example, with the words ‫,"أعؾف"‬ the letters ‫ػ"‬ ‫"ط,‬ have been identified constant letters. The letter ‫"ف"‬ exists after the first constant letter. Hence, the letter ‫"ف"‬ is treated as a constant letter. The constants' letters list becomes ‫ػ"‬ ‫ػ,‬ ‫."ط,‬ Rule2: Check the position of the letter ‫"ف"‬ in the word. If the letter ‫"ف"‬ exists in the second half of the word, it is treated as a constant letter. For example, consider the word" ‫."اعزٍف‬ The letter ‫"ف"‬ position in the second half of the word. Thus, in this case, it is considered a constant letter.
Rule3: If the letter ‫"ف"‬ is preceded by the letters ‫,"اي"‬ it is treated as a constant letter. As it is in the word" ‫."اٌفْٕٛ‬ Rule4: The letter ‫"ف"‬ is treated as a constant letter if it has been preceded by one of these letters "ٞ ‫٘ـ,‬ ,َ ‫ط,‬ ,ْ ‫."د,‬ As it is in the following words"‫٠فٍظ‬ ‫ِفٍظ,‬ ‫عف١ٗ,‬ ‫٘فٛف,‬ ‫ٔفٍظ,‬ ‫",رفٍظ,‬

d) Prefix letter ‫ة‬
Initially, the letter ‫}ة{‬ is a non-constant letter. It can be converted to a constant letter by applying the following rules: Rule1: -If the letter ‫"ة"‬ exists after the first constant letter, the letter ‫"ة"‬ is treated a constant letter. For example, with the word ‫,"صجبػ"‬ the letters ‫ػ{‬ ‫}ص,‬ have been identified constant letters. The letter ‫"ة"‬ exists after the first constant letter. So, the letter ‫"ة"‬ is treated as a constant letter. The constants letters' list becomes ‫ػ"‬ ‫ة,‬ ‫."ص,‬ Rule2: Check the position of the letter ‫"ة"‬ in the word. If the letter ‫"ة"‬ exists in the second half of the word, it is treated as a constant letter. For example, in the word ‫,"عبٌت"‬ the letter ‫"ة"‬ positioned in the second half of the word. Thus, in this case, it is considered a constant letter.

Rule3:
If the letter ‫"ة"‬ is preceded by the letters"‫,"اي‬ it is treated as a constant letter. As it is in the following word" ‫."اٌجبعً‬ Rule4: If the letter ‫"ة"‬ location is more than two in the word, it is treated as a constant letter. As it is in the word ‫."ا٢ثذ٠ٓ"‬ Rule5: The letter ‫"ة"‬ is treated as a constant letter if it has been preceded by one of these letters "ٞ ‫٘ـ,‬ ‫ط,‬ ,َ ,ْ ‫د,‬ ‫ا,‬ ‫أ,‬ ‫."ة,‬ As it is in the following words" ‫٠جزٍع,‬ ‫٘جٛة,‬ ‫عجبق,‬ ‫ثجعط,‬ ‫اثبْ,‬ ‫أثبسوزُ,‬ ‫."ٔجذأ‬ Rule6: When the letter ‫"ة"‬ exists in the prefix part of the word, It is not possible to decide if the letter ‫"ة"‬ is a constant letter or not. Such as the word" ‫."ثبعً‬

2) Suffix letter" ‫"٘ـ‬
: Suffix letter is one of the non-constant letters and attached at the end of the words. A certain set of rules has been implemented to convert this letter from nonconstant letter to a constant letter. In this algorithm ‫"٘ـ"‬ is the only suffix letter.
The letter ‫"٘ـ"‬ is treated as a non-constant letter if the letter ‫"٘ـ"‬ exists in the suffix part of the word. The letter ‫"٘ـ"‬ is treated as an original root letter if it exists in places rather than the suffix part of the word. Initially, the letter ‫"٘ـ"‬ is a nonconstant letter. It can be converted into a constant letter by applying the following rules: Rule 1: If the letter ‫"٘ـ"‬ exists before the last constant letter, the letter ‫"٘ـ"‬ is treated as a constant letter. For example, in the word ‫,"اعزٙذ"‬ the letters ‫د"‬ ‫"ط,‬ have been identified as a constant letter. The letter ‫"٘ـ"‬ exists before the last constant letter. So, ‫"٘ـ"‬ is treated as a constant letter. The constants letters list becomes ‫د"‬ ‫٘ـ,‬ ‫."ط,‬ Rule 2: Check the position of the letter ‫"٘ـ"‬ in the word. If the letter ‫"٘ـ"‬ exists in the first half of the word, it is treated as a constant letter. For example, consider the word" ‫."رٙبِخ‬ The letter ‫"٘ـ"‬ position is in the first half of the word. So, in this case, it is considered a constant letter.

Rule 3:
The letter ‫"٘ـ"‬ is considered as a constant letter if the letters ‫"ٚا"‬ exist at the end of the word and the letter ‫"٘ـ"‬ appears just before the letters ‫"ٚا"‬ , such as " ‫أِب٘ٛا,‬ ‫أزجٙٛا,‬ ‫."رال٘ٛا‬ Rule 4: The letter ‫"٘ـ"‬ is treated as a constant letter if it has been preceded by one of the letters ‫ح"‬ ‫٘ـ,‬ ‫ي,‬ ‫ن,‬ ‫ف,‬ ‫ط,‬ ‫,"ة,‬ such as" ‫رأً٘,‬ ‫اعٍٙٗ,‬ ‫فمب٘خ,‬ ‫أشجٙه,‬ ‫اٌزٍٙف,‬ ‫اٌذ٘ظ,‬ ‫أعٙت‬ ," 3) The prefix-suffix letters ‫,ك"‬ ‫ن‬ ‫:"م,‬ The Prefix-Suffix letters ‫,ن{‬ ْ ,َ} are non-constant letters; a certain set of rules has been implemented on each letter on the Prefix-Suffix letters' list in order to convert these letters from non-constant letters to constant letters. The Prefix-Suffix letters are treated as constant letters to the root if these letters exist in the Prefix part or the suffix part or of the word. In contrast, they are treated as original root letters if they exist in the places rather than the prefix part or the suffix part of the word.
a) The Prefix-Suffix letter " ‫"ن‬ Initially, the letter ‫"ن"‬ is a non-constant letter; it can be converted to a constant letter by applying the following rules:

Rule1:
The letter " ‫"ن‬ is treated as an original root letter if it exists between constant letters. In the word ‫,"اٌشىش"‬ the letters ‫س"‬ ‫"ػ,‬ are identified as a root letter. Then the letter ‫"ن"‬ exists between the two constants letters, ‫"ن"‬ letter is treated as a constant letter also. Thus, the constant letters list is ‫س"‬ ‫ن,‬ ‫."ػ,‬ Rule2: The letter ‫"ن"‬ is considered as a constant letter if it appears in the first half of the word and not following the " ‫ف,‬ ٚ" letters, such as " ‫ِىبعش‬ ‫اٌىٍّبد,‬ ‫."أوالئٙب,‬ Rule3: The letter ‫"ن"‬ is considered as a constant letter if it appears in the second half of the word and before the last constant letter, such as ‫إٌىبػ"‬ ‫."إٌّىش,‬ Rule4: The letter ‫"ن"‬ is considered as a constant letter if it appears in the second half of the word and has been followed by ", ‫اد‬ ‫د,‬ ,ٓ٠" letters, such as ‫رجبوذ"‬ ‫اٌّغبو١ٓ,‬ ‫."اٌّإرفىبد,‬ Rule5: When the letter ‫"ن"‬ exists in the prefix or suffix part of the word, it is not possible to decide if ‫"ن"‬ is a constant letter or not. The word is ambiguous, such as ‫وض١ت"‬ ‫"ششان,‬ b) The Prefix-Suffix letter " َ " Rule5: The letter "ٚ" is considered as a constant letter if it appears in the word and has been preceded by ‫"اي"‬ letter, such as ‫اٌٌٛ١ّٗ"‬ ‫."اٌٛال٠بد,‬ 5) The extra letter ‫:"ة"‬ The extra letter ‫"ح"‬ is not from the root's letter. Therefore, we remove this letter from the word.

D. Extracting All Possible Patterns for the Word
In the previous step, finding all constant letters will minimize the possible root's letters. The problem is when the algorithm does not find three constant letters or more, the algorithm tries to expert each letter in the word to complete the missing letter in the root.

1) Extracting all possible patterns when constant letters are three or more
Most of the Arabic words are derived from trilateral Arabic roots. However, there are very few quadric-literal Arabic roots relative to the number of trilateral Arabic roots. Most of the studies which related to Arabic Root extraction either are based on a dictionary of Arabic roots or use a set of rules to identify the verb patterns of the Arabic words. The rules are selected depending on the number of letters in the word to find the Arabic roots. In this section, we explain how to extract all possible patterns when constant letters are three or more. One possible verbal pattern exists if the word consists of three or more constant letters. The steps are summarized in Table II. If constant letters are more than two letters, replace the first constant letter with ‫"ف"‬ letter; then replace the second constant letter with ‫"ع"‬ letter, after that replace the rest of constant letter with ‫"ي"‬ letter.
‫اٌفبعٍْٛ‬ www.ijacsa.thesai.org According to Table II, with the word ‫",اٌؾبشذْٚ"‬ three constant letters are found in the word, which is forming one possible root for the word; it is ‫".ؽشذ"‬ To find the pattern of the word ‫,"اٌؾبشذْٚ"‬ replace ‫"ػ"‬ letter with ‫"ف"‬ letter; then replace ‫"ػ"‬ letter with ‫"ع"‬ letter, after that ‫"د"‬ letter with ‫"ي"‬ letter, in order to achieve the pattern" ‫اٌفبعٍٛ‬ ْ ". Another example that contains four constant letters with the word ‫",اٌّزذؽشعبد"‬ . There are four constant letters forming the possible root ‫".دؽشط"‬ Therefore, the pattern is ‫".اٌّزفعٍالد"‬

2) Extracting all possible patterns when constant letters are less than three
If the number of constant letters less than three letters, there will be more than one possible pattern. For instance, in the word"ٝ‫",اٌزم‬ there is just only one constant letter ‫".ق"‬ In this case, we cannot build the pattern because we should have at least three constant letters to build the complete pattern. Therefore, the algorithm tries to find another two constant letters in the word in order to form the correct possible patterns. Note that all letters in the word are the candidate to be constant letters. Therefore, the suggested letters are ‫,اي"‬ ‫,اد‬ ‫,اٜ‬ ‫,ٌذ‬ ٌٝ, ‫."رٝ‬ Referring to the suggested letters, the possible patterns are ‫,فعزٍٝ"‬ ‫فٍعٍ‬ ٝ ‫,افزعً,افعٍٝ,فٍزعً,‬ ‫"اٌفعً‬ . Refer the process in Table III.

3) Exclude the wrong patterns by applying the rules
In the previous section, some of the extracting of the possible patterns ‫,افزعً,افعٍٝ,فٍزعً,فٍعٍٝ,فعزٍٝ"‬ ‫اٌفعً‬ ", patterns are wrong because the non-constant letters in the patterns are in the wrong places. So, the present algorithm applies the rules in section C in order to reject the wrong patterns. With the word"ٝ‫",اٌزم‬ the suggested patterns" ‫,فٍعٍٝ‬ ‫فٍزع‬ ً " are removed from the list after applying the rules. As indicated earlier for ‫"ي"‬ letter rule, if ‫"ي"‬ letter exists between two constant letters, it is considered as a constant letter. At the same time, it could not be extra letters. Therefore, the possible patterns are ‫,فعزٍٝ"‬ ‫افزع‬ ‫افعٍٝ,‬ ً , ‫"اٌفعً‬ see the Table IV.

4) Minimizing the possible patterns by comparing them with patterns' list
A list of patterns in the corpus of Thalji [24] is automatically extracted and this list contains "4320" patterns. In order to ensure that the possible patterns are correct, they are compared with the patterns' list if they are not found they are rejected. For example, for the possible patterns in the word"ٝ‫,"اٌزم‬ the pattern ‫"فعزٍٝ"‬ is not found in the list, so it is rejected, and the remaining possible patterns ‫,افزعً,افعٍٝ"‪are‬‬ ‫."اٌفعً‬

E. Extract All Possible Roots for the Word 1) Finding all possible roots by matching the patterns
After finding all possible patterns, now all the possible roots that match the patterns are extracted. For example, in the word"ٝ‫,"اٌزم‬ the possible patterns are"ٍٝ‫,افع‬ ‫,افزعً‬ ‫"اٌفعً‬ . Therefore, the possible roots are"‫,ٌزك‬ ‫,ٌمٝ‬ ‫."رمٝ‬ 2) Finding all possible roots by applying Ebdal rules After careful and considered review of the content of the Arabic dictionaries such as "Lessan AL-Arab" [25], it has been found that this dictionary has roots like" ‫صطفً‬ ‫صطجً,‬ ‫صد‬ ‫ظطش,‬ , ‫س‬ , ‫."صدف‬ However, this the dictionary doesn't apply the Ebdal rule [26]. So, the Ebdal rules are always not applied. In our algorithm all the possible roots are returned with applying Ebdal rules and with don't apply it, to be in the safe side. Our proposed algorithm returns all the possible roots for each word by applying Ebdal rule and returning suggested roots without applying it.

F. Solve the Problem with Shaddah
This work is for the non-vocalized text. So, in many cases the writers don't write Shaddah above the letter, hence, the algorithm will try to check for missing Shaddah. It is started from the second letter in the word to check for missing Shaddah for all letters except the vowel letters. For example, the word ‫,"اٌجش"‬ the algorithm is generated by these possible missing Shaddah, ‫اٌجشس"‬ ‫اٌججش,‬ ‫."اٌٍجش,‬

G. Solve the Problem with a Missing Vowel in Ealal Rules
In Arabic language, if the root has one or more long vowel, in derivation words these letters may be deleted. For example, for the root ‫,"لٛي"‬ one of possible derivation word is ‫."لً"‬ During the derivation process, the long vowel "ٚ" letter is deleted. So, in this case, the algorithm gives all possible cases of missing long vowel letter. The algorithm is generated these possible missing vowels ‫لٍٛ"‬ ‫لٛي,‬ ‫."ٚلً,‬

IV. EXPERIMENT AND EVALUATION
In this section, the presented algorithm is compared with the Arabic root extraction algorithm of Khoja and Garside, which is the most popular Arabic root extraction algorithm, and the only Arabic root extraction algorithm that publicly available for download. Khoja and Garside tested their Arabic root extraction algorithm using newspaper text and achieved 95%. Specifically, we make a pure and completely comparison between the algorithm of Khoja and Garside and the presented algorithm on the corpus of Thalji. Thalji's corpus is an automatic corpus that is built from ten old Arabic dictionaries. This corpus is mainly built to test and fairly compare Arabic roots extraction algorithms. This corpus contains 720,000 words roots pair, which helps to avoid the interference of a human expert normally needed to verify the correct roots of each word used in the testing or comparison process. Moreover, this corpus has more than 4,320 types of words which derived from (12000) roots. So, it guarantees the comprehensiveness of words.
The experimental result shows that the accuracy of the algorithm of Khoja and Garside is 63%, and the accuracy of the presented algorithm achieves 94%. As shown in Figure I. We observed that the following limitations caused the decrement of accuracy for the algorithm of Khoja and Garside: 1) The algorithm of Khoja and Garside is missing a large number of roots, prefixes, suffixes, and patterns. The dictionary of Khoja and Garside is restricting the result for just 4,748 roots, 3,822 trilateral roots, 926 quadrilateral roots. Because the algorithm of Khoja and Garside ignores 7252 roots, the result of ignoring these roots causes wrong results because if one uses any of ignoring roots, he/ she will not find 63% 92% 0% 20% 40% 60% 80% 100%
2) The algorithm of Khoja and Garside suffers from affix ambiguity problems. For example, it returns ‫"ِ١ع"‬ root for the word ‫"اعزّبع‬ " , but it should also return the root ‫,"عّع"‬ this is because it starts by removing the longest suffix or prefix, but sometimes its neither prefix nor suffix, its root's letters.
3) The algorithm of Khoja and Garside, again, returns just one solution for non-vocalized words, ignoring other possible solutions. For example, the word ‫,"لً"‬ the possible roots are"ٍٟ‫ل‬ ‫ل١ً‬ ‫ٚلً,‬ ‫لٍٛ,‬ ‫لًٍ,‬ ‫,"لٛي,‬ where the result is just ‫."لًٍ"‬ 4) The algorithm of Khoja and Garside replaces a weak letter with the letter " ٚ", which occasionally produces a root that is not related to the original word. For example, it returns ‫"صٛي"‬ root for the word ‫"أص١ً"‬ which is the wrong root, the right root is ‫."أصً"‬ 5) The algorithm of Khoja and Garside may generate invalid roots or fail to find roots for words that contain Ebdal rule ‫"اثذاي"‬ like ‫اصطؾت,‬ ‫‪"and‬اصطٍؼ,‬ ‫ا‬ ‫صد٘ش‬ " .
6) The algorithm of Khoja and Garside, also, doesn't deal with Shaddah. For example, with the word " ‫ٚأة"‬ , it returns the root"ٟ‫,"أث‬ where the possible root also is ‫."أثت"‬ To be fair, the followings are the limitation points of the proposed algorithm:

7)
The presented algorithm unable to find the root of words is in the word " ‫ر٠عٛعخ"‬ , the algorithm result is just ‫"رعع"‬ root. In this well-known word's case, if the algorithm finds three constant letters, it returns them as trilateral root which becomes the result. At the same time, the presented algorithm is not deal with exchanging the constant letter with the vowel letter because this case rarely happened.
8) The presented algorithm gives all possible roots of the word. However, this causes a misunderstanding result for the researcher to find which the exact root for the word is. This limitation coming up clearly because the presented algorithm deals with words rather than completed meaningful sentences in a paragraph.

V. CONCLUSION AND FUTURE WORK
In this study, we investigate the rules which are based on the existing Arabic root extraction, analyse most previous Arabic root extraction algorithms, inspired by all their strong ideas, and overcome the weaknesses' points. This study continues what the others already started by performing extensive enhancement and improvements.