Corpus for Test , Compare and Enhance Arabic Root Extraction Algorithms

Many studies have focused recently on building, evaluating and comparing Arabic root extracting algorithm. The main challenges facing root extraction algorithms are the absence of standard data set for testing, comparing and enhancing different Arabic root extraction algorithms. In addition, the absence of complete lists of roots prefixes suffixes and patterns. In this paper, we describe the development of a new corpus driven from traditional Arabic dictionaries “mu’jams”. The goal is to use the corpus, as a new gold standard data set for testing, comparing and enhancing different Arabic root extraction algorithms. This data set covers all types of words and all roots. It contains each word and its root as a pair to avoid the consultation of a human expert needed to verify the correct roots of words used in the testing or comparing process. We describe the individual phases of the corpus construction, i.e. normalisation, reading derivation words and roots as a pair, and reading each root and its definition part. We have automatically extracted (12000) roots, (430) prefixes, (320) suffixes, (4320) patterns, and (720,000) word-root pair. Konja’s and Garside Arabic root extraction algorithm was tested on this corpus; the accuracy was (63%), then we test it after supplying it with our lists of roots prefixes suffixes and patterns, the accuracy of it became 84%. Keywords—Arabic root extraction algorithm; corpus; pattern; prefix; suffix; root


INTRODUCTION
Most researchers working in the field of Arabic root extraction algorithms opt to construct their own manually collected data set to run their experiments.Most of the time, the data sets are either small or incomprehensive.Therefore, their experimental findings may neither be convincing nor clear as for how to scale up the results [1].
The literature abounds with discussions about the design of Arabic stemming algorithms; yet little effort has gone into the investigation of the nature of the data set at the core of all these systems.
Al-Kabi and Al-Mustafa in [2], Ghwanmeh et el in [3], Al-Kabi et al in [4], Taghva et al in [5], Alshalabi in [6], Al-Shalabi and Evens in [7], Yaseen and Hmeidi in [8], Hmeidi et al in [9] and most new Arabic root extraction algorithms in the literature have tested their proposed root extraction algorithm on a different data set and compared their finding with other existing work.However, the data set that they used did not cover all types of words.In addition, the consultation by an Arabic language expert was needed to verify the accuracy of each finding manually.
Most of these algorithms manually constructed their own lists of prefixes, suffixes, and patterns as no standard lists were available.Thus, there was a huge variation between one algorithm and another.As the larger, the lists are the more accurate the result is.
Many research projects have studied Arabic root extraction algorithms and their effectiveness.Most of these studies claim an accuracy exceeding 75%.It has been found that the accuracy of these algorithms has been decreased after testing these algorithms on deferent data set other than what the researcher has used.
For example, in [3] Ghwanmeh et el claimed 95% accuracy for his algorithm.Testing the same algorithm in [4] on a different data set the authors claimed an accuracy of 67.40% for Ghwanmeh et el algorithm.Moreover, in [10] the authors conducted another test on Ghwanmeh et el algorithm using different data set.The author claimed an accuracy of 39%.This is due to a variation in size and type of the data set used to test Ghwanmeh et el stemmers [4].
As mentioned earlier, the lack of a standard data set was the main problem faced these algorithms.Each algorithm uses its own data set.These data sets are differed in size and type of words and are not available for authors to use.
Arabic root extraction algorithms need a standard data set to test their accuracy in comparison with other algorithms; this data set should be large enough to cover all types of words and cover all roots.This data set should contain the word and its root as a pair.In addition, Arabic root extraction algorithms need complete lists of roots, prefixes, suffixes, and patterns to enhance their accuracy.www.ijacsa.thesai.org The quality and coverage of the data set will determine the quality and coverage of each Arabic root extraction algorithm, and any limitations found in the data set will make their way through to the algorithm.Arabic root extraction is an important step toward conducting effective research on most of the Arabic natural language processing (ANLP) applications.
Arabic root extraction algorithms are used in information retrieval systems, indexers, text mining, text classifiers, data compression, spelling checkers, text summarisation and machine translation.The algorithms extract stems or roots of different words, so that words derived from the same stem or root are grouped together.
In Latin-based languages, the stem and the root are the same; however, this is not the case for the Arabic language.Stemming is the first step toward finding the root.The stem is simply defined as a word without a prefix or/and suffix [11].Some further processing to a stem through the removal of some infixes might be required to obtain an Arabic root.
The lack of a gold standard dataset to be used to carry benchmark tests of different Arabic root extraction algorithms lead us to develop and build an automated corpus (Gold standard dataset).The purpose of this corpus is to be used to test, compare and enhance different Arabic root extraction algorithms.

The standard gold data set:
 Should be large enough to contain all types of words and roots.There exist about 12000 roots.
 The data set should contain the word and its root to avoid the interference of a human expert normally needed to verify the correct roots of each word used in the testing or comparison process.
Our aim in this paper is to build a corpus pairing each word to its root and contain a standard list of roots, prefixes, suffixes, and patterns.The suggested corpus will help researchers to enhance, test and compare the present root-extraction algorithms and any future algorithms.
The structure of this paper is as follows.In Section 2, previous approaches and their drawbacks have been discussed.Section 3 describes proposed methodology, including details of each process.Section 4 explains the experimental implementation of our approach and the evaluation process.Section 5 concludes the main points of the paper and gives some future directions.
The corpus exists freely and publicly for researchers to download.The main issue here is that Khoja"s corpus is limited in its contents, manually tagged and missing roots derivatives.
Al-Shawakfa et al in [10] builds a corpus for the purpose of evaluating and comparing Arabic root extraction algorithms.This corpus was built based upon the set of trilateral Arabic roots that were introduced by Buckwalter in [13].
The developed corpus was mainly built of 3823 trilateral roots.By using these roots as a base, a corpus was obtained of approximately 27.6 million unique words of size 1.63GB.Furthermore, all combinations of 73 trilateral patterns, 10 suffixes, and eight prefixes were applied to the roots to create different forms of Arabic words.All generated words were syntactically correct; but not necessary semantically correct.
Al-Shawakfa corpus did not require a manual root verification upon completing the testing process.
The disadvantages of Al-Shawakfa corpus are:  In many cases, many words are not semantically correct.
 Although the fact that the corpus has contained large data set, it has only covered 3823 roots out of 12000.
 Two types of words are missing: 1) Words with (changing the vowel letters with deferent vowel letters ‫االلالة"‬ ".For example, the root ‫,"لٛي"‬ "ٚ" letter is changing to ‫"ا"‬ in ‫"لبي"‬ word. 2) Words with (changing the place of a letter " ‫االثعاي‬ ") type.For example, the root ‫,"ٚخٗ"‬ "ٚ" letter is changing to ‫"ا"‬ in "ٖ‫"خب‬word, and the place of ‫"ا"‬ has changed in the new word too.
Sawlha and Atwell [14] constructed a broad-coverage lexical resource to improve the accuracy of morphological analysers and part-of-speech taggers of Arabic text.Twentythree lexicons have been collected from different web resources freely available.
The lexicons" texts contain 14,369,570 words, 2,184,315 vowelised word types and 569,412 non-vowelised word types.According to Sawalha and Atwell's study, a tokenising module for the program must specify the root entries and their definition parts.Then, a bag of words is extracted from the definition text.The bag stores pairs of word-root where each word appearing on the definition part is associated to the root of that part.
Many words appearing in the definition part are not relevant to the root associated with that definition.Such words are found inside the bag of words-root.A normalisation www.ijacsa.thesai.organalysis that verifies the word-root pairs is done by applying linguistic knowledge that governs the derivation process of words from their roots.These conditions are simply described as the following: Condition 1 (check consonants): If all consonant letters constructing the root appear in the analysed word, then check condition 2.
Condition 2 (consonants order): If all root letters appear in the same order as the word"s letters, then word-root combination might be correct.[14] Since the Arabic language is a sophisticated language, these two conditions are not enough to be sure that this word is derived from this root.Sawalha and Atwell algorithm was implemented.The algorithm has retained successes in some cases and fails in many cases.
Sawalha and Atwll research is a step forward towards creating a new corpus derived from Arabic lexicons to be used as a standard data set containing all the roots, a large number of derivatives and pairing each root with its derivatives.Our finding shows that there are many words are related to unexpected roots.Table 1 shows an example of words that are wrongly related to the roots.In addition, the algorithm doesn"t declare how many pairs of words roots were founded.It is clear; this work needs more rules to enhance the results.

III. METHOD
All Arabic roots and its derivations can be found in ("mu"jams","ُ‫)"اٌّعبخ‬ dictionaries.Most of the Arabic dictionaries were studied carefully in this paper.
Traditional Arabic lexicons are not available in computerised lexicographic databases.Moreover, they have different arrangement methodologies than modern English dictionaries [14].Existing Arabic dictionaries suffer from many issues.The main one was that they were built to be used manually.Dictionaries in Arabic contain the roots as a title followed by root definition part, which may contain one or more paragraphs for each root; these paragraphs describe the meaning of the root and contain possible word"s derivation from the root.The definition part may extend to many pages.Each dictionary has its own deferent definition part.New and deferent information can be read for each root when reading different dictionaries.Figure 1 shows a sample of text taken from Al-Mesbah-Almonir dictionary ‫إٌّ١ؽ"(‬ ‫اٌّصجبذ‬ ‫,)"ِعدُ‬ with roots ‫أثؽ"(‬ ‫اثع,‬ ‫)"أثع,‬ and its definition parts.Figure 2 shows a sample of text taken from Asas Al-Blaghah dictionary (" ‫ِعدُ‬ ‫اٌجالؼخ‬ ‫,)"أقبـ‬ with roots ‫أثؽ"(‬ ‫اثع,‬ ‫)"أثع,‬ and its definition parts.We can notice the deferent information given each time.The definition part is written as an article which defines most of the derived words of a certain root and contains many other words.These words are neither the root nor its derivatives.They exist mainly for explaining the meaning of the root.In Figure 1 the roots are written between two brackets; the derived words are written between two parentheses with red colour.This is a modified version of the original dictionary; the original version did not distinguish between the roots and its derivation.
The problem of the modified version is that many parentheses contain words other than the root derivation words.In addition, not all the root derivation words are written between two parentheses.

A. Manual Annotations
Traditionally, lexicons are constructed in many ways.Roots and lexical entries are presented without using any computerised lexicographic representations, and the roots of many of them are not distinguishable from other entries.
In this study, the root has been distinguished manually from other entries.Each root has been placed between two stars symbol "*". Figure 3 shows a sample text of Asas Al-Balaghah dictionary after putting each root between two stars.The process has covered all existing traditional dictionaries to enable the researchers from reading each root and its definition part automatically.Fig. 3. Sample text of Asas Al-Balaghah dictionary after distinguishing the roots

B. Normalisation
Text normalisation is defined as a process that consists of a series of steps that should be followed to wrangle, clean and standardise textual data to a form which could be consumed by other NLP and analytics systems and applications as input [13].
The process steps of the proposed text normalisation are as follows: 1) Remove kasheeda symbol ("_").

C. Extract All Information
In this section, we try to read all information in dictionaries.www.ijacsa.thesai.org

1) Extract Roots and Their Definitions Part
A separate database was created and saved for each studied dictionary.The created database consists of the distinguished root and their definition part.Table 2 shows a sample of the created database for some roots and their definitions parts taken from Asas Al-Balaghah dictionary.Our work is opposite to the root extraction algorithms work, they start from the derivation words to find the root.In our work, the root is known initially and then the derivation words have to be found.When the root is known, finding the derivation words is much easier than finding the root.We have used the rules that were discovered by root extraction algorithm in [25].These rules are mention below:

Prefix letters:
These letters can be added only in the prefix part.They are :{ ‫ة‬ ‫ؾ،‬ ‫،ـ،‬ ‫.}ي‬ Prefix part: is the part of the word, one or more letters before the first letter of the root word.So if we found these letters in place other than prefix part, and these letters are not a root"s letter this word will be rejected.In root extraction algorithm finding, the prefix part is a challenge, but in our work, we can determine the prefix part as the root is known.For example, the word ‫"اٌكجبثخ"‬ founded in the definition part of ‫"قجت"‬ root, so ‫ة"‬ , ‫ة‬ , ‫"ـ‬ letters are consonant in the word, the part before ‫"ـ"‬ letter in the word is the prefix, which is ‫."اي"‬This word is accepted because"‫"ي‬ can be in prefix part, so the pair ‫قجت"(‬ ‫)"اٌكجبثخ,‬ is accepted.And ‫"اٌكجبثخ"‬ also founded in the definition part of ‫"أٌت"‬ root, so ‫ة"‬ , ‫ة‬ , ‫"ا‬ letters are consonant in the word, but ‫"ـ"‬ is not in prefix place, its after ‫,"ا"‬ so the pair ‫أٌت"(‬ ‫)"اٌكجبثخ,‬ is rejected.

Suffix letters:
These letters can be added only in the suffix part.Suffix part: Is the part of the word, one or more letters after the last letter of the original root word.So if we found these letters in place other than suffix part and these letters are not root"s letters the word is rejected.In this paper the Suffixes are limited to single letter suffix: {ٖ}.
For example, the word ‫"أٌٚٗ"‬ founded in definition part of ‫"أٚي"‬ root, so ‫ي"‬ , ٚ , ‫"أ‬ letters are consonant, the part after ‫"ي"‬ letter in the word is the suffix part, which is "ٖ".This word is accepted as " ٖ" was founded in the suffix part, so the pair ‫أٚي"(‬ ‫)"أٌٚٗ,‬ is accepted.
Another example is the word ‫"أ٘ٛٞ"‬ founded in the definition part of ‫"أٚٞ"‬ root, so "ٞ , ٚ , ‫"أ‬ letters are consonant, "ٖ" is not a root letter, and has not been found in the suffix part, it is before "ٞ" and not after, so the pair ‫أٚٞ"(‬ ‫)"أ٘ٛٞ,‬ is rejected.

Prefix-Suffix letters:
These letters can be found only on both sides of the word, i.e. in the suffix part or in the prefix part.They are :{َ, ‫ن‬ , ْ}.
If these letters have been spotted in place other than prefix part or suffix part in the word, and these letters are not a root"s letters this word is rejected.For example, the word ‫"أٔجط"‬ and the root ‫,"أثط"‬ this word is rejected because " ْ" is not in prefix or suffix places, it"s neither before ‫"أ"‬ nor after ‫,"ض"‬ so the pair (" ‫أثط‬ ‫)"أٔجط,‬ is rejected.
All roots and their derivation words are stored in a database.Table 3 shows a sample from the database for " Asas Al-Balaghah" dictionary after picking the derivation words.
The database contains the roots, their derivation words, and the definition part for each root; derivation words were distinguished by putting each derivation word between brackets.

3) Extract Prefixes, Suffixes and Patterns
Since the root and its derivation words are known, prefix, suffix and the pattern can be extracted from each word.‫"ؾ"‬ will replace the first root letter, ‫"ع"‬ will replace second root letter, and ‫"ي"‬ will replace the third root letter in the word.If the root is more than three letter length, ‫"ي"‬ will replace all the rest of the letters.

IV. EXPERIMENT AND EVALUATION
In this section a comparison between our corpus, Khoja and Garside corpus, Buckwalter corpus, and Al-Shawakfa et al corpus was conducted.The result of the comparison is shown in Table 5.The Table 5 shows that Khaja and Buckwalt corpuses have not paired each word with its root.As mention earlier, Khojas corpus has limited number of suffixes, prefixes and patterns.It has been shown that Shawakfa corpus has more suffixes, Prefixes and pattern in comparison with Khoja"s corpus.Our corpus has the longest lists of roots, prefixes, suffixes and patterns.Al-Shawakfa et al corpus have the longest list of the word-root pair, but as mention in previous work section many words are semantically incorrect.
Khoja and Garside reported 96% accuracy of her stemmer using newspaper text on the assumption it was evaluated on the developed corpus.However, details of the evaluation methodology are not available, the text used in evaluation and accuracy metrics [26].
Khoja and Garside algorithm was tested in many studies; it was tested in [10] study, the test reveals an accuracy of 34%, and tested in [3] study, the test reveals an accuracy of 74%.This is due to differences in size and type of the data sets that are used [4].The main challenges or problems that faced authors wanted to test or compare these algorithms are the manual verification for a result, and the absence of a corpus that has the word and its root as a pair.Khojas algorithm was tested using Al-Shawakfa corpus.An accuracy of 34% was obtained initially.The accuracy of the test has increased to 55% after providing Khoja"s algorithm with Al-Shawakfa corpus lists, see Figure 4.
Khoja and Garside algorithm was tested on the newly developed corpus to compute the accuracy of their algorithm.Khoja and Garside Algorithm achieved about (63%) average accuracy.This is due to many factors: Restricting the result for just (4748) roots, (3,822) trilateral roots, (926) quadrilateral roots.It has ignored (7252) roots, for example, the word ‫"اثبٔٗ"‬ is stemmed is to the wrong root ‫,"ث١ٓ"‬ because the root ‫‪"is‬أثت"‬ missing.
Missing a very large number of prefixes, suffix, and patterns, for example, the word ‫"زٛقت"‬ is not stemmed, because it is missing the pattern ‫."ـٛعً"‬ Another test was conducted on Khoja and Garside algorithm after supplying the newly developed corpus with our lists of roots, prefixes, suffixes, and patterns.Khoja and Garside algorithm has achieved (84%) average accuracy.Figure 5 shows Khoja and Garside algorithm accuracy average rate before and after supplying the newly developed corpus"s lists .

V. CONCLUSION AND FUTURE WORK
In this work, a new corpus has been developed based on traditional manual Arabic dictionaries "mu"jams".The developed corpus was built mainly for testing, comparing and enhancing Arabic root extraction algorithms; we automatically extracted from these dictionaries (12000) roots, (430) prefixes, (320) suffixes, (4320) patterns, (720,000) word-root pair.
The developed corpus covers all types of words and all roots.It contains each word paired with its root.The developed corpus will save a lot of time and effort compared with the manual corpus previously used for testing purposes.
There is no need for the manual verification usually done by consulting Arabic language experts.Arabic root extraction algorithms can test and compare their finding using the newly automated corpus.
Khoja and Garside Arabic root extraction algorithm was tested using the developed corpus.The test has given results with (63%) accuracy.
The test was carried out after supplying it with our lists of roots prefixes, suffixes, and patterns the accuracy of it becomes 84%.

Fig. 4 .
Fig. 4. Khoja and Garside algorithm"s accuracy before and after supplying Al-Shawakfa et al corpus"s lists

TABLE I .
EXAMPLE OF WORDS THAT ARE WRONGLY RELATED TO THE ROOTS BY SAWALHA AND ATWELL CORPUS

TABLE II .
SAMPLE OF THE DATABASE FOR ROOTS AND THEIR DEFINITIONS FOR ASAS AL-BALAGHAH DICTIONARY www.ijacsa.thesai.org

TABLE III .
SAMPLE OF THE DATABASE FOR " ASAS AL-BALAGAH" AFTER PICKING THE DERIVATION WORDS.

TABLE IV .
SAMPLE OF THE DATABASE FOR PREFIXES SUFFIXES AND PATTERNS

TABLE V .
COMPARISON BETWEEN OUR CORPUS, KHOJA AND GARSIDE CORPUS, BUCKWALTER CORPUS, AND AL-SHAWAKFA ET AL CORPUS