For an Independent Spell-checking System from the Arabic Language Vocabulary

—In this paper, we propose a new approach for spell-checking errors committed in Arabic language. This approach is almost independent of the used dictionary, of the fact that we introduced the concept of morphological analysis in the process of spell-checking. Hence, our new system uses a stems dictionary of reduced size rather than exploiting a large dictionary not covering the all Arabic words. The obtained results are highly positive and satisfactory; this has allowed us to appreciate the validity of our concept and shows the importance of our new approach.


I. INTRODUCTION
Automatic correction of spelling errors is one of the most important areas in the field of Natural Language Processing (NLP) and it has been a subject of many researches since the 60's [1].Spell-checking consists in suggesting the closest corrections for a misspelled word, this implies the development of error models and methods allowing the scheduling of plausible corrections and the disposal of a representative lexicon for a given language in order to compare.
Early researches consist in founding a kind of error modeling.Since, appears Damerau's [2] definition which consider a spelling error as a simple combination of elementary edition operations of insertion, deletion, transposition and substitution.Based on Damerau's definition, Levenshtein [3] will define his distance (Levenshtein distance) which is characterized by three of elementary edition operations: insertion, deletion and permutation.Another modeling proposed by Pollock and Zamora [4] consists in associating for each word in the dictionary its alpha-code (consonants of the word), hence the need of having two dictionaries: one for the words and the other for their alpha-codes, and therefore the correction will be done by comparing alpha-codes with the misspelled word.This method is efficient for permutation errors cases.
We also find among these studies: the decomposition method based on the concept of N-gram language model which is based on decomposing a misspelled word to di-trigrams and compare them to the dictionary di-trigrams in order to produce a similarity index to designate the nearest words to the misspelled word [5].In 1996, Oflaser [6] defined a new approach called tolerant recognition of spelling errors by using the concept of finite state automaton and a distance called cutoff edit distance.Using this approach, the correction of a misspelled word is done by browsing the dictionary automaton and by calculating the cut-off edit distance for each transition, without exceeding a threshold previously defined in the algorithm.Gueddah, Yousfi and Belkasmi [7] proposed a typical and efficient variant of edit distance by integrating frequency editing errors matrices [8] in the Levenshtein algorithm in order to improve the scheduling of the solutions of an erroneous word in Arabic documents.
Generally in natural languages, and especially in Arabic, existing spelling automatic correction systems do not cover all the misspelled words.Spelling correction was always related to the disposition of a given lexicon covering the totality of misspelled words.
Several studies have been made towards the development of dictionaries adapted for spell-checking systems.Among these studies we cite in particular: the Ayaspell 1 project that aims to generate dictionaries, for example the Arabic lexicon Hunspell-ar Version 3.2 that contains more than 300000 Arabic words designed for free office suite applications of Open Office (writer) and Mozilla Firefox 3, Thunderbird and Google Chrome incorporating the spell-checker Hunspell 2 (originally designed for the Hungarian language).www.ijacsa.thesai.orgDespite linguistic resources available to the Arabic language, we note that we do not have yet a robust spellchecker capable of covering all the spelling errors committed.And that raised a major challenge for spell-checking [9].
In order to overcome this limitation raised on spellchecking for the Arabic language, we propose in this paper a new approach which aims to introduce the concept of morphological analysis in the process of spell-checking.
In reality, there are few works that deal with morphological analysis in spell-checking process.Among these works are cited, specially:  Emirkanian [10] have developed an expert system in the fields of spell-checking, morphological analysis and syntactic analysis for the French language by developing a morpho-syntactic analyzer capable of detecting and correcting spelling, morphological and syntactic errors frequently committed in French documents entries.This system is based on the integration of knowledge-based rules of French for various levels: orthographic (radical's dictionary), morphological (analyzer, suffixes dictionary) and syntactic (syntactic tree, substitution rules, completion rules).This system uses a metrical distance [11] to limit the search space and define the substitution rules.
 In another approach proposed by Bowden and Kiraz [12], they have presented a morpho-graphemic model for spelling and morphological errors correction based on McCarthy morphological analyzer [13].The advantage of this model is the way it's combines lexical analysis with morphological analysis to determine the correction possibilities.
 Another recent study is presented by Shaalan and his team [14].This project present a spell-checking system for Arabic language that aims to explore in a first step a huge dictionary of few millions words (13 millions) generated by the AraComLex3 finite state transducer with only 9 millions of valid lexical forms filtered by the AraMorph of Buckwalter morphological analyzer [15].These ones have explored this dictionary to propose a generic spell-checking model for Arabic by using finite state automaton technology [6] and a specific metrical distance [3] combined with a noisy channel model and also with knowledge-based rules to assign weights to the suggested corrections in order to refine the best solutions.
The major inconvenient of all the spell-checking systems resides in the limitation of the used dictionaries, because they do not cover the totality of the words of a given language.Our idea presented in this paper aims to develop a spell-checking system for Arabic language independently from the vocabulary 4 by introducing morphological analysis in the spellchecking process.

II. THE MORPHOLOGICAL ANALYZER: ARAMORPH
The Buckwalter morphological analyzer [15] developed by LDC (Linguistic Data Consortium), named AraMorph, allows segmenting each word into a sequence of triplet "prefix-stemsuffix".The AraMorph analyzer is formed mainly on three lexicons: prefixes (548 entries), suffixes (906 entries) and stem (78 839 entries).Lexicons are complemented by three compatibility tables used to cover all the possible combinations of prefix-stem (2435 entries), suffix-stem (1612 entries) and prefix-suffix (1138 entries).Thus, the parser will output the stems, prefixes and suffixes associated to the word to be analyzed, and then it checks the validity of these solutions in the lexicon of the system and in the correspondence tables prefix-stem, stem-suffix and prefix-suffix.The stems used in AraMorph are constructed as follows: the stems of root " ‫فعل‬ " are: " ‫فاعل‬ " , " ‫فعول‬ " , " ‫فعيل‬ " , " ‫فوعل‬ " and " ‫فعال‬ " .

III. THE LEVENSHTEIN DISTANCE
Among the most known metrical methods in the field of spell-checking, we have the unavoidable Levenshtein distance [3], also known as the Edit Distance.The edit distance calculates the minimal number of elementary editing operations required to transform a misspelled word to a dictionary word.Editing operations considered by Levenshtein are: insertion, deletion, and permutation.The procedure of calculating the edit distance between two strings and where the length is respectively and , consists in calculating recursively step by step in a matrix the edit distance between different substrings of and .
The calculation of the cell corresponding to the edit distance between the two substrings and Q , is given by the following recurrent relation:

IV. INTRODUCING MORPHOLOGICAL ANALYSIS INTO LEVENSHTEIN DISTANCE
Our new idea in this work is to use a dictionary of small size that represents Arabic language stems 5 to correct spelling errors instead of using a large dictionary.In other words, our vision is to invest in a relevant metric method instead of building a dictionary that covers all the words in a given language, which is usually difficult to build.www.ijacsa.thesai.orgAccording to the morphological analysis approach, there exist ( , , ) in such as = respectively for the misspelled word = , where means an erroneous prefix and means an erroneous stem and means an erroneous suffix.
In order to introduce the morphological analysis concept (used by Buckwalter) in the Levenshtein algorithm, we have defined a new measure noted , as well the measurement between W err and vector (P v , T v , S v ) is given by the following formula: ) + (4) The corrections of erroneous word are given by: For all prefixes, stems and suffix respectively belonging to and , we calculate the minimum only on prefixes, stems and suffixes that are compatibles with each other, and that by introducing the three tables of correspondence between prefixstem, stem-suffix and prefix-suffix already used by Buckwalter.

Example:
Let " ‫قغخل‬ " a misspelled word to correct.By applying the formula (4), we get the following solutions in these first orders:

 ‫قغخل‬ ‫ف‬ ‫دخل‬
)) =2 (wich presents the minimal distance in all stems, prefixes and suffixes)  the system suggest the solution ‫"فدخل"‬ as a correction of the misspelled word " ‫قغخل‬ " , with distance 2.  Thus our method returns the solution ‫قغخل‬ ‫ف‬ ‫غول‬ )) =2  which represents the word ‫,"فغول"‬ with distance 2.

V. TESTS AND RESULTS
To highlight our approach, we have developed a spellchecking program 6 that allows comparing our method to the classical approach of Levenshtein.
The list of words used in this study as reference lexicon for Levenshtein approach contains more than 170000 words extracted from MySpell7 program of Open Office Writer.
For our approach, we relied on a list of prefixes, suffixes and a list of stems built on Buckwalter  The rectifications suggestions proposed by Levenshtein distance are the word of minimal distance relative to the misspelled word.For our approach, we used the formula (4), explained in the previous paragraph.For our tests, we have used a corpus of 2784 misspelled words.There were three types of errors: addition, deletion and permutation.The table below shows the rate of correction by editing operations: To compare our new approach with Levenshtein's, we have used the following three indicators:


The correction average time.


The rate of rectified words.


The size of each system lexicon.
 We have taken 170000 words as lexicon size for Levenshtein method.For our system, the theorical the empty string.

TABLE I .
COMPARATIVE TABLE BETWEEN THE TWO METHODS