An Algerian dialect: Study and Resources

—Arabic is the ofﬁcial language overall Arab countries, it is used for ofﬁcial speech, news-papers, public administration and school. In Parallel, for everyday communication, non-ofﬁcial talks, songs and movies, Arab people use their dialects which are inspired from Standard Arabic and differ from one Arabic country to another. These linguistic phenomenon is called disglossia, a situation in which two distinct varieties of a language are spoken within the same speech community. It is observed Throughout all Arab countries, standard Arabic widely written but not used in everyday conversation, dialect widely spoken in everyday life but almost never written. Thus, in NLP area, a lot of works have been dedicated for written Arabic. In contrast, Arabic dialects at a near time were not studied enough. Interest for them is recent. First work for these dialects began in the last decade for middle-east ones. Dialects of the Maghreb are just beginning to be studied. Compared to written Arabic, dialects are under-resourced languages which suffer from lack of NLP resources despite their large use. We deal in this paper with Arabic Algerian dialect a non-resourced language for which no known resource is available to date. We present a ﬁrst linguistic study introducing its most important features and we describe the resources that we created from scratch for this dialect.


I. INTRODUCTION
Under-resourced languages are languages which lacks resources dedicated for natural language processing.In fact, these languages suffer from unavailability of basic tools like corpora, mono or multilingual dictionaries, morphological and syntactic analyzers, etc.This lack of resources makes working with these languages a great challenge, especially when we deal with unwritten languages like Arabic dialects.Compared to other under-resourced languages, Arabic dialects present the following additional difficulties: • Since they are spoken languages they are not written and there are no established rules to write them.A same word could have many orthographic forms which are all acceptable since there is no writing rules as reference.
• The flexibility in the grammatical and lexical levels despite their belonging to Arabic Language.
• Besides the fact that these dialects are different from Arabic, they are also different from each other.For instance, dialects of the Maghreb differ from those of the middle-east.They may be also different inside the same country.
• These dialects are also widely influenced by other languages such as French, English, Spanish, Turkish and Berber.
In Algeria, as well as in all arab countries, these dialects are used in everyday conversations.However, with the advent of the internet they are increasingly used in social networks and forums.They emerge on the web as a real communication language due to the ease to communicate in dialect especially for people with low level of education.But unfortunately basic NLP tools for these dialects are not available.This work is a first part of the Project TORJMAN 1 which is a Speech-To-Speech Translator between Algerian Arabic dialects and MSA.Unlike Middle-East Arabic dialects, Algerian Arabic dialects are non-resourced languages, they lack all kinds of NLP resources.Consequently, TORJMAN begins from Scratch.
In this paper, we describe and extend resources creation tasks for Arabic dialect of Algeria that appeared in [1] and [2].We focus on Algiers dialect which is the spoken Arabic of Algiers (capital city of Algeria) and its periphery.This choice is justified by the fact that this dialect is the one we know best and practice since we are native speakers of this dialect.For convenience of reference, we will design Algiers dialect by ALG, this will make this manuscript easier to read.This paper is organized as follows: before dealing with Algerian dialect we give in Section II a brief overview of Arabic language, whereas in Section III we present different aspects of ALG.The following Sections will be dedicated to the resources that we created, we detail how we made the first corpus of Algiers dialect (Section IV).Then we present ALG graphemephoneme converter(Section V) which has allowed us to get a phonetized corpus of Algiers dialect.In Section VI we describe how we created a morphological analyzer for ALG by adapting BAMA [3] the well known analyser for MSA.Finally, we will conclude by summarizing the main ideas of this work and by giving our future tendencies.

II. ARABIC LANGUAGE
Arabic is a Semitic language, it is used by around 420 million people.It is the official language of about 22 countries.Arabic is a generic term covering 3 separate groups: 1 TORJMAN is a national research project which is totally financed by the Algerian research ministry, this appellation means translator or interpreter in English.www.ijacsa.thesai.org • Classical Arabic: is principally defined as the Arabic used in the Qur'an and in the earliest literature from the Arabian peninsula, but also forms the core of much literature until the present day.
• Modern Standard Arabic: Generally referred as MSA (Alfus'ha in Arabic), is the variety of Arabic which was retained as the official language in all Arab countries, and as a common language.It is essentially a modern variant of classical Arabic.Standard Arabic is not acquired as a mother tongue, but rather it is learned as a second language at school and through exposure to formal broadcast programs (such as the daily news), religious practice, and newspaper [4].
• Arabic dialects: also called colloquial Arabic or vernaculars are spoken varieties of Arabic language.In contrast to classical Arabic and MSA, they are not written.These dialects have mixed form with many variations.They are influenced both by the ancient local tongues and by European languages such as French, Spanish, English, and Italian. 2 Differences between these variants of spoken Arabic throughout the Arab world can be large enough to make them incomprehensible to one another.Hence, regarding the large differences between dialects, we can consider them as disparate languages depending on the geographical place in which they are practiced.Thus, most of the literature describe Arabic dialects from the viewpoint of east-west dichotomy [5]:  In the next section, we will focus on a Maghreb dialect from Algeria and more specifically the dialect spoken in Algiers the capital city of Algeria, we will highlight its most features in contrast to MSA.

III. SPECIFICITIES OF ALGIERS DIALECT
Algiers dialect (ALG) is the dialectical Arabic spoken in Algiers and its periphery.This dialect is different from the dialects spoken in the other places of Algeria.It is not used in schools, television or newspapers, which usually use standard Arabic or French, but is more likely, heard in songs if not just heard in Algerian homes and on the street.Algerian Arabic is spoken daily by the vast majority of Algerians [7].ALG as the other Arabic dialects simplifies the morphological and syntactic rules of the written Arabic.In [8], the author draws how match spoken Arabic is different from written Arabic in various language levels: Phonological differences between Classical Arabic and spoken Arabic are moderate (compared to other pairs of language-dialect), whereas grammatical differences are the most striking ones.At lexical level, differences are marked with variations in form and with differences of use and meaning.Indeed, at phonological level, ALG (naturally) shares the most features related to Arabic.In addition to the 28 consonants phonemes of Arabic4 (given in Table I), ALG consonantal system includes non Arabic phonemes like /g/ as in the word (all), and the phonemes /p/ and /v/ used mainly in words borrowed from French like the case of (adapted from the French word "pompe" which means a pump) and (adapted from the French word "valise" which means a bag).Also, it should be noted that the use of the phonemes ( ) and ( ) is very rare, most of the time is pronounced /d'/( ) and is pronounced /d/( ).The same case is observed for /T/ ( ) which is pronounced /t/( ).Note that the last two substitutions are observed also for Jordanian dialect [9].Phonological features of ALG will be detailed further in this paper (section V).

A. Vocabulary
Algerian dialect has a vocabulary inspired from Arabic but the original words have been altered phonologically, with significant Berber substrates, and many new words and loanwords borrowed from French, Turkish and Spanish.Even though most of this vocabulary is from MSA, there is significant variation in the vocalization in most cases, and the omission or modification of some letters in other cases (mainly the Hamza) 6 .Vocabulary of Algiers's dialect includes verbs, nouns, pronouns and particles.In the following a brief description of each category.include an important portion of french words.Most of them are the results of a wide phonological alteration of original words such as ("moteur" in French, motor ), ("la tension", blood pressure) and ("policier", policeman).Nouns include also numbers which represent units, tens, hundreds, etc. From 1 to 10 the numbers are close to MSA (with different vocalization), except for the numbers 0 and 2: the first one is pronounced as in French /zero/, and the second is , whereas in MSA it is .From number 11 to 19 the pronunciation in ALG differs from MSA, some letters and diacritics change but the number can be perceived easily by an Arab speaker.Numbers greater than 20 are also close to MSA numbers, only the diacritics marks differ.

• Pronouns
The list of the pronouns is a closed list; it contains demonstrative and personal pronouns.For relative pronouns, there is only one in Algiers dialect which is (that); this pronoun is used for female, masculine, singular and plural.We give in Tables IV and V all ALG used pronouns.It is important to note that the

B. Inflection
Algiers dialect is an inflected language such as Arabic.Words in this language are modified to express different grammatical categories such as tense, voice, person, number, and gender.It is well-known that depending on word category, the inflection is called conjugation when it is related to a verb, and declension when it is related to nouns, adjectives or pronouns.We show in the following these linguistic aspects for Algiers dialect.
1) Verbs conjugation: Verb conjugation in ALG is affected (as in MSA) by person (first, second or third person), number(singular or plural), gender (feminine or masculine), tense (past, present or future), and voice (active or passive).Algiers dialect uses as MSA the followings forms: • The past: Its forms are obtained by adding suffixes relative to number and gender to the verb root and by changing its diacritic marks(see  • The present and future: The present form of a ALG verb is achieved by affixation: the prefixes , and and the suffixes and (Table VIII).The verb could be preceded by the particle (in its inflected form 7 ) to express a present continuous tense.The future is obtained in the same way as present (same prefixes and affixes) but it must be marked by the ante-position of a particle or an expression that indicates the future like (later) or (tomorrow), next month, ...etc.2) Declension: Singular word declension in written Arabic corresponds to three cases: the nominative, the genitive, and the accusative which take the short vowels , and 7 See next section III-B2 respectively attached to the end of the word.These three cases are used to indicate grammatical functions of the words.It should be noted that also the vowels ( , , ) represent the tanween doubled case endings corresponding to the three cases cited above and express nominal indefiniteness.ALG has dropped these case endings such as all Arabic dialects.The disappearance of final short vowels and dropping of /h/ in certain conditions in many dialects of Arabic are very significant changes [10].The same author in [8] states: Classical Arabic has three cases in the noun marked by endings; colloquial dialects have none.Thus, a major feature of ALG is that it does not accept the three cases declension of singular nouns and adjectives as written Arabic.For singular nouns declension to the plural, ALG have the same plural classes as MSA: • Masculine regular plural: which is formed without modifying the word structure by post-fixing the singular word by , unlike written Arabic where the masculine regular plural of a noun is obtained by adding the suffixes (for the nominative), and (for both the accusative and genitive) depending on the grammatical function of the word.For example, masculine regular plural of MSA word (teacher) could be (nominative case) or (accusative or genitive).In contrast, for instance the ALG word (going) always takes for the regular plural whatever its grammatical category.
• Feminine regular plural: is obtained by adding the suffix to the word without changing the structure of the word as in MSA but with a single difference in case endings.Indeed, in MSA, the feminine regular plural has the following marks cases ( or for nominative and or for accusative and genitive), ALG has only one mark case which is the Sukun (absence of diacritic whose symbol is ).For example the plural of MSA word is or 8 and the plural of ALG word is always (both MSA and ALG words mean beautiful).
• Broken plural: an irregular form of plural which modifies the structure of the singular word to get its plural.As in MSA it has different rules depending on the word pattern.Like singular words, the MSA broken plural takes the three case endings in ALG it does not.
In Table IX we give an example for each ALG plural category.
Another major difference between Algiers dialect and the written Arabic is the absence of the dual (a kind of plural which designs 2 items).Indeed in MSA, for example the dual of (a boy) is designed by ( the word is post-fixed by or depending on the case 9 ).In ALG Generally, the dual is obtained by the word (two) followed by the plural www.ijacsa.thesai.org(feminine or masculine) of the noun or the adjective. 10For example, the dual of is (two boys) C. Syntactic level 1) Declarative form: Words order of a declarative sentence in ALG is relatively flexible.Indeed, in common usage ALG sentences could begin with the verb, the subject or even the object.This order is based on the importance given by the speaker to each of these entities; usually the sentence begins with the item that the speaker wishes to highlight.In Table X we give an example of different word orders for a same sentence.It should be noted that the two first forms (SVO, TABLE X: Example of word order in a ALG declarative sentence.

Order
Dialect Sentence English SVO The boy went to school VSO OVS OSV VSO) are the most used in the every day conversations.
2) Interrogative form: In Algiers, any sentence can be turned into a question, in any one of the following ways: 1) It may be uttered in an interrogative tone of voice, like (Will you revise?). 2) By introducing an interrogative pronoun or particle as (where will you revise?).
We list in Table XI the most common interrogative particles and pronouns used in the dialect of Algiers.We mention particularly the particle used in questions that accept a yes or no answer.
3) Negative form: The particles and are generally used to express negation.is used both in Algiers's dialect and MSA, but the form of negation differs between the two languages whereas is specific to the ALG.Using these particles, the negative form is obtained in different ways in ALG (we give in Table XII some examples labeled with each enumerated case): 10 An exception is made for words like (two eyes), (two ears), ... • Negation with particle 1) Adding the affixes and to conjugated verbs ( as prefix and as suffix).2) We can enumerate a particular case with the particle which is equivalent to the verb to be in present tense 11 .The negation is obtained by adding the affixes and to the particle possibly combined with a personal pronoun.
• Negation with particle 3) The particle can be added at the beginning of a verbal declarative sentence without modification of the sentence.4) The particle can be added at the beginning of a verbal declarative sentence by introducing the relative pronoun .5) In the case of a nominal sentence, can be added at the beginning of the sentence by reversing the order of its constituents.6) Also could be added in the middle of a nominal sentence with no modification.As mentioned above, this work began from scratch.No kind of resources was available for Algiers dialect.The foundation stone of the work was a corpus that we created by transcribing conversations recorded from everyday life and also from some TV shows and movies.This transcription step required conventional writing rules to make the transcribed text homogeneous.Considering the fact that ALG is an Arabic dialect, we adopted the following writing policy: when writing a word in Algiers dialect we look if there is an Arabic word close to this dialect word, if it does exist we adopt the Arabic writing for the dialect word, otherwise the word is written as it is pronounced.The transcription step produced a corpus of 6400 sentences that we afterwards translated to MSA.Thus, we got a parallel corpus of 6400 aligned sentences.In Table XIII, we give informations about the size of this corpus.It should be noted that all tasks described above were done by hand.It was time consuming but the result was a clean parallel corpus.Furthermore, ALG side of this corpus has been vocalized with our diacritizer described in [11] and used to develop the first NLP resources dedicated to an Algerian dialect (at our knowledge).The next sections of this paper are dedicated to describe these resources.

V. GRAPHEME-TO-PHONEME CONVERSION
As pointed out above, the general purpose of the project TORJMAN is a speech translation system between Modern Standard Arabic and Algiers dialect.Such a system must include a Text-to-Speech module that requires a Grapheme-To-Phoneme converter.We therefore dedicated our efforts to develop this converter by using ALG vocalized corpus described earlier.Grapheme-to-Phoneme (G2P) conversion or phonetic transcription is the process which converts a written form of a word to its pronunciation form.Grapheme phoneme conversion is not a simple deal, especially for non-transparent languages like English where a phoneme may be represented by a letter or a group of letters and vice-versa.Unlike English, Arabic is considered a transparent language, in fact the relationship between grapheme and phoneme is one to one, but note that this feature is conditioned by the presence of diacritics.Lack of vocalization generates ambiguity at all levels (lexical, syntactic and semantic) and the phonetic level consequently, such as the word /ktb/, its phonetic transcription could be /kataba/, /kutiba/, /kutubun/, /kutubi/, /katbin/... Algiers dialect obeys to the same rule, without diacritics grapheme-phoneme conversion will be a difficult issue to resolve.Most works on G2P conversion obey to two approaches: the first one is dictionary-based approach, where a phonetized dictionary contains for each word of the language its correct pronunciation.The G2P conversion is reduced to a lookup of this dictionary.The second approach is rule-based [12], [13], [14], in which the conversion is done by applying phonetic rules, these rules are deduced from phonological and phonetic studies of the considered language or learned on a phonetized corpus using a statistical approach based on significant quantities of data [15], [16].For Algiers dialect which is a non-resourced language, a dictionary based solution for a G2P converter is not feasible since a phonetized dictionary with a large amount of data is not available.The first intuitive approach (regards to the lack of resource) is a rule based one, but the specificity of Algiers dialect (that we will detail hereafter in the next section.)had led us to a statistical approach in order to consider all features related to this language.

A. Issues of G2P conversion for Algiers dialect
Algiers dialect G2P conversion obeys to the same rules as MSA.Indeed, ALG could be considered as a transparent language since alignment between grapheme and phoneme is one to one when the input text is vocalized.But unfortunately, it is not as simple as what has been presented, since ALG contains several borrowed words from foreign languages which most of them have been altered phonologically and adapted to it.Henceforth, the vocabulary of this dialect contains many French words used in everyday conversation.French borrowed words could be divided into two categories: the first includes French words phonologically altered such as the word (famille in French, family) and the second one includes words which are uttered as in French like the word (sûr in French, sure) whose utterance is /syö/(/y/ is not an Arabic phoneme but a French phoneme).This last category constitutes www.ijacsa.thesai.org a serious deal for G2P conversion since these words do not obey to Arabic pronunciation rules.  .On the other side, the last two words are phoentized as French words since they are pronounced as in French by Algiers dialect speakers.In order to take account of this word category, the French phonemes like /E/,/Õ/ and /@/ must be included in Algiers dialect phonemes.

B. Rule based approach
As stated previously, the rule based approach for G2P conversion applied to ALG requires a diacritized text, that is why we used our ALG vocalized corpus.The diacritized text is converted into its phonetic form by applying the followings rules.It should be noted that most of these rules are those adopted also for Arabic [12], [13] and are applicable only for Arabic words and foreign words phonologically altered in our corpus.Let consider: BS is a mark of the beginning of a sentence, ES is a mark of the end of a sentence, BL is a blank character, C is a consonant, V is a vowel, LC is a lunar consonant, SC is a solar consonant, and LV is a long vowel.A sample conversion rule could be written as follows: The rule is read as follows: a grapheme GR having as left and right contexts LFT and RGT respectively, is converted to the phoneme PH.Left and right contexts could be a grapheme, a word separator, the beginning or the end of a sentence or empty.We give in the following all rules that we used for Algiers dialect G2P (the representation of these rules according to the sample below is given in the Appendix (Table XXIII When the Shadda appears on a consonant, this consonant is doubled (geminated) Example: (sugar) =⇒ /sukkur/ It should be noted that most of these rules could be applied for other Algerian dialects and Arabic dialect close to them such Tunisian and Moroccan.
Experiment: As indicated above for experiment we used our ALG vocalized corpus which includes three categories of words: 1) Arabic words.
2) French words phonologically altered and their pronunciation is realized with Arabic phonemes.3) French words for which the pronunciation is realized with French phonemes.
We applied phonetization rules seen below on the ALG corpus.
In addition to Arabic words, French words of the second category are correctly phonetized because their phonetic realization is close to Algiers dialect.For example the word (kitchen, original French word is cuisine) which is a borrowed French word phonologically altered is correctly converted as /ku:zina/, while a word in the third category as (connection, original French word connexion) is incorrectly converted to /ku:niksju:n/ since it is realized /kOnnEksjÕ/ with French phonemes.Considering these words, system accuracy is 92%.The issue of these words is that we can not introduce rules for French words written in Arabic script, since the relation between Arabic graphemes and French phonemes is not one to one.For example the graphemes in a French word written in Arabic script could correspond to the French phonemes /y/, /u/, /O/ or /O/ (see some examples in Table XV).

C. Statistical Approach
Rule based approach adopted above does not take into account French words used in ALG which are pronounced as in French language.This issue takes us to choose a statistical approach in order to consider this feature.We use statistical machine translation system where source language is a text (a set of graphemes) and target language is its phonetic representation (a set of phonemes).This system uses Moses package [17], Giza++ [18] for alignment and SRILM [19] for language model training.The main motivation of using a statistical approach is that we can include French phonemes in the training data.For building this system, the first component is a parallel corpus including a text and its phonetic representation.Actually, this resource is not available, so we created it by using the rule based converter described above.We proceed as follows: we used the rule based system to convert Arabic words and French words phonologically altered (category 1 and 2) to Arabic phonemes.Whereas for French words realized with French phonemes (category 3), we began by identifying them and we transliterated them to their original form in Latin script, then converted them to French phonemes (using a free French G2P converter), all these operations were done by hand.For example the word is transliterated to connexion then converted to /kOnnEksjÕ/.
This system operates at grapheme and phoneme level, we split the parallel corpus into individual graphemes and phonemes including a special character as word separator in order to restore the word after conversion process (see Table XVI).Experiment: For evaluating the statistical approach, we split the parallel corpus into three datasets: training data (80%) tuning data (10%) and testing data (10%).First we tested the statistical approach on a corpus containing only Arabic words and French words phonologically altered (category 1 and 2).We got an accuracy of 93%.Then we proceeded to a test on a corpus including the three words categories, system accuracy decreases to 85%.This result is due to the increase of hypothesis number of each grapheme because of introducing French phonemes in the training data.The graphemes for example in some Arabic words (category 1) are phonetized as the French phonemes /y/ or /Õ/ instead of the Arabic long vowel /u:/, the phoneme /Õ/ instead of /u:n/.Contrary to that some words in category 3 are phonetized with Arabic phonemes by substituting for example the phonemes /y/, /u/, /O/ or /O/ by the /u:/, and /E/ by /a:/.

D. Discussion
At first glance, and regards to accuracy rates, we could deduce that rule based approach is more efficient than statistical approach (92% vs 85%).Rule based approach does not take into account French words of category 3, it achieves efficient results only for Arabic words and French phonologically altered words (category 1 and 2).Results of statistical approach must be analysed regards to the small amount of the training data.On another side, a hybrid approach could be adopted: instead of using one corpus including all categories of words for training the statistical G2P converter, we can use two corpora: the first one including words of categories 1 and 2, could be processed by rule based approach.The second corpus is a parallel corpus including words of category 3 with their French phonetization used for training the statistical G2P www.ijacsa.thesai.orgconverter.Unfortunately, we have not sufficient data for testing such a converter, since our corpus includes only about 1k words of category 3.In terms of resources, this work allowed us to build a phonetized dictionary for Algiers dialect; at our knowledge no such resource is available at this time.

VI. MORPHOLOGICAL ANALYZER FOR ALGERIAN DIALECT
A. Related works Compared to MSA, there are a little number of Morphological Analysers (MA) dedicated to Arabic dialects.Works in this area could be divided into two categories.The first one includes MA that are built from scratch such as in [20] and [21], the second includes works that attempt to adapt existing MSA Morphological Analysers to Arabic dialect.This trend is adopted for several dialects since it is not time consuming.In [22], authors used BAMA Buckwalter Arabic Morphological Analyser [3] by extending its affixes table with Levantine/Egyptian dialectal affixes.The same approach is adopted in [23] where a list of dialectal affixes (belonging to four Arabic dialects) was added to Al-Khalil [24] affix list.Authors in [25] converted the ECAL (Egyptian Colloquial Arabic Lexicon) to SAMA (Standard Modern Arabic Analyser) representation [26].For Tunisian dialect, authors in [27] adapted Al-Khalil MA, they create a lexicon by converting MSA patterns to Tunisian dialect patterns and then extracting specific roots and patterns from a training corpus that they created.

B. Adopted Approach
To build a MA for Algiers dialect, we decide to adapt BAMA, since it does not consume time and takes profit from the fact that it is widely used.BAMA is based on a dictionary of three tables containing Arabic stems, suffixes and prefixes and three compatibility tables defining relations between stems, prefixes and suffixes.Adaptation of BAMA is got by populating these tables by dialect data.

C. Building the dialect dictionary
We built dialect dictionary by adopting the following principle: in order to exploit BAMA dictionary, we kept from it all entries that belong also to ALG with some modification (for example MSA prefixes , and are used in ALG so we kept them as ALG prefixes).Beside that, we deleted all entries which are not suitable for Algiers dialect.Moreover, we created entries that are purely dialectal and which did never exist in MSA dictionary. 1) Affixes tables: For affixes tables, common affixes between MSA and ALG are kept (in prefixes and suffixes tables), whereas all other MSA affixes which do not belong to dialect were deleted.However, some dialect affixes which do not exist in MSA were added to affixes tables.Note that when an affix is deleted, all complex affixes where it occurs are also deleted.

1) Prefixes table:
We kept some prefixes unchanged like prefixes and that precede imperfect verbs (for the singular third person masculine and feminine, respectively).We eliminated purely MSA prefixes 12and all complex prefixes where they appear instead of the prefix (expressing the future when it precedes imperfect verbs ) and the prefix 13 (conjunction), some examples are given in Table XVII.2) Suffixes table : We also eliminated all MSA suffixes not used in Algiers dialect mainly: • Suffixes related to the dual both feminine and masculine, • Feminine plural suffixes, • All word case endings suffixes All complex suffixes where they appear were also deleted.Likewise, we added dialectal suffixes like the suffix for negation and all complex suffixes that must be included with it.We integrated also a set of suffixes to take into account all various writings of dialects words which are not normalized.An example is the suffix , which could express the plural (feminine and masculine) in the end of a verb, a possessive pronoun at the end of a noun exactly like the MSA suffix .We give in table XVIII a set of examples of each case.
2) Stems table: Dialect stems table was populated by the lexicon of Algiers dialect corpus and MSA stems included in BAMA.We used a part (85%, 9170 distinct words) of our ALG corpus for creating dialect stems, the remaining 15% (1618 distinct words) is used for test.

Stems from ALG corpus lexicon
First, we began by extracting a list of nouns easily identifiable by affixes and definite article (used only with nouns).We deleted these two affixes from all extracted words, then from obtained list of words we created stem entries according to BAMA.Next, the rest of the corpus was analysed and classified into three sets: function words, verbs and nouns (which do not include and suffixes) and converted to stems according to BAMA stems categories.Let us indicate that we added some stems categories to take into account all dialectal features.For example, in MSA the perfect verb stem category with the pattern covers the three persons, the two genders, the single, the dual and plural; just relative are added to it to have its different inflected forms.In ALG, we split this stem category into two distinct stems: and to cover all perfect verbs inflected forms, in Table XIX we give an example related to the stem (to hear).we constructed imperfect verb stems and command verb stems from the ALG perfect verb stems that we created as described above.

2) Nouns
We kept all proper nouns from MSA stems table because it contains an important number of entries related to countries, currencies, personal nouns,...We analysed all other types of words and kept from them those existing in ALG by modifying diacritics, adding or deleting one or more letters.

3) Function words
We deleted all function words that do not exist in ALG like relative pronouns and personal pronouns related to the dual and feminine plural, then we translated remaining ones to ALG.
Note that we introduced dialect stems with non Arabic letters G, V, and P in stems table and we modified BAMA code to consider words containing these letters.Also, since every stem entry in BAMA contains an English glossary, when creating a dialect entry, we added the Arabic word to English glossary, so for each dialect entry is associated an English and Arabic glossary.After creating affixes and stems tables for ALG, compatibility tables of BAMA were updated according to the data included in these tables.

D. Experiment
As mentioned above, we tested our MA on the Algiers Dialect corpus, the test set contains 1618 distinct words extracted from 600 sentences chosen randomly.We consider www.ijacsa.thesai.org that a word is correctly analysed if it is correctly decomposed to prefix+stem+suffix and if all the features related to them are correct (POS, gender, number, person).We first began by testing the MA with stems extracted only from the ALG corpus lexicon, then we introduced stems created from the MSA stems table.We list in Table XXI the obtained results.We examinated the words for which no answer were given by the morphological analyzer(see Table XXII), most of the cases are: • French words which do not exist in the stem table like (électricité , electricity), or words like (ingénieur, engineer) and (numéro, number) that are included in stems table but with an other orthography (respectively and ).The same case is observed for nouns written with long vowel in the end instead of such as (place).
• We noticed also that some words are written with missed letters like the word which appears in stems table as .The same case is noticed for (he said to me) instead of or or (I said to him) instead of or .
• Some Unanalyzed words also are proper nouns.

VII. CONCLUSION
This paper summarize a first attempt to work on Algerian Arabic dialects which are non-resourced languages.These dialects lag behind compared to other dialects of the Middleeast for which several works were dedicated and produced many NLP tools.The presented work is the first part of a big project of Speech translation between MSA and Algerian dialects.We focus in this first part on the one spoken in Algiers and its periphery.We began by a study showing all fearures related to it, then we introduced resources that we created from scratch.This process was expensive in terms of time and human effort but the results were worth it.We get a cleaned corpus of Algiers dialect aligned to MSA, this corpus is the first parallel corpus which includes Algerian dialect to date.We presented also the Grapheme-to-Phoneme converter that we created for Algiers dialect.We combined a rule based approach to a statistical appraoch.The level of correctness for the G2P converter is about 85%.In terms of corpus resources, this task enabled us to transcribe the ALG corpus to a phonetic form.We also proposed a morphological analyser for AlG that we adapted from the well known BAMA dedicated for MSA.We reached an accuracy rate of 69% when evaluating it on a dataset extracted from ALG corpus.Our future work before developing a statistical machine translation system, is to extend the corpus we created to other Algerian Arabic dialects, and to adapt all tools dedicated to ALG to these dialects. 3

TABLE I :
Arabic phonemes using SAMPA5

TABLE II :
Examples of verbs scheme differences between ALG and MSA.

TABLE III :
Example of ALG nouns derived from a verb.

TABLE IV :
Personal pronouns of Algiers dialect.

TABLE V :
Demonstrative pronouns of Algiers dialect.

TABLE VI :
The verb conjugation in the past tense.

TABLE VII :
The verb conjugation in the present tense.
• The imperative: It expresses commands or requests, and is used only for the second person.It is generally realised by adding the prefix and the suffixes and to the verb.

TABLE VIII :
The verb conjugation in the present tense.

TABLE IX :
Examples of ALG plural forms.

TABLE XI :
Interrogative particles and pronouns in ALG and their equivalents in MSA.

Table
XII illustrates some examples of declarative sentences with their negations.www.ijacsa.thesai.org

TABLE XII :
Declarative sentences with their Negation.

TABLE XIII :
Parallel corpus description.

TABLE XIV :
Example of French words used in ALG.Table XIV, although the first two words are French, they are phonetized as Arabic words.The French phoneme /t/ is replaced by the Arabic Phoneme /t'/ in the word table ).

TABLE XV :
Examples of mappings between Arabic grapheme and French phonemes.

TABLE XVI :
Examples of aligned graphemes and phonemes.

TABLE XVII :
Examples of kept, deleted and added prefixes in ALG prefixes table.

TABLE XIX :
Example of splitting a MSA stem to two Dialectal stems.At this stage, we constructed a set of Arabic verb stems having dialect pattern, we analysed them and eliminated all stems that are not used in ALG.We give in Table XX some examples.

TABLE XX :
Examples of converted stems from MSA to ALG.

TABLE XXI :
Results of ALG morphological Analyser.

TABLE XXII :
Examples of unanalyzed words.