Dynamic Phrase Generation for Detection of Idioms of Gujarati Language using Diacritics and Suffix-based Rules

Gujarati is the language used for everyday communication in the state of Gujarat, India. The Gujarati language is also officially recognized by the constitution and the government of India. Gujarati script is based on the Devanagari script. An idiom is an expression, phrase, or word that has a different meaning from the literal meaning of the words in it. Idioms represent the cultural heritage of Gujarati language. Idioms are used in Gujarati language for effective communication and convey of an accurate message. No Machine Translation System does the accurate translation of Gujarati idioms to English or any other language. Different idiom phrases can be generated by adding diacritic(s) as well as suffix to the root or base form of the idiom. Many forms of single idiom make automatic idiom identification as well as machine translation more challenging. This paper focuses on the design and implementation of diacritics and suffix-based rules for dynamic phrase generation and detection of idioms of Gujarati language. This implementation helps in identifying Gujarati idiom present in any possible form in the Gujarati text. The obtained results with the execution of 7050 different Gujarati idiom phrases yield an accuracy of 99.73%. The results are encouraging enough to make the proposed implementation useful for Natural Language processing tasks related to Gujarati language idioms. Keywords—Diacritic; Gujarati; idiom; machine translation system (MTS); natural language processing (NLP); suffix; unicode transformation format (UTF)


I. INTRODUCTION
Machine translation is the sub-field of Natural Language Processing (NLP) which is also a sub-field of Artificial intelligence (AI). Natural language processing is the study of any language by analyzing its structure and morphology. Natural language processing is challenging as different language has different grammatical structure. Vocabulary is important for the enrichment of the language. Idioms also contribute to the enrichment of the language. Idioms give impetus to the language. The idiom is an incomplete phrase as part of a sentence.
The people of Gujarat state are known as Gujarati. Idioms are the invaluable heritage of Gujarati language and for Gujarati people. Idiom in which a word or phrase becomes specific rather than its literal meaning. An idiom is in which a word or phrase becomes a specific meaning rather than its literal meaning. Gujarati idioms represent the customs, manners and beliefs of the people of Gujarat who speak Gujarati language. Gujarati idioms are the adornment of Gujarati language. Gujarati idioms are spoken in day-to-day communication and understood by every Gujarati.
When the speaker uses idioms, the listener is likely to mistakenly understand the literal meaning of the words if they do not already know the metaphor. Usually, idioms cannot be translated properly by any machine translation system. In most cases, the meaning changes when the idioms are translated into another language or it becomes misleading. In the Gujarati language, a particular idiom may have one or more forms or phrases. For the correct translation of Gujarati idioms into any other language, detection of all idiom forms or phrases is a very crucial task.

A. Gujarati Script
There are more than 46.1 million speakers of Gujarati language in the world. Gujarati is the 26th most spoken native language in the world [1]. Gujarati script is a script closely related to Devanagari script. It is a syllabic alphabet (abugida), in which every consonant carries the inherent vowel. Its principles are similar to Devanagari script principles [2]. It is distinguished from Devanagari script by not having a horizontal bar for its letter forms [3]. The Gujarati script is used to write the Gujarati language of Gujarat state in India. Gujarati language consist three different types of character: 34 independent consonants, 13 independent vowels and dependent vowel signs [3][4][5].

B. Gujarati Idioms
An idiom is a common phrase whose meaning is different from its literal meaning of the word. It is widely used and it has its popular meaning. For the correct translation of Gujarati idioms, identification of different forms of idioms from the input text is important. In Gujarati language, different and valid forms of idioms are possible by adding one or more specific diacritics marks to the base or root form of the Gujarati idiom. For example, હાથ અ "hath aap" is the base or root form of Gujarati idiom. It"s one meaning is "to help" in English language. From root form હાથ અ, other valid idiom forms like હાથ અલૉ "hath aapvo", હાથ અી "hath aapi", હાથ અીન "hath aapine", હાથ અપ્મૉ "hath aapyo", હાથ અ રૉ "hath aapelo" etc. can be generated. Identification of all *Corresponding Author Gujarati idiom phrases is concentrated here. Surrounding words are important for the idioms having more than one literal meaning in Gujarati language [6][7][8][9]. But the dynamic generation, as well as identification of all Gujarati idiom forms from the base form of the idiom, are focused here only.

C. Diacritics
Diacritic or accent is a mark which is added to a word that changes its function, sense or pronunciation. A diacritical mark is attached to a letter or word to show appropriate stress or sound [10]. Diacritic marks are more common in Gujarati language. Diacritics can be inserted below, before, after or above to a letter or word. In Gujarati language, every consonant carries the natural vowel. Diacritic vowel signs are added after, before, above, or below a consonant. Table I shows the diacritics for Gujarati language [11].

E. Unicode Transformation Format (UTF)
The Unicode standard is based on ISCII-1988 (Indian Script Code for Information Interchange). The Unicode Standard encodes the Devanagari characters, as well as the same layout, is followed for Gujarati language [3]. Unicode or Universal Transformation Format is a variable width character encoding. It is capable of all valid code points in Unicode. It is a superset of American Standard Code of Information Interchange (ASCII). UTF goes beyond 8-bits and supports all languages in the world.
The rest of the paper is organized as follows: Section II presents the literature review related to the study of Gujarati morphology, diacritic identification methods, stemmer, and Gujarati idioms; Section III covers the methodology in which diacritics and suffix based rules are generated for dynamic idiom generation; it also describes proposed algorithm steps to identify the different Gujarati idiom phrases within the input text. In Sections IV and V, experiments with results, analysis, and observations are discussed; finally, the conclusion part is discussed in Section VI.

II. RELATED LITERATURE REVIEW
Various projects have been carried out for the study of different languages and their machine translation, but the scope of this paper is Gujarati language and related to idioms only.
Rakholia et al. [11] implemented diacritic identification and extraction Technique for Gujarati language using an 8-bit Unicode Transformation Format. They designed an independent tokenization algorithm for diacritic extraction from Gujarati documents. They claimed 99.58% accuracy in diacritic extraction.
Sheth et al. [12] proposed a stemmer called Dhiya for morphological level analysis for Gujarati language. They used inflections of Gujarati text and created rule sets and tested stemmer performance using EMILLE corpus. They claimed 92.41% accuracy of stemmer.
Patel et al. [13] proposed a hybrid morphological analyzer paradigm model for Gujarati language. They applied a supervised approach and used partial stemmer for the generation of word forms based on language-dependent rules. They built a manual dictionary of root words, classified words and covered 5000 nouns.
Baxi et al. [14] developed a morphological analyzer for Gujarati language using a hybrid combination of knowledgebased, statistical-based and paradigm-based approach. They claimed 92.34% accuracy with a knowledge-based hybrid model and 82.84% accuracy with a statistical hybrid model.
Fashwan et al. [15] proposed rule-based method for detecting the case ending diacritics but it was for modern standard Arabic texts. They applied morphological analysis, part of speech, syntactic analysis and word relation as well as position. They experimented with both morphological and syntactic processing levels for handling diacritization problem.
Dan et al. [16] described the languages that use diacritical characters and difficulty in recovery of missing diacritics. They evaluated and described a system for automatically recovering the missing diacritics in documents in the Romanian language. They suggested recovery suggestions for possible changes for Romanian diacritics.
Rakholia et al. [17] implemented a rules-based technique to identify stop words from the Gujarati text. They presented 11 rules to identify a complete list of Gujarati stop words. They applied an automatic and dynamic approach to identify stop words from Gujarati documents and claimed 94.08% average accuracy. www.ijacsa.thesai.org Research works involving Natural Language Processing (NLP) of Gujarati language have been presented for MTS for Sanskrit-Gujarati pair [18], comparison of morphologically analyzed words [19], bilingual dictionary implementation [20], constituency mapper [21], classification [22] and information retrieval [23] to name a few.
Based on this literature review and the analysis based on Gujarati diacritics, no researchers have identified various idiom forms from the input text using the rule-based diacritics insertion technique. No researchers have applied diacritics and suffix based rules on idiom base form to generate all possible idioms. Some researchers have experimented on diacritization but using different languages other than the Gujarati language. Some of the researchers have applied various techniques for creating rule-based stemmer and diacritics identification methods.
The proposed model deals with the Gujarati idioms and their possible forms of idioms. Due to many phrases or forms of Gujarati idioms, the detection of Gujarati idioms within input text is a challenging task. The proposed model detects all Gujarati idioms present in the text by generating and searching all possible forms of particular idiom within the text. The proposed model applies dynamic phase generation for the detection of Gujarati idioms using diacritics and suffix-based rules. All available machine translation systems encounter problems in translating Gujarati idioms. Idiom detection helps the researchers" community in translating the Gujarati idioms into any language.

III. METHODOLOGY
In Gujarati language, distinct 3240 n-gram Gujarati idioms were collected. But in Gujarati language, one idiom can be used in many ways i.e. one specific idiom may have many forms or phrases. Rules are generated and applied on idiom base form to generate all possible forms of idioms. So for the generation of idiom forms, the base idiom form is stored in the database and all possible forms of idioms are generated dynamically by inserting diacritics and suffix to the base idiom form by applying defined rules. This implementation is used to identify any forms of Gujarati idioms within input Gujarati text.

A. Rules Generation
Rules are generated and applied for n-gram idioms where n>=2. For bigram or 2-gram idioms, rules are applied on the 2nd word only and various idiom forms can be generated. For trigram or 3-gram idioms, rules are applied on 3rd word only and many idiom forms can be generated. For example, હાથ અ "hath aap" is the bigram idiom root form, so diacritic rules are applied on 2nd word અ "aap" only; whereas ક્કર ભાયી જલ "akkal mari jav" is the trigram idiom root form, so diacritic rules are applied on 3rd word જલ "jav" only. In general, for ngram idiom where n>=2, then many idiom forms can be generated by applying rules on last word of idiom root form. For 1-gram idioms as well as some n-gram idiom(s), different forms of idioms are not applicable.
Following Rules are identified to generate possible idiom forms from the given base form of n-gram idiom.

1) Rule 0: Root or base form only
For instance, idiom ધ્ધય યાખ "adhdhar rakh", the same form is used in sentences as an idiom. So no diacritics need to be added on root verb યાખ.

9) Rule 8: using diacritic ા , suffix and diacritics
Root form + ંા + લ  Table III shows all rules generated for inserting diacritics and suffix to root idiom form. Rule 0 is the original root form of idiom stored in the database. All rules from Rule 1 to Rule 9 are applied to all root forms of idioms. Rule 10 to Rule 14 are exception rules for idioms whose root form ends with જલ "jav" થલ "thav" ખાલ "khaav" ર લ "lev" ફ સ "bes".

B. Proposed Model
To store the idiom database, MySQL was used for database software. PHP was used as a scripting language for the development platform. Visual Studio Code was used as an editor to write PHP coding. XAMP was used for the crossplatform local web server for the implementation of the model.


Step 1: Data collection: A total of 3240 distinct Gujarati n-gram idioms were collected from different books and websites.  Step 2: Pre-processing step: Except for 1-gram idioms, wherever applicable, the root form of each Gujarati idiom is generated; so diacritics and suffix can be added dynamically to the root form to generate various possible idiom forms.


Step 3: Idiom database generated that contains idiom column in which root or base form of the idiom is stored once for each idiom. Idiom database contains root form idiom column with corresponding literal Gujarati meaning column for each idiom.


Step 4: Accept input as text containing Gujarati idiom(s). Input text may contain any number of n-gram Gujarati idioms.


Step 5: The algorithm searches all n-gram idioms from the input text by comparing input text with the idiom column of the idiom database.


Step 6: Proposed algorithm generates all possible forms of idiom(s) for specific n-gram idiom. The algorithm uses the rules created and shown in Table III and generates all possible forms of idioms.


Step 7: Generated various idiom form(s) are compared with the idiom form entered within the input text. If matching idiom form is found in the input text, the algorithm displays its possible literal meaning(s) in the Gujarati language; otherwise, it displays the text or idiom form as it is.

IV. RESULTS
There is no automated tool available to measure accuracy in Gujarati language. The help of 2 or 3 linguists was taken, particularly for manual verification of results. Obtained results are recorded manually side-by-side to calculate accuracy. Individual idiom with their possible and valid forms is analyzed and tested each form for accuracy. For example, different forms of root form તયસ ખાલ "taras khav" and હાથ અ "hath aap" are tested. The literal meaning of Gujarati idiom તયસ ખાલ is દમા ફતાલલી "daya batavavi" in Gujarati and "showing kindness" in English.

FINAL OUTPUT=ભદદ કયલી ભદદ કયલી ભદદ કયલી ભદદ કયલી ભદદ કયલી ભદદ કયલી
For experiments, overall 7050 different valid idiom phrases or forms were entered as input text and tested for results. Only 19 idiom forms were not correctly identified, whereas 7031 idioms forms were correctly identified by proposed system. The accuracy obtained for the correct identification of the Gujarati idiom(s) from the Gujarati text was 99.73%. Idiom phrases or forms which were not correctly identified due to similarity in their root forms; for example જાભી જલુ ં "jami javu" and જાભી જલી "jami javi' both are distinct idioms with distinct literal Gujarati meaning but their root forms are same જાભી જલ "jami jav". Other error issues were due to inclusion of comma, hyphen, space, non-breaking space between words for n-gram idioms where n>=2.

V. ANALYSIS AND DISCUSSION
Some observations, results and language ambiguities came out during the experimentation.
2) However, all collected idioms i.e. idiom root forms are inserted in the database, if the idiom does not exist in the database then algorithm won"t able to identify its various forms. Therefore the algorithm returns particular idiom form as it is in the text without its literal Gujarati meaning.
6) For n-gram idioms (n>=2), where root forms are irrelevant or only a single idiom form is possible, the same idiom phrase is entered in the idiom database. For example, ટક ં ચા દૉઢસૉ "atkal pancha dodhso" is 3-gram idiom and its different forms are irrelevant in Gujarati language. So અટકળ ં ચા દૉઢસૉ "atkal pancha dodhso" as it is stored in the idiom database instead of its root form. Another example, પડમર ટટ્ટ ુ "adiyal tattu" is 2-gram idiom and its different forms are irrelevant in Gujarati language; so it is stored in the same form in the idiom database.
An idiom database containing 3240 distinct idioms was created and tested 7050 different idiom phrases or forms. An implemented algorithm can find out all possible idiom phrases within Gujarati text by applying 15 rules of Table III on idiom root form whose root form or idiom is present in the idiom database. Spelling errors in Gujarati idioms can also be rectified by the proposed model.

VI. CONCLUSION
The proposed model that identifies all valid Gujarati idiom forms within Gujarati text and returns their literal Gujarati meaning was successfully implemented. Dynamic generation and identification of all Gujarati idiom phrases are focused here. By the exhaustive in-depth study of 3240 Gujarati idioms and their 7050 different idiom forms, 15 rules are generated. These rules are used to insert diacritic(s) and suffixes to the base or root form of Gujarati idiom. These dynamically generated different idiom forms are used to identify any idiom phrase inside the text. If the particular Gujarati idiom root form is not present in the database, then model returns the idiom as it is. So an entire assemblage of idioms is required for the success of the model. It is noteworthy that the proposed approach is not just diacritic based but also uses suffixes like લ, મ and ન.
Based on the results obtained from generating various possible idiom forms via rules implementation, it is advocated that the proposed rule-based system for generating various idiom forms is promising and worth implementation in the real world for the translation of Gujarati language idiom to any other language. Since Gujarati idioms are used in many forms in real life, all forms of idiom identification are challenging tasks for any machine translation system. The implemented rule-based system identifies most of the various forms of idioms. The proposed model opens the path for Gujarati idiom translation to any other language by finding all possible idioms forms within the input text. The task of context identification for multiple meanings idioms concerning the translation of Gujarati idioms is left for future scope.