Key Issues in Vowel Based Splitting of Telugu Bigrams

—Splitting of compound Telugu words into its components or root words is one of the important, tedious and yet inaccurate tasks of Natural Language Processing (NLP). Except in few special cases, at least one vowel is necessarily involved in Telugu conjunctions. In the result, vowels are often repeated as they are or are converted into other vowels or consonants. This paper describes issues involved in vowel based splitting of a Telugu bigram into proper root words using Telugu grammar conjunction ('sandhi') rules for MT.


INTRODUCTION
Sanskrit is considered as the mother language for almost all Indian languages, since a majority of the Indian languages are based on grammar rules similar to that of Sanskrit grammar [6].Sanskrit is grammatically very well structured and very rich in its inflections [7].It is the oldest language on the earth to have a powerful structured grammar.Panini (300 BCE) the greatest grammarian developed Sanskrit grammar with more than 4000 rules [8], [10].Unlike western languages, Sanskrit is the best example that unites the words to form a compound word (or simply compound).According to Bloomfield and Chomsky (1957), sentence is the largest grammatical unit [16].
There is a possibility and custom to write a complete sentence as a single compound in Sanskrit.For instance "jalObhaumamantarikshamitidvidhAbhavati" for convenience, it can be tokenized as "jalaH bhaumaM antarikshaM iti dvidhA bhavati" means 'water is of two types, one is on the earth, and another is in space' ('jalaH'water, 'bhaumaM' -on the earth, 'antarikshaM' -in space, 'iti'like this, 'dvidhA' -two categories, 'bhavati'is).
Sanskrit scholars are to be very careful about tokenization.Lack of appropriate knowledge on the grammar or less attention to each and every letter gives immature tokenization that leads to yield distorted or quite opposite meaning in some special cases [8].For example 'viSvAmitraH' is the word to be tokenized; its meaning is friend of the universe.It can be tokenized as 'viSva' + 'amitraH' according to 'savarNadIrga sandhi', which is not to be applied here because it gives opposite meaning i.e., enemy of the universe.For this kind of special cases, Sanskrit gives exemptions strictly.So it should be 'viSva' + 'mitraH', where regular conjunction rule is to be violated and special rule is applied.The person who is aware of this kind of special cases can only tokenize properly.
Likewise majority of Indian languages follow the features of Sanskrit; undergo conjunction which is inevitable that lead in generating compounds that are essentially bigrams, trigrams or n-grams.Bigram is a compound formed by two words and trigrams by three words, and so on.As Telugu is one of them, one can see the nature of uniting the words to form n-grams in Telugu also.Though Telugu is highly influenced by other languages, especially most of it is by Sanskrit [7], Telugu is not originated from Sanskrit [4].Even though Telugu was originally intended to be totally free from Sanskrit, it has tremendous impact and deep penetration into Telugu.In 1816, Francis White Ellis raised this issue.Later Bishop Robert Cardwell proved that a family of twelve Dravidian languages Telugu, Tamil, Kannada/Canarese, Malayalam, Tulu, Kodagu/Coorg, Tuda, Kota, Gond, Khond/Ku, Rajmahal and Oraon are not originated from Sanskrit in his book titled "A Comparative Grammar of Dravidian Languages" in 1856 [5].As a proof of that, pure Telugu literature work is available in the form of 'yayAti caritramu' by 'ponnagaMTi telaganna' written in 16 th century [1] [13].Later Telugu mingled with Sanskrit heavily by 'samskrutAndhra kavulu' (Sanskrit -Andhra -a synonym of Telugu -poets) when they translated epics in Sanskrit literature like Ramayana, Mahabharata and bhAgavata, etc. Learning or speaking Sanskrit was a great honor in those days and literature work in Sanskrit was highly honored.That can be one of the reasons to Sanskritize Telugu to enhance its value.
Additionally, there are numerous dialects in Indian languages -even many of them do not have script and are based on their culture, territory, and have tremendous impact of non-Indian languages like Urdu, Persian, Arabic, English, etc.For instance, most of the Telugu language is affected by Urdu in 'telaMgANa' territory.'tarfIdu, aafIsu, pennu, pEparu, kaburlu, bassu', etc. are words from those languages adapted in Telugu [4].Such words, their conjunctions and their corrupted / colloquial forms are almost understandable by local humans but not easily by non-locals.For example 'nI jimmaDa' is the word very frequently used by the natives of eastern Andhra.It means 'let your tongue fall' (literally 'jimma' is the colloquial form of 'jihva' -Sanskrit word for tongue, 'aDa' is the corrupted form of 'paDu'a Telugu word).www.ijacsa.thesai.orgII.VOWELS According to 'pANini' Sanskrit grammar, vowels and their forms are given as in  Note: 'pluta' is applied in calling somebody who is at a distance.For example, 'hE rAmA3'.Here '3' indicates the 'pluta' of the vowel 'A'.If 'pluta' is not applied here, the person cannot be called.
Again each type is classified in to two different forms, namely 'anunAsika' and 'ananunAsika'.'anunAsika' is a nasal sound while 'ananunAsika' is not.A total of six types for each vowel 'a, A, A3' yields 18 different forms of vowel 'a(अ)'.
Telugu includes two more short vowels 'e' and 'o' and one more long vowel 'Z' to the above listed Sanskrit vowels to comprise a total of eighteen vowels [2].All proper Telugu words end with vowels only.That's why Telugu language is called 'ajanta' (= 'ach' + 'anta', literally 'ach' meaning vowel and 'anta' meaning ending) language.Consonants are called 'hal' in Telugu.They are 37 in number.Unlike Telugu, words of almost all Indian languages end in consonants and hence called 'halanta' languages.All western languages are also categorized as 'halanta' languages as their words commonly end in consonants except Italian that ends in vowels.This is the reason why Telugu is called 'Italian of the East' and one of the secrets behind sweetness of Telugu vocabulary.
There are eighteen vowels in Telugu language as shown in

III. PROCESS OF SPLITTING WORDS
Due to many practical issues involved in maintaining a database with all combinations of compounds, it is better to maintain only standard or root words.Compounds of the source language are split to obtain the original words using reverse engineering in accordance to the conjunction ('sandhi') rules.This will make the morphological analysis easier.
Proper stemming and correcting of corrupted forms for splitting of n-grams into individual tokens is necessary for better understanding the context.This plays an important role in translation also whereas understanding is also a kind of translation.Splitting of compounds into root words is an important phase in NLP for the applications like MT [9].Building a computational model to analysis natural language is the goal of NLP [15].For MT from Telugu to any other language including Indian languages, one of the issues of dealing with source language words is that each word need to be stored in the database together with different suffixes/prefixes (also known as inflections) thus tremendously increasing the storage space.This is in the case like Telugu that has about 800 different dialects within the state of Andhra Pradesh.But most of the conjunctions are common and are computable.The best way to translate them is to split back into root words as they formed and then translate individual root words.Compounds are formed with two or more root words.While the root words can be retrieved from database, the inflections thus obtained needs serious focus.Inability to sufficiently handle the inflections may result in false word formations and distorted meaning.But mere splitting the compound may not give complete meaning all the time.To understand the meaning of a compound, first identify the meaning of components and then the relationship between them [11].For instance, a compound 'rAmunitOkapirAju' is formed by two words 'rAmunitO + kapirAju'.www.ijacsa.thesai.orgFirst word is inflected, and second word is a root word.'rAmunitO' literally means with 'rAma' and 'kapirAju' means Hanuman.If the inflection is not observed in the first word, it may be split as 'rAmunitOka' (literally means the tail of 'rAma') + 'pirAju' (an absurd word), which gives a distorted meaning.
The scope of this paper is limited to deal with bigrams only for obtaining better MT and aims to propose solutions to the issues of vowel based splitting.Issues related to handling of different types of dialects and their corrupted forms have not been considered.More specifically, handling of compounds formed according to the grammar rules, and their splitting based on vowels together with certain special cases in Telugu have been discussed.

IV. CONJUNCTION RULES
Splitting is a process opposite to the conjunction.Conjunction is called 'sandhi' and splitting is called 'sandhi vicchEda' in Sanskrit.Telugu also use the word 'sandhi' to represent conjunction.At least two words are required for conjunction.First word is called 'pUrva pada' and the second word is called 'para pada / uttara pada' [12].While most part of the word remains unchanged, technically, the actual 'sandhi' occurs between two letters, i.e., 'pUrva svaram'( last letter of the 'pUrva pada') and 'para svaram' (first letter of the 'para pada') [1].Telugu language adapted many of the 'sandhi' rules from Sanskrit as it uses much grammar of Sanskrit in addition to its own grammar rules.Sanskrit grammar rules were adapted into Telugu since majority words of Telugu language were taken from Sanskrit.Sanskrit grammar describes three ways to form a 'sandhi'.They are In Sanskrit, there are five important classifications of 'sandhis'.They are 1) 'ach sandhi', 2) 'hal sandhi' 3) 'visarga sandhi' 4) 'prakruti bhAva sandhi' and 5) 'svAdi sandhi' [1].But only first three 'sandhis' are used very frequently.'ach' and 'visarga sandhis' works with vowels and 'hal sandhis' works with consonants.'sandhi' classifications are given in TABLE III.Though these three kinds of 'sandhis' are used by Telugu as it is, they are treated as Sanskrit 'sandhis'.Telugu defines around thirty 'sandhis' (TABLE IV) according to its grammar.These Telugu 'sandhis' fall under 'ach sandhis', 'hal sandhis' or work with both vowels as well as consonants [1].Though there are many Sanskrit and Telugu 'sandhis', only some of them for vowel based splitting have been considered which are resulting in a vowel in compound (TABLE V) irrespective of they are classified as 'ach sandhi', 'hal sandhi' or 'visarga sandhi'.Some special cases are also discussed in this paper even they are involving a consonant.Note: For this rule any vowel can precede the 'visarga' and that vowel appears in the compound with preceding character of 'visarga'.

V. VOWEL BASED SPLITTING RULES
Technically, whatever the rules used for conjunction, they are used in reverse order to obtain those root words back.This approach can be considered as a reverse engineering process.

Algorithm:
1) A compound in Telugu, which is to be translated, is taken and is transliterated into Roman Telugu.
2) Each character is checked to determine whether it is a vowel.
3) If it is a vowel, then try all possible combinations to split the word according to the 'sandhi' rules listed in Tables 6 through 13.
4) If the compound is formed according to 'sandhi' rules of two words, then it is split into two words.
5) The process is recursively processed till all the words thus separated are found in the dictionary/database.
Example: 'SivArcana' -formed by the root words 'Siva' + 'arcana'.While using vowel based splitting, the vowels of 'SivArcana' i.e., 'i, A, a' are to be checked (TABLE XIV).Note: If V (vowel) is the last character of the compound, then there will be a chance of split when the letter is a long vowel like 'A, I, U, E, Y, O' e.g., 'vaccADA' = 'vaccADu' + 'A' (means, 'did he come?').This is occurs almost in interrogative cases.But there is no chance for short vowels to be the result of conjunction.There is no need to check the last character of the compound, if it is 'a, i, u, R, e or o' assuming it is the result of 'sandhi'.From all the patterns listed in TABLE XIX, 'a + a' pattern of 'savarNadIrgha sandhi' is applicable to split 'SivArcana' into 'Siva + arcana'.Amongst these 13 patterns, only one pattern is suitable to split the compound properly.
When one pattern splits the compound successfully, then there is no need to go for further splitting until unless the compound is formed by three or more.Unnecessary splitting may yield improper or unacceptable root words.As a rule of thumb, best results are obtained by splitting in such a way that first word extracted from the compound is as long as possible.Even if a proper word is obtained from the compound much before finishing, splitting process is not to be stopped until all vowels of the compound are checked.Ex. 'adhikAramaDugu' is a bigram formed by two proper words 'adhikAramu' (authority) and 'aDugu' (to ask) by the rule of 'ukAra sandhi'.But it can also be assumed as a trigram formed by three proper words 'adhi' (to overcome), 'kAramu' (chilli powder), 'aDugu' (to ask).If splitting process is stopped at the earlier stage when it found a proper word (for instance, 'adhi'), it yields useless or distorted meaning when translated.
Sometimes, some words are not to be treated as compounds and should be translated as a whole.For instance, 'adhikAri' literally meaning "officer" is the word to be treated as single word and should not be split.If it is split, it becomes, www.ijacsa.thesai.org'adhika + ari' by the pattern 'a + a = A' from 'savarNadIrgha sandhi'.'adhika' (means 'more') and 'ari' (means 'enemy').Both are root words and 'sandhi' seems to be proper but the meaning yields 'more enemy', an incorrect translation.The primary requirement in translation is that the meaning of the context should not be disturbed.

VI. SPECIAL CASES OF 'ACH SANDHI'
All the 'sandhis' and the cases discussed above are related to single independent vowel.There are special cases in which either next or previous letters of the vowel is also to be checked in splitting.This ensures that the compound is formed by a particular 'sandhi'.
For some 'sandhi' rules, both the previous and next letters of the vowel are to be checked (  From the Tables 15 and 16 it is observed that when a conjunction results in two or more letters, the total numbers of letters are to be observed for splitting.
While checking vowels of the compound in vowel based splitting, if the vowel if preceded with the letter 'y/v/r' then consider both the letters 'y/v/r' + vowel to apply 'yaNAdESa sandhi' rules in reverse engineering to find root words effectively.3. 'jastva sandhi': This 'sandhi' also results in specific consonants along with vowels in compound.These specific 'consonant + vowel' patterns are helpful (Except in some cases -refer Note of 'ukAra sandhi'), in tracing exactly the root words by applying 'jastva sandhi' rules.
Rule: when 'pUrva svara' is 'k/c/T/t/p' and 'parasvara' is a vowel/ g/j/D/d/b/h/y/v/r' then, 'g/j/D/d/b' come as 'AdESam' (TABLE XVIII).While checking vowels of the compound in vowel based splitting, if the previous two letters of the vowel are 'TT', then 'dvirukta TakAra sandhi' rules are followed in reverse engineering to find out easily the root words.

4) 'visarga sandhi':
This 'sandhi' also results in specific consonants along with vowels in some special cases.These 'consonant + vowel' patterns are helpful in tracing root words by applying 'visarga sandhi' rules.

VII. DISCUSSIONS a) Inflections:
Inflections are called 'vibhaktis' which play an important role in Telugu grammar.In Telugu, inflections occur at the rear part of a word which leads in altering the original form of the root word.If the word is inflected then it is not possible to carryout splitting straightaway.All inflections must be separated and splitting is applied to obtain root words.Conjunction is possible not only with root words but also with inflected words.
For example, 'bhUmyAkASamutO' is the compound which is inflected ('tO') at rear end.It is separated first and split rule is applied to obtain 'bhUmi' + 'AkASamu' (TABLE XVII).
If 'pUrvapada' is inflected and participated in conjunction, then it is difficult to find out root word.For example, in the sentence 'rAmuNNeMduku cUSAvu' (why did you see Rama?) the compound is 'rAmuNNeMduku' and is to be split.It is known that this is formed by two words 'rAmuNNi' + 'eMduku'.Since the first word is inflected ('rAmuDu' + 'ni') and such words are not available in database as they are.In such cases, splitting should be applied using morphology rules.

b) Plural forms:
Plural forms are very common in any language.If they are involved in conjunction, splitting becomes difficult.For example, 'kukkalarupulu' (barking of dogs) is a compound formed by 'kukkala' + 'arupulu'.'pUrva pada' is a plural term and is inflected.Formation of some plural words is not proper.For example 'baLLu' can be the plural form of either 'baDi' (school), or 'baMDi' (a cart).But 'baDulu' and 'baMDlu' are the right plural forms of 'baDi' and 'baMDi' respectively.
But in normal conversations, corrupted form 'baLLu' is intermittently used for representing plurals for both.Likewise, 'paLLu' can also act as plural form for 'paMDu' (fruit) and 'pannu' (teeth).'paMDlu' and 'paLLu' are the plural forms of the above respectively.

c) Colloquial forms:
Colloquial or corrupted form of language is inevitable.These corrupted forms become impossible to split until unless they are maintained in Database.For example, 'rAmoDoccADu' (Rama came) is the compound formed by 'rAmuDu' + 'occADu' where 'occADu' is the corrupted form of 'vaccADu'.For successful splitting, either 'occADu' must also be available in database or a rule must be made to morph/consider it as 'vaccADu'.

d) Problems caused by conjunctions:
Some conjunctions create difficulties in identifying the root verb.For example, 'mEDipaMDujUDu' (see the fig fruit) is the compound of root words 'mEDipaMDu' +' cUDu'.But according to conjunction rule ('saraLAdESa sandhi') they must be 'mEDipaMDunu' + 'cUDu' before conjunction.Here 'nu' of the 'pUrva pada' is removed and 'c' of the 'parapada' is converted to 'j' and 'jUDu' is not available in database.
'gasaDadavAdESa sandhi' also creates similar difficulties in splitting.For example, 'tallidaMDrulu'(parents) is the compound formed by 'talli' + 'taMDrulu'.According to conjunction rule, first letter of the 'parapada' is converted from 't' to 'd' and 'daMDrulu' is not available in database.If 'da' is morphed to 'ta' for splitting, it leads to another difficulty.For example, 'Akalidappikalu' (hunger and thirst) is the compound formed by 'Akali' + 'dappikalu' by 'gasaDadavAdESa sandhi'.If 'da' of this compound is changed, then it becomes 'tappulu' (mistakes) thus providing wrong translation.VIII.CONCLUSION Though there are many issues involved in splitting, splitting plays key role in MT.It paves a way to translate the source language as much as possible.Issues involved in the splitting can be solved by applying appropriate properly evolved morphological processes.
All possible patterns to observe in a compound for vowel based splitting given in TABLE XXII.Applying longest pattern as much as possible gives good results.Apply the rule of appropriate 'sandhi' for splitting.When there are no multiletter patterns available in compound, then it becomes mandatory to observe only single vowel for splitting.This may lead ambiguity in some cases.However, Vowel based splitting can separate at least one proper word from compound from left to right, if any.One can find more patterns for some special cases and can be included to split the compound very precisely.

Pronunciations:
The letters should be pronounced normally as in English, except when they are italicized.If so, follow the TABLE XXIII.

TABLE II .
All the vowels are called 'ach' or 'svara' according to Telugu grammar, their Roman equivalents are as in TABLE II.

TABLE VI .
ALL PATTERNS OF 'SAVARNADIRGHA SANDHI'

TABLE VII .
ALL PATTERNS OF 'GUNA SANDHI'

TABLE X
rule is applied to split 'vAgISuDu', it becomes 'vAgu' + 'ISuDu', which is a wrong splitting.It should actually be split as 'vAk' + 'ISuDu'.Such conflicts should be handled carefully and may require manual checks.

TABLE XIII .
ALL PATTERNS OF 'IKARA SANDHI'

TABLE XIV .
POSSIBLE PATTERNS OF THIS SPLITTING OF 'SIVARCANA' TABLE XVIII).Following are the examples.1) 'guNa sandhi': In specific cases, this 'sandhi' results in two letters instead of one in compound (TABLE VII).Sometimes more than one letter also to be checked since, to reduce time complexity in splitting i.e. six patterns causes to result in vowel 'a' but only three patterns can result 'ar'.Ex.For 'brahmarshi'is a compound formed by two root words 'brahma' and 'Rshi'.All patterns are given in (TABLE XV, XVI).

TABLE XV .
POSSIBLE SPLITTING BY OBSERVING ONLY VOWEL 'A'

TABLE XXI .
VOWEL PATTERNS OF 'VISARGA SANDHI-2'While checking vowels of the compound in this splitting, if next two letters of the vowel are 'SS/shsh/ss/sc/shT/st', then applying 'visarga sandhi' rules give better results in reverse engineering to find out the root words easily.

TABLE XXIII .
ROMAN TELUGU PRONUNCIATION RT stands for Roman Telugu and the capitalized letters should be pronounced with greater emphasis on them. *