Automation and Validation of Annotation for Hindi Anaphora Resolution

—The process of labelling any language genre by which one can extract useful information is called annotation. This provides syntactic information about a word or a word phrase. In this paper, an effort has been made to provide the algorithm for semiautomatic annotation for Hindi text to cater anaphora resolution only. The study was conducted on twelve files of Ranchi Express available in EMILLE corpus. The corpus is originally tagged for demonstrative pronouns. The detection of the pronouns is supported by the incorporation of seven tags. However the semantic interpretation of the demonstrative pronoun is not supported in the original corpus. In this paper an effort has been made to automate the process of tagging as well as the handling of semantic information through addition tags. It was conducted on 1485 demonstrative pronouns. The average accuracy of precision, recall and F measure is 74, 71 and 72 respectively. I. INTRODUCTION Natural language processing has attracted the researchers " volition to enhance the natural language resources during the last few decades. A number of applications of natural language processing need the syntactic meanings of words or word phrases. These meanings are used for different applications like, part of speech tagging, information retrieval, text summarizations, question answering, anaphora resolution, etc. So, the annotation will play a pivotal role in these NLP applications. In this study, the systematic discussion has been held of the labelling process of demonstrative pronoun in the context of anaphora resolution.


INTRODUCTION
Natural language processing has attracted the researchers" volition to enhance the natural language resources during the last few decades.A number of applications of natural language processing need the syntactic meanings of words or word phrases.These meanings are used for different applications like, part of speech tagging, information retrieval, text summarizations, question answering, anaphora resolution, etc.So, the annotation will play a pivotal role in these NLP applications.In this study, the systematic discussion has been held of the labelling process of demonstrative pronoun in the context of anaphora resolution.
Anaphora is a process of finding the referring expression in the discourse.Wrong correlation of referring expression in the genre affects all applications of NLP.
Example 1: "They don"t understand why it seems like bad behavior on Wall Street is rewarded, but hard work on Main Street isn"t, or why Washington has been unable or unwilling to solve any of our problems", (Obama 2010, http://www.divaportal.org/smash/get/diva2:531167/fulltext01.pdf,accessed on 18 th Aug, 2015).
In example 1, Obama referred to the people in a negative way by implying that they might be inferior, since Obama assumed that they do not understand how the economic crisis was solved.Though the Obama wants to refer the Congress rather people.
Example 2: "Now, our friends down in Tampa at the Republican Convention were more than happy to talk about everything they think is wrong with America.But they didn't have much to say about how they'd make it right.They want your vote, but they don't want you to know their plan.And that's because all they have to offer is the same prescriptions they've had for the last 30 years" (Obama, Sept 6, 2012, http://www.presidency.ucsb.edu/ws/index.php?pid=101968, accessed on 18 th Aug, 2015.) In example 2, "they' refer to a specific group, which does not belong to Democrats and this demonstrative pronoun creates serious problems in anaphoric context.Moreover the interpretation of machine has been always error prone.So, anaphora resolution itself is a significant problem.To address this problem annotation of any corpus is crucial, while formulation, evaluation and optimization of any algorithms in NLP, particularly automation of anaphora resolution.Annotation becomes the prerequisite condition for anaphora and other applications for better accuracy.

II. BACKGROUND OF ANNOTATION
A number of attempts have been made to retrieve the information from the text by a number of means; one of them is annotation.There is no standard annotation scheme which can fulfill all the requirements.The different labeling schemes have been adopted to address the different problems.In this regard the most commonly used practices are phrase structure, dependency, HPSG (Head-driven Phrase Structure Grammar) and Hybrid (Phrase structure and Dependency, both).Penn Treebank [1] is the most used and adapted annotation scheme; firstly for English and then in other languages.A number of languages parsed according to Pen Treebank are like, Arabic, Bulgarian, Chinese, Czech, Danish, Dutch, English, Estonian, Finnish, etc.

III. MOTIVATION
Annotated corpora promise to be valuable for researchers as diverse as the automatic construction of statistical models.Written or spoken language provides the raw data to investigate, evaluation and comparison of different linguistic tools/ models.Combining raw language data with linguistic information, offers a promising basis for the development of www.ijacsa.thesai.orgnew efficient and robust NLP methods.Real world texts annotated with different strata of linguistic information can be used for grammar induction.Annotating the corpora manually takes rigorous effort and competency too.It is better to draw some conclusion/ rules which lead to fully or partially automate this process.
Skill and competency levels of human beings always impose the restriction on the accuracy of annotation.Human interpretation of discourse and its constituents may vary.Understanding may become subjective in the context and may lead to incorrect annotation.Though, manually tagged data are the most preferred practice, may be due to the ultimate benchmark for accuracy of any NLP task, i.e., human interpretation, the need for having automation is necessitated by the availability of voluminous digital data these days and the limitation imposed by the human capacity.

IV. LITERATURE SURVEY
A number of attempts have been made to label the text or dialogue.Broadly, these are classified under four categories.First dependency structure, second phrase structure, third HPSG (Head-driven Phrase Structure Grammar) and fourth is hybrid labeling.Technically, hybrid type of annotation is to serve two purposes; phrase structure as well as dependency.These annotation schemes are being used to enrich language resources.Multiple corpuses for same language is an added advantage for researchers to test and validate their tool/ technique for the same.It helps to create a more generic solution for those languages.

A. Manually Annotated Corpora
Fully tagged corpus is a necessity of any language to groom its automation tools.A number of corpora are available in English and European languages.The Prague Dependency Tree Bank (PDT) annotated up to three levels; morphological, syntactic annotation and third level were linguistic level [1].Susanne Corpus as treebanks exist for English [2], the Lancaster Parsed Corpus [3] and another corpus for English, the International Corpus of English [4], the Prague Dependency Treebank for Czech [5].Treebank projects for other languages made in the recent times, e.g., for French which is tagged for morphosyntax, lemmas, compounds, lexical clusters and phrase boundaries [6], Italian corpora annotated with grammatical relations and syntactic representation of sentence [7], syntactically parsed for Spanish [8], lexically annotated speech corpora of Turkish was an attempt by marking derivation boundaries by [9], and a dependency structured Russian corpora which was lemmatized, morphologically and syntactically tagged [10].The annotation of the German TIGER Treebank [11] is done in a manner so that it can easily be exported to XML.They consider the verb-sub categorization, coordination, appositions and parentheses as well as proper nouns.TIMEBANK is richly annotated to indicate events, times, and temporal relations [12].

B. Annotated corpora for Coreference and Anaphora
There are a few corpora which were annotated, especially for anaphora resolution or co-reference resolution.Lancaster Anaphoric Treebank [13] of Associated Press with 100000 words and annotated according to UCREL annotation scheme.This was the joint venture of UCREL and IBM.MUC-6 and MUC-7 annotate co-referential link of 65000 words.Similar to MUC scheme a tool ClinKa was developed to annotate English genre at University of Wolverhampton [14].Another corpus developed by the members of University of Stendahls Grenoble and Xerox Research Center Europe [15] by creating an anaphoric and cataphoric link.It addresses the zero noun anaphora, adverbial anaphora, indefinite pronoun, demonstrative pronoun, third person personal pronoun, and personal pronoun.French corpus [16], annotate anaphoric links in MUC.Few multilingual corpora are available like English-Romanian corpus [17], technical manual of English-French for co-referential link at the University of Wolverhampton.

C. Automatic or Semiautomatic Annotated Corpora
An attempt for corpus annotation for labelling semantic and syntactic meaning of word and word phrases for coverage of deep parser to generate syntactic structure, semantic representation and discourse information for dialogue by means of semiautomatic technique [18].The EPAC was a speech corpus, it consists of a set of 100 hours of conversational speech manually transcribed [19].This spoken corpus automatically annotated by automatic segmentation, transcription, POS tagging and other tools.The Diachronic German Corpus [20] was automatically annotated by a suite of NLP tools.These tools are integrated into WebLicht and CLARIN-D.WebLicht Service Oriented Architecture is used as an integrated environment.A corpus is trained with automatic system for semantic [21].It advocates coreference, quantification, and defined a set of semantic rules for many other higher-order phenomena, which was left out by Penn Treebank.

D. Annotation for Hindi
Botley & Mc Energy [22] proposed an annotation scheme for English to resolve anaphora.Later this scheme modified by Botley [23] again for the same purpose and same language with emphasis on indirect anaphora.They considered the recoverability of antecedent, direction of reference, phoric type, syntactic function, antecedent type to annotate three genres.These corpora are the American Printing House for the Blind (APHB) Corpus, the Associated Press (AP) Corpus, and the Hansard Corpus [23].Recoverability refers to the relation between referring expression and its corresponding antecedent in context of demonstrative pronoun and this feature based on Halliday and Hassan [24].Feature "Phoric type" is distinction between substitution and reference [24].Their tag set adapted to annotate Hindi by Sinha [25], and later [26] added a few more tags.Reference [27] used some semantic information for indirect anaphora categorization.The author has identified a few semantic categories to classify indirect anaphora in Hindi.

A. Tag set used
The demonstrative pronouns are understood in terms of an unordered paradigmatic set of five distinctive features [22] .Syntactic Function and Antecedent Type, two other features, which are proposed by [23].Last three features in table 1 facilitate to identify indirect anaphora [27], [26], [25].Recoverability of antecedent helps to identify the same.www.ijacsa.thesai.org

B. Corpus selection
This study has been conducted on EMILEE corpus.The Hindi written corpus contains a total of approximately 12,390,000 words in Unicode.It is pre annotated corpora for demonstrative pronouns.One component of this corpus is based upon Ranchi Express news items.In pre annotated corpus there are seven features.We have modified this corpus with an annotation scheme according to [27].Being in Unicode is an added advantage of using EMILEE corpus.
Example 1 is manually annotated Ranchi Express news from EMILEE corpus according to Botley"s annotation.In this annotation only seven tags are there and their respective value of उन्होंने (unhone) pronoun is DPDARHN.We have considered the data from EMILLE corpus.We picked one segment of corpus which is based on the news items from Ranchi express.In this study, we analysed twelve files of plain text.In example 4 additional three tags have been attached उन्होंने ,यह, ,_,_,_".First is pronoun pattern, second case marker and third semantic category.In this example null is denoted by "_".

C. Algorithm
Though, some researchers advocated syntactic features of languages for annotation.Reference [27] suggested some specific pattern of pronoun and other words which is categorized in pattern.Secondly, case marker elaborates the pronoun significance and binding with its antecedent.This algorithm annotate only last three tags discussed in example 4.
Step1: Input the set of case marker.Step5: Stop EMILEE corpus is annotated with seven tags.Last three tags (Pronoun pattern, Case marker/ Connective, Semantic category) have to be annotated either manually or automatically.Reference [27] draw some rules for classification of indirect anaphora in demonstrative pronoun.
As a first step, the eight case markers of Hindi given as input then pronoun pattern.Third step is to read all files to be annotated.It extracts all tags (feature list) which is already annotated in EMILLE corpus.Then it generates the case marker and a pronoun pattern by defining window size.We took window size five.In this window, algorithm will search specific pronoun pattern and case marker according to [27].Pronoun was matched with the given list of pronoun and proceeding word; which indicates recoverability of antecedent on the basis of semantics.We considered the first hit for pronoun pattern and case marker.Then extract the case marker and pronoun pattern.In fourth step syntactic category, pronoun and pattern following pronoun are stored as the element of a string tag (i.e.tag [8], tag [9] and tag [10] respectively).Then apply the rules given by [27].In "b" part of "fourth' step of above algorithm (i.e.Sequential search these in the rule based classifier) has been adopted from [27].The output is stored in one class.This is the required output with file for all ten tags.

D. Case marker used
Hindi consists of nine case markers.First eight are in use and the last one "hey" (हे ) is obsolete, or less in use practically.These case markers specify the binding with anaphor and antecedent.In ergative case marker there are few bindings and exceptions, e.g. the postposition "ne" (ने ) must come right after the subject; the subject changes in oblique the perfect form of the verb now agrees with the direct object in number and gender.In case of above condition, number and gender agreement puts the bindings between verb and direct object.Exceptions are:  If the object is not stated, or if the object is followed by िो (ko) then the perfect form of the verb should be in masculine singular form.
 The auxiliary verb (if any) also agrees with the object, not the subject.
The eight case markers are ergative, nominative, ablative, accusative, instrumental, genitive, dative and locative.We have considered these eight cases of Hindi for the study.Few cases are given below, though in linguistic perspective, these cases come as a suffix.
 neit marked as ergative marks the subject or topic (but only in the past perfective tense for transitive verbs)  ka/ke/ki -marks the genitive  ko -marks the accusative or dative, -typically means "from" or "by", also marks the passive agent  mein, par -locative "in", "at"

E. Accuracy measurement of automated tags
In natural language processing mainly two metrics are used.The First precision and the second recall.Another metrics is derived from precision and recall, which is called F measure  Precision (P) is the fraction of retrieved documents that are relevant (1)  Recall (R) is the fraction of relevant documents that are retrieved (2) These notions can be made clearer by examining the following contingency table: There is another alternative to calculate accuracy of data set.It is the ration of true selection of text and the sum of all selections (all true + all false).(5) This seems plausible, since there are two actual classes, true and false, and an information retrieval system can be considered as a two-class classifier which attempts to label them as such.We are using only equation first, second and seventh.

 F measure : It is the harmonic mean of Precision and Recall
Given n points, x1, x2,………….xn , the harmonic mean is: ∑ (6) So, the harmonic means of precision and recall is: With the help of above equation ( 7), we will calculate the F measure.The authenticity of results checked against only three metrics, i.e. precision, recall and F measure.

VI. RESULT AND DISCUSSION
The study was conducted on twelve files of Ranchi Express from EMILEE corpus.It consists 206 news items, and 1485 demonstrative pronouns.An effort has been made to automate tagging of last three tags given in table 1.A set of rules are applied to accomplish the task.Three accuracy metrics have been considered; precision, recall and F measure.The Table 3 shows all the twelve files and number of pronouns found in the respective file.Number of news and length of news per file is directly proportional to the number of pronouns.In each file for each pronoun performance metrics are calculated.In the first file, first pronoun was "iss" (इस) and the value of precision, recall & F measure is 66.66, 66.66 and 66.66 respectively.These values for second pronoun are 100, 100 and 100.The average precision, recall and F measure of file 1 at serial number one in the table.Hence, the average of each metric is calculated and given in the table 4 below along with its respective file.Then again average of all files is calculated.Precision varies from 65.52 to 79.08 and recall 65 to 75.This depicts there is no major variation in all metrics.
In figure 1, the annotated genre; on the axis X number of files are given.On axis Y, percentage of accuracy of three tags in the terms of precision, recall and F measure is given.Though the pattern of result of all these three parameters for all the twelve files is almost the same.The average accuracy of precision, recall and F measure is 74, 71 and 72 (approximately) respectively.

A. Error Analysis
We classify the entire results into three types; worst, average and best results across the data set (all files) and their pronoun.

1) Worst case (file 8, pronoun 81 st ):
It is observed that in the 7 th file (file name is file 8) having 81 st to 83 rd pronoun produced the result zero for precision, recall and F measure.Automated file has been given below.Special cases, which annotate genre with zero accuracy compared with manual tagged file for the same text with same pronoun.In the proposed algorithm, first, we stored all case marker/ connector and pronoun patterns.The algorithm search for the required case marker and pronoun pattern.Wherein window size is more than one in which we are seeking the case marker and the pronoun patterns.The precision and recall was calculated of the entire string after the 7 th tag.In manual tagging there are four entries, including pronoun after the 7 th tag.On the other hand automatically tagged 81 st pronoun has ten entries after the 7 th tag.Last three tags are same in automatic annotation and manually tagged annotation for this example. It reflects that the seeking window should be decreased to the optimal size in order to avoid additional case marker and pronoun pattern.This particular example (Worst case, file 8, pronoun 81 st ) depicts that the scenario of case marker string and pronoun pattern string may have multiple entries due to the number of pronoun and the number of case marker come consecutively in discourse.e.g.उन (un), इन(in), उन्होंने (unhone).
 The last three features was automated.There are two methods to match ten features.First, start the matching of features from 7 th tag to 10 th tag.And the second is to match the last three features of automatic and manual files.Then discard the additional entries between 7 th and last but 3 rd .It will solve the problem of additional entries in case marker and pronoun string.It also will improve the accuracy in the terms of precision, recall and F measure.
 Refining the number of rules will increase the accuracy of automatic annotation.These rules define the pattern of case marker/ connectives and pronoun.These two features have more contribution for error.

A. Depiction of results
Few results were concluded from ongoing work.Before arriving at the conclusion, twelve files are studied of monologue, and 206 news items which consist 1485 pronouns.Accuracy varies from 65 to 79 for precision.Recall varies 62 to 75 and F measure has been maximum and the minimum values are 77 and 63 percent respectively.Average of precision, recall and F measure remained 74, 71 and 72 percent.File number twelve (12) has the lowest accuracy for all the three metrics and file ten (10) has the highest.

B. Conclusion
This dataset depicts the generalized picture of the genre.Fine tuning of rules besides considering few other semantic categories increase the accuracy of results.In the proposed algorithm the window size is five or till the end of sentence.It will count all the pronoun patterns and add to the tag set.It will make additional entries in values of feature set string.The remedial action for window size is to decrease it to one.Now, there are few advantages and few disadvantages of window size one.
Decreasing window size may increase the accuracy in terms of precision, recall and F measure.While the disadvantage is that it will skip further potential pronoun pattern which may yield higher accuracy.That means we have to go for the best hit.To find the best hit one has to develop new logic.Potential pronoun pattern and case marker may be tested for other genres.It also optimizes the results.In this work we are replacing the potential pronoun pattern with the www.ijacsa.thesai.orgfirst hit.In future it may be replaced with the best hit by other logic.Fine tuning the rules and taking gold standard dataset increase the percentage of accuracy.

C. Future work
Though the corpus is small.A larger and gold standard dataset may produce more authentic results.Selection and fine tuning of rules increase the accuracy of results.Developing the logic for the best hit in seeking window for potential pronoun pattern and case markers also helps to improve the results.These few issues can be addressed in future work which optimize the results.

Step2:
Input pattern for pronounsStep3: WHILE(file in the lists of files) REPEAT I)Find "<w tag, and corresponding ">" a. Extract the feature list call it tag II) Split the tag list into list of individual features III) Generate_Case_Marker_and_Pronoun_ Pattern() a. Define window size for pattern www.ijacsa.thesai.orgbsemantic category, which is tag[10], Pronoun, which is tag[8], pattern following pronoun which is stored in string tag [].b.Sequential search these in the rule based classifier.c.Output the CLASS; return CLASS.END WHILE Step4: Print all the text along with modified tag to files.

TABLE III .
PRONOUN COUNT PER FILE

TABLE IV .
PRECESION, RECALL AND F MEASURE PER FILE

case (file 1, pronoun 1 st ):
,P,D,A,R,H,N,उन्होंने ,उनिा,वे ,उसी,उन,_,_,_">उन्होंने </w> योजनाबनािर <wThe pronoun 82 nd and 83 rd of the same file depict the same result due to over length entries in string of case marker and pronoun pattern.It can be fixed with the help of window size.File 3, file 8, file 12 falls in the category of the worst case.Though the average of these files is 61 to 75 percentage.It was considered as average case if the accuracy of precision, recall and F measure is fifty percent or more.File 1, file 4, file 6, file 8, file 9 and file 11 lies in the average case.