Novel Language Resources for Hindi: An Aesthetics Text Corpus and a Comprehensive Stop Lemma List

This paper is an effort to complement the contributions made by researchers working toward the inclusion of non-English languages in natural language processing studies. Two novel Hindi language resources have been created and released for public consumption. The first resource is a corpus consisting of nearly thousand pre-processed fictional and nonfictional texts spanning over hundred years. The second resource is an exhaustive list of stop lemmas created from 12 corpora across multiple domains, consisting of over 13 million words, from which more than 200,000 lemmas were generated, and 11 publicly available stop word lists comprising over 1000 words, from which nearly 400 unique lemmas were generated. This research lays emphasis on the use of stop lemmas instead of stop words owing to the presence of various, but not all morphological forms of a word in stop word lists, as opposed to the presence of only the root form of the word, from which variations could be derived if required. It was also observed that stop lemmas were more consistent across multiple sources as compared to stop words. In order to generate a stop lemma list, the parts of speech of the lemmas were investigated but rejected as it was found that there was no significant correlation between the rank of a word in the frequency list and its part of speech. The stop lemma list was assessed using a comparative method. A formal evaluation method is suggested as future work arising from this study.


I. INTRODUCTION
One of the basic requirements to devise a tool to perform any task in Natural Language Processing (NLP) is a corpus that represents the target language or the target domain. Adhering to the 'Bender Rule', according to which researchers are required to name the language that was targeted by the study, we would like to inform the readers that the study focuses on Hindi, the official language of India. The language ranks third on the list of the languages with the largest number of first language speakers in the world [1]. The study aims at building a corpus in the aesthetics domain and utilizing the corpus, along with other corpora to publish an exhaustive list of stop lemmas based on their raw frequencies in the corpora. The corpus has been released under the GNU General Public License for public use, in order to ease the process of corpus acquisition. It is also necessary to mention that the text and the corpora that were acquired from various sources have been used solely for academic and research purpose. The corpus was created because of the difficulty that we faced while searching for and acquiring novels, stories and non-fictional content written by contemporary authors as well as content written by authors in India's pre-independence era, i.e., prior to 1947. The broad objective of the study is to utilize and release unbiased time-independent text in the form of a corpus. Since the study emphasizes the lexical aspect of text, features related to context and discourse have not been discussed in this paper. The stop lemma list created as part of this work has been built using text from multiple sources, thus making it suitable for generic consumption. The reported outcome is best as on date subject to the data used for this research. We believe that this corpus and the list of stop lemmas would be a useful resource for NLP tasks such as creating language models, text classification and information retrieval in Hindi.
The remaining paper is organized into three sections. Section II of the paper consists of the existing work in this area and the research questions. Section III contains the description of the corpus and the methodology along with the final list of stop lemmas. Section IV consists of the concluding remarks.

II. EXISTING WORK AND OBJECTIVES
Stop words are words that are present in a sentence solely for grammatical reasons and do not contribute to the information obtained from the text [2]. Hence if these stop words are identified and removed before using the text for a task, the performance of the task could improve. Many studies have focused on the importance of removing stop words as a pre-processing step for text processing tasks [3], [4] and [5]. Author in [6] manually extracted stop words based on parts of speech such as pronouns, prepositions, conjunctions etc. from two news-based corpora to create a stop word list in Hindi consisting of 275 words. Author in [7] created a stop word list by converting words to lemmas in a corpus of news articles consisting of 441,153 words. Author in [8] proposed a method for automatic stop word generation that created stop word lists that matched the top twenty stop word lists from four publicly available lists. Their corpus was based on news articles. Although we found studies that focused on automatic stop word generation [9], [10], [11], [12], [13] and others that published the stop word lists for public use [14], we could not find an exhaustive publicly available list of stop words in Hindi based on multiple corpora. Another problem with multiple lists based on one or two corpora is the inconsistency of the words in the lists [15]. Author in [16] manually created a stop word list consisting of 256 words from Punjabi poetry and articles. This list was brought down to 184 unique words by lemmatizing the words. We would like to emphasize here that the researchers lemmatized the words after identifying a word as a stop word as opposed to lemmatizing all the words and then identifying the stop words. [17] collated a list of stop words in numerous languages spanning multiple countries. www.ijacsa.thesai.org The researchers observed the presence of multiple word forms in the stop word lists depending on the morphological complexity of the language. We catered to this issue while framing the objectives of our study. Author in [18] manually created a list of stop words based on news corpora in Gujarati. The researchers also aimed to find a pattern in the stop words that were assigned the same part of speech tag but they could not find any similarity. Another study used a threshold based approach to identify stop words in Sanskrit text by calculating the frequencies of the words in the text [19]. The researchers removed nouns from the list under the assumption that nouns cannot be treated as stop words. We aimed to test the validity of this assumption through our study.
The source of data in majority of the studies discussed in this section was news articles. All these studies, except for one, focused on the creation of stop word lists, instead of stop lemma lists. This would lead to the creation of a list that consists of a morphological variation of a word instead of the root word itself, which could negatively affect the quality of the NLP task owing to the non-robust nature of the list, hence defeating the purpose of its creation. One of the studies focused on assigning a part of speech tag depending on the stop word. We formed one of our objectives along similar lines, but our emphasis was on the relationship between parts of speech and stop words in general, rather than on categorizing similar stop words under a particular lexical category.
The objectives of the study were framed as follows, based on the analysis of the existing work: a) To determine whether the existing publicly available stop word lists are inter-changeable b) To determine whether the part of speech of a word can be used to determine whether the word is a stop word or not c) To create an exhaustive stop lemma list irrespective of the target domain In order to meet the above objectives, our study was focused on finding answers to the following research questions: RQ1: Are the top ten stop words across all the available stop words lists the same, irrespective of the rank of the words in the lists? RQ2: Is there any significant relationship between: a) The part of speech of a word and its rank in the stop word list?
b) The part of speech of a lemma and its rank in the stop lemma list? RQ3: How can the available resources be synthesized to create an exhaustive stop lemma list?

A. Corpus Creation and Corpora Metadata
This section discusses the creation of our corpus and throws light on the metadata of the corpora we collected. We scraped novels and stories from http://hindisamay.com, an e-library maintained by Mahatma Gandhi Antarrashtriya Hindi Vishwa Vidyalaya (translated to Mahatma Gandhi International Hindi University), Wardha, http://premchand.co.in, a website dedicated to the popular novelist Premchand's stories, and Bhandarkar Oriental Research Institute's Digital Library (http://borilib.com). Scrapy, an open source tool, was used to extract content from websites. We also extracted content from PDF of novels that were not available on the specified websites, but the text thus extracted could not be used because of encoding errors. Owing to issues such as this and lack of availability of public content, the size of the corpus is not comparable with that of articles present in Wiki dumps. However, the corpus, in combination with other corpora can become an effective resource.
As a preprocessing step, we split the text into sentences, tokenized the sentences and deleted special characters, English tokens and Latin numbers. Joined words were not segmented. The details with respect to the vocabulary of our corpus and the corpora that we acquired from various sources to create a stop word list, can be seen in Table I. LR indicates the language resource specified in the Language Resources section. LR#12 refers to the corpus that we created.
The numbers presented in the table against LR#13 to LR#14 were generated by calculating the raw frequency of the words, and by fetching the lemmas of the words present in the corpora and calculating the raw frequency of the lemmas, as this information was not readily available. The corpora included in the table span over ten domains and a period of over hundred years.
The state-wise distribution of the authors whose works have been included in the corpus, can be seen in Fig. 1.
As depicted in the figure, majority of the work is associated with authors whose native state is Uttar Pradesh, a state in northern India. Another observation made in the metadata was that only 4.84% of the publications are authored by females. This number was 0 in the pre-independence era for this corpus. However, the number increased as the years passed, although it is incomparable with that of the male authors. 978 articles including novels, short stories and non-fictional texts were collated from the sources mentioned earlier. Out of these 978 articles, the metadata of 164 articles could not be found. We were unable to find details of authors who are not highly accomplished, but we believe that the authors are not amateur writers since the university chose to publish their work on their website.

B. Stop Word List Creation
In this section, we attempt to answer the research questions framed in Section 2.
RQ1: Are the top ten stop words across all the available stop words lists the same, irrespective of the rank of the words in the lists?
The null hypothesis was framed as follows: H0: The top ten stop words across all the available stop words lists are the same, irrespective of the rank of the words in the lists.
HA: The top ten stop words across all the available stop word lists are not the same, irrespective of the rank of the words in the lists.
We created a list of the top ten stop words obtained from publicly available lists and generated the list from corpora released by Technology Development for Indian Languages (TDIL), Ministry of Electronics & Information Technology (MeitY), Government of India., Centre for Indian Language Technology (CFILT), IIT Bombay, Wikipedia dump, http://opus.nlpl.eu that consists of subtitles obtained from opensubtitles.org, and the aesthetics corpus that we created using articles collected from the sources mentioned above. We did not have access to the corpora that were used to generate the publically available lists as these lists have neither been released as formal resources nor has the process of creation of these lists been published as research papers. Table II contains the list of the top ten stop words in each source.
A word cloud generated from the list of these stop words and their count in the lists across all the sources, can be seen in Fig. 2. The size of the word in the figure is directly proportional to its count. It was observed that the word lists were not consistent across the sources. Out of the 19 sources and 82 unique stop words, the maximum count of sources in which a word appeared was 14. We also found inconsistencies in the list of lemmas. The list of top ten stop lemmas from each source can be seen in Table III. Table III consists of fewer language resources as  compared to Table II, as there exists to publicly available lists of stop lemmas at the time of writing this paper. Hence, we used the corpora in hand to generate the lemmas of the words and sorted them based on their raw frequencies.
A word cloud generated from the list of the top 10 lemmas and their count in this list across 8 sources (LR#12 to LR#19), can be seen in Fig. 3.
An interesting observation here was that among 8 sources and 22 unique stop lemmas, the maximum count of sources in which a lemma appeared was 8, which is a significant improvement over the count of the words in the stop word list. Hence, though we reject our null hypothesis and accept the alternative hypothesis, we believe that while creating an exhaustive list of stop words, we should consider the lemmas of the words and not just various morphological variations of a word.

RQ2: Is there any significant relationship between:
a) The part of speech of a word and its rank in the stop word list?
b) The part of speech of a lemma and its rank in the stop lemma list? The null hypotheses were framed based on a general assumption that parts of speech such as prepositions, postpositions and symbols are commonly marked as stop words. a) H0: There is a significant relationship between the part of speech of a word and its rank in the stop word list.
HA: There is no significant relationship between the part of speech of a word and its rank in the stop word list. b) H0: There is a significant relationship between the part of speech of a word and its rank in the stop word list.
HA: There is no significant relationship between the part of speech of a lemma and its rank in the stop lemma list.
In order to test the hypotheses, we considered the parts of speech of all the words that appeared in the top ten ranks of the stop word lists for each source from S12 to S19. The parts of speech for every word and every lemma in each list was determined for all the sources using the part of speech generated from the Hindi treebank of the stanfordnlp POS processor. The point biserial correlation coefficient between the parts of speech present in the top ten ranks of the stop words lists, and the rank of the word in the list was calculated. The parts of speech of 10722288 unique words and 335089 lemmas were analyzed. The results can be seen in Table IV. The table consists of descriptive statistics of the coefficient values across the lists generated from all the sources.
It was observed that there was no significant relationship between any part of speech and the words' ranks. Hence the null hypotheses were rejected and the alternative hypotheses were accepted. This bursts the assumption that words belonging to parts of speech such as prepositions, postpositions, conjunctions etc. would be ranked among the top stop words.
RQ3: How can the available resources be synthesized to create an exhaustive stop lemma list? In an attempt to solve this question, we collected the top 100 stop words from the existing publicly available lists, took a union of all the lists, and replaced the words with their lemmas. The lemmas were used owing to the observation made while searching for the answer to RQ1. This was named Set A. Set B was created by taking a union of the top 100 stop lemmas from all the corpora. The exhaustive stop word list C was another set that was generated by taking an intersection of Set A and Set B.
where Seti represents the set of stop words in source i; 1<=i<=11 and genLemma is a function that replaces each word in the set with its corresponding lemma where Setj represents the set of the top 100 most frequent lemmas in source j; 12<=j<=19 Set C = Set A ∩ Set B 1096 words from publicly available lists were brought down to 1071 words after removing the duplicate words and phrases. A list of unique lemmas of these words was created. The length of this list was 370.
Set B, which was created by fetching the lemmas of all the words in the corpora, consisted of 213554 lemmas. These lemmas were generated from 13811781 words that were brought down to 405111 words after removing the duplicates.
The final list of stop lemmas was created by combining both the lists and extracting the common lemmas. The list consists of 311 lemmas, as can be seen in Table V. The lemmas have been arranged in decreasing order of frequency from top to bottom.
In order to assess this list, we compared our list with the English stop word list present in Python's Natural Language ToolKit package (nltk). We used Google Translate with manual intervention to translate the English stop words into their equivalent Hindi words. We chose English as we could not ascertain the correctness of translation of other languages in Hindi. The lemmas of these translated words in Hindi were generated and compared with the exhaustive stop lemma list. 74 unique equivalent Hindi stop lemmas were generated for the 179 words in the English stop word list. We could not translate the words 'being', 'will' and 'shall'. Although 'shall' and 'will' were not present as it is in the English stop word list, we expanded the word form ''ll' to 'shall' and 'will'. An ambiguity of a particular kind was found while analyzing the words in the English stop words list. The list contained the word 'won', which was next to words such as 'shan' and 'aren'. Here, we did not consider the meaning of 'won' to be the past tense of 'win'. On the other hand, it was a form of wouldn't, i.e., won't, wherein the apostrophe and the 't' were removed from the word. We used the same approach for words such as 'shant', 'aren', etc. Following the disambiguation, translation and lemmatization phase, we observed that out of the 74 unique lemmas, 73 lemmas were present in our stop lemma list. The lemma that was absent from our list was 'जरूर', the Hindi equivalent of the English stop word 'must'.

IV. CONCLUSION
At the onset, we set out to create a corpus consisting of aesthetic texts in Hindi and to study the stop words present in this corpus as well as other existing corpora and publicly available lists, with the intent of finding a comprehensive list of stop words. Through this study, we have achieved the following broad objectives: a) To create an unbiased time-independent aesthetics corpus in Hindi b) To study the relationship between the part of speech of a word and its rank in the frequency list c) To provide an exhaustive stop lemma list in Hindi In order to achieve the first objective, we used texts belonging to fictional and non-fictional categories.
An aesthetics corpus consisting of approximately 1000 articles and 145508 words was built using text from various sources. The gender distribution of the authors was discussed. We could not find significant amount of work by female authors that is available in digital format for public consumption. The corpus consists of both, contemporary texts as well as stories that are over a hundred years old. In order to create an exhaustive stop word list, texts were collated from seven different corpora. The stop word lists that are available for public use were also used. Stop word lists were generated from each corpus based on the word count. It was found that there was an inconsistency even in the top ten stop words across all the lists. We also studied the individual lists and corpora to determine whether the part of speech of a word could be used to determine whether the word can be considered to be a stop word but the results contradicted our assumption. Owing to this observation, while creating a comprehensive list, we did not focus on the part of speech as a characteristic of a word to deem it as a stop word. Only the frequency of the word was taken into count. Through our study, we propose the concept of 'stop lemma' list, instead of mere 'stop word' list, as it is believed to yield better results when used in NLP tasks. Therefore in order to make the list robust, the lemmas of the words were considered as it was found that stop lemmas were more consistent than stop words across different corpora, as can be observed from the density of the respective word clouds generated using the top ten words and the top ten lemmas in the sources . At the time of writing this paper, such a list does not exist in the public domain. This list is limited to the texts collected from the specified sources. We assessed the quality of the list by translating, lemmatizing and comparing the English stop words present in NLTK (a Python package that is widely used in English NLP tasks), with the lemmas from our list. We chose English owing to the familiarity with the language as it helped in determining the quality of translation. We found that almost every lemma that was present in the English list mapped to a corresponding lemma in the Hindi list. Future work arising from this study could comprise evaluation of the list by using it with different NLP tasks, as well as a comparison of the results of a specific NLP task with respect to the use of stop words and stop lemmas in various languages. The corpus, the metadata as well as the analysis has been made available for public consumption at https://github.com/ gayatrivenugopal/hindi-corpus-stoplemmas.

ACKNOWLEDGMENT
This study is funded by Symbiosis Centre for Research and Innovation, Symbiosis International (Deemed University). We would also like to thank the organizations that provided the corpora that were used in the study, and the Stanford NLP group that released the stanfordnlp package for Python which was used to fetch accurate lemmas of the words in Hindi. The "Linguistic Resources" obtained from TDIL have been developed & made available by TDIL, MeitY, Government of India. The resources from CFILT, IIT-Bombay have been downloaded from http://www.cfilt.iitb.ac.in/Downloads.html.