Towards Natural Language Processing with Figures of Speech in Hindi Poetry

Poems have always been an excellent way of expressing emotions in any language. In particular, Hindi poetry is having versatile popularity among native and non-native speakers all over the world. A typical poem in Hindi is characterized by meter (“Chhand”), emotion (“Rasa”), and figure of speech (“Alankaar”). The present research work is the first of its kind in Hindi Natural Language Processing (NLP), which touches on the area of Hindi figure of speech. The authors have created a systematic hierarchical structure of Hindi “Alankaar” types and sub-types and attempted and extended the work to identify a few. A taxonomical list of 58 Hindi figures of speech is presented along with their nearest mapping to English equivalents. On the sidelines, the paper also presents the distinct rules for each type and sub-type needed for the classification task of NLP. The authors achieved 97% efficiency in reporting the first results with an average execution time of 0.002 seconds. Keywords—“Alankaar”; figure of speech; Hindi; Natural Language Processing (NLP); poetry


I. INTRODUCTION
Poetry refers to the poetic creation of a person and consists of a series of verses. Different types of poetry have been written from as early as the 25th century BCE [1]. Different types of rules and regulations are followed for writing poetry in different language scripts, yet maintaining grammar [2]. Hindi is one of the prevalent languages of the world. To some extent, Hindi is a majorly spoken language for communication in India and is written using the Devanagari script. It is used along with English as the official language of the Government of India [3].
Many well-known writers have done many poetic creations in the Hindi language, and every day many writers are writing some new poems. In Hindi poetry, "Rasa" (i.e., "emotion"), "Chhand" (i.e., verse meter), and "Alankaar" (i.e., the figure of speech) are essentials of the poetic composition [4]. Though little progress can be seen in the "Rasa" and verse-related research works, it is almost absent when it comes to the figure of speech. The figure of speech known as "Alankaar" in Hindi is capable enough to make any poem's creation magical through its presence.
Our contribution through the present research work includes: • Detailed exploration of Hindi "Alankaar", • The standardization of taxonomical classification structure for different types of "Alankaar", and • The specific methodology for identification of the three trendy Hindi "Alankaar".
The former could be well exploited for Natural Language Processing (NLP) of Hindi language, particularly for the classification task.
Notably, all of these are worked upon and reported for the first time in the scientific literature. The structure of the remainder of the paper includes literature review, description of rules and creation of "Alankaar" hierarchy, "Alankaar" identification, and results. The paper ends with the conclusions derived from the work and some pointers to future work.

II. LITERATURE REVIEW
An extensive literature review was carried out for this research, in which we tried to dig up the research items, books, blogs, and online portals for the different kinds of information retrieval and to know the current state of the research progress in this specific segment. Research works for the internationally well-known languages that can be seen concerning poetry and related nearby segments such as emotion detection, text classification, and identification in different languages such as Arabic, Chinese, English, and Persian [5][6][7][8]. Some research works related to poetry were found for Indian regional languages like Hindi, Marathi, and Punjabi [9][10][11].
Saini and Kaur [13] worked for Punjabi poems annotated corpus for emotion detection based on the nine different types of emotions ("Rasa"). Pal and Patel [12] introduced a model for the classification of Hindi poems. Audichya and Saini [14] worked for the unified rule-based technique for automatic metadata generation based on different meter rules in Hindi poetry. Kushwah and Joshi [15] researched the detection of a specific type of verse meter named "Rola". Bafna and Saini [16][17][18][19] also worked using eager machine learning and concept learning algorithms to classify Hindi verses.
After founding and exploring this much, authors can powerfully convey that the figure of speech as known as "Alankaar" in Hindi is an untouched portion of the research works related to Hindi or any other Indian regional level languages. The main reason behind less work in this area is because it is tedious to deal with, and no such initial research works had been done or carried out so far. To fulfill this gap, we have presented a path to work further in Hindi NLP from the perspective of "Alankaar". One of the goals of this research work is to organize and manage all available information related to the figure of speech in the Hindi language after *Corresponding Author proper collection, verification, and validation for a better research approach so that one is not needed to deal with insufficient knowledge or contradictory information in upcoming times.
After the detailed and in-depth literature review, it was observed that there is a lack of identification mechanism which can detect and identify the different "Alankaar" in Hindi. The vast Hindi content data can be sorted in an organized manner with the generated metadata based on the detection. The metadata can help populate better search results instead of regular keywords-based searching. Apart from that, this research work can help digital libraries to manage the content based on the different types of Hindi "Alankaar". With the perspective of computational logistics, this research work can help analyze the write-ups based on the different types of the Hindi "Alankaar". So as authors felt, it can be a valuable and necessary novel work for current and upcoming times, which can be understood with already emphasized points, and there can be many more uses scenarios also. These all aspects were the actual motivation to carry out this research work.

III. "ALANKAAR": THE HINDI FIGURE OF SPEECH
The figure of speech which is scripted as "अलं कार" ("Alankaar") in the Devanagari script of the Hindi language, is an essential part of the creation of the poem. "Alankaar" means "ornament", and just as the beauty of a person is adorned with ornamentation, in the same way, the grace of poetry is ornamented. Something which embellishes the poem is known as the figure of speech.
To identify and detect the different types of "Alankaar" we first need standard types and rules, which are not available systematically, and where ever is available, it is either missing some information or have some fewer details [20][21]. The first and significant time-consuming task of this research work was to go through the various sources to collect, verify, and systematically structure hierarchical classes for better research.
The rules and all their relevant information collection were carried out through different places such as educational materials, websites, blogs, and portals [22][23][24]. After this process, we have structured everything in such a way with Hindi experts' opinions from academia to be helpful for the upcoming research works.

A. Types of "Alankaar"
Mainly as per the characteristics, the types of "Alankaar" are divided into three streams which are as follows: Each has its own set of rules and further subtypes, and even sub-sub types [25]. We discuss each of them quickly in the subsequent sections.
1) "Shabd Alankaar" ("शब्दालं कार"): "Shabd Alankaar" is the first type of the Hindi figure of speech, those which embellish the poems through the words in the figure of speech, that is, by putting a particular word in a poem, the beauty comes, and if the beauty is lost when using a synonym, are called the "Shabd Alankaar". Although not all the "Alankaars" that come in this category are purely based on words, the primary focus in those "Alankaars" is on the words, which is why they are added into this category.
"Shabd Alankaar" have further sub and sub-sub classes as per their types mentioned as follows: Each has its own set of rules and further subtypes, and even sub-sub types [25]. We discuss each of them quickly in the subsequent sections.
2) "Arth Alankaar" ("अथार् लं कार"): "Arth Alankaar" is mainly related to the meaning of the words, so those which embellish the poems through the meaning of the words in the figure of speech, That is, by putting a particular word in a poem and due to the meaning of that word the miracles occur in poetry, are called the "Arth Alankaar". Although not all the "Alankaar" which comes in this category are purely based on the meaning of words, yet the primary focus in those "Alankaars" is on the meaning of the words that's why they are added into this category.
"Arth Alankaar" are having further sub and sub-sub classes as per their types which are mentioned as following:  Table I is representing the mapping with English equivalents found and not found while collecting the information while conducting this research work. As structured, we finally have 58 types of the "Alankaars" along with rules and examples, as shown in Fig. 1. It is not feasible to include all the rules of different "Alankaar" types in this research paper. There might be some more types of missing "Alankaar" that can be added in future research works. The next task is to determine the challenges faced or one needs to face while dealing with the figure of speech identification and automatic detection.

B. Selecting a Template
Research works are always challenging, and that's what the beauty of research is, but when it comes to the research with the figure of speech in Hindi, it is very tedious and challenging. That's the only reason this segment was still untouched, and no such initial research was found. While carrying out this research work, the following challenges were faced.

1) No previous research works:
Initial level research work requires some extra efforts as we discussed already that no previous research work or articles had been found so far, so one needs to create their path or way to work to accomplish the research-related tasks and it requires massive efforts because there is no dataset, algorithms or implementation strategy is existing.
2) Missing and conflicting information: Information Collection, verification, and systematic arrangement are some of the initial tasks of any research work, and this is more important when dealing with a purely new segment where no such past research works or articles can be seen. The authors came across different sources in this collection and validation process where either some types were missing or having incomplete information.
3) Context-based meaning: To deal with Hindi words' meanings, one can integrate with the existing Hindi wordnet or other libraries, but the context-based meaning is required, which is missing, or still, some research works are going on in the same segment and research in its own [26]. 4) Homonymy and polysemy: "Alankaars" are all about the words and their meanings, here a single word can have multiple and can be used to express different things, which are polysemy, and similar words that are either spelled similar or sound the same but have different meanings are homonymy. That is another challenge level, which is still a vast issue and essential for this research work, too [27]. 5) Multiple "Alankars" detection: As per the nature and characteristics of "Alankaar", there can be multiple "Alankaars" in the same poem lines or even in a part of the poem, comparatively in "Rasa" and "Chhand" usually it has been observed that mostly there will be only a single type of "Rasa" or "Chhand" will be there in a part of the poem.
6) Unavailability of datasets for experiments: To carry out any research work, one will always need a dataset, as there is no such research work done in this specific problem segment, and in other poem related Hindi research also works dataset is a challenge because there is no such ready dataset or open-source datasets are available. To deal with such things, one has to follow one and the only thing that makes the dataset by self, and again it requires some additional effort and time.
Despite all of the listed challenges, we followed the approach of focusing on the best optimum problem-solving methods, and the same is discussed in the following "Alankaar" Identification section.

IV. "ALANKAAR" IDENTIFICATION
To Identify and detect the "Alankaars" used in Hindi poetry based on the different rules of "Alankaars", we tried to implement the viral, trendy, and three primarily used "Alankaars" out of the all mentioned 58 different types.
For example, to identify and detect these two "Alankaars", namely "Anupras" (i.e., Alliteration) "Punrukti", we need to know the respective appropriate rules of both types. If we consider "Anupras", the rule says that when a specific character occurs repeatedly, there is "Anupras". If we talk about the "Punrukti", a word that occurs twice consecutively, then there is "Punrukti". Let us understand with the following example which fits for both the types: 'ठु मु �क -ठु मु �क �नझु न धु िन -सु िन, कनक अिजर िशशु डोलत।' In this example, the Unicode Standard [28] Unicode Transformation Format -8 (UTF-8) based text is accepted as input, and if we observe closely, we can see that the character 'क', 'न' and more occurs more than once, repeated and again and again so "Anupras" is here. Also, there is "ठु मु �क" word which is occurring twice consecutively it is fulfilling "Punrukti" rule. This is how one can understand this concept, but to make a computer computationally understand the same, we need to follow some systematic process so we have designed in such a way that in the case in near future we need to add some more "Alankaar" implementation we can do that very quickly.
The simplest way to understand the implementation methodology is as follows: Step 1: Start.
Step 3: Cleaning and Preprocessing operations.
Step 4: Perform Character Count and Word Count.
Step 5: Send the data to check the "Alankaar".
Step 6: Check "Alankaar" in "Shabd Alankaar" where it will further pass on the sub-type functions, and if any type gets detected, it will be added to the output result buffer.
Step 7: Check "Alankaar" in "Arth Alankaar" where it will further pass on the subtype functions and if any type gets detected, it will be added to the output result buffer.
Step 8: Check "Alankaar" in "Ubhay Alankaar" where it will further pass on the sub-type functions and if any type gets detected, it will be added to the output result buffer.
Step 9: Return appropriate output by merging all the output in the buffer.
With this methodology, many "Alankaars" can be easily covered as soon as the modeling of the specific "Alankaar" rules is done in the implementation script's functional modules. Let's have a look at the pseudo-code for this implementation. The function gets called while checking "Shabd Alankaar". As same as the "isanupras()" other alankars methods also gets executed automatically while checking different classes of "Alankaar". Along with that, by keeping the computational perspective in mind, if required, the appropriate position of the detected "Alankaar" can also be populated along with the final metadata.

V. RESULTS
In this research work, we started from scratch and as a final result, we have systematically sorted and arranged standard hierarchical data of "Alankaars". Apart from that, the authors were also able to execute binary classification for the three "Alankaars" successfully. From an implementation perspective, the authors have already implemented "Anupras", "Punrukti" and "Yamak" "Alankaars". The same example is used to explain the result, which was used to discuss in the "Alankaar" Identification section. The code for the output depicted in Fig. 1 was written using Python version 3.9 and executed on MacBook Air 13-inch, 2017 system with macOS Big Sur Version 11.1 having 8 GB 1600 MHz DDR3 Memory along with 1.8 GHz Dual-Core Intel Core i5 processor. Fig. 2. Output of the Automatic "Alankaar" Identification. Fig. 2 shows that it took input and processed the same as discussed in section IV, and on completion, it returns that the input is consists of two "Alankaars" which comes under the primary type "Shabd Alankaar" and their subtypes are "Anupras" and "Punrukti". Also, one thing to notice here is that the whole process took just 0.002 seconds, which is a rapid execution time. Apart from these two, we have also incorporated the "Yamak Alankar" identification which works quite well, but it does not work in some scenarios as we need to make it work better using the integration of the wordnet for the meanings of the words for the comparison of different words. Table II shows the result related stats of this research work carried out after the working model's design and implementation integration. There is no training mechanism based on data in this research study, so whatever data inputted for the tests were genuinely on unseen data only. The test was carried out on the 78 different UTF-8 based input, and based on the results, we were finally able to achieve overall 97.00% accuracy in 0.002 second average execution time.