Measuring the Similarity between the Sanskrit Documents using the Context of the Corpus

Identifying the similarity between two documents is a challenging but important task. It benefits various applications like recommender systems, plagiarism detection and so on. To process any text document one of the popularly used approaches is document term matrix (DTM). The proposed approach processes the oldest, untouched, one of the morphologically critical languages, Sanskrit and builds a document term matrix for Sanskrit (DTMS) and Document synset matrix Sanskrit (DSMS). DTMS uses the frequency of the term whereas DSMS uses the frequency of synset instead of term and contributes to the dimension reduction. The proposed approach considers the semantics and context of the corpus to solve the problem of polysemy. More than 760 documents including Subhashitas and stories are processed together. F1 Score, precision, Matthews Correlation coefficient (MCC) which is the most balanced measure and accuracy are used to prove the betterment of the proposed approach. Keywords—Cosine; dimension reduction; sanskrit; synset; matthews correlation coefficient


I. INTRODUCTION
The degree of matching between two text pieces based on their statistics as well as semantics is termed as the similarity between text pieces [24]. Statistics of the document means the length of the document, tokens present in the document, etc. The semantics of the document means the understanding meaning of the words present in the document. These documents/ text pieces can be in the form of the word, pdf and so on. There are various measures to calculate the similarity between the two documents. Jaccard, cosine similarity and so on. Cosine similarity is independent of the statistics of the document. Cosine similarity calculates the cosine value of the angle between two vectors. These vectors comprise the frequency of words in a multi-dimensional plane. Each word present in the document represents the dimension/feature [10]. Thus the orientation of the text document gets captured by cosine similarity instead of the magnitude only. Cosine similarity [11] is better than other similarity measures eg. Euclidean distance. The value near to '1' indicates the documents are most similar. Cosine value is always between '0' and '1'. Calculating cosine similarity between English, Hindi [12][13][14], Marathi [15] text documents [5] is a common task but processing Sanskrit language [33,30,28] and its morphological analysis [35] are critical tasks, as a result finding out the mapping between Sanskrit language texts is challenging. Sanskrit is assumed to be the mother of every language. Panini has introduced this grammar rich language before 2500 years ago. The Sanskrit language has been the traditional means of communication in Hinduism, Jainism, Buddhism, and Sikhism, still, Sanskrit text mining is an untouched area. Several kinds of literature are available in Sanskrit for eg. stories, subhashits and so on. A subhashita (Sanskrit: 'सु भाषित') can be explained as a legendary kind of Sanskrit concise poems to communicate the message of advice, aphorism and so on. Generally, Sanskrit subhashit or stories are related to all aspects of life. Subhashitas are significant in Indian traditional education and are used to teach values like truthfulness, courage, etc. which are applicable for each phase of life righteousness.
To extract any information from Sanskrit text, various techniques are used. DTMS is one of the techniques using which different operations can be carried out on Sanskrit corpus. Sanskrit documents are placed in rows and significant terms are placed in columns. The entry in the matrix represents the number of times the particular Sanskrit term occurred in the document. The significance of the term is decided based on the frequency of the term. The semantics of the term is considered in DSMS. DSMS uses synset groups in which semantically similar tokens are grouped. Instead of considering term frequency, synset group frequency is considered. It facilitates to solve the polysemy problem means one word used with different senses.
Dimension reduction means the removal of unnecessary features. Several methods are available for dimension reduction like principal component analysis, latent semantic analysis, etc. In the text processing [1] [2], Different NLP tasks [5][6][7][8][9] are carried out like removal of stop words [3][4] [32] [29] results in dimension reduction. Before stopwords, removal tokenization needs to be carried out for example 'ततो मषिका उषिय गता.' meaning 'The fly flew away' In this Sanskrit statement 'ततो' meaning 'from there' is removed after separation of tokens. Tokens of the sentence are 'ततो', 'मषिका', 'उषिय', 'गता', '.'. Lemmatization coverts words into their meaningful root form [31]. On the formulated document synset matrix, several applications could be built like plagiarism detection, document clustering, etc. Till now no research is carried out to find Sanskrit document similarity using semantics and context. To evaluate machine learning algorithms different parameters are available. Eg. Precision, accuracy, Matthews's Correlation coefficient. Matthews Correlation coefficient (MCC) is a quality measure for binary classification. It is a www.ijacsa.thesai.org balanced measure because it considers all false and true positives and negatives. Precision is defined as the ratio of relevant documents to the retrieved documents. Accuracy is the ratio of a number of correctly classified documents to the total number of input documents.
The arrangement of the paper is as mentioned. The existing work carried out by other researchers is written as a literature review in the second section. The research methodology is stated in the third section. The fourth section depicts results and discussions and conclusions are presented as the fifth section to end the paper.
The proposed approach is unique because 1) It constructs synset for Sanskrit corpus.
2) It extracts the context of the corpus using semantics.
3) It builds concept space for the corpus.
II. LITERATURE REVIEW Sanskrit is inflectionally strong language and the correct morphological analyzer is required to process Sanskrit text. In spite of being identified as a good analyzed language, morphological analysis of Sanskrit is a challenging task. A morphological analyzer is built which covers wider aspects of language and covers the complexity of words. The applications of the analyzer are used for Sandhi splitting, search engine, etc. [16] Modularity is used while building the morphological analyzer. Other modules like spell check etc. can be added easily. Some modules in the proposed approach are not working, but these modules are based on the grammar which can be easily handled manually. Accuracy of the analyzer is calculated and improvements are discussed [17]. One of the important and unique features of the Sanskrit language is focused known as a dual case. Confusion between plural and dual can be easily removed if grammar rules are followed. Indian treasure is explored to state its different common applications. Various challenges related to NLP along with features of NLP are explored. There are two aspects of similarity implicit and explicit, covering both of these aspects are necessary but critical for finding out the similarity between two texts. This is also termed as the NLP challenge. Different levels that are document or paragraph or sentence or word affect differently and that is why need to be considered to calculate the similarity between two text documents [18]. Sanskrit needs to be handled differently as its morphology is strong. Short texts of Sanskrit [34] are processed semantically using the morphological based approach. Words sematic membership in the sentence is considered assuming that each word has a different significance in the sentence. Ranking of the tokens is carried out using an adaptive measuring algorithm. Sanskrit complex text was processed [19] WordNet is a lexical database that includes vital components like Glosses etc. necessary to identify different NLP features. Domain ontologies could be built using WordNet also semantic relations, polysemy problems can be handled using Wordnet. Sanskrit Wordnet can be used to solve issues like word sense disambiguation (WSD) using gloss. Different techniques that have developed Sanskrit gloss are surveyed to depict different pattern types to solve WSD [20]. Text summarization is carried out to avoid the efforts and time required to read the document. Natural language processing tools are easily available for the English language. Very less work is explored for Indian languages because of the availability of fewer resources. Existing summarization techniques for Indian languages are surveyed and opportunities for research are discussed [21]. Sentiment analysis is useful for researchers to identify the views of individuals for various services, products, etc. The Internet has allowed exploring NLP tasks by providing a huge amount of data. Machine learning techniques facilitate to provide analysis of this data. The need for domain experts is been reduced for verifying the results due to deep learning algorithms. The sentiment analysis can be done using deep learning techniques and effectively than traditional machine learning techniques for resource-scarce languages. Major challenges like word embedding, ontology building are surveyed which can act as a stepping stone for NLP research [22]. Identifying the language present in the given text is called Language Identification (LI). This could be the important initial preprocessing step to carry out NLP tasks Automatic language recognition is the challenging process. India is a country of multiple languages and there is good scope for language identification problem, It will bridge the digital gap Indian and other languages. Hindi and Sanskrit text is separated using the N-gram approach. The languages which share the same scripts, the technique can be applied [23].

III. RESEARCH METHODOLOGY
The proposed approach is already being tested for Marathi and termed as DSMM [26] and are now experimenting with Sanskrit. Sanskrit parsers are not developed fully and still are in research phase. Same way the udmodel which is being used in the proposed approach a few times gives incorrect results for a few NLP tasks it impacts other sequential NLP processes. Knowledge of Sanskrit language is necessary and manual intervention, expertise is involved to get correct results unlike Document Synset Matrix Marathi [26]. Also, MCC is used to validate the results which were not used to asses DSMM. Also, the literature available for Sanskrit text similarity is very limited as compared to Marathi text similarity.
R programming is used which provides different packages like tm, quanteda, libraries like udpipe which provides a morphological analysis of the text, for example, identifying the part of speech (PoS), tense, gender and so on. Also functions like documentTermMatrix_tfidf(), udpipe_annotate (udmodel_ Sanskrit) and so on are provided by R programming. Wordnet provided by CFILT, IIT Bombay [27] is used to identifying the semantic relationship between different tokens. Fig. 1 shows the research methodology steps.

A. Collection of a Dataset, Creation of Corpus and Preprocessing
Sanskrit corpus is not available. The first step is to collect Sanskrit subhashits and stories [25]. The corpus belongs to 340 subhashits and 421 stories. The data is about 15 MB. Different preprocessing steps are carried out. In the first step, tokenization is carried out. Total tokens are 1, 11,123. Stop words are removed, total stop words counted are 34,287. Lemmatization is carried out on the remaining 76836 words. www.ijacsa.thesai.org Unique terms/ tokens are generated which are 56,198. The tokens having similar meanings are grouped and termed as synset groups. Synsets groups are formed and a total number of terms in the form of synset groups are 46,345.

B. Identification of Significant Terms and Formulation of DSMS
The frequency of each term is calculated and then the group frequency of each synset group is calculated. Group frequency means, the addition of the frequencies of all terms present in one synset group. Significant synset groups are identified based on a threshold. The synset groups having more than 75% threshold are considered while building DSMS. It means that, if maximum synset frequency is 100 then all those synsets having frequency more than 75, are considered as significant synsets/terms. Documents (subhashits and stories) occupy the rows of the matrix and synsets act as columns in the matrix. The entry in the table/matrix shows the frequency of that particular term/synset in the corresponding document. Due to synset formation, dimensions are reduced, also the context of the term gets considered. For example 'पिे ' means in favor of as well as on the other hand, if 'क ृ पा' (kripa) and 'पिे ' (pakshe) come together, it means its meaning is 'in favor of' and thus the semantics of the words gets identified. This problem of having multiple meanings of a single word is also known as word sense disambiguation or polysemy which is solved by identifying the sense in which it is used in the corpus. Also with the help of other identified significant terms / synsets, the context of the corpus is understood.

C. Calculate Document Similarity using Cosine Measure
Thus the proposed technique is different than the traditional one which considers only significant terms as column heads and formulates DTMS. DTMS not only lacks to identify semantics and context of the corpus but also dimensions are more than DSMS. On DSMS cosine measure is applied and the similarity of the document is calculated. Two documents are said to be similar if they have cosine measure greater than 0. 65 that is the threshold of cosine measure considered for document similarity is 65%. The value of the threshold is considered after evaluating thresholds ranging from 10% to 90%, precision calculated WAS MAXIMUM FOR 65% THRESHOLD. Table I presents the morphological analysis of the entire corpus. It includes all the documents with their index with respect to the corpus. The paragraph code is assigned for every document. It represents the position of paragraph in the document. The sentence rank of each sentence of a paragraph is represented by sentence code, for example in the first documents, in the first paragraph, the second sentence is shown at the 4 th Sr. No. The table also depicts lemma, part of speech and other information like gender, the tense of verbs, singular, plural and so on. For example for term 'असौ', it's lemma is 'अदस् ', its part of speech is a pronoun and its gender is masculine whereas 'प्राह' is a verb which is singular, representing the first person, having past tense and active voice. Instead of using any stop word removal technique, only adjectives, adverbs, nouns, and verbs are selected. It resulted in the removal of stop words in the form of discarding numbers, special characters, and pronouns, etc. Also instead of using stemming, lemmatization is used and unique lemmas/terms are identified. The frequency of each term is identified. For example the frequency of 'ध्यानम् ' is 30. Group frequency is calculated by adding up the frequency of similar terms 'वृ त्त', 'वाताा ' are similar terms and are grouped called as a synset group. Table II shows the frequency of 56198 unique terms in the corpus. Table III shows the frequency of 46, 345 synset groups. In DTMS 'वृ त्त' having frequency 4 is not considered as significant terms because its frequency is less than the threshold frequency. But it is an important word. This importance is considered in DSMS as it adds up the frequency and the particular words in synset groups are treated as significant words.  Table IV shows DTMS for a sample of four documents. The frequent terms are placed as column heads and documents are placed in rows. The entry in the matrix shows the frequency of that term in the document. For example, 'धमा ' has occurred 1 time in document 1. Table V shows the DSMS for the sample of five documents in the corpus. Synset groups are placed in a column, unlike DTMS. It's clear that not only dimensions are reduced but also semantics is being considered that is 'िषिय, नृ प' having the same meaning are grouped while formulating DSMS. In fact, by looking into other significant terms/synset groups context of the corpus is also.   V. CONCLUSIONS AND FUTURE WORK DSMS is proposed which identifies the semantics of the term. Synset groups are formed to achieve dimension reduction. Total terms present in the DSMS contribute to identify the context of the corpus. Thus polysemy problem gets solved for the Sanskrit corpus processing which is strongly inflectional language. F1 Score, precision, accuracy and Matthews Correlation Coefficient (MCC) are used to prove the betterment of the technique. F1 score and precision are improved by 0.2 measure and Accuracy and MCC are improved by 0.1 measure. A total of 761 documents are processed including Subhashits and stories. The same approach can be extended for other regional languages.