Sensed-Lexicon based Approach for Identification of Similarity among Punjabi Documents

Textual similarity among documents often leads to copyright issues. Manual measurement of similarity among documents is a time consuming infeasible activity. In this paper, we proposed a technique for measuring similarity at sensedlexicon level for documents written in Punjabi language using Gurumukhi script. 50 Punjabi document pairs were manually collected with the help of Punjabi native writers. The proposed technique consisted of major 4 levels. Level 0 consists of data collection phase. Level 1 consists of noise removal and stop word removal sub levels. Extracted tokens were stemmed, lemmatized and synonyms were replaced based on part of speech tagging in level 2. Vector space representation corresponding to each document leads to n-gram generation of documents in level 2. Extracted n-grams were weighted based on term frequency. In level 3, string based token level similarity indexes such as Jaccard Similarity Index (JSI), Cosine Similarity Index (CSI) and Levenshtien Distance Index (LDI) were experimented with weighed tokens. In this work, Human Intelligence Task (HIT) based rating has been utilized for measuring the similarity among documents between 0-100. Results obtained from HIT based rating are compared with results obtained from the proposed technique with various combinations of pre-processing levels. Results revealed that on the basis of majority voting, combination of stop word removal with stemming and ‘noun’ based synonym replacement leads to the best combination with bi-gram tokens. Statistical analysis indicates strong correlation between CSI and HIT based rating. Keywords—Cosine Similarity Index (CSI); Jaccard Similarity Index (JSI); Levenshtien Distance Index (LDI); n-gram; Punjabi; similarity checker


I. INTRODUCTION
"It is better to fail in originality than to succeed in imitation" Herman M.
Measuring similarity between words/ terms, sentences, paragraph and document plays an important role in computational linguistics. Similarity measurement is significant component for text classification, search engine, topic modelling, text summarization, legal documents, question answer generation, information retrieval, plagiarism detection and other language related research. Similarity is associated with finding the overlapping index among two documents. This overlapping can be present at sentence level or document level. Similarity among documents can be identified at lexical level and semantic level. In lexical level, words and/or phrases are compared to identify the similarity whereas in semantic level, contextual information associated with words or phrases is extracted and used for comparison.
In general, an automatic document similarity analyzer takes two documents and generates similarity index for them. In this paper, document level similarity is identified at sensedlexicon level. These documents are written in Punjabi language using Gurumukhi script which adds one more layer of complexity to this task. This work has potential application in plagiarism detection in Punjabi documents. India is the land of languages. Numerous languages and its dialect are being used in spoken as well as written form. Punjabi is one of them. Punjabi falls in Indo -Aryan language category. It is indicated as first language for about 130 million people and is the 10th most spoken language in the world [1][2].
A lot of research has been carried out in area of measuring similarity among documents written in foreign languages, especially English. But this area still needs to be explored in Indian languages. No work has been reported for Punjabi language.

II. RELATED WORK
This section presents different works carried out in area of detecting similarity among documents. Indexes for finding documents similarity are broadly categorized into string based, corpus based and knowledge based measure [5]. String based algorithms perform character level or token level comparison. Corpus based methods detect similarity based on semantic information extracted from large corpus and Knowledge based methods extract semantic similarity based on information extracted from semantic network.

A. Similarity Checking Work in Foreign Languages
Researchers [6] proposed a technique for handling semantically similar words/ paraphrases in Arabic language. Open Source Arabic Corpora (OSAC) was utilized for identifying suspected documents and Word2vec was used for experimentation. Various methods such as Term Frequency-Inverse Document Frequency (TF-IDF), Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), word2vec, Global Vector Representation (GloVe), and Convolutional Neural Network (CNN) were experimented for paraphrase detection. Another group of researchers [10] used a deep learning based method to detect Arabic paraphrasing. This method consists of pre-processing phase, and word2vec phase. Convolutional Neural Network was used to generate sentence vector. Authors [12] proposed two layer plagiarism *Corresponding Author www.ijacsa.thesai.org detection method for Arabic documents. This method consists of two layer: Fingerprinting and Word embedding. Documents were weighted using different techniques such as word alignment, POS tags, and inverse document frequency. With recall of 88% and precision of 86%, this method outperformed Plagdet. Different word embedding models were experimented for capturing semantic similarity among sentences. In this work, authors proposed a model (M-MaxLSTM-CNN) for employing multiple sets of word embedding for evaluating sentence similarity. Multi-level comparisons among sentence embedding, generated by multiple word embedding, leads to sentence similarity information. Proposed technique experimented with STS Benchmark dataset and SICK dataset from SemEval and outperform all other existing methods [7].
Saptono et al. experimented with Vector Space Model (VSM) for detecting plagiarism. In this work, cosine similarity method was used to generate the rank of textual paragraphs from query as well as collection vector. Conditional probability concept was utilized to extract number of words from a paragraph. Results revealed that 54.28% average precision and 100% average recall is achieved with threshold value of 0.3 for the conditional probability and 0.2 for cosine similarity [8]. Authors introduced the project ParaPhraser.ru for collecting of Russian paraphrase corpus and organizing a Paraphrase Detection Shared Task. Different techniques were experimented for finding paraphrases among Russian language. Result revealed that traditional classifiers with linguistics features outperformed other methods [13].

B. Similarity Checking work in Indian Languages
Automatic plagiarism software Maulik was developed to check the plagiarism among Hindi documents. Approach used for detecting plagiarism is based on n-grams and comparison with repository and online documents. Input text was preprocessing using stop word removal and stemming. Different values of n were compared with cosine similarity index to find the best value of n. Accuracy reported was 96.3 which is better as compared to existing techniques [3]. Authors proposed Document Synset Matrix for Marathi (DSMM) technique for measuring among Marathi documents. In this work, proses and verses were used for experimentation. Dataset consists of 1206 proses and verses. Different problems such as sense identification of words, polysemy were handled using proposed technique. Accuracy reported was 80 which was better than existing techniques [4]. In this paper, authors presented fuzzy semantic based and Naïve Bayes model for identifying obfuscated plagiarism in English as well as Marathi Language. Semantic relatedness information was analysed based on part of speech tags and WordNet measures. Results revealed that Naïve Bayes Model performed better as compared to fuzzy method [9]. Authors proposed technique for detecting plagiarism in Urdu documents. Reordering of sentences, and inter-textual similarity among Urdu documents was handled in this work. Proposed technique was evaluated using Support vector machine (SVM) and Naïve Bayes (NB). Performance of this proposed method was better as compared to existing techniques [11]. Author proposed Deep learning based methods for handling paraphrase detection task in Indian languages. Convolutional neural network with word embedding, WordNet score and LSTM based methods were experimented [14].

III. METHODOLOGY
This section provides the detailed architecture of system for finding similarity among Punjabi documents. Fig. 1 presents the architecture of Punjabi document similarity analyzer. Punjabi document similarity analyzer consists of mainly 4 different levels. Each level (except level 0) takes input from previous level and provides some output to the next level. Working and detail about each level is as follows: First step for any kind of analysis is corpus. Due to unavailability of textual similarity corpus in Punjabi, similar document pairs were created. For the creation of these documents, two techniques were followed. In first one, two human annotator (Punjabi native users (writers and speakers)) were requested to write one page on a given topic. As this process was very time consuming, so internet was used as second source for generating similar document pair. Topic selection is versatile from latest topic such as corona virus to festivals of Punjab, from motivational write up to a small story, from real heroes such as Bhagat Singh to real world problems such as pollution, religious gurus to motivational thoughts. Total 50 document pairs (100 documents) written in Punjabi language using Gurumukhi script were collected for further experimentation. Out of 50 document pairs, 26 document pairs were annotated by Punjabi native user whereas, remaining 24 document pairs were generated through internet. Each document pair consists of two documents. So, these two documents D1, and D2 were passed through following phases/levels. Table I provides the statistical details about the dataset. B. Level 1 D1 and D2 are passed through various pre-processing sublevels. Existing similarity checker for various language (such as English) consists of comparison based on phrases and terms only. Whereas, in this proposed technique, comparison was not just based on exact phrases and but contextual information association with phrases and terms was also checked using IndoWordNet [19]. The purpose of this sub level is to reduce the noise in the input data. So various punctuation marks, symbols were removed from documents (D1 and D2). Stop words were also removed from D1 and D2 [16] [22].

C. Level 2
As mentioned earlier, in this work, document similarity is identified at sensed-lexicon level. Lexical level comprises of lexicons that are being used in both the documents. Lexical features are proven to be effective in Punjabi poetry classification work [17] [21]. Correct sense of these lexicons leads to sensed-lexicon. In next sublevel, remaining words were normalized into their root form. For word normalization, Punjabi stemming rules were used [18]. ND1 and ND 2 (normalized words from document 1 and document 2) were passed to next sublevel. Another important aspect of near copy similarity is synonym replacement. To identify the synonyms replacement among the documents, an algorithm is devised. Detailed steps are presented in algorithm 1. Effect of synonym replacement with stemming is presented separately in the results section. IndoWordNet was utilized for synonym replacement based on Part of Speech (POS) tags [19][20]. In this word, two part of speech tags ("noun" and "verb") were experimented for identifying the synonym information from document. These normalized words (ND1, ND2) from D1 and D2 represent Vector Space Representation of both documents (VSRD1, VSRD2) [24]. With an intention to give more preference to higher occurring word in document, term frequency (TF) was used to weight the words in D1 and D2. Formula for term frequency is as follows: ( )

D. Level 3
In this level, weighted ND1 and ND2 tokens from VSRD1 and VSRD2 were divided into n-grams. Results are presented for n is equal to 1 to 5. Generated n-grams were passed to next sublevel: document similarity level. Lexical similarity between documents was identified through following techniques. Similarity of documents was generated on the basis of scale from 0-100. 0 means no overlapping between the documents and 100 means completely copied document.
Jaccard Similarity Index (JSI): This index was used to measure the similarity between two sets using the formula as given below [24] ( ) Where represents the n-gram representation of weighted ND1 and ND2.
Cosine Similarity Index (CSI): This index was used to measure the similarity based on angle between two vectors [25] where document were represented as vectors.
Where represents the n-gram representation of weighted ND1 and ND2 c. Levenshtien Distance index (LDI): This is edit based similarity index. Number of edits in form of insertion, deletion and substitution is calculated. Overall bounded similarity index is generated between 0 and 1 [26].
It measures first i characters and j characters of respectively.
Implementation of this entire work was done in Python 3.7 [15]. Different packages such as nltk, inltk, sklearn were used in this work. www.ijacsa.thesai.org Algorithm I: Algorithm for finding synonyms of tokens based on part of speech associated Input: Document1 (D1) and Document2 (D2) Output: All synonyms replaced in Document2 Step 1: Both documents were tagged based Part of Speech with the help of part of speech tagger.
Step 2: Divide the document D2 into tokens a (t 1 …t n ) and form Bag of Word2 (BOW 2 ).

If token is present in BOW 2
Continue with the next token in Bag of Word1 (BOW 1 ), Else a) Find the synonyms of token using IndoWordNet and search the presence of each synonym in BOW 2 b) If match found in BOW2, replace synonym matched with the original token in BOW 2 c) Goto step3 Step 5: End

IV. RESULTS AND ANALYSIS
The purpose of this research work was to find the most suitable similarity index for Punjabi documents. Similarity between the documents can be identified either at Lexical level or at Semantic level. In this work, similarity between Punjabi documents has been measured at lexical level (indicated with "A" in this work) with different combination of pre-processing techniques. For finding the similarity index, document vectors of TF weighted n-grams have been used. For evaluating the system, results are presented in two sections. Section 1 consists of results by the algorithm and section 2 consists of evaluation results by human linguistic expert through HIT.

A. Results based on Algorithm
In order to find similarity index at lexical level (A), different measures (as specified in previous section) were experimented with different combinations of pre-processing techniques. These combinations have been labelled with characters a to e. Details of these measures with code are presented in Table II. It is notable that these codes have been coined by us for simplicity. Each document pair has been evaluated using 5 combinations of pre-processing techniques (as indicated in Table II) in addition with n-gram values from 1 to 5. For a single document pair, 5x5x3 combination have been tested where 5 were the combinations, 5 n-gram values and 3 similarity indexes. In total, 50x5x5x3 combination of experiments have been performed to analyze the result where number of document pairs are 50. For each document pair, each combination from A.a to A.e was tested with value of ngram used was 1 to 5. Result of each combination (considering only non-zero results for n-grams have been averaged. Results were analyzed based on two valid findings: 1) Finding 1: For more than 38 document pairs, similarity index values have been reported to be 0 for n-gram having value 4 and 5. So, these values were excluded while calculating average.
2) Finding 2: By averaging the n-gram results (as per finding1) obtained in each combination, best combination was selected. Although, combination A.a comes out to be the best combination in all of them. But, A.a results were ignored considering the presence of stop words and so is the maximum overlapping. Detail results are presented in the next subsection.

B. Results based on Human Intelligence Task (HIT)
For this work, each document pair was shared among 10 Punjabi language native speakers. Users selected for this research are from technical background and have sound knowledge about plagiarism and similarity. They were requested to rate the similarity between two documents on the scale of 0-100. Rating value equal to 0 or 100 was ignored considering it as outlier, and such values were not considered while calculating Average Human Intelligence Task (AHIT) rating.

C. Analysis of Similarity Indexes
For each document pair, the best combination is selected on the basis of Average Jaccard Similarity Index (AJSI), Average Cosine Similarity Index (ACSI), and Average Levenshtien Distance Index (ALDI). Table III provides the results obtained with algorithm and index value obtained with AHIT score. Values in column AHIT were averaged and rounded off to 2 decimal points. From Table III, Table IV is derived based on the frequency count of each combination. From Table IV, it can be observed that combination A.c is proven to be the best combination sofar on the basis of majority voting mechanism. Result of combination A.a is ignored as stated in finding 1. *Total value reflected in Table IV is 54 because in DP-1, DP-11, DP-12 and DP-15, two combinations comes out to be the best instead of one.
In second phase of experimentation, all the results for combination A.c were compared for checking the existence of correlation with AHIT obtained. For finding the correlation among these values, distribution of data was identified.
Distribution details were presented in Fig. 2. As it can be observed from Fig. 2, data is not normally distributed, so spearman correlation coefficient method was used for finding the correlation between values obtained by algorithm and human score [23]. Correlation strength values lies between -1 and 1.    Spearman correlation coefficient was obtained between 3 similarity index values and average HIT score. Table VI presents the coefficient values. From Fig. 3, it can be observed that highest coefficient value is 0.621 with p-value >0.05. So, AHIT score is more correlated with average cosine similarity index value. So, ACSI values obtained with algorithm has strong association with AHIT (as indicated from Table V values).

D. Analysis of n-gram
In this section, n-gram effect on similarity task is studied. For this work, value of n is taken from 1 to 5. As per assumption specified in result section, results are taken into consideration for n equal to 4 and 5. Analysis is carried out on unigram (n=1), bigram (n=2) and trigram (n=3). Table VII presents the results obtained for 50 document pairs for these ngrams.
For n-gram analysis, n-gram wise result for each combination (A.a to A.e) are averaged. Value for trigrams in document pair 4 and 10 are ignored and are not considered while calculating column average. It can be observed from the Table VII and Fig. 4 that bigram (n=2) gives the best result whereas as n is increased to 3, index values have been reduced.

V. CONCLUSION
As the Punjabi textual content is increasing day by day on web, there is a need to check many of such documents for similarity. Manually detecting the similarity is a tedious task. So, the main objective of this work was to automate the similarity detection process. As there was unavailability of similarity textual corpus, it was created manually through human annotators. 50 document pairs were collected for further experimentation. Each document pair consists of information about the same topic. These document pairs were passed through various pre-processing techniques such as stop word removal, stemming, part of speech based synonym replacement with the help of IndoWordNet. Different combinations of these techniques were tested with n-gram with value of n from 1 to 5. JSI, CSI, LDI and HIT based rating have been used for evaluation. Results indicated that combination of pre-processing technique (stop word removal with root word conversion using stemming and synonym replacement with "noun" based part of speech tag) proven to be the best combination so-far for detecting similarity among Punjabi documents. Out of the 3 indexes used for experimentation, values obtained for CSI are highly correlated with HIT based rating.