Cross-Language Plagiarism Detection using Word Embedding and Inverse Document Frequency (IDF)

The purpose of cross-language textual similarity detection is to approximate the similarity of two textual units in different languages. This paper embeds the distributed representation of words in cross-language textual similarity detection using word embedding and IDF. The paper introduces a novel cross-language plagiarism detection approach constructed with the distributed representation of words in sentences. To improve the textual similarity of the approach, a novel method is used called CL-CTS-CBOW. Consequently, adding the syntax feature to the approach is improved by a novel method called CL-WES. Afterward, the approach is improved by the IDF weighting method. The corpora used in this study are four Arabic-English corpora, specifically books, Wikipedia, EAPCOUNT, and MultiUN, which have more than 10,017,106 sentences and uses with supported parallel and comparable assemblages. The proposed method in this paper combines different methods to confirm their complementarity. In the experiment, the proposed system obtains 88% English-Arabic similarity detection at the word level and 82.75% at the sentence level with various corpora. Keywords—NLP; cross-language plagiarism detection; word embedding; similarity detection; IDF


I. INTRODUCTION
Plagiarism is a major problem today. Cross-lingual plagiarism (CLP) is a type of plagiarism that occurs when texts are translated from one language to another without citing the original sources. Monolingual plagiarism analysis, which detects plagiarism in documents written in the same language, has been executed by many researchers, but CLP remains a challenge. Earlier studies have used approaches such as crosslingual explicit semantic analysis (CL-ESA), syntactic alignment using character N-grams (CL-CNG), dictionaries and thesauruses, statistical machine translation, online machine translators [1] [6], and more recently, semantic networks and word embedding [7]. However, these approaches are specific to bilingual plagiarism detection tasks and are normally not sufficient for limited resource languages.
Conversely, word embedding is a significant representation theory used to represent sentence units used in natural language processing (NLP) applications [15]. This process depends on the low-dimensional vector representation of words, and it can easily measure the syntax vs. semantic relationship. Currently, a variety of NLP applications are contingent on two-word embedding models: the word2vec model [12] and the GloVe model [17]. The word2vec model is a neural network that includes three layers: one input layer, one output layer and one hidden layer. However, the GloVe word embedding model uses a global vector for word representation [21].
In this paper, we explore the performance of the distributed representation of word embedding to propose novel crosslingual similarity procedures for similarity detection. We use word embeddings with the IDF weighting method.

II. RELATED WORK
Word embedding is used in natural language processing as a representation of the vocabulary of a document. This method depends on identifying the context of a word (syntactic and semantic similarities) relative to other words using vector representation and involves two models: the word2vec and GloVe models. Recently, these two-word embeddings models have been used in various natural language processing applications [21].
However, this processing starts by converting words into vectors. Consequently, the cosine similarity is used to measure the semantic similarity between two words [13]. The previous method for representing a word vector was a "one-hot" representation, where the number of dimensions of each vector is matched to the number of dimensions of the vocabulary. Modern word embeddings are accessible for the study of semantic and syntax similarities.
Word2vec is one type of neural network with three layers: an input layer, hidden layer, and output layer. The number of dimensions of the vector that represents a word is the same as the number of neurons in the hidden layer. Typically, the word2vec model applies big datasets in the training phase to optimize the syntax and semantics correctly. Word2vec mathematically detects similarities to cluster the vectors of similar words together in vector space. The created vectors detect the word features by distributed arithmetic representations without human mediation. Additionally, using the given data, word2vec can determine highly accurate solutions about a word's meaning based on past sentences. Those solutions can be used to launch a word's connection with other words or cluster documents and classify them by topic (for example, "man" is to "boy", and "woman" is to "girl"). In addition, those clusters can be used in a sentiment analysis, where each item in the vocabulary has a vector attached to it and can be fed into a deep-learning networked or analysed to discover the relations between words.
The main approaches of word2vec are the skip-gram model and the bag-of-words model (BOW), and both of these models have achieved developments in computational cost and www.ijacsa.thesai.org accuracy. In these two approaches, the same hyperparameters are used, such as the window size denoted by C and the vocabulary size (represents the number of words in the corpus) denoted by |v|. In the next paragraph, these two approaches are explained briefly.
Conversely, the continuous bag-of-words technique (CBOW) inputs the context of each word using a linear classifier and predicts the middle word corresponding to the adjacent features in that context [10] [21]. The deeper analysis of CBOW can show that the input words comprise a one-hot encoded CxV dimension matrix of the context words, and the output layer comprises a vector with the elements being the softmax values of V length; the hidden layer contains N neurons and takes an average over all the C context input words, as shown in Fig. 1.
The continuous skip-gram approach or skip-gram technique (the second approach of the word2vec model) is very similar to the CBOW model. However, the difference between the two approaches exists in the input and output layers. The input in CBOW is the context words, and the output is the middle word, whereas the opposite occurs in the skip-gram model, where the input is the present word, and the output is the context words. Fig. 2 shows that the skip-gram model has three layers. The input layer includes the input vector with length V for only one word. The hidden layer has the same definition as it does in the CBOW model, where h in formula (1) denotes the relationship between the input and hidden layers, i.e., h is simply transposed onto a row with two layers with weight matrix, W, which is supplementary to the input word wI: h=WT:=vT, (1) (k,·) wI (1)  For the output layer of the model outputting C probability distributions, each context position has C probability distributions with V probabilities (one for each word) [19].
The skip-gram model is efficient when training small datasets with irregular words. However, the CBOW model is proficient when used with common words [15]. Moreover, the considerable challenge with both word2vec representations is learning the output vectors. To appropriately learn the output vectors, the proposed hierarchical softmax and negative sampling algorithms can be used [13]. The first algorithm (hierarchical softmax) is centred on the Huffman tree (a binary tree), which uses word frequencies to estimate the words in a tree. Then, the algorithm uses normalization in each step from the root to the target word [15]. The second algorithm, negative sampling, targets the noise distribution to update the samples of the output vectors. Correspondingly, negative sampling is used in the case of low-dimension vectors with more common words, whereas hierarchical softmax is used in the case of irregular words.

A. Dataset
The dataset used throughout our study is the new dataset familiarized by Aljuaid [2]. The characteristics of this dataset are as follows:  written in English and Arabic;  united at different levels (the document, sentence, and word chunk levels);  uses supported parallel and comparable assemblages;  conceals several subjects;  translates automatically or by humans, regardless of whether the translations are performed by professionals;  collected from more than 3,000 random documents that were checked manually. Table I shows the details of the dataset and presents the number of aligned units. Table II presents the different characteristics of the dataset within each corpus.

B. Outline of State-of-the-Art Methods
Cross-language plagiarism estimates the textual similarity between two languages in two textual units. In this section, the state-of-the-art methods that are used in this paper are discussed.  Cross-language character n-gram (CL-CnG) is dependent on the comparison of dual textual units according to their ngram vectors based on the [11].
Cross-language conceptual thesaurus-based similarity (CL-CTS) is used to extract the roots of the textual units to measure the semantics of the words [16].
Cross-language alignment-based similarity analysis (CL-ASA) is used as a bilingual unigram dictionary to determine the ability of one textual unit to translate to another textual unit and their probabilities extracted from a parallel corpus [18].
Cross-language explicit semantic analysis (CL-ESA) denotes the meaning of a document by a vector based on concepts derived from Wikipedia according to the explicit semantic analysis [8].
Translation + monolingual analysis (T+MA) involves translating elements in two different languages into the same language to perform monolingual identification among the elements [3]. This state-of-the-art method is discussed in depth in our previous paper [2].

A. Model used
The word embedding representation is achieved and is compatible with the corpus context. Words with similar contexts should be projected onto a continuous multidimensional space. However, word embedding can be used to detect and calculate similarities between sentences in the same or different languages.
Consequently, we used the word2vec CBOW approach toolkit offered by MultiVec [4]. To build and train the vectors, we use the large collection corpus discussed in [2].
To train the CBOW embedding system, some parameters are selected to affect the resulting vectors. The selected parameter has a vector size of 100 with a window size of 5, and a number of negative examples in training 10 are shown in Table III.

B. Textual Similarity
We introduce a new method to identify the similarity among textual words. However, the lexical resource in the cross-language conceptual thesaurus-based similarity (CL-CTS) is replaced with the distributed representation of words. To construct the words with the BOW model, we used the CBOW model to detect pairs of two words, wi and wj. Each word is represented by vectors vi and vj, respectively. The similarity between wi and wj is obtained by comparing their vectors vi and vj that were evaluated using cosine similarity. We call this new implementation CL-CTS-CBOW, and this method is used to improve textual similarity.
Then, we implement a method that performs a comparison between two sentences S and S' in different languages. We call this method CL-WES, which uses the cosine similarity of the embedded vectors of all units among the sentences to represent the distribution of the sentences [6], where S'= w1,w2...,wi and S" = w1′, w2′,...,wj′, with two textual units U' and U" in two different languages. Then, CL-WES builds the bilingual corpus of the two different languages. The two representation vectors V' and V" utilize cosine similarity.
The calculation of the distributed representation V around a textual unit U is: where V is the vector of the function that gives the word embedding, and ui is the textual unit. Fig. 3 shows our proposed system.

C. Syntax Similarity
In this section, the CL-WES model is improved by adding the syntax aspect, as discussed in Section 4.2, where U is a textual unit with n words, as shown in formula (1). However, we start by applying the part of speech tagger (POS) to syntactically tag U, which is used to weight every word in the sentence representation, classifying it into its morphosyntactic category. Then, we normalize the tags using the universal tagset [20]. Then, a weight is assigned to each tag according to this formula: where Poswk is the function used to determine the weight of the POS tagging of wk [14].

Moreover, if
and are two textual units with different languages, their representation vectors and are built using formula (4); then, cosine similarity is applied between them.
V =∑ (4) where the variable weight is a function that determines the weight of a POS, and the variable vector is a function that outputs the word embedding vector.

D. Combining Multiple Methods
To improve our method's performance in detecting crosslanguage similarity in English and Arabic languages, we combine our method with the IDF weighting method, where during weight processing, the similarity score of each method www.ijacsa.thesai.org is assigned, and the composite score is calculated (weighted), as shown in Fig. 3. The distribution of the weights is optimized with the Bersini method [5]. However, one fold of every corpus is used to train the IDF weights, so the other evaluates the IDF method.

1) IDF weighting method:
The IDF method constructs a compound weight of every word in a sentence. The IDF weight operates as a measurement term related to the absolute similarity between documents.
However, the Salton and [9] method is employed, where one fold of each corpus is used as an input to be semantically verified. To compute the IDF weight for every word, the other folds in the corpus are used as the background quantity. Moreover, the IDF is calculated with the following formula: where S is the number of sentences in the corpus written in the two languages of Arabic and English, and WS is the number of sentences containing w. Then, the cosine similarity between V1 and V2, cos(V1, V2), in and is calculated to obtain the similarity between S1 and S2: where idf (wk) is the weight of wk in the background.
Regarding the state-of-the-art methods for clustering capacity, the similar and different terms are correctly separated, and their ability to predict a (mis)match is determined. We combine these methods with IDF weighting to reduce uncertainties in the classification and exploit the complementarities of these methods. However, we find that these methods are processed differently according to their features. Some of them are lexical syntax-based, others are semantic-based and process the aligned words, and others capture the context with word vectors.

A. Evaluation Indicators
To evaluate our method, a distance matrix of size NxM is built, where M=1,000 and N is the evaluated sub-corpus we previously denoted as (S). However, to operate S, every textual unit is matched with its consistent units in the intentioned language (i.e., to detect the similarity in the cross-lingual analysis); in addition, it is compared to M-1, which is a unit randomly selected from S. In the comparison, each obtained matching score leads to the distance matrix. To identify the threshold of the matrix, the best F-score is used and defined as the symmetrical mean of precision and recall, where precision is the number of matches in similar units that is retrieved using all of the matches. All of the methods are applied to the Arabic-English corpus at the word and sentence levels. In every construction, a particular method is applied to the supcorpus for training and evaluation when considering a particular level. The evaluation folds are supported by varying the M selected units. The formulas for calculating the F-score, precision and recall are shown in formulas (7) -(9), respectively.
where TP is the number of samples with positive similarity. TN is the number of samples with negative similarity. FP is the number of samples that have a negative similarity tagged as a positive similarity, and FN is the number of samples that have a positive similarity tagged as a negative similarity.

1) Use of word embedding evaluation:
The F-score, which presents the distributed representation of words compared with lexical resources, improves the CL-CTS-WE performance to 78% at the word level, which is better than the performance of the CL-CTS method, which obtains a 59% performance at the word level and 54% performance at the sentence level, as shown in Table IV. However, the use of CL-WES improves the performance at the word level to 86%, which is higher than the state-of-the-art method performances, as shown in Fig. 4. Focusing on the state-of-the-art methods, we found that the best performance is from the CL-ASA method at the word and sentence levels, but the overall performance of the method is lower than the CL-WES performance, which is the best single method evaluated.
2) IDF evaluation: The results of the IDF method are recorded at both the word and sentence levels in Table IV and Fig. 5. In each case, we combine five state-of-the-art approaches and the proposed novel approach. The IDF weighting method is better than the state-of-the-art approaches and the embedding-based approaches at all levels. At the word level, the IDF method has an F-score of 88%. However, the best single method achieves an F-score of 86.5%. At the sentence level, the IDF method also obtains a trend of 82.75 against the CL-WES trend (81.5), which was recorded as the  Table IV confirm that the altered approaches proposed experience enhanced performance. Additionally, the obtained results in Table IV indicate that the embeddings are practical for Arabic-English cross-language similarity detection.

CL-CTS-CBOW
Finally, the performances of the methods indicate their capabilities with the dataset. In Fig. 6, we find that the precision improved by 1.54% in the Wikipedia and MultiUN corpora; the recall increased to 1.23%, and the F-score also increased by 2.05 in the Wikipedia and MultiUN corpus. By combining the performances of each method for the dataset, we find that the effect of the IDF method is better than that of the state-of-the-art methods, as discussed previously.    VI. CONCLUSION AND FUTURE WORK A novel approach for a word embedding-based system is presented in this paper to measure similarities in two crosslinguistic plagiarism. This method could be used for different cross-language similarities and in the training and evaluation phases applied in the Arabic-English corpus as a special case. The proposed methodology improves upon a syntactically weighted distribution representation that operates using the cosine similarity of imbedded vectors (CL-WES). The CL-WES model dominates all of the top state-of-the-art methods. Conclusively, the outcomes achieved from the proposed system confirmed that all methods are complementary and that their IDF weights are beneficial to the performance of crosslanguage textual similarity detection. The IDF method indicates an overall F-score of 88% at the word level; however, the CL-WES method obtains an 86.5% F-score at the word level, whereas the best single method obtains an F-score of only 64.75%. Additionally, at the sentence level, the methods show the same trends.
Our future work will be to improve the CL-WES method by exploring the syntactic and semantic weights according to the plagiarist's stylometry. Additionally, a smart hybridization www.ijacsa.thesai.org between both IDF weighting and POS tagging procedures will be applied to improve the results.