The Influence of Loss Function Usage at SIAMESE Network in Measuring Text Similarity

In a text matching similarity task, a model takes two sequence of text as an input and predicts a category or scale value to show their relationship. A developed model is to measure the similarity one of relationship between those two text. The model is SIAMESE network that implement two copies of same network of CNN, it takes text 1 and text 2 as the inputs respectively for two CNN networks. The output of each CNN network is features vector of the corresponding text input, both outputs are then fed by a loss function to calculate the value of loss (i.e. similarity). This research implemented two types of loss functions, i.e. Triplet loss and Contrastive loss. The usage purpose of these two types of loss functions was to see the influence toward the measurement results of similarity between two text being compared. The metrices used for this comparison are precision, recall, and F1-score. Based on the experimental results done on 1500 pairs of sentences, and varied on the epoch value starting from 10 until 200 with an increment of 10, showed the best result was for epoch value of 180 with precision 0.8004, recall 0.6780, and F1-score 0.6713 for Triplet loss function; and epoch value of 160 with precision 0.6463, recall 0.6440, and F1-score 0.6451 for Contrastive loss function gave the best performance. So that, the Triplet loss function gave better influence than Contrastive loss function in measuring similarity between two given sentences. Keywords—Sentence; similarity; triplet; contrastive; CNN; Siamese; dataframe


I. INTRODUCTION
The very fast growth of information nowdays causes a particular problem, such as an overwhelming of information [21]. It is very likely among those collections of huge of information found some similar ones, so that, they can be grouped into several classes based on their similarity. In order to overcome the overwhelming of information, each class will only be represented by a single information. Obviously, to identify the similar information requires a process called similarity measurement. A text similarity measurements is one of text mining approach that capable of coping with the information overwhelming. This process begins with finding similar word for sentece, then paragraph, and finally document [6]. Text similarity approach will ease people to find relevance information. It has a great support in successness for text mining operations such as, searching and information retrieval (IR), text classification, information extraction (IE), document clustering [8], sentiment analysis [4] [10] [16][3] [13], machine translation, text summarization, and natural language processing (NLP). Text similarity measurement may be done by comparing text -text matching. In text comparison tasks, a model takes two texts as inputs and predicts a category or a scale value indicates the relationship between those two texts. A big number of varieties of tasks such as, natural language inference [2] [11], paraphrase identification [17], answer selection [19] could be consider as special form of text matching problems. Recently, deep neural network is the most popular choice for text mining. Semantic alignment and comparison of two sequence of texts are the key of neural text matching. Most of previous deep neural network contain single inter-sequence alignment layer. In order to make the alignment process fully used, model must take many external syntaxtical features or aligment as additional inputs at alignment layer [5] [7], adopt a complex alignment mechanism [17], or build a big number of post-process layers to analyze alignment results [7].
This research proposed two models to compute similarity value between two texts using SIAMESE network in which each uses two different types of loss function -Triplet loss and Contrastive loss. The model consists of two copies of same CNN networks, where each recieves text (or sentence) as input. Subsequently, each CNN network results feature vector of each recieved text, and finally fed by loss function to compute the similirity value. Each model will be tested using the same dataset Quora Question Pair similarity taken from https://www.kaggle.com/c/quora-question-pairs. In order to see the influence of using these two types of loss function, the three metrics: precision, recall, and F1-score will be computed.

II. RELATED WORKS
Deep neural network is very dominant in text matching area. While semantic alignment and comparison between two text sequences is the core of text matching [17]. The very beginning task explores encoding of each sequence individuallly into a vector and then bulds a neural network classifier over the two vectors. In this paradigm, recurrence [2], recursive [12], convolutional network [20] were used as sequence encoder. In this model, where encoding from a sequence independent with other sequence caused the last classifier had difficulty in modeling complex relations. Therefore, the subseqeunt tasks adopt framework matching aggregation to match two sequences at the lower level and aggregate results based on attention mechanism. DecomAtt used a simple form from attention for alignment and aggregate representations aligned by feed-foward network [15]. ESIM used a similar attention mechanism and implement bidirectional LSTMs as encoders and aggregators [5]. In order to improve model's performance, the researcher adopted three main paradigms. The first paradigm used syntacs that richer hand-writing features. HIM used syntactic parse tree [5]. The many usages of POS were found in previous tasks, some of them in [12] [7]. The exact match with lemmatized tokens was reported as a powerfull binary features [7] [12]. The second way was adding the complexity to alignment computation. BiMPM exploited an advanced multi-perspective matching operation [17], and MwAN implemented multi heterogeneous attention functions to calculate the alignment's results. The third way to improve model is by building heavy post-processing layers for alignment results. CAFE extracted additional indicators from allignment process using alignment factorization. DIIN adopted DenseNet as a deep convolutional feature extractor to filter information from alignment results [7]. The more effective models could be built when inter-sequence matching was allowed to be done more than once. CSRAN performed multi-level attention refinement with dense connections among many levels [12]. DRCN stacked encoding and alignment layers [12]. DRCN concatenated all previous alignment results and must've used autoencoder to cope with features space explotion. SAN exploited recurrent networks to combine many alignment results [14]. An architecture of deep based on new way relating contiguous blocks called by augmented residual connections to filter previous aligned information roling as inmportant features for text matching [18].

III. METHODOLOGY
This section presents the building of two Siamese networks each with triplet loss and contrastive loss functions.The training model for Siamese network with triplet loss function consists of three copies of same network of CNN, it takes text 1, text 2 and text 3 as the inputs, while one with contrastive loss function consists of two copies only, it takes text 1, and text 2 as the inputs. However, the testing model for Siamese network both with triplet and contrastive loss function consist of two copies of same network of CNN, it takes text 1, and text 2 to be calculated their similarity. The dataset used to do training and testing was Quora Question Pair similarity taken from https://www.kaggle.com/c/quoraquestion-pairs. At the end, the three metrics: precision, recall, and F1-score were computed to see the influence of loss function's usage in each network.

A. Learning Model
Based on the objective of the research, the model was built with three main components: twin network, similarity function, and output layer. 1) Twin Network: most feature's extraction for text take place in this network. It has two copies of same networks, two networks share set of same weights. It capable of receiving two different inputs -two sequence of text. Each network in it is a convolutional text encoder. CNN network is used twice before performing backpropagation. 2) Similarity function: two outputs from twin netork representing features learned automatically from two compared text was fed by a layer. Subsequently, distance formula was used to compare the similarity between two text in n-dimensional space. 3) Output layer: last layer whose single neuron connecting n neurons from the results of previous similarity function. The role of this part of the model was to decide the probability of tested text to be a member the text used as reference. The probability was computed using Sigmoid function as in equation (1).
where h 1 j and h 2 j are values of j-th neuron of first twin network and second twin network respectively, σ j is weight between neuron output and j-th neuron in similarity layer.
The structure of Siamese network is shown in Fig. 1 [23]. The two types of loss function are implemented in the research, namely, triplet and contrastive. The aim of this implementation is to find the influence of loss function usage in similarity value computed between two text.

a. Triplet Network
A triplet network is comprised of three instances of the same feedforward network (with shared parameters). When fed with three samples, the network outputs two intermediate values -the L 2 distances between the embedded representation of two of its inputs from the representation of the third. The three inputs will be denoted as x, x + and x − , and the embedded representation of the network as N et(x).
In words, this encodes the pair of distances between each of x + and x − , against the reference x, i.e., The distance between reference input x and positive input x + was minimized, on the other hand, the distance between reference input x and negative input x − was maximized. The structure of triplet network is shown in Fig. 2 The contrastive loss function takes output from network for positive sample and computes the distance to an example from the same class and contrast it with distance to negative examples. In other word, the value of loss is low if positive samples are encoded to similar representations (closer), and negative examples are encoded to different representations (farer). The formula used to compute distance was cosine similarity distance as shown in equation (2).
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 12, 2020 As previously mentioned, the dataset used in the research was public. It was taken from https://www.kaggle.com/c/quora-question-pairs. The dataset was in CSV file containing 5000 pairs of Quora questions. The dataset was read to generate dataframe, and then the generated dataframe was devided into two: 70% for training and the other 30% for testing. The next process was building Word2Vec model using Gensim for embedding layer of deep learning. Then define function to transform sentence into vector containing index of Word2Vec's vocabularies. Another two important functions to be defined were triplet and contrastive loss, continued to define the CNN Siamese network with both triplet and contrastive loss functions, and then trained it.

A. Dataframe
The dataframe was generated in several steps: first applying simple process both to question 1 (sent 1) and question 2 (sent 2), to do this there was a function simple process taken from https://radimrehurek.com/gensim/utils.html #gensim.utils.simple preprocess for tokenization. This process was done fao all pairs of sentences in the dataset. From these all tokenized pairs of sentence, the number of tokens from the longest sentence was calculated based on the mean and deviation standard. This value was used to do padding, to made all tokenized pairs of sentences had the same length. In order to train the Siamese CNN model with triplet loss function, a negative sentence (the third tokenized sentence) must have been added to each pair of tokenized sententece (i.e. tokenized sent 1 and tokenized sent 2) for training dataframe. The format of training dataframe is shown in Fig. 3.
However, in the testing process for Siamese CNN model even with triplet loss function only need two tokenized sentences (tokenized sent 1 and tokenized sent 2). The dataframe format for testing was shown in Fig. ??.

B. Word2Vec Model
Before building the CNN Siamese network, the Word2Vec model was built using Gensim for embedding layer in deep learning. It was built with CBOW (https://iksinc.online/tag/continuous-bag-of-words-cbow/) with 20 iterations, vector length of 100, and window size was 5. The Word2Vec was trained once using string of words (tokenozed sentence as a result of concatenation between tokenized of sentence 1 and tokenized of sentence 2), and saved it. During the process of training the model, there was a function required to convert a sentence into a vector containing Word2Vec vocabulary indices. For the model using triplet loss function, the converter changed three tokenized sentences (i.e. tokenized sent 1, tokenized sent 2, tokenized sent 3) into three vectors containing indices of Word2Vec vocabularies (i.e. sent 1 ids, sent 2 ids, sent 3 ids), and two tokenized sentences (i.e. tokenized sent 1, tokenized sent 2) into two vectors containing indices of Word2Vec vocabulary (i.e. sent 1 ids, sent 2 ids) respectively in training and testing processes. However, for the model using contrastive loss function, the converter changed two tokenized sentences (i.e. tokenized sent 1, tokenized sent 2) into two vectors containing indices of Word2Vec vocabularies (i.e. sent 1 ids, sent 2 ids) in both training and testing processes. The sample of conversion results was shown in Fig. 5.
While the definition of the function that convert list of tokens to list of indices was shown in Fig. 6.

C. Triplet Loss
The Triplet loss function takes three inputs: baseline (an anchor sentence), positive (true sentence -one closest to an anchor), and negative (false sentence -one farest to an anchor). Therefore, the objective of this function is to minimize the

D. Contrastive Loss
Unlike the triplet, the contrastive loss function only takes two inputs: positive sample and negative example. The objective is to contrast the distance between positive sample and an example from the same class with the distance between positive sample and negative example.
The implementation of the triplet function was shown in Fig. 8. These two functions were only used during the training process to update the weights of network in order to converge; the triplet loss used two distances, and the contrastive loss only used one distance. While during the testing, no loss function needed, but the distance between two sentences. Because the model simply used the final weights resulted from training process.

E. CNN Siamese Network
The CNN Siamese network was built with the following specifications: 1) The embedding layer was the Word2Vec model that had already been trained. 2) Three 2-dimensional convolutional layers with configurable hyperparameters. 3) For every convolutional layer used tanh as an activation function, dropout, and maxpool layer.
The architecture of CNN that was implemented in the CNN Siamese or just Siamese model was shown in Fig. 9 [22]. The hyperparameter of the CNN Siamese network was shown in Fig. 10. The CNN Siamese model with triplet loss function consisting of three exact same models of CNN was trained using triple tokenized sentences (i.e. tokenized sent 1, tokenized sent 2, and tokenized sent 3) from the dataframe training. The output of each CNN model was fed by Triplet loss function to calculate the loss value. On the other hand, the CNN Siamese model with contrastive loss function consisting of two exact same models of CNN was trained using pair of tokenized sentences (i.e., tokenized sent 1, and tokenized sent 2) from the dataframe training. The output of each CNN model was fed by contrastive loss function to calculate the loss value. Finally, both model of CNN Siamese with triplet and contrastive loss function respectively were tested by feeding two tokenized sentences to be calculated their similirity value. Both testing were conducted using the same number of 1500 pairs of sentences (Quora questions), and varied for the values of EPOCH starting from 10 until 200 with an increment of 10 to obtain the values of three metrics: precision, recall and F1-score. The values of three matrics for each CNN Siamese model were shown respectively in Table I  and Table II. In accordance with the testing results, it seemed that an ordinary CNN did not fit enough for this dataset even though each trained epoch showed its convergence, but the validation results were not good. This was caused either by the difference between distribution of words in the training data and the one in testing data or an ordinary CNN model found difficulty to distinguish two sentences with very similar structures, but in fact they were different semantically. For instance, "What's your favorite political bumper sticker?" with "What was your favorite bumper sticker in the 2000s?". These two sentences was taged "not similar", even their structures were similar, and there were some others. Conceptually, a CNN usually performs convolution with kernel. The kernel size used in this model was [1,3,5] multiplied by the size of embedding yielded by Word2Vec. If there are similar sentence structures, the result of the convolution will only be different in some elements of convolution's results matrix. So that, the similarity between matrices of convolution's results from two sentence inputs will be high (i.e., two sentences are similar). Even though dropout and regularization had been implemented, this problem would still happened, because there were pair of sentences with similar structures would be considered either similar or not similar. In order to solve the problem it needs a language model and a more complex network. In this experiment, Word2Vec yielded vector of word based on surrounding words. It was not contextual, it means that one word would be represented by a vector that always the same. Table I and Table II was derived chart comparing the two loss functions (triplet and contrastive) each for metric precision, recall and F1-score. Fig. 11, 12 and 13 show the chart of comparing metric between two loss functions, respectively for precision, recall and F1-score. The three metric, precision, recall and F1-score were www.ijacsa.thesai.org

V. CONCLUSIONS
The two CNN Siamese models had been succesfully built, one with triplet loss function and the other with contrastive loss functions. The size of dataset used for training and testing were 3500 and 1500 respectively for the total of 5000. Both models were treated equally either in training or testing processes. In the training process, the model with triplet loss function consisted of three exact the same CNN models, while one with contrastive loss function only consisted of two exact the same CNN models. From the results of the model testing, it was seen that among 20 values of epoch there was one with epoch value of 180 with precision 0.8004, recall 0.6780, and F1-score 0.6713 for triplet loss function; and epoch value of 160 with precision 0.6463, recall 0.6440, and F1-score 0.6451 for contrastive loss function gave the best performance. So that, the triplet loss function gave better influence than contrastive loss function in measuring similarity between two given sentences.