Transformer based Model for Coherence Evaluation of Scientiﬁc Abstracts: Second Fine-tuned BERT

—Coherence evaluation is a problem related to the area of natural language processing whose complexity lies mainly in the analysis of the semantics and context of the words in the text. Fortunately, the Bidirectional Encoder Representation from Transformers (BERT) architecture can capture the aforemen- tioned variables and represent them as embeddings to perform Fine-tunings. The present study proposes a Second Fine-Tuned model based on BERT to detect inconsistent sentences (coherence evaluation) in scientiﬁc abstracts written in English/Spanish. For this purpose, 2 formal methods for the generation of inconsistent abstracts have been proposed: Random Manipulation (RM) and K-means Random Manipulation (KRM). Six experiments were performed; showing that performing Second Fine-Tuned improves the detection of inconsistent sentences with an accuracy of 71%. This happens even if the new retraining data are of different language or different domain. It was also shown that using several methods for generating inconsistent abstracts and mixing them when performing Second Fine-Tuned does not provide better results than using a single technique.


I. INTRODUCTION
Natural language processing (NLP) is a subarea of artificial intelligence that involves tasks related to the analysis of text information using computational means. These tasks are: text generation, automatic text summarization, speech analysis and information extraction. Textual coherence modeling belongs to this class of tasks described; it consists of distinguishing coherent documents from incoherent ones [1]. Coherence in NLP is very relevant nowadays, because it is implicitly involved in several applications such as speech generation, text summarization generation, translations etc. The models proposed in text generation must ensure that the results are coherent texts. The automatic evaluation of coherence contributes to the generation of these texts with quality.
According to Charolles [2], coherence operates through the thematic progression which implies that all the ideas of a coherent text must be connected to each other. Each sentence provides a piece of information that ensures thematic continuity. A coherent text must also present consistency of ideas; this implies that no idea in a text should contradict another and neither should it be incongruent with the universe of the text to which it belongs.
Coherence also implies the type of informational and semantic connectivity that a text possesses [3]. A text is considered coherent if it is semantically consistent and provides cognitive integrity [4], therefore, a coherent document is easier to understand than an incoherent document. Coherence is more important when analyzing scientific papers, as it must communicate information effectively to reviewers and researchers. Incoherence in scientific writing directly affects both the reading experience and the comprehensibility of scientific papers [27]. Let us consider the sentence-divided scientific papers in Fig. 1 and 2.
In the left column of Fig. 1, the scientific abstract reports on the effects of candy advertising and consumption reactions of certain additives in children under 12 years old, while in the right column, the third sentence reports on projectbased learning that expresses an idea different from the other sentences, evidencing the incoherence of the text, due to the fact that the thematic progression and consistency of ideas have not been met. This phenomenon occurs in the same way in the right column of Fig. 2. The abstract deals with machine learning and data representation, while the sixth sentence reports on image processing.
As we have seen, the incoherence that occurs during scientific writing creates difficulties in transmitting and disseminating the authors ideas [27]. This happens because the sentences produced are not strongly interconnected, but are isolated, managing to label the scientific paper as "poorly written" or "difficult to follow" [28]. This kind of problem can be intentional or unintentional. Unfortunately, most of the existing systems that check for errors in scientific papers [29] lack advanced features for coherence quality control.
Given the above consideration, identifying incoherent sentences in a scientific paper becomes a problem of high rigor when evaluating scientific abstracts. The abstract is the only part of the article that is usually published in conference proceedings, and that readers usually review when searching through electronic databases; Likewise, it is a section that a potential referee gives a reading to when they are invited by an editor to review a manuscript [5]. It often contains the following structure: context, methodology, results and conclusions, each of which should provide relevant and semantically consistent information.
The evaluation of coherence requires a thorough analysis of the parts of the text at the structural and semantic level, since being a natural language it does not follow a set of rules like formal languages [4]. It is too abstract a concept [6]; however, it is a problem that has been given attention in different studies www.ijacsa.thesai.org  and/or solutions since the twentieth century, as mentioned in the following paragraphs.
Foltz in 1998 [7] proposed the first coherence evaluation method using machine. This method is based on the study of latent semantic analysis (LSA), a method that compares units of textual information and determines their semantic relationship. In the following years, several coherence analysis methods were proposed by various researchers; however, no method has proved to be perfect [8].
Relational models such as the "Rhetorical Structure Theory" [9] define relationships that hierarchically structure texts. Thus, in the work of Barzila and Lapata (2008) a model called Entity Grid [10] was proposed to evaluate local cohesion. This approach was based on the centering theory [11], which models a text as a set of segments and utterances that produce centers of attention.
Unlike the Entity Grid model [10], which is a method for evaluating coherence at the local level, in the work of Guinaudeau and Strube (2013) a graphical model called Entity Graph [12] was proposed to measure text coherence at the global level. This bipartite graph allows relating non-adjacent sentences of a text.
In the work of Li and Hovy (2014) it has been shown that recurrent and recursive neural networks are designed to estimate the coherence of a text [13]. Recurrent neural networks simulate the processing of a text according to a reading process: word by word; while in the recursive neural network the processing is represented through a binary tree.
Li and Jurafsky (2016) developed a discriminative neural model that can distinguish coherent and incoherent text. They also created 2 generative models that produce coherent text, one is based on SEQ2SEQ and the other is a Markovian model. These models capture the latent discourse dependencies of a given text [14] Based on the foundations of the Entity Graph model [12], a semantic similarity graph model was proposed in the work of Putra and Tokunaga (2017) to address coherence from a cohesion perspective [15]. They argue that the coherence of a text is built by the cohesion between its sentences. This method employs an unsupervised learning approach.
In the work of Baiyun Cui et al. (2017) a deep coherence www.ijacsa.thesai.org model (DCM) was proposed making use of a convolutional neural network architecture to capture text coherence [6]. The model captures the interactions between sentences by calculating the similarities of their distributional representations.
In the work of Mesgar and Strube [21], a local coherence model was developed using a unidirectional standardized LSTM architecture to encode the context of an input sequence of words, then the relationships between adjacent sentences were encoded using LSTM. Finally, a vector representing the coherence of the text was produced.
The use of Recurrent or Recursive Neural Networks allows a vector representation of an input sequence. Based on this, a series of dense (linear) layers can be applied to classify whether the input sequence is coherent or incoherent. Thus, in the work of Moon et al. [22], the Bidirectional LSTM (BiL-STM) sentence encoder was applied to capture the grammar of each sentence. Given the numerical representations of the sentences, the local coherence model and the global coherence model extract the respective features.
Bao et al. [23] used Recurrent Neural Networks (RNN) for the model to semantically represent a text. For this purpose, they used bidirectional closed recurrent units (BiGRU) in conjunction with the pretrained language model Word2Vec to represent this semantics. The results show that a complete analysis of the coherence of a text can be represented, which favors the task of binary text classification.
The creation of a dataset for training, validation and testing is also indispensable for the evaluation of coherence. Because of this, the work of Mohammadi et al. [24] proposes different techniques for generating incoherent or negative documents to be added to the coherent documents in order to train a Convolutional Neural Network. Their results indicate that artificially generating incoherent documents does not guarantee "sufficiently incoherent" documents, which negatively influences the accuracy of the model.
As seen, RNN, LSTM, gated recurrent neural network (GRNN) and BiLSTM are some of the sequence models for NLP tasks such as natural language modeling [17]. In 2017, the Google research team presented the Transformer architecture that replaces the complex RNN and CNN (Convolutional Neural Network) architectures because of its better results: parallel training capability with several GPUs and self-attention mechanism, which allows to "remember" the information in the long term [18]. Fig. 3 shows the architecture of the Transformer and the Bidirectional Encoder Representations from Transformers model. In the Transformer architecture, the encoder maps an input sequence of symbols (x 1 , x 2 , ..., x n−1 , x n ) to a sequence of continuous representations z = (z 1 , z 2 , ..., z n−1 , z n ). Given z, the decoder generates a sequence of output symbols (y 1 , y 2 , ..., y m−1 , y m ) , considering one element at a time. The general architecture of Transformer comprises the selfattention mechanism, the encoder and decoder, which are fully connected.
BERT is a model of the pretrained open source language introduced in 2018 [19]. It is based on Google's Transformer architecture. Also, it is designed to pre-train text representations in a bidirectional (left-to-right) manner from unlabeled texts [20]. BERT has two pre-trained models: BERT Base and BERT Large. The first model consists of 12 encoders and a bidirectional self-attention mechanism; while the second model consists of 24 encoders and 16 bidirectional heads. The BERT model is pre-trained with 800 million words from BooksCorpus and unlabeled text from English Wikipedia with 2.5 billion words. This model is suited for small datasets related to specific NLP tasks, for example, the evaluation of coherence in scientific papers. Fig. 4 shows the neural network architecture of the deep bidirectional BERT and unidirectional (from left to right) Ope-nAI GPT contextual models [17], in which the unidirectional model generates a representation for each word based on other words in the same sentence. The bidirectional BERT model represents both the preceding and following context in a sentence. However, the context-free models Word2vec and Glove generate a word representation based on each vocabulary word. According to the historical review, BERT has revolutionized the field of NLP by enabling transfer learning of large language models that can capture complex textual patterns [36]. It also has the advantage of offering better performance and scalability over recurrent neural network architectures since the latter operate sequentially, while BERT can be parallelized. This research work proposes a Second Fine-Tuned model based on BERT to detect inconsistent sentences to evaluate the coherence in abstracts written in Spanish/English. Two formal techniques for generating incoherent abstracts are also proposed in order to improve the training and validation of the aforementioned model. This paper is organized as follows: in Section II the most recent and important research on the evaluation of coherence is mentioned. In Section III the methodology and the research proposal are described. Section IV presents the experiment with the results. Finally, Section V presents the discussion, conclusions and future work.

II. RELATED WORK
Discovering semantic progression and consistency of ideas is indispensable for understanding coherence. Previous research has relied on the RNN, LSTM and BiLSTM architecture to evaluate coherence; however, these networks do not use a self-attention mechanism to encode sentences and some information is lost [16]. The Transformer-based architecture allows receiving input sequences in parallel making it more efficient; and, specifically, BERT allows capturing the context of a sentence based on bidirectional analysis [18]. Some work based on BERT to evaluate coherence is shown below.
In the work of Muangkammuen et al. (2020), a scoring method based on BERT was proposed to score text clarity using local coherence between adjacent sentences. Causeeffect relationship and contrast were considered [25]. First, a local coherence model was trained according to BERT; then the model was retrained to evaluate the clarity of a text. The results show that retraining provides positive results even if the data on which both trainings were performed were not domain related.
In the work of Callan and Foster (2021), a corpus of narrative stories automatically generated by the pre-trained Transformer GPT-NEO model was proposed, which were analyzed by humans and by 2 automated metrics: BERT Score y BERT NSP. This was done in order to evaluate the coherence and level of interest of narrative texts. The results show that greater emphasis should be placed on BERT evaluation techniques and that generative models do not always produce coherent texts; the more natural and coherent the generated text is, the higher its quality [26].
In the work of et al. (2021), a comparative analysis of 3 different types of models for the evaluation of coherence in Polish documents has been developed. The first one is based on Semantic Similarity Graph (SSG); the second one is based on Long Short Term Memory (LSTM); and the third one is based on BERT. The results show that the neural network related methods offer better accuracy than the SSG related methods; and within the neural networks, although the LSTM based method shows better accuracy compared to the BERT based method, it is emphasized that the latter can increase the value of said metric with an additional Fine-Tuned [4].
In the work of Nguyen and Zaslavskiy (2021), a Fine-Tuned method based on the BERT model in conjunction with a clustering algorithm was proposed for the detection of discordant sentences in a corpus of scientific documents written in English and Russian, in order to detect incoherence in scientific writing. Primero generaron ejemplos negativos mediante la métrica BERT Score para calcular la similitud semántica entre pares de oraciones. They first generated negative examples using the BERT Score metric to compute semantic similarity between sentence pairs. Then they trained a model with coherent and incoherent sentence pairs. Finally, they retrained this model with whole paragraph training. The results were positive [27].
In the work of Bendevski et al. (2021), a comparative analysis of different artificial intelligence methods for predicting the coherence score of narrative documents was proposed, where it was evidenced that BERT produces better results compared to traditional machine learning methods such as: Linear Regression, Support Vector Machine, Random Forest. They establish dimensions for coherence which are: Context, Chronology and Theme, each of these dimensions possess narrative texts with a coherence score of 0-3 (4-class classification) [30].
According to the work of Noji and Takamura [31], negative examples contribute to a neural language model's ability to robustly handle complex syntactic constructs and improve its robustness.

III. PROPOSED MODEL AND METHODOLOGY
The principal objective of this study is to build a Second Fine-Tuned model based on BERT to detect inconsistent sentences of abstracts in English/Spanish (coherence evaluation).

A. Data Recollection and Preprocessing
First of all, a program has been developed in Python with the beautifulsoup4 library to perform web scraping to the website of the journal "Comunicar" [32]. From this journal 1,493 scientific abstracts written in Spanish were extracted. Second, the corpus of "Medical Semantic Indexing in Spanish" (MESINESP) was downloaded [33], from which 51,390 scientific abstracts written in Spanish were collected. Thirdly, a corpus of arXiv was downloaded through Kaggle [34], from which 56,181 scientific documents written in English were collected.
In addition, a corpus of 448 scientific abstracts from the "International Conference on Machine Learning and Applications" (ICMLA) were added to this corpus. [35]. As a result, a corpus of 56,629 scientific documents was constructed, 3 corpus were collected, 2 in Spanish and 1 in English. It should be added that the downloaded corpus were subjected to 2 preprocessing stages: Remove blanks and Remove duplicates. Two formal techniques were developed for the generation of negative examples (incoherent abstracts):

1) Random Manipulation (RM):
In the first method, every abstract T i of a corpus D is tokenized in N sentences. This T i is represented as a set of sentences the position of the sentence in T j , it must be fulfilled that 1 ≤ j ≤ N . The variable i represents the position of a scientific abstract in the corpus D, it must be fulfilled that 1 ≤ i ≤ size(D).Then a S j is randomly selected, knowing that: S j ∈ T i , T i ∈ D. Therefore, this S j is replaced with a S j ; knowing that: S j ∈ T i , T i ∈ D , and S j = S j .
2) Manipulation Random based K-means (KRM): In the second method, embeddings with BERT are generated from all abstracts of a corpus D. Then by clustering with K-Means ( #clusters : K = 10 ) As a partial result, there are 10 clusters of scientific abstracts labeled as C i , being i the cluster number in the corpus D. These clusters are grouped by similar embeddings. The K-means algorithm minimizes the principle of inertia, according to the following equation 1: This measure 1 indicates the internal coherence between the samples belonging to the different clusters. Taking as a reference the variables of the first method; to generate negative examples a S k ∈ T j is randomly chosen, considering that: T j ∈ C i y, C i ∈ D. LThen this S k is randomly replaced with a S k knowing that: S k ∈ T j , T j ∈ C i y C i ∈ D. It must be fulfilled that: S k = S k y C i = C j . Having clusters whose similar scientific documents, knowing that the groups among themselves are different, it is ensured that negative examples are explicitly more incoherent than the first RM method. Finally, the corpus are summarized in the following table I.   During the tokenization process, an input sequence must be preprocessed to produce tokens (prayer units). In this process, the CASED preprocessor BERT was used, which is a multilingual preprocessor that can process special characters or capital letters/minuscule. A sorting token [CLS] is always included at the beginning of a sentence, each word being a token. To separate one sentence from another, the separation token [SEP] is included. [37]. It is known that an abstract T i is represented as The BERT Large pre-trained model was used to perform the numerical coding process of input sequences. This is a more complete version with 24 encoders and 340 million parameters [37]. It was ensured that this model used is multilingual and also that it is CASED. For each token 3 representations of embeddings were applied, the first is called token embeddings (h s1 ), is responsible for representing a token as a numerical vector. The second is called segment embeddings (h s2 ), this indicates to which segment a token embeddings belongs. It is known that a segment is that delimited by the [SEP] separator of another segment. The third is called positional embeddings (h s3 ), this indicates the relative position of a token embeddings in the sentence. Each word is processed simultaneously [37]. Finally, each numerical representation is added to produce a single resulting vector h c that will be used to train the Fine-Tuned model. The vector h c can be represented by the following equation: Also, Fig. 6 represents how these embedding layers work with a pre-processed input. Fig. 7 shows the general architecture of the BERT Large model.  To perform the Fine-Tuned of the pre-trained model, following the six experiments described above, the dataset of the journal "Comunicar" for testing has been divided by 10%. Then 10% was divided for validation and 80% for training. Each experiment has the same percentage of data division. It should be remembered that each experiment was tested with the testing data of the journal "Comunicar". After defining the data sets, a Neural Network Layer Dense has been built with a Rectified Linear Unit activation function (ReLU). This activation function will generate a positive or zero output (if the input is negative). This function is optimal to accelerate the training of deep neural networks, it is defined with the following equation 3: After the ReLU layer, a Dropout layer was added to avoid overfitting the model during training; as several studies have shown that it reduces this overfitting and improves the performance of deep neural networks for tasks such as document classification [38]. After implementing the Neural Network layer, a simple layer has been added with a sigmoid activation function. This function is commonly used for binary classifications. It has a prediction range of [0 − 1], with those closest to zero being those incoherent abstracts (rounded to 0) and those closest to 1 being the coherent ones (rounded to 1). The mathematical representation of this function is shown in the following equation 4: The latter simple layer contains a single neuron, whose output represents the probability of coherence of a group of abstract sentences. It can be mathematically defined by the following equations 5 and 6. Each experiment has followed the same procedure so far mentioned. Considering the above, the 6 experiments possess the same architecture. This architecture is observed in the following Fig. 8. www.ijacsa.thesai.org

IV. EXPERIMENTAL SETTINGS AND RESULTS
In this section, the 6 experiments mentioned in the previous section are in depth. Six models with exact architecture have been generated to Fig. 8. The only difference between the experiments are the positive and negative data sets with which their models have been trained. The following subsections are detailed: The Datasets, Experimental Environments, Parameters Fine-tuning and Performance measurements:

A. The Datasets
According to Table I and described in previous sections; each experiment was assigned training, validation and testing data segments. This is shown in Table II. These datasets feed each model to perform the corresponding Fine-Tuned. It should be mentioned that each dataset generated for the experiments has passed through a preprocessing stage explained in Section III. According to Table II, 2 formal methods of generating negative examples have been applied as explained in Section III. Experiment 1 uses only the dataset of the journal "Comunicar". Experiments 2 and 3 depend on experiment 1, in that sense; experiment 2 first uses the dataset of MESINESP and, secondly, the data of the journal "Comunicar" is added (repeating exactly the same training and validation data). Experiment 3 follows the same flow of experiment 2, only that it uses its own dataset as detailed in Table II. Incoherent abstracts of experiments 1,2 and 3 were generated with the RM method. Experiment 4 is similar to experiment 1 with the difference that its incoherent abstracts were generated by the KRM method. Experiments 5 and 6 depend on experiment 4. In that sense, experiment 5 is similar to experiment 2, with the difference that when performing the training, the techniques of RM and KRM are combined to create a varied dataset. The RM method was applied to the MESINESP dataset and the KRM to the "Communicar" dataset. Experiment 6 follows the same flow as experiment 5 with the difference that uses its own dataset as detailed in Table II. In summary, the incoherent abstracts of experiment 4 were generated with the KRM method, therefore, in experiments 5 and 6 they ended up using both KRM and RM methods. It should be remembered that the testing data was generated with the RM method. This set belongs only to the magazine "Comunicar", repeating in each of the experiments without exception.

B. Experimental Environments
Experiments 1 and 4 were executed in the Google Colab environment. This environment serves to create automatic machine learning models for free and with powerful hardware resources such as: Graphic Processing Unit (GPU) and Tension Processing Unit (TPU) [39]. On the other hand, experiments 2, 3, 5 and 6 were executed on a standard server. Resources used for experiments are described in Table III.

C. Parameters Fine-Tuning
The parameters, hyper-parameters and other configurations of the 6 experiments are detailed in the following Table IV.

D. Performance Measures
The basic components that have been used for the evaluation of the models developed during the six experiments are the following: • True Positive (TP): When the model correctly predicts the positive class. This means it correctly predicts a coherent abstract.
• True Negative (TN): When the model correctly predicts the negative class. This means it correctly predicts an incoherent abstract.
• False Positive (FP): When the model incorrectly predicts the positive class. This means it predicts an incoherent abstract as coherent.
• False Negative (FN): When the model incorrectly predicts the negative class. This means it predicts a coherent abstract as incoherent.
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 5, 2022 The basic components of the 6 experiments are detailed in the Table V:  The Accuracy, F1-score, precision and recall were the most frequently used metrics to report model performance on benchmark datasets. As metrics for binary classification problems, they can be derived from a confusion matrix, a two by two contingency table of the predicted and observed class labels [40]. Once the evaluation components have been defined, performance measurements were calculated using the following equations: 7 8 9 y 10: Accuracy = T P + T N T P + F P + T N + F N Applying performance measurements equations to the 6 generated models, the following results were obtained from the Table VI.  The experiment 2 model offers a better Accuracy (0.71) to detect inconsistent sentences in scientific abstracts. This shows that performing a Second Fine-Tuned mixing data from the same language, but different domain (MESINESP + Comunicar), improves the Accuracy to evaluate coherence. Experiment 3 shows that performing a Second Fine-Tuned by mixing data from a different language (English) and different domain (arXiv + ICMLA + Communicar) improves accuracy to detect inconsistent sentences (0.70). This aspect is important, as it is shown that you can increase training data from different languages without worrying about results to evaluate coherence. This is because a multilingual pre-trained model of BERT has been used.
In experiments 4, 5 and 6 it is shown that the KRM method works slightly better than the RM method with few data. The KRM method, at first, better classifies incoherent abstracts, but if a large set of data is increased whose negative examples were generated with RM, it does not offer higher performance. This happens because the testing data were not created with the KRM method but with the RM method, which negatively impacts predictions.
The authors agree with the work of Bendevski et al. [30] when he mentions that BERT offers better results for the evaluation of coherence than traditional machine learning methods, since the latter cannot understand the context of a text as does BERT, in addition to that BERT offers greater scalability and guarantees the Transfer learning process.
In these studies [4], [6] and [23] they have focused on generating incoherent examples varying the order of sentences (Sentence Ordering Task) unlike the present research that has focused on the detection of inconsistent sentences for the evaluation of coherence in scientific abstracts written in English/Spanish.

V. CONCLUSION AND FUTURE WORK
In this study, it has been shown that abstracts written in different languages/domains can be trained to detect inconsistent sentences of test data whose language and domain is also different from training data. Experiment 2 has proved to be better for the detection of inconsistent sentences of abstracts written in Spanish. Experiment 3 has proved to be more optimal for the evaluation of abstracts written in Spanish using combined training and validation data written in Spanish and English. Also, the variety of incoherent abstracts generated with RM and KRM during the Second Fine-Tuned has proven to deliver no better results than to train with a single method of generating incoherent abstracts when you want to detect inconsistent sentences for coherence evaluation.
Future research will renew the current clustering method (KRM) to a BERT Score-based method for the detection of inconsistent sentences. It will also address the evaluation of coherence as a multi-classification problem taking into account the types of incoherence: contradiction, redundancy and thematic discontinuity.