Grammatical Error Correction with Denoising Autoencoder

A denoising autoencoder sequence-to-sequence model based on transformer architecture proved to be useful for underlying tasks such as summarization, machine translation, or question answering. This paper investigates the possibilities of using this model type for grammatical error correction and introduces a novel method of remark-based model checkpoint output combining. This approach was evaluated by the BEA 2019 shared task. It was able to achieve state-of-the-art F-score results on the test set 73.90 and development set 56.58. This was done without any GEC-specific pre-training, but only by fine-tuning the autoencoder model and combining checkpoint outputs. This proves that an efficient model solving GEC might be trained in a matter of hours using a single GPU. Keywords—Denoising autoencoder transformer; sequence-tosequence; grammatical error correction; model ensembling; error remarks filtering; fine-tuning


I. INTRODUCTION
Grammatical Error Correction (GEC) is a language processing task whose target is to detect and correct any mistake that could be found in input data, without changing the meaning intended by an author.
According to the British Council, English is spoken at a useful level by more than a quarter of the world's population [1]. Most of English users are not native speakers and posses different levels of proficiency. Therefore, all tools aimed for improving language correctness and assisting learning process would be of great importance.
There are two main approaches in solving Grammatical Error Correction task by neural models. First is to treat GEC as a form of Neural Machine Translation, where erroneous source texts are "translated" into correct ones (for example, [2] and [3]). The other way is to treat GEC as a sequence classification task, where model provides probability distribution over available corrections for every token ( [4] and [5]). From many approaches to create a GEC-solving system, so far the best results (BEA 2019 shared task [6] test set) have been reported by GECToR [4]. They propose a sequence tagging model that classifies input text tokens in a few iterations to identify the errors. They use pre-trained transformerbased encoders with dense layers on top that select one of possible token-level transformations. This architecture aids fast inference, since there is no need to sequentially decode output tokens as in NMT-like solutions. The model training was done in three phases, using a large amount of parallel synthetic data at first and then tuning on smaller higher quality sets (NUCLE, LANG8 [7], FCE, WI+LOCNESS).
In [2] was introduced the most recent sequence-to-sequence approach that uses a transformer-based encoder model as a base for sequence-to-sequence system. The base encoder model is pre-trained BERT ( [8]), which is then fine-tuned on GEC data. This fine-tuning is perform on two tasks: MLM (Masked Language Model objective from [8]) and GED (Grammatical Error Detection). The encoder model adjusted this way is used to generate additional features in a sequenceto-sequence target model.
Problem of an inadequate amount of supervised training data was addressed in [9] and approached by using confusion sets to generate pseudo-data and pre-trains a sequenceto-sequence transformer. In [3] pseudo-data generation was performed via back-translation.
The main challenge of GEC is a very limited amount of annotated training data. It is relatively easy to acquire parallel texts for Machine Translation, while there are plenty of sources that provide texts in different languages. Corrected text, on the other hand, which are used for GEC, need to be proofread by human annotators. Preferably, every text should be reviewed multiple times, as in a test set for [6] and in [10].
Another aspect of this GEC task that might need closer attention is making better use of quickly improving language models. Both [2] and [4] include knowledge from models like BERT or XLNet in their approaches, but they also require quite complex pre-training phases with generated pseudo-data. The main advantage of relying more on a general-purpose model is that the target GEC system will get better together with constantly improving language models. This paper investigates the possibilities of applying pretrained sequence-to-sequence models for grammatical error correction and proves that fine-tuning is sufficient for achieving an efficient error correction model. This approach enables developing such models relatively quickly, with limited computational resources and limited data. Furthermore, after applying remark combination, it is possible to improve state-of-the-art results for GEC.

II. MODEL
In this section we describe our design decisions regarding model architecture, training and processing model output.

A. Architecture
In our approach we treat GEC as a sequence-to-sequence text transformation task, similar to machine translation. We choose the Transformer architecture ( [11]) for our model because of plenty of successful applications of this model type in NLP problems (for example, machine translation [12], summarization [13] or question answering [14]). The most imporant transformer-based sequence-to-sequence models are GPT-2 ( [15], T5 ( [16]) and BART ( [17]). Therefore, we use the pre-trained transfomer-based sequence-to-sequence model. Unlike encoder-models like BERT, XLNet, etc. they were pre-trained on full text-to-text tasks. We choose BART as our base model because of text denoising as its pre-training objective. This method makes it a natural candidate to solve GEC which may be seen as reconstructing correct text from some erroneous input. Our best results were achieved through BART large, which contained 12 encoder and 12 decoder layers and embedding dimension size equal to 1024.

B. Training
In contrast to other text generation tasks, in GEC difference between input and output text is relatively small. That impacts training and inference of the model. During training we try to set up configuration that would lead model to copy an input to the output with corrected language as the only adjustment. We noticed that both the amount of training data and training time need to be small and accurately selected to meet this goal. In Section IV we will describe the impact of data set source and size.

C. Inference
As for inference, an important distinction from other NLP tasks is the type of a decoding method. For example, in Machine Translation a common approach is to use Beam Search heuristic or such methods as Sampling, Sampling with Temperature ( [18]) to aid diverse and human-like, natural output. However, in our model, applying Beam Search or Sampling led to very noisy output, and the best results required greedy selection of elements in the generated sequence (so the Beam parameter was set to 1).

D. Ensembling Method
The final output of the GEC task might be considered a set of text remarks that transforms the original text into the target one. It might be beneficial from the educational point of view, but also might be used to achieve performance improvement by combining remarks from multiple model instances. ERRANT [19] grading and annotation tool enables one to extract atomic remarks from parallel texts. Fig. 1 show examples of annotated sentences. The lines starting with S contain original sentences, following the lines starting with A which contain remarks. Every remark describes the annotation span, type and value. For example, in the sentence from Listing 1: I think that the public transport will always be in the future the first remark suggests removing the definite article, by defining the span from the 3rd to 4th token and empty replacement text. The second remark suggests replacing the infinitive be with exist.
We propose a simple and effective algorithm of combining remarks. Every model returns remarks for input text. Same remark may be produced by multiple models. Let us define output of a model i as: where r is a single text remark. Then multi-set of model ensemble output M is defined as a tuple: where m is a function that gives every remark its number of occurrences: In practice, for simple texts, all sets of remarks are exactly the same. For more ambiguous texts, different model checkpoint outputs will differ. To combine different remark sets, we define the parameter R, which stands for required remark frequency, so the ensemble output M e will be: Only the remark present in at least R model outputs will be chosen to the combined output. For example, if R = 1, we take all remarks from all models, and if R = N , where N equals a number of model checkpoints, only the remarks present in all model outputs are used in the target output. Increasing R forces only highly probable remarks to be selected for the target set; decreasing R results in selecting more remarks for the target set (see Fig. 2 that displays impact of R on Precision and Recall).

III. DATA
Four publicly available data sets were used for training experiments (listed in Table I), all of them having been described for the BEA 2019 Workshop shared tasks [6].
During the pre-processing, all sentences whose byte-pair representation was longer than 400 were removed from the training set (it was no more than 0.5 % of all data), which allow for using bigger batches during the training. This, in turn, sped up model convergence. Furthermore, we tried the approach introduced in [2] and removed sentence pairs without any corrections. We achieved the best results after all the correct sentence pairs were removed from the NUCLE and LANG-8 datasets.  Evaluation on WI+Locness was performed by ERRANT [19]. The FCE and CONLL dataset results were measured by the M2 scorer [20]. However, except Table VI, all the results were reported by the ERRANT score. The M2 scorer was used only to allow for comparison with other reported results.
ERRANT and M2 evaluation method is based on text edits comparison. For every input sentence, measured system outputs some hypothesis. This hypothesis might be considered as a set of text edits E.
Every sentence has some gold standard edits G.
[20] defines precision and recall of system hypothesis as: ERRANT and M2 display system edits and the gold standard in a format defined in Section II-D. ERRANT additionally generates results indicating specific error categories (such as M:PUNCT, which stands for missing punctuation).

IV. EXPERIMENTS
During our experiments we measured the impact of different factors on model performance on GEC task. In Section II we emphasise the specifics of GEC. We noticed that our model setup was very sensitive for the quality and type of training data. On the other hand, a small amount of training data required precise selection of training time and learning rate to prevent overfitting. We reduced a dropout to 0.05higher values slowed down the model convergence and did not give any long-running benefits.
The model was trained using the Fairseq toolkit [21], adopting the general configuration designed for translation tasks. This setting requires providing dictionaries, which in our case were the same for both the source and target language. The baseline model was BART in two versions: base (140M parameters) and large (400M parameters). BART requires text pre-processed by the byte-pair encoder, introduced in [15].
All experiments were performed on single GPU (Geforce RTX 2080 11GB), on Python version 3.7.6.

A. Configuration
An optimizer used for training was Adam [22] with labelsmooth cross-entropy loss function [23]; the learning rate was set according to a polynomial schedule. All the training and learning rate schedule parameters, except those listed in Table II, were left unchanged from their default values. The polynomial schedule in its default configuration (a polynomial degree equals 1) basically increases the learning rate from 0 to the max value during warm-up phase and then linearly decreases. Token and sentence limits were set to facilitate a single batch fit into the available GPU memory.

B. Data Set Impact
In our approach, the base model is already trained on reconstructing noisy text. During the fine-tuning phase, we show pairs of correct/incorrect text, which alters model behavior to precisely fix a specific set of text modifications. We investigated the impact of including different data sets into the training set. Table III shows detailed results achieved on three development sets, while adding data to the training set. It proved to be important that a training data includes highquality corrected texts (W I, F CE), and adding texts from other sources may degrade model performance.
Training only on the W I dataset yields average results of 53.65 for W I and 54.38. These results are almost as good as achieved by bigger training sets, but on F CE it gets only 48.49, which is significantly worse than further results. Adding the F CE training set improves score on the F CE test set to 53.22, without degrading results on other test sets. After adding N U CLE, average results on F CE increases slightly to 53.73 and on CONLL-14, to 55.75. However, the models trained only on W I and F CE, without N U CLE, achieve better result when multiple model output is combined. Data from the LANG-8 dataset caused quite a significant drop on all the test sets, which might be caused by difference in annotation quality between the training and test data. The LANG-8 annotations were created by native speakers -collaborative users of the LANG-8 learning service. The W I test set was created from selected Write and Improve service submissions, mixed with parts of the LOCNESS essay corpus and annotated 5 times by Write and Improve annotators.

C. Model Size Impact
For comparison, Table IV shows results for a smaller version of pre-trained BART containing 140M parameters. Both model types were trained from 10 different random initialization points. Results reported in the table are: the best checkpoint result, an average of all 10 checkpoints, an ensemble containing 3 models (an average of 10 random combinations of size 3) and an ensemble containing all 10 models. A detailed description of the ensembling method is provided in Section IV-D.
The smaller model achieves an average of 41.96 F-score, which, comparing to the LARGE version score of 53.36, is significantly worse, but its inference time is 2 times better, which might be an important quality when considering the production use. Table V shows a change of F 0.5 on BEA-Dev dataset while changing values of R and N (parameters of the output combination algorithm, see Section II), and Fig. 2 showcases a trade-off between precision and recall for an ensemble of 10 checkpoints and a changing value of R.

D. Combining Output
The different model checkpoints are trained using the same train sets and configuration, but are initialized with different random values. Adding models to an ensemble allows for better overall correction-quality but requires longer inference time.
Thanks to the method described in Section II, the overall reported performance for BEA19 can increase by almost 4%, where a single model achieves 69.80, and after ensambling, it reaches 73.90. Detailed Results on Different Ensembles see Table V.

E. Results Summary
Single models were selected by comparing results from development sets, so the value reported on the BEA19 test comes from the checkpoint that achieved the best result on the BEA19 development set. In the case of CONLL14, where there is no development set available, the reported value is an average of 10 randomly initialized checkpoints. Table VI comprises results reported in current research papers and those achieved by our model. We report the best score on the BEA19 test and development sets. The scores on CONLL2014 and FCE are not far from the best reported results. These results, achieved by relatively low-resource fine-tuning, suggest that the GEC models might greatly benefit from a pre-trained model. A sequence-to-sequence denoising pre-training objective uses similar text transformations, as commonly required in GEC. [17] uses token masking, token deletion, token infilling, sentence permutation and document rotation. After fine-tuning, the model identifies a subset of these transformations specific to GEC. Remark-based ensambling proved to be a reasonable method to increase the correction precision, which improves the overall score, but it is also important for further model applications, where false positive remarks might be very misleading, especially in educational systems.

V. CONCLUSION AND FUTURE WORK
A fine-tuned sequence-to-sequence transformer model is very effective in solving the GEC task. It was able to achieve state-of-the-art results on the BEA19 test set 73.90 and development set 56.58. It proves that an efficient model solving GEC might be trained in a matter of hours using a single GPU. Only a limited amount of human-annotated data was required.
What is also beneficial in our approach is that it facilitates leveraging future progress in general-purpose language models. Single language models followed by multi-lingual systems will enable solving GEC for languages other than English. Constantly expanding transformer-based models extend the limits of text comprehension and may improve GEC performance without costly data annotation and only with lowresource and fast fine-tuning.   [4] 73.63 -66.50 --Combined systems [24] 73.18 ----Transformer + Pseudo data [3] 70.20 -65.00 60.90 -BERT-fuse [2] 69. Research and Development and is a part of the project "System automatycznego wykrywania bledow jezykowych i sugerowa-