An Empirical Analysis of BERT Embedding for Automated Essay Scoring

Automated Essay Scoring (AES) is one of the most challenging problems in Natural Language Processing (NLP). The significant challenges include the length of the essay, the presence of spelling mistakes affecting the quality of the essay and representing essay in terms of relevant features for the efficient scoring of essays. In this work, we present a comparative empirical analysis of Automatic Essay Scoring (AES) models based on combinations of various feature sets. We use 30manually extracted features, 300-word2vec representation, and 768-word embedding features using BERT model and forms different combinations for evaluating the performance of AES models. We formulate an automated essay scoring problem as a rescaled regression problem and quantized classification problem. We analyzed the performance of AES models for different combinations. We compared them against the existing ensemble approaches in terms of Kappa Statistics and Accuracy for rescaled regression problem and quantized classification problem respectively. A combination of 30-manually extracted features, 300-word2vec representation, and 768-word embedding features using BERT model results up to 77.2 ± 1.7 of Kappa statistics for rescaled regression problem and 75.2 ± 1.0 of accuracy value for Quantized Classification problem using a benchmark dataset consisting of about 12,000 essays divided into eight groups. The reporting results provide directions to the researchers in the field to use manually extracted features along with deep encoded features for developing a more reliable AES model. Keywords—Automated Essay Scoring (AES); BERT; deep learning; neural network; language model


I. INTRODUCTION
Automated Essay Scoring (AES) involves the use of statistical models for extracting useful features from the essay and assigning grades in the numeric range. It helps to reduce human efforts in manual grading of essays and improve the effectiveness and efficiency of writing assessment. Several models have been proposed for automatic essay scoring in the recent past. Broadly, these models can be further categorized into two classes [1]. The first type of AES models belongs to feature engineering-based models. These models use manually extracted features from an essay in term of number of words, number of grammatical errors, number of unique vocabulary words, term frequency, inverse document frequency, etc. [2][3]. Feature engineering-based models have the benefits of using manually extracted features that can be easily explained and modified to adapt different scoring criteria. However, these model suffer from the limitation of lack of understanding some cement features leading to low accuracy of the models.
The second type of AES models is called an end to end models. These models are developed using machine learning or deep learning techniques [4,5] based on some word embedding methods [6,7]. The word embedding methods represent essay into low dimensional vectors. A dense layer follows the low dimensional vectors for transforming them into a deep encoded vector for further scoring of the essay. End to end models exhibits good performance for extracting semantic features and address the limitation of feature engineering models. However, these models are unable to integrate manually extracted features.
AES engine assigns a score to an essay based upon extracted features from the raw data of essays. The scoring process involves two phases [8]. The first phase consists of collecting the data for scoring by AES engine. The engine is trained based on some holistic rubrics that specify the satisfaction criteria of the essay. The rubrics consider different factors like grammatical errors, spelling mistakes, clarity, organization of the text, and Cohesion of the essay [9]. Kaggle competition has made AES data set available to the public. The second phase involves dividing the essay dataset into two data subsets for training and testing purposes. The training data set is a labelled data set used for developing a trained model of AES engine based upon the selected features of essay dataset. The trained model is further applied to the test data set for assigning them the labels as a score of the essay.
In this paper, we focus on manually extracted features as well as word embedding features of BERT model for analyzing the performance of but language model in automated scoring of essay. We conduct a set of experiments using word-embedding models along with the manually extracted features and compare their performances for automated scoring of essay using a benchmark dataset. The performance of different models is compared in terms of Kappa statistics and accuracy by considering the automatic scoring process as rescaled regression and quantized classification problem, respectively.
Rest of the paper is structured as follows. Section 2 highlights the background of AES and describes the different models developed for efficient AES. Section 3 describes the details of experiments, such as experimental setup, benchmark dataset and performance metrics. It provides comprehensive experimental mythology being following in this work. Section 4 presents results, analyses and compares the results with the existing approaches. Section 5 concludes this paper at the end.

II. BACKGROUND
AES is considered as one of the most challenging problems in natural language processing (NLP). The significant challenges are the length of the essay, the presence of spelling mistakes affecting the quality of the essay. Several research efforts have been invested in the recent past for automated essay scoring [8,10]. Initially, these research efforts involve the use of statistical methods based upon bag of words (BOW), use of Logistic regression method, and other probability-based methods. Some researches applied neural networks for automated scoring of essays using the word embedding method [6]. Embedding methods mainly work on characters words or sentences and transform them into ndimensional vectors by preserving semantic features. It results in a conversion of character data into a sequence of ndimensional data. The n-dimensional vector can be further used to create the model of different neural networks like LSTM, CNN and GRU [11]. These neural networks are the nonlinear models that are used to score the given essay based upon some scoring rubrics.  [14] have summarized supervised and unsupervised learning-based embedding methods. In supervised learning methods for automatic scoring of essays, the researches considered AES problem as a regression problem and classification problem. Regression problem involves predicting the score of essay in the given numeric range. Classification problem involves the classification of essay to one of the predefined classes like medium, low and high. In case of regression, researchers use linear regression [15,16], support vector regression [17,18], and sequential minimal optimization (SMO, a variant of support vector machines) [19] for automatic scoring of the essays based upon different features. In the case of classification, researchers employed SMO [19], logistic regression [20] and Bayesian network classification [21] for classifying essays to their predefined classes. Many researchers also used neural networks for automated scoring of essays. Taghipour and Ng [22] proposed the first approach based on neural network for scoring essays. They used a series of words as input to convolutional layer and extracted n-gram features from essays. The extracted features represent local text dependencies among words. The extracted features are passed to LSTM layers for capturing long-term dependencies in the words of essay. Further, they concatenated vectors at different time intervals for feeding to a dense layer. Finally, they predicted the score of the essay after training of the model.
The above-cited research work uses different types of features like implicit features or explicit about scoring the essay automatically using different models. The performance of the model is mainly dependent upon the extent to which the extracted feature represents the given essay. Some researchers focused on manually extracted features, word2vec feature representation or embedding representation. In this work, we believe and hypothesize that both manually extracted features and deep-encoded features can contribute to enhancing the performance of AES models. Therefore, we conducted a comprehensive set of experiments in this work to evaluate word embedding in combination with manually extracted features and word2vec features.

III. EXPERIMENTS
This section describes a comprehensive set of experiments conducted in this work to evaluate the performance of word embedding in combination with manually extracted features and word2vec features. It presents for experimental methodology by explaining different proposed in this work. Benchmark data set is used for comparing the performance of different models based upon different feature sets. This section also defines the set of performance metrics used to measure the performance of different models in this work.

A. Experimental Methodology
To conduct a comprehensive set of experiments, we followed the experimental methodology presented in Fig. 1. The proposed methodology consists of four modules, namely, essay raw data collection, feature extraction, scoring engine, and performance evaluation.
Raw data collection module collects the raw data of essays from the database and feeds into the feature extraction module. The feature extraction module can employ different types of methods for extracting relevant features that preserve the semantics of the essay. The features can be extracted manually, word2vec representation or by using the word embedding method. In this work, we focus on measuring the performance of AES models based upon different combinations of manually extracted features, word2vec representations, and word embedding using BERT model. We use 30 manually extracted features, 300-dimensional word2vec representation, and 768-word embedding features using BERT model and forms different combinations for evaluating the performance of AES models. Table I. summarizes manually extracted used in this work.
The different combinations of manually extracted features, word2vec representation and word embedding features are provided as input to AES engine for scoring the test essay data set after training of AES model based on training essay data set. The performance evaluation module analyses the performance of AES models based upon different combinations of manually extracted features, word2vec representation and word embedding features as presented in Fig. 1. Here, we fine-tuned the BERT model using different hyper-parameters. The optimal values used in this set of experiments are presented in Table II.

B. Benchmark Dataset
Most AES related research work used the Automated Student Assessment Prize (ASAP) dataset for evaluating AES models [1, 2, 3,]. This data set contains about 12,000 essays divided into eight groups. Essays in the data set are not assigned in the normalized score range. We assign scores that range from [2 -12] to [10 -60]. Essays in the data set also have a variable length ranging from 120 tokens to 500 tokens. Sentences have a length from 120 to 500 tokens. Each group of the data set contains about 700 to 1800 items.
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 10, 2020 206 | P a g e www.ijacsa.thesai.org To use the available data as benchmark dataset in our experiments, we normalize the essays score in the range of 0 to 10 by applying independent transformation for essay group. The resultant distribution of the scores in the data set is not uniform but seems like the normal distribution. Since the current work involved the comparison of the performance of the AES model in scoring rescaled regression problem and Quantized Classification problem, so we distributed dataset into three subgroups by approximating two quartile cut points. Each subgroup is replaced with its number in ascending order for obtaining a discrete score of 0, 1, or 2 effectively. We use 3-quantile subgroups discretization to produce far from equally populated subgroups due to skewed score frequencies in our experiments. A complete dataset has the frequencies per 3-subgroups in classification problem as presented in Table III.
In this work, we use this dataset as a benchmark dataset for evaluating the performance of different models based upon different combinations of manually extracted features, word2vec representation and word embedding features.

C. Performance Metrics
This section describes the performance metrics used for measuring the performance of AES models. The most widely used performance evaluation metric is the Kappa statistics, specifically for regression problems. Kappa statistic is an agreement metric whose value ranges from 0 to 1. Kappa statistics can be computed using Equation 1 [8]. (1) Where, represents the observed exact agreement among AES models and represents the hypothetical probability of chance agreement. K=1 indicates that models agree and K=0 indicates total disagreement of AES models. In the case of the classification problem, we measure the performance in terms of accuracy of the AES model. WE computed accuracy from a confusion matrix that gives the number of essays assigned correct score label as expected.

IV. RESULTS AND DISCUSSION
This section presents the experimental results obtained in this work based on given benchmark dataset using different AES models. For a comprehensive comparison of AES models, we use baseline performance as the performance of a combination of 30-manually extracted features and 300-word2vec [23] features reported in the study [24]. In [24], the authors used 330-features and neural network for automated scoring of essays. Furthermore, we use 768 word-embedding features of BERT model. We use combinations of three feature sets to evaluate the performance of AES models. The performance of different models in terms of Kappa statistics and accuracy for the rescaled regression problem and Quantized Classification problem is presented in Table IV.
The values presented for reference model [1] in Table I utilized the 5-fold cross-validation method based on 80% of the dataset in their experiments. The authors of the study [1] have not reported standard deviation estimates. They only reported mean values of Kappa statistic metric. Whereas, in our experiments, we used 90% of benchmark essay dataset. We conducted experiments using 10-fold cross-validation. We presented these results as mean and standard deviation values of Kappa statistics and accuracy for five iterations in our experiments.
In these experiments, we also plotted learning curves for regression and classicization tasks considered in this work based on different feature sets in terms of Mean Squared Error (MSE) and accuracy, respectively. Fig. 2 presents the learning curves obtained in this set of experiments.
It can be observed from Table IV that performance of AES model based on a combination of use 30 manually extracted features, 300 feature dimensional word2vec representation, and 768-word embedding features using BERT model has reported better performance in comparison to the other feature combinations. This model has reported kappa statistics value of 77.2 ± 1.7 for rescaled regression problems and accuracy of 75.2 ± 1.0 for Quantized classification problem. Fig. 3 presents the confusion matrix for the rescaled regression problem based on MF+MV+EM features in this work.  Table IV that BERT word embedding model has reported the similar performance that of 30-manually extracted features and 300-word2vec features. In the case of BERT embedding, regression problem has better values of Kappa statistics than that of MF-WV combination. In contrast, slightly lower value of accuracy has been reported by BERT embedding for classification problem than that of MF-WV combination. It has been observed that both manual features and Word2Vec embedding methods individually score about 66-67% of accuracy on the quantized classification task. It can also be noticed that WV-EM embedding combination has reported similar performance with minor variation in comparison to MF-WV combination. Such kind of behaviour may be due to the bigger input www.ijacsa.thesai.org dimension size whilst preserving the same model capacity. Small dataset size and curse of dimensionality can be a significant cause of the deceased accuracy of the results. It can also be noted that ME-EM embedding combination of features has reported better-rescaled regression and quantized classification results in comparison to the results reported in [1]. It is worth mentioning that the authors of the study [1] have used an ensemble of LSTM based encoders and XGboost, whereas we employed only a shallow 2-hidden layers feed-forward network.   Nadeem et al. [25] also used BERT embedding for AES. But, they only reported results for the first and second essay groups. Their results are even worse than the results of the AES model based on MF features. They were only able to improve results slightly by using a combination of both feature inputs.
It can be observed from Table IV that the performance of all combinations in case of a rescaled regression problem is better in comparison to the corresponding quantized classification problem. This can happen because a Kappa statistic score is capable of tolerating deviations from a ground truth label and scoring near predictions to some degree. Whereas, accuracy does count only exact category equality. V. CONCLUSION Despite many challenges, researchers are investing continuous effects in developing efficient and effective AES using different features of essays. In this paper, we demonstrated a comparative empirical analysis of AES models based on different combinations of various features, namely, manually extracted features, word2vec representation and word embedding using BERT model. The reporting results support our hypothesis that both manually extracted features and deep-encoded features contribute to enhancing the performance of AES models. A combination of manually extracted features, word2vec representation and word embedding using BERT model leads to better performance in comparison to other feature combinations as well as the existing ensemble-based approaches. This combination of features resulted up to 77.2 ± 1.7 of Kappa statistics for rescaled regression problem and 75.2 ± 1.0 of accuracy value for Quantized Classification problem using a benchmark dataset consisting of about 12,000 essays divided into eight groups.
In this paper, we mainly contributed to explaining and comparing AES models based on combinations of various feature sets. We conclude that both manually extracted features and deep-encoded features contribute to enhancing the performance of AES models, makes AES models more reliable than human beings and helps in saving time and money for scoring essays.