Improving Performance of Automated Essay Scoring by Using Back-Translation Essays and Adjusted Scores
Automated essay scoring plays an important role in judging students’ language abilities in education. Traditional approaches use handcrafted features to score and are time consuming and complicated. Recently, neural network approaches have improved performance without any feature engineering. Unlike other natural language processing tasks, only a small number of datasets are publicly available for automated essay scoring, and the size of the dataset is not sufficiently large. Considering that the performance of a neural network is closely related to the size of the dataset, the lack of data limits the performance improvement of the automated essay scoring model. In this study, we proposed a method to increase the number of essay-score pairs using back translation and score adjustment and applied it to the Automated Student Assessment Prize dataset for augmentation. We evaluated the effectiveness of the augmented data using models from prior work. In addition, performance was evaluated in a model using long short-term memory, which is widely used for automated essay scoring. The performance was improved by using augmented data.
Artificial intelligence is one of the key drivers of industrial development and is an important factor in accelerating the integration of emerging technologies. Nowadays, artificial intelligence is being used in a very wide way in almost all aspects of our lives[1–4].
Currently, online education systems are being used more actively due to the COVID-19 outbreak, and the role of educational assessment systems has become more important. Learning a foreign language has become more common. Learning a foreign language is not only to satisfy our interests and flirts. In today’s multicultural society, it has become essential to freely speak a second or third language. Writing is an important part of language learning. The assessment of writing ability is included in all language tests.
Automated essay scoring (AES) is the task of evaluating one’s writing ability and assigning a score to an essay without human interference. The process of manually scoring essays is complex and time consuming. Even though there is a fixed scoring guide, the scoring process is influenced by individual factors, such as mood and personality, and assigned scores are subjective and lack credibility. Automatic scoring for the essay was proposed as a solution to manual scoring.
All early works were based on feature engineering, and the performance was improved by adding more complicated features. In recent years, neural network models have been introduced in this research and have improved the performance without any feature engineering. Various neural networks have been used for AES, and their performance has been continuously improved. From the simplest recurrent neural network (RNN)  and convolutional neural network (CNN) to a complicated large-scale natural language pretraining model (XLNet) , almost every neural network has been used for AES. Prior works have improved performance by changing the structure of the neural network model or adding other features.
However, we improved the performance by generating more useful data from the original data. Data augmentation techniques have been applied to other natural language processing (NLP) tasks and have shown good performance. However, there are no examples of data augmentation techniques applied to AES. Our study is the first attempt to augment AES data.
Data augmentation for the NLP corpus should be conducted at the sentence level or the document level. If we just replaced some words, some information in the entire text may be lost. Therefore, back translation, which is simple and augments data at the document level, was selected as our data augmentation technique.
We generated back-translation essays using Google Translator (https://translate.google.com/) and adjusted the corresponding scores in several ways. We trained and validated the model with doubled number of essay-score pairs and tested it on the original data. The performance is improved by using augmented data.
Our main contributions are as follows:(1)Data augmentation was introduced into AES and improved performance. We proved the possibility of data augmentation by adjusting the score along with the essay.(2)We analyzed the characteristics of back-translation essays and the AES task and came up with a score adjustment method suitable for back-translation essays in AES data.(3)We generated back-translation essays (English-Chinese-English, English-French-English) (https://github.com/j-y-j-109/asap-back-translation) for the Automated Student Assessment Prize (ASAP) dataset (https://www.kaggle.com/c/asap-aes).
2. Related Work
The first AES system was created in 1966 and uses some linguistic features to score essays . Most recent works have used neural network models for AES. In 2016, Taghipour and Ng  designed a neural network model using CNN and long short-term memory (LSTM)  and showed significant improvement compared to traditional methods that depend on manual feature engineering (Figure 1). This is the simplest and most representative model, and it generates a representation of the input essay and obtains a value from it. The convolution layer extracts local features from the essay and the recurrent layer generates a representation for an essay. In the mean over time layer, the sum of outputs of the recurrent layer is divided into essay length. Let be the outputs of the recurrent layer. Then, the function of the mean over time layer is defined by the following equation:
In the last linear layer, they used the sigmoid function to get a score in the range of (0,1). The sigmoid function is given by the following equation:
Therefore, they normalized all scores to [0,1] before training the model.
Since then, an increasing number of neural network models have been used for AES, and almost all neural networks including GRU , BERT , and XLNet have been used (Table 1). RNN was also used, but it was not used in the final model because the performance was lower than GRU or LSTM.
After simple models, such as the model in reference , models that assign scores by capturing some other features, such as coherence and relevance, have also been used. Example essays were also used to assign a score to the input essay. In references [15, 17], the relevance between the prompt and the essay and the coherence of sentences within the essay were captured. In reference , the similarity between the word distribution of the input essay and that of the example essays was used. In reference , the k-means algorithm was used to classify example essays, and representative essays were selected from each cluster and used to assign scores.
Various models and embeddings have been used as representations for words and sentences. In reference , the pretrained word embeddings released in reference  were used. In reference , C&W embeddings were used [19, 20] and augmented by considering the contribution of each word to the essay score. In reference , character embeddings were attempted, and in references [13, 14], they used pretrained Glove embeddings trained on Google News . In references [15, 17], BERT was used to obtain sentence representations.
Data augmentation is a technique to increase the size of data by slightly modifying the original data or adding newly created data to the original data. Data augmentation has received a lot of attention as it reduces the number of cases where neural network training fails due to a lack of data. It can also improve the performance of neural network models. For example, data augmentation such as flipping and rotating an image is used in the field of computer vision. Unlike images, natural language is discrete and it is more difficult to augment data, but many data augmentation techniques have been proposed recently.
Data augmentation techniques are widely used in NLP tasks. For the text classification task, Zhang et al.  used the method of finding synonyms in WordNet and replacing words, and for the categorization task, Wang and Yang  obtained synonyms by calculating cosine similarity. Yu et al.  generated augmented data by translating sentences into French and again into English for Reading Comprehension, and Lun et al.  generated back-translation data using Japanese for Automatic Short Answer Scoring. To date, various data augmentation techniques have been proposed and used for many NLP tasks. However, no prior works have applied data augmentation to AES.
Back translation means that the original data are translated into other languages and then translated back to obtain new data in the original language. This method rewrites the entire text without replacing individual words. In references [24, 26], the English-French translation model was used to perform back translation for each sentence. In addition to the trained machine translation model, Google’s Cloud Translation API service has been widely used [27, 28]. In reference , the Baidu Translation API service was used. There are also other methods to add various additional features based on back translation.
3. Augmented Data
This section describes the original data and augmented data in detail. Dataset for AES consists of essays and corresponding scores. Therefore, when creating new data using data augmentation techniques, essays and corresponding scores should be determined together.
3.1. Original Dataset
There are several open datasets for AES, and more than 90% of prior works were evaluated using the ASAP dataset . In 2012, Kaggle hosted the ASAP competition to evaluate the capabilities of AES systems. The ASAP dataset is built with essays written by students ranging in grade levels from grade 7 to grade 10. There are approximately 13,000 essays corresponding to 8 prompts. For individual prompts, the number of essays is less than 2,000. Specific dataset information is presented in Table 2. Each prompt has a different score range and number of essays. The test set used in the competition is not publicly available.
Identifying information from the essays of the ASAP dataset was removed using the Named Entity Recognizer from the Stanford Natural Language Processing group and a variety of other approaches. The relevant entities were identified in the essay and then replaced with a string starting with “@.” Any misspelled words or grammatical errors were transcribed exactly as they occur in the original essays.
3.2. Back Translation
First, we need to obtain back-translation essays using essays of the original data. To study the general effect of back-translation essays, we generated back-translation essays using two languages. We hypothesized that back translation using different languages would help to diversify augmented data. We used the multibyte language, Chinese, and the single-byte language, French.
As we mentioned in the Related Work section, in references [27, 28], the data was augmented by using the Google Translate API. However, we did not use the Google Translate API, but we used the homepage version of Google Translator. We do not need to write any additional code, and we do not have to make as many requests as we do when using the API. By dividing the entire text into documents of a size that Google Translator can process at once, the translation can be completed by requesting the number of documents times.
The reasons we used Google Translator are as follows: first, the quality of the back translation is related to the translator. Our study below assumes that the quality of the translator is good enough. Second, our results must be reproducible. Google Translator is a popular translator that anyone can use.
As the amount of data that Google Translator can process at one time is limited, the original essays were divided into 8 equal-sized parts for translation. Therefore, the entire data were completed by requesting 32 (8 (number of parts) 2 (back translation) 2 (two languages)) times of translation. Google Translator perfectly translated special words starting with @ in essays.
3.3. Score Adjustment
After obtaining the back-translation essays, the corresponding scores must be determined. The score setting directly affects the performance of data augmentation. This is because even if the number of essays in the data is large, the performance of the model can be further degraded if the scores for essays are not reasonably determined (Evaluation Section). Therefore, it is essential to adjust the scores for back-translation essays.
The most intuitive way to set the score is to give the score of the original essay since back-translation essays are similar to original essays. In this case, the new scores are given by the following equation:where represents all essays of prompt and and are the original and new scores of essay , respectively. Another method provides a more suitable score by finely adjusting the original score. In this case, the new scores are given by the following equations:orwhere is a condition. For example, let be the condition for judging whether an essay has a length greater than 300. means that the length of essay is greater than 300. and mean the maximum and minimum scores that an essay in prompt can take. is an additional value to adjust the score.
The essays in the ASAP dataset have certain characteristics. In the essays of the ASAP dataset, certain errors, such as grammatical and lexical errors, exist because of the characteristics of the AES task. For example, as described in Section 3.1, there are some misspelled words in the ASAP dataset, and the number of misspelled words decreases after back translation using Chinese (Figure 2). If we use a translator, the translator can correct these errors to a certain extent in the process of translating them into other languages, and we can generate translated essays with a smaller number of errors (Figure 3). Here, we assumed that the translator is good enough to do so. When generating back-translation essays using these translated essays, the translator can generate essays at a relatively higher level than the original essays. Therefore, it can be assumed that the quality of back-translation essays is slightly higher than that of original essays.
Considering that back-translation essays are similar to the original essays, scores of back-translation essays were given as original scores according to equation (3) for all prompts. For prompts with a small score range, that is, 1–6, back translation cannot change the score of essays. For example, if there are only two scores, 0 and 1, an essay with a score of 0 cannot be back translated to a score of 1, and an essay with a score of 1 cannot be back translated to a score of 0. A change of 1 point requires large changes in the essays. For prompts 7 and 8, equation (4) was also used to determine the scores of back-translation essays. Condition P returns 1 when the essay has a score higher than the highest frequency score, and is given as 1. For prompts 7 and 8, the scores of back-translation essays are given by the following equation:
First, the additional point is set to 1 because back translation slightly raises the quality of the essay. Second, when scoring, a higher score is usually given for a more perfectly written essay than the baseline. If we want to obtain a high score, we have to complete it more precisely in all aspects, such as vocabulary and grammar. The baseline can be assumed as the level of the essay with the highest frequency score. For high-score essays, even if there is a slight improvement, the score increases. In other words, the scores of back-translation essays from low-score essays do not increase even if back translation is performed, but the scores of back-translation essays from high-score essays do increase after back translation. For each prompt, the highest frequency score was found, and additional points were given to essays with a score higher than that score.
We hypothesized that we could show the generality of back translation by using French and Chinese. Also, on the assumption that the quality of the translator is good, the score adjustment method was determined. In this section, we evaluate whether the performance of the model improves and whether our score adjustment method is effective.
To evaluate the effectiveness of the augmented data and for a fair comparison, we trained the models using their published code (https://github.com/sdeva14/sustai21-counter-neural-essay-length). The effectiveness of data augmentation was evaluated by training the model using the original data and augmented data. Performance was evaluated with Quadratic Weighted Kappa (QWK). The QWK was the official criterion for the ASAP competition and was used to evaluate and compare the performance of models in many works. W is a matrix constructed by the following equation:where and are the gold score and predicted score, respectively, and is the number of possible scores. The QWK score is calculated by using the following equation:Here, represents the number of essays received score by model and score by human. The matrix E is the outer product of histogram vectors of the two scores. To make the sum of elements in O and E the same, matrix E is normalized.
We used two models to determine whether the augmented data improve the model’s performance (Figure 5). As the first model, we use “Manipulating-Length-GRU.” This model does not divide the sum of outputs of the recurrent layer by the length of the essay, as in the model in reference  or other models, but by the average length of the essays included in each prompt. GRU was used as the recurrent layer. The function of the layer is given by the following equation:where is the average length of the essays included in prompt .
As the second model, “Considering-Content-LSTM” was used. This model computes the KL divergence between the word distribution of the example essays divided into three levels and the word distribution of the input essay and concatenates them to the averaged output of the recurrent layer. In this model, the function of the layer is given by the following equation:where , , and are KL divergences for three levels, respectively.
In reference , GRU and XLNet were used, but we used LSTM, which is widely used for AES.
Since the complexity of the recurrent layer is the biggest in the model, the complexity of the recurrent layer becomes the complexity of the model. Therefore, the complexity of the model is . is the output dimension of the recurrent layer and is the length of the input essay.
As a word vector, these models used Glove, a 100-dimensional pretrained embedding model trained on Google News.
4.2. Experimental Setup
The ASAP dataset does not have a test set, and cross validation is used to evaluate the models. We used the same cross-validation partitions as those in reference . We trained and validated the model with doubled number of essay-score pairs and evaluated performance on the original test set. We performed 50 epochs on the validation set and applied the best model to the test set. ADAM optimizer (eps = 1e − 7) was used with a learning rate of 0.001. The batch size was set to 32. The cell size of the recurrent layer was 300. The loss function is given by the following equation:where is the number of essays in dataset, is a gold score, and is a predicted score.
We have performed 15 times of training for every prompt and obtained the performance value as an average value.
Table 4 lists the performance of the “Manipulating-Length-GRU” model for prompts with a small score range. “Ori” means original data. “Ori + Ch” means augmented data with back-translation essays using Chinese. The same goes for “Ori + Fr.” The scores of back-translation essays are the same as those of the original essays. Table 5 lists the performance of the “Manipulating-Length-GRU” model for prompts 7 and 8 with a large score range. For the augmented data using adjusted scores, we marked the score adjustment equation number and the value of variable in the equation next to the data. For example, [(4), ] means giving all back-translation essays in the prompt 2 points higher scores than the original essay scores. The value marked with “” is slightly bigger than the value marked with “^.”
For prompts 2, 3, 4, and 6, “Ori + Fr” showed the best performance. Except for the “Ori + Ch” for prompt 5, for prompts 2 to 6, the performance was improved by using back-translation essays and original scores. For prompt 1, the performance did not improve, and we suspect that this is because prompt 1 has a relatively larger score range than prompts 2 to 6. For prompt 8, since the score range is twice that of prompt 7, [(4), ] was also applied. Except for the “Ori + Ch [(5), ]” for prompt 8, if all essays of the prompt are scored using equations (4) or (5), the performance is lower than that when the score is not adjusted. In contrast, the augmented data using equation (6) showed a higher performance than when using original scores. For prompt 8, the augmented data improved the performance compared to the original data. For prompt 7, the performance of the original data is the best.
The performance of the new augmented data was slightly higher than that of “Ori + Ch [(5), ].” This indicates that equation (6) is still effective. Ori + Ch [(6)] performed worse than “Ori + Ch [(5), ]” because the number of applied essays was smaller. Equation (6) was applied to 146 essays, as listed in Table 3, but equation (5) was applied to a total of 723 essays.
For prompt 5, the performance is decreased when “Ori + Ch” is applied, and for prompt 8, the performance is improved when “Ori + Ch [(5), ]” is applied. We suspect that some information is lost when translating low-score essays using a multibyte language.
Figure 6 demonstrates the QWK score on the validation set for every epoch after training the model once. The performance in the graph has a certain difference from the final performance. The performance converges until an epoch number below 10, and the convergence speed is faster when the augmented data are used. However, there are cases where the performance improves even for epochs after 10, especially for prompts 7 and 8.
Using the augmented data is twice as long as using the original data. To reduce the training time, we attempted to reduce the number of epochs for augmented data. After obtaining the result by setting the number of epochs to 50, we determined the first epoch number with the best performance (Figure 7). When using the augmented data, the first epoch number of the best model was much smaller than when using the original data. This implies that using augmented data gets to the best model faster though increasing the training time for one epoch. Therefore, when training the next model, we set the number of epochs to 30 for augmented data. The training time of the augmented data is 1.2 (2 3/5) times longer than that of the original data.
Second, we trained the “Considering-Content-LSTM” model (Table 6). In Table 6, the average improvement value was obtained by dividing the improvement value for 5 prompts by 8. The value marked with “” is slightly bigger than the value marked with “^.”
Table 7 lists a performance comparison between our evaluation model and previous models. Those models used the same cross-validation data to evaluate. In fact, models in reference  outperform these previous models without data augmentation.
Through the experiment on the first model, for prompts 2, 3, 4, 6, and 8, the effectiveness of the augmented data was confirmed. New results were obtained using augmented data in the second model. The performance was also improved in the second model.
The performance was improved by 0.2% on average for both models using the augmented data.
In this study, data augmentation was first introduced in AES and the score was adjusted in consideration of the characteristics of the AES task. We improved the performance of AES by using back-translation essays and adjusted scores. We generated back-translation essays and adjusted scores for the ASAP dataset and confirmed the effectiveness of the augmented data. We used different score adjustment methods for specific prompts to find a reasonable method.
We generated back-translation essays for the ASAP dataset using Chinese and French. It was effective to maintain the score as it was for prompts with a small score range. For prompts with a large score range, based on the highest frequency score, it was effective to increase the score for the high-score essays and maintain the score for the low-score essays. For prompts 2, 3, 4, 6, and 8, higher performance was obtained than when using the original data. The performance was improved by 0.2% on average. In addition, we found that the augmented data get to the best model faster than the original data, reducing the effect of increasing time from data augmentation to some extent.
By improving the performance of AES using data augmentation, it is possible to further improve the performance to a certain extent even when the dataset cannot be sufficiently established due to various limitations. In other words, it provided new research possibilities for the AES task.
Currently, our research has the following limitations. Score adjustment was not performed more mathematically. The experiment did not proceed sufficiently. Comparative models and datasets used for evaluation are insufficient.
In the future, we will explore more efficient, more mathematically theoretical, and practical score adjustment methods for back-translation essays. In addition, the present method will be applied to other datasets. We also plan to research other data augmentation techniques and their corresponding score adjustment methods for the AES task.
The ASAP dataset used in this study is available at https://www.kaggle.com/c/asap-aes. The Python codes and back-translation data used to support the findings of this study are available at https://github.com/j-y-j-109/asap-back-translation and https://github.com/sdeva14/sustai21-counter-neural-essay-length.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
J. Ma, Y. Wang, X. Niu, S Jiang, and Z. Liu, “A comparative study of mutual information-based input variable selection strategies for the displacement prediction of seepage-driven landslides using optimized support vector regression,” Stochastic Environmental Research and Risk Assessment, pp. 1–21, 2022.View at: Publisher Site | Google Scholar
J. Zhang, H. Tang, D. D. Tannant, C. Lin, and X. DingL. XiaoZ. YongquanM. Junwei, ““Combined forecasting model with CEEMD-LCSS reconstruction and the ABC-SVR method for landslide displacement prediction,” Journal of Cleaner Production, vol. 293, Article ID 126205, 2021.View at: Publisher Site | Google Scholar
Z. L. Yang, Z. H. Dai, Y. M. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “Xlnet: generalized autoregressive pretraining for language understanding,” in Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), pp. 5754–5764, Vancouver, Canada, December 2019.View at: Google Scholar
F. Dong, Y. Zhang, and J. Yang, “Attention-based recurrent convolutional neural network for automatic essay scoring,” in Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 153–162, Vancouver, Canada, August 2017.View at: Google Scholar
W. Y. Zou, R. Socher, D. Cer, and D. Christopher, “Bilingual word embeddings for phrase-based machine translation,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1393–1398, Stanford, CA, U.S.A, January 2013.View at: Google Scholar
R. Collobert and J. Weston, “A unified architecture for natural language processing: deep neural networks with multitask learning,” in Proceedings of the 25th International Conference on Machine Learning, pp. 160–167, New York, NY, U.S.A, July 2008.View at: Google Scholar
R. Collobert, J. Weston, L. Bottou, K. Michael, K. Koray, and K. Pavel, “Natural language processing (almost) from scratch,” Journal of Machine Learning Research, vol. 12, pp. 2493–2537, 2011.View at: Google Scholar
X. Zhang, Xiang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” Advances in Neural Information Processing Systems, vol. 28, pp. 649–657, 2015.View at: Google Scholar
W. Y. Wang and D. Yang, “That’s so annoying!!!: a lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets,” in Proceedings of the 2015 conference on empirical methods in natural language processing, pp. 2557–2563, Lisbon, Portugal, January 2015.View at: Google Scholar
Q. Xie, Z. Dai, and E. Hovy, “Unsupervised data augmentation for consistency training,” Advances in Neural Information Processing Systems, vol. 33, pp. 6256–6268, 2020.View at: Google Scholar