Collecting a large amount of training data for grammatical error correction (GEC) models has been an ongoing challenge in the field of GEC. Recently, it has become common to use data demanding deep neural models such as an encoder-decoder for GEC; thus, tackling the problem of data collection has become increasingly important. The incorporation of pseudo data in the training of GEC models is one of the main approaches for mitigating the problem of data scarcity. However, a consensus is lacking on experimental configurations, namely, (i) the methods for generating pseudo data, (ii) the seed corpora used as the source of the pseudo data, and (iii) the means of optimizing the model. In this study, these configurations are thoroughly explored through massive amount of experiments, with the aim of providing an improved understanding of pseudo data. Our main experimental finding is that pretraining a model with pseudo data generated by back-translation-based method is the most effective approach. Our findings are supported by the achievement of state-of-the-art performance on multiple benchmark test sets (the CoNLL-2014 test set and the official test set of the BEA-2019 shared task) without requiring any modifications to the model architecture. We also perform an in-depth analysis of our model with respect to the grammatical error type and proficiency level of the text. Finally, we suggest future directions for further improving model performance.
|ジャーナル||IEEE/ACM Transactions on Audio Speech and Language Processing|
|出版ステータス||Published - 2020|
ASJC Scopus subject areas
- コンピュータ サイエンス（その他）