Massive Exploration of Pseudo Data for Grammatical Error Correction

Shun Kiyono, Jun Suzuki, Tomoya Mizumoto, Kentaro Inui

研究成果: Article査読


Collecting a large amount of training data for grammatical error correction (GEC) models has been an ongoing challenge in the field of GEC. Recently, it has become common to use data demanding deep neural models such as an encoder-decoder for GEC; thus, tackling the problem of data collection has become increasingly important. The incorporation of pseudo data in the training of GEC models is one of the main approaches for mitigating the problem of data scarcity. However, a consensus is lacking on experimental configurations, namely, (i) the methods for generating pseudo data, (ii) the seed corpora used as the source of the pseudo data, and (iii) the means of optimizing the model. In this study, these configurations are thoroughly explored through massive amount of experiments, with the aim of providing an improved understanding of pseudo data. Our main experimental finding is that pretraining a model with pseudo data generated by back-translation-based method is the most effective approach. Our findings are supported by the achievement of state-of-the-art performance on multiple benchmark test sets (the CoNLL-2014 test set and the official test set of the BEA-2019 shared task) without requiring any modifications to the model architecture. We also perform an in-depth analysis of our model with respect to the grammatical error type and proficiency level of the text. Finally, we suggest future directions for further improving model performance.

ジャーナルIEEE/ACM Transactions on Audio Speech and Language Processing
出版ステータスPublished - 2020

ASJC Scopus subject areas

  • コンピュータ サイエンス(その他)
  • 音響学および超音波学
  • 計算数学
  • 電子工学および電気工学


「Massive Exploration of Pseudo Data for Grammatical Error Correction」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。