JParaCrawl: A large scale web-based English-Japanese parallel corpus

Makoto Morishita, Jun Suzuki, Masaaki Nagata

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Recent machine translation algorithms mainly rely on parallel corpora. However, since the availability of parallel corpora remains limited, only some resource-rich language pairs can benefit from them. We constructed a parallel corpus for English-Japanese, for which the amount of publicly available parallel corpora is still limited. We constructed the parallel corpus by broadly crawling the web and automatically aligning parallel sentences. Our collected corpus, called JParaCrawl, amassed over 8.7 million sentence pairs. We show how it includes a broader range of domains and how a neural machine translation model trained with it works as a good pre-trained model for fine-tuning specific domains. The pre-training and fine-tuning approaches achieved or surpassed performance comparable to model training from the initial state and reduced the training time. Additionally, we trained the model with an in-domain dataset and JParaCrawl to show how we achieved the best performance with them. JParaCrawl and the pre-trained models are freely available online for research purposes.

Original languageEnglish
Title of host publicationLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
PublisherEuropean Language Resources Association (ELRA)
Pages3603-3609
Number of pages7
ISBN (Electronic)9791095546344
Publication statusPublished - 2020
Externally publishedYes
Event12th International Conference on Language Resources and Evaluation, LREC 2020 - Marseille, France
Duration: 2020 May 112020 May 16

Publication series

NameLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

Conference

Conference12th International Conference on Language Resources and Evaluation, LREC 2020
CountryFrance
CityMarseille
Period20/5/1120/5/16

Keywords

  • English
  • Japanese
  • Machine translation
  • Parallel corpus

ASJC Scopus subject areas

  • Language and Linguistics
  • Education
  • Library and Information Sciences
  • Linguistics and Language

Fingerprint Dive into the research topics of 'JParaCrawl: A large scale web-based English-Japanese parallel corpus'. Together they form a unique fingerprint.

Cite this