Training a language model using webdata for large vocabulary Japanese spontaneous speech recognition

Ryo Masumura, Seongjun Hahm, Akinori Ito

研究成果: Conference article査読

11 被引用数 (Scopus)

抄録

This paper describes a language modeling method using large-scale spoken language data retrieved from the Web for spontaneous speech recognition. We downloaded 15 million Web pages on a comprehensive range topics. Next, spoken language- like texts were selected from the downloaded Web data using the naïve Bayes classifier, and typical linguistic phenomena such as fillers and pauses were added using simulation models. A language model trained by the generated data gave as high performance as the large-scale spontaneous speech corpus (Corpus of Spontaneous Japanese, CSJ). By combining the generated data and CSJ, we improved word accuracy.

本文言語English
ページ(範囲)1465-1468
ページ数4
ジャーナルProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
出版ステータスPublished - 2011
イベント12th Annual Conference of the International Speech Communication Association, INTERSPEECH 2011 - Florence, Italy
継続期間: 2011 8月 272011 8月 31

ASJC Scopus subject areas

  • 言語および言語学
  • 人間とコンピュータの相互作用
  • 信号処理
  • ソフトウェア
  • モデリングとシミュレーション

フィンガープリント

「Training a language model using webdata for large vocabulary Japanese spontaneous speech recognition」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル