Exploring rich expressive information from audiobook data using cluster adaptive training

Langzhou Chen, Mark J.F. Gales, Vincent Wan, Javier Latorre, Masami Akamine

Research output: Chapter in Book/Report/Conference proceedingConference contribution

28 Citations (Scopus)

Abstract

Audiobook data is a freely available source of rich expressive speech data. To accurately generate speech of this form, expressiveness must be incorporated into the synthesis system. This paper investigates two parts of this process: the representation of expressive information in a statistical parametric speech synthesis system; and whether discrete expressive state labels can sufficiently represent the full diversity of expressive speech. Initially a discrete form of expressive information was used. A new form of expressive representation, where each condition maps to a point in an expressive speech space, is described. This cluster adaptively trained (CAT) system is compared to incorporating information in the decision tree construction and a transform based system using CMLLR and CSMAPLR. Experimental results indicate that the CAT system outperformed the contrast systems in both expressiveness and voice quality. The CAT-style representation yields a continuous expressive speech space. Thus, it is possible to treat utterance-level expressiveness as a point in this continuous space, rather than as one of a set of discrete states. This continuous-space representation outperformed discrete clusters, indicating limitations of discrete labels for expressiveness in audiobook data.

Original languageEnglish
Title of host publication13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012
Pages958-961
Number of pages4
Publication statusPublished - 2012 Dec 1
Externally publishedYes
Event13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012 - Portland, OR, United States
Duration: 2012 Sep 92012 Sep 13

Publication series

Name13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012
Volume2

Conference

Conference13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012
CountryUnited States
CityPortland, OR
Period12/9/912/9/13

Keywords

  • Audiobook
  • Cluster adaptive training
  • Expressive speech synthesis
  • Hidden Markov model

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Communication

Fingerprint Dive into the research topics of 'Exploring rich expressive information from audiobook data using cluster adaptive training'. Together they form a unique fingerprint.

  • Cite this

    Chen, L., Gales, M. J. F., Wan, V., Latorre, J., & Akamine, M. (2012). Exploring rich expressive information from audiobook data using cluster adaptive training. In 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012 (pp. 958-961). (13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012; Vol. 2).