Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification?

Javier Latorre, Mark J.F. Gales, Sabine Buchholz, Kate Knill, Masatsune Tamura, Yamato Ohtani, Masami Akamine

Research output: Chapter in Book/Report/Conference proceedingConference contribution

18 Citations (Scopus)

Abstract

Most HMM-based TTS systems use a hard voiced/unvoiced classification to produce a discontinuous F0 signal which is used for the generation of the source-excitation. When a mixed source excitation is used, this decision can be based on two different sources of information: the state-specific MSD-prior of the F0 models, and/or the frame-specific features generated by the aperiodicity model. This paper examines the meaning of these variables in the synthesis process, their interaction, and how they affect the perceived quality of the generated speech The results of several perceptual experiments show that when using mixed excitation, subjects consistently prefer samples with very few or no false unvoiced errors, whereas a reduction in the rate of false voiced errors does not produce any perceptual improvement. This suggests that rather than using any form of hard voiced/unvoiced classification, e.g., the MSD-prior, it is better for synthesis to use a continuous F0 signal and rely on the frame-level soft voiced/unvoiced decision of the aperiodicity model.

Original languageEnglish
Title of host publication2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings
Pages4724-4727
Number of pages4
DOIs
Publication statusPublished - 2011 Aug 18
Externally publishedYes
Event36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Prague, Czech Republic
Duration: 2011 May 222011 May 27

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)1520-6149

Conference

Conference36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011
CountryCzech Republic
CityPrague
Period11/5/2211/5/27

Keywords

  • Continuous F0
  • HMM-based synthesis
  • aperiodicity
  • multi-band mixed excitation
  • voiced/unvoiced decision

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Fingerprint Dive into the research topics of 'Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification?'. Together they form a unique fingerprint.

  • Cite this

    Latorre, J., Gales, M. J. F., Buchholz, S., Knill, K., Tamura, M., Ohtani, Y., & Akamine, M. (2011). Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification? In 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings (pp. 4724-4727). [5947410] (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). https://doi.org/10.1109/ICASSP.2011.5947410