Two-stage sequence-to-sequence neural voice conversion with low-to-high definition spectrogram mapping

Sou Miyamoto, Takashi Nose, Kazuyuki Hiroshiba, Yuri Odagiri, Akinori Ito

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

In this study, we propose a voice conversion technique with two-stage conversion, which is realized by using two models consisting of U-Net and pix2pix. Using U-Net, we tried to reproduce intonation of a target speaker by performing low-dimensional feature conversion considering the time direction. We introduced pix2pix for the task of spectrogram enhancement. The pix2pix is trained to map from low definition spectrogram to high definition spectrogram (low-to-high spectrogram mapping). Low definition spectrogram is reconstructed from low dimensional mel-cepstrum converted by U-Net and high definition spectrogram is extracted from natural speech. In objective evaluations, we showed that the proposed method was effective in improvement of mel-cepstral distance (MCD) and Log F0 RMSE. Subjective evaluations revealed that the use of the proposed method had a certain effect in improving speech individuality while maintaining the same level of naturalness as the conventional method.

Original languageEnglish
Title of host publicationRecent Advances in Intelligent Information Hiding and Multimedia Signal Processing - Proceeding of the Fourteenth International Conference on Intelligent Information Hiding and Multimedia Signal Processing
EditorsLakhmi C. Jain, Lakhmi C. Jain, Pei-Wei Tsai, Akinori Ito, Jeng-Shyang Pan, Lakhmi C. Jain
PublisherSpringer Science and Business Media Deutschland GmbH
Pages132-139
Number of pages8
ISBN (Print)9783030037475
DOIs
Publication statusPublished - 2019
Event14th International Conference on Intelligent Information Hiding and Multimedia Signal Processing, IIH-MSP 2018 - Sendai, Japan
Duration: 2018 Nov 262018 Nov 28

Publication series

NameSmart Innovation, Systems and Technologies
Volume110
ISSN (Print)2190-3018
ISSN (Electronic)2190-3026

Other

Other14th International Conference on Intelligent Information Hiding and Multimedia Signal Processing, IIH-MSP 2018
CountryJapan
CitySendai
Period18/11/2618/11/28

Keywords

  • CNN
  • DNN-based voice conversion
  • Pix2pix
  • Two-stage conversion
  • U-Net

ASJC Scopus subject areas

  • Decision Sciences(all)
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Two-stage sequence-to-sequence neural voice conversion with low-to-high definition spectrogram mapping'. Together they form a unique fingerprint.

Cite this