Skip to main content
Thesis defences

On Zero-Shot Multi-Speaker Text-to-Speech Using Deep Learning


Date & time
Wednesday, August 16, 2023
4 p.m. – 5:30 p.m.
Speaker(s)

Pradnya Kandarkar

Cost

This event is free

Organization

Department of Computer Science and Software Engineering

Contact

Mirco Ravanelli

Where

ER Building
2155 Guy St.
Room ZOOM

Wheel chair accessible

Yes

Abstract

  This thesis explores various aspects of zero-shot multi-speaker text-to-speech (TTS) synthesis using deep learning to create an effective system. A deep learning model for zero-shot multi-speaker TTS uses text and speaker identity as input to generate the respective output speech without fine-tuning for speakers not seen during training. The experiments consider a system with three main components: a speaker encoder network, a mel-spectrogram prediction network, and a vocoder network. A speaker encoder network captures the speaker identity in a fixed-sized speaker embedding. This speaker embedding is injected into a mel-spectrogram prediction network at one or more locations to generate a mel-spectrogram conditioned on the text and the speaker embedding. Finally, a vocoder network converts the mel-spectrogram into a waveform. All three components are trained separately. The speech synthesis aspects explored in the experiments include the speaker embedding injection method, speaker encoder network, speaker embedding injection location, and mel-spectrogram prediction network for the TTS system. The FiLM method from the visual reasoning field is adapted for the first time to inject speaker embeddings into the TTS workflow and compared against traditional methods. The significance of speaker embeddings is highlighted by comparing two well-established speaker embedding models. New combinations of speaker embedding injection locations are explored for two mel-spectrogram prediction networks. The best-performing model generates speech with naturalness ranging from fair to good, exhibits more than moderate speaker similarity, and shows potential for improvement. Additionally, the zero-shot multi-speaker TTS system is enhanced to generate fictitious voices.

Back to top

© Concordia University