This week’s Deep Learning Paper Review is Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis.
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
What’s Exciting About this Paper
This paper demonstrates that 5 seconds of audio from speakers unseen in the training set is enough to generate a high-quality voice clone. Previous State-of-the-Art (SOTA) models needed tens of minutes.
The researchers decoupled the speaker encoder and TTS (Text-to-Speech) network which reduces data quality requirements for each step and enables zero-shot learning. Older TTS pipelines are typically end-to-end, require high quality labeled speaker-audio data, and can not generalize well to speaker voices not seen in training.
By training on a large training dataset of unlabeled audio data in a self-supervised manner on a speaker verification task, the speaker encoder network learns to generate fixed dimensional speaker embedding vectors that represent the characteristics of a speaker’s voice abstracted away from the content of the audio.
This speaker embedding is then fed into a standard TTS pipeline concatenated with user input text embeddings where it is then converted into log-mel spectrograms before being transformed into waveforms by a final vocoder network.
Previous end-to-end pipelines required labeled audio data with speaker and transcription labels to train but by splitting up the speaker encoder network and the TTS pipeline, the speaker encoder network only requires unlabelled audio data to train and the TTS pipeline only requires transcribed audio data (without speaker information) to train, both of which is significantly more abundant than the former.
By tweaking this pipeline, such as adding fictitious speaker embeddings, random text generation, and audio augmentation, could this approach be used to generate unlimited high-quality labeled data?