Deep Learning

Recent developments in Generative AI for Audio

The spotlight has been on language and images for Generative AI, but there's been a lot of recent progress in the audio domain. Learn everything you need to know about generative audio models in this article.

Recent developments in Generative AI for Audio

Over the past decade, we've witnessed significant advancements in AI-powered audio generation techniques, including music and speech synthesis. Until very recently, however, these improvements were still far from the outstanding progress observed in image and text generation.

This trend has recently begun to shift. With various foundational ideas from large language models and text-to-image generation being adapted and incorporated into the audio modality, the latest AI-powered audio-generative systems are reaching a new unprecedented level of quality.

Listen to the following short audio clip:

Guitar solo

This was generated in a handful of seconds by Google’s audio-generative model MusicLM. The input text prompt was “guitar solo,” together with the following melodic prompt:

Bella Ciao - Humming

Last year’s release of user-friendly interfaces for models like DALL-E 2 or Stable Diffusion for image generation, and ChatGPT for text generation, has captured the world’s attention to the new wave of Generative AI. With some first steps in this direction in the past weeks – Google’s AI test kitchen and Meta open-sourcing its music generator – some experts are now expecting a “GPT moment” for AI-powered music generation this year.

In speech synthesis, Voicebox, a new state-of-the-art model from Meta has just been announced, fully open-sourced alternatives are already available, and there’s noticeable momentum among emerging companies that are concentrating on voice generation technologies.

But how do these new audio-generative models work? And how exactly are text-based or image-based techniques being used for generating novel audio data?

In this article, we take an overview of some exciting new advances in the space of Generative AI for audio that have all happened in the past few months, explaining where the key ideas come from and how they come together to bring audio generation to a new level. This blog post is part of a series on generative AI. If you have not read the introduction article, feel free to take a minute to do that now.

Why is audio generation hard?

Audio signals represent one of the most multifaceted data types that we regularly encounter, encompassing everything from the spoken word to musical compositions, and the ambient noise of our environments.

Audio signals can be represented as waveforms, possessing specific characteristics such as frequency, amplitude, and phase, whose different combinations can encode various types of information like pitch and loudness in sound. This wave-based representation is foundational for traditional audio signal processing but also demands a certain level of complexity in the mathematical and computational tools historically applied to directly deal with waveforms.

An audio waveform is a visual representation of the changes in air pressure, or amplitude (y-axis), over time (x-axis) as sound waves travel through a medium like air or water.
An audio waveform is a visual representation of the changes in air pressure, or amplitude (y-axis), over time (x-axis) as sound waves travel through a medium like air or water.

Deep Neural Networks (DNNs) have proven to be exceptionally adept at processing highly complicated modalities like these, so it is unsurprising that they have revolutionized the way we approach audio data modeling. Today, we can train deep learning algorithms that can automatically extract and represent information contained in audio signals, if trained with enough data. This shift has led to dramatic improvements in speech recognition and several other applications of discriminative AI.

Traditional machine learning feature-based pipeline vs. end-to-end deep learning approach (source).

On the other hand, generative AI for speech and music data presents some important challenges. One of them is due to the multiple scales of abstraction of the information contained in these signals – information that one would ultimately be able to edit and control with a generative model.

Speech audio signals, for example, are dense with local features, at the level of specific phonetic or acoustic elements, as well as a broader type of information, encompassing aspects like prosody (the rhythm and intonation of speech), emotional intonation, and speaker intentions.

Music signals share this need for multi-level interpretation. While individual notes and their immediate arrangement often form an important part of a musical composition, the overall structure and flow of a piece – its melody, harmony, and rhythm – constitute a separate, and essential, layer of information.

Creating synthetic audio that convincingly mirrors real-world sounds requires systems that can:

  • “understand” (encode) all these types of information uniformly, i.e. in a way suitable to an end-to-end approach.
  • Map a prompt, a description of the desired audio qualities and its content, to a generated waveform output. Note that the prompt itself could be either given in free-text, audio form, or both.

One could say that similar requirements also hold for text and image generation, so why should there be a stronger barrier for generative audio models?

While LLMs have progressed by aggressively scaling the transformer architecture and image generators through the invention of diffusion, audio models neither had a breakthrough in architecture akin to diffusion models nor have any preeminent model architecture lent itself to massive scaling so far.

The main difference lies in the amount of available high-quality data for training: while image and text data are abundant, audio data is comparably scarce or expensive. The much higher information density of audio signals and the presence of unique difficulties for model evaluation (especially for music generation systems) marks the difference even further.

In the rest of this article, we are going to take a full overview of how some ideas borrowed by image and text generative models have been adapted to the audio modality to solve or improve a number of issues specific to the field of audio-generative systems.

How Generative AI for Text and Image Influenced Audio Generation Models

Text-to-audio AI models used to depend almost entirely on acoustic models, which transform text into mel-spectrograms – visual representations of varying frequencies in a sound over time – and also relied on so-called vocoder models that convert mel-spectrograms into audible sound.

Schematic pipeline of previous-generation text-to-speech models with two components: first, an acoustic model converts input text into a mel-spectrogram. The mel-spectrogram is an image representation of an audio signal which at each time step (x-axis) displays the audio pitch frequencies (y-axis) at their different intensities, or amplitudes (color gradient). Then, a vocoder model converts the spectrogram into a waveform.

While these models have demonstrated some degree of effectiveness, they require high-quality audio data for training, which is both scarce and costly. As such, scaling these systems has proven to be a challenging task.

This approach also suffers from limited generalization capabilities. For example, it often necessitates fine-tuning a system using extensive data from a new speaker to reproduce their voice accurately.

An essential approach employed by large language models is the abstraction of input data to encode various forms of "meaning" like syntax and semantics. Word embeddings serve as a basic example. Some of the latest advances in this type of representation-based learning, together with smart uses of sequence-modeling capabilities enabled by transformer models, have been applied to the audio domain in interesting ways.

The next section covers the first important examples, with tokenization, quantization, and vectorization techniques being applied to learning discrete representations of audio features.

Audio Tokenization and Audio Language Models

Autoregressive transformers operate with a primary objective: predicting the next symbol, or token, in a sequence. Tokens form part of a fixed 'vocabulary' (a list of symbols) that the transformer can access.

Note how this process is abstract in that a transformer can handle diverse types of symbols not limited to human language. For instance, in the field of bioinformatics transformers have been used for learning and predicting protein sequences, treating amino acids as tokens.

In the context of a text-based language model, tokens often represent a word or sub-word unit, and the model's task is to predict the next token in a text sequence, a process known as next token prediction. The transformation of words into tokens, or tokenization, is part of this process.

But why use tokens instead of whole words?

A tokenizer may split the word multidisciplinary into three tokens: multi, -disciplin, and -ary. Different tokenization strategies exist depending on the specific model and its training data.

Employing tokens, it turns out, paves the way for a manageable and flexible vocabulary, improved language comprehension, and computational efficiency. Reducing the size of the vocabulary greatly improves a model’s accuracy on next token predictions. Manipulating tokens instead of words also enhances the model's ability to recognize rare terms and shared patterns across related words, enabling faster processing of extensive text volumes.

One of the advantages of tokenization is that it substantially reduces the size of the vocabulary, improving a model’s efficiency and processing time.

This same principle of sequence continuation, which is central to language models, can be applied to audio data. The idea of audio continuation posits that an input audio waveform could be treated as a sequence of 'audio tokens'. A model could then be trained to generate an audio continuation that aligns with the characteristics of the input.

The idea of audio continuation is, however, charged with new technical challenges and questions: What is the equivalent of a token in an audio waveform? Can we train an AI model to tokenize any type of audio data?

Addressing these questions requires a rather sophisticated approach, primarily because realistic audio generation necessitates modeling information at various scales. An audio sequence generally contains:

  • Semantic aspects: for example, the consistency of speech, or the melodic/harmonic consistency in music.
  • Acoustic aspects: for example, the unique tone of a voice, or the timbre of a musical instrument.

An effective model must properly encode both these information types and, at generation time, must be able to ‘understand’ the qualitative difference of this distinction.

There’s more that adds to the complexity: The data rate of audio sequences is significantly higher than that of text sequences: recorded human speech, even after compression, requires orders of magnitudes more storage space than its corresponding text sequence.

A back-of-the-envelope calculation

On average, 1 second of human speech amounts to about 3 tokens in English. Assuming 5 characters per token and 8 bits/character, this translates to 3 tokens x 5 characters/token x 8 bits/character = 120 bits for an average second of spoken audio. Even assuming that the input audio is compressed at 6kb/s, a spoken audio sequence requires roughly 6000/120 = 50 times more storage than a text sequence.

Consequently, regardless of the chosen tokenization strategy, audio data yields much longer sequences of tokens than text inputs. A generative audio model must then be able to deal with long-range dependencies, the relationships between distant elements within an audio sequence.

These dependencies are crucial to capturing recurring or evolving patterns over time and can be of both semantic and acoustic type. In music, for example, the very existence of a musical motif or a rhythmic pattern evolving throughout a piece is often what we as humans seem to enjoy.

All LLM-based systems, however, notoriously struggle with dealing with (very) long-range dependencies, and a great deal of engineering efforts are dedicated to finding new ways to extend their “context windows”.

The figure illustrates different strategies to improve the attention mechanism, a component of the transformer model responsible for modeling dependencies (by computing a score) between different tokens in a sequence. These computations become rapidly heavy for longer sequences (source).

One pioneering attempt at creating an audio model that handles these challenges is Google's AudioLM. The approach focuses on training a system capable of performing audio continuation: given a brief audio sequence, the model generates an output that is coherent within its previous context.

In essence, AudioLM puts together two components, each dedicated to performing distinct tasks:

Audio Embedding - A first component, w2v-BERT, is a BERT-type language model trained on speech data that generates semantic tokens capturing both local dependencies (such as phonetics in speech, or local melody in music) and global long-term structure (such as harmony and rhythm in music). These tokens aim to capture long-term structure in the audio. This model is directly inspired by word embeddings and relies on the Conformer architecture used in the speech recognition domain.

Audio Quantization - A second component, based on SoundStream, generates acoustic tokens which capture the details of the audio waveform (such as speaker characteristics or recording conditions) and allow for high-quality synthesis.

AudioLM’s two main components w2v-BERT and SoundStream are used to represent semantic and acoustic information in audio data (source).

To learn more on audio quantization and how SoundStream and EnCodec work, we recommend our dedicated article:

What is Residual Vector Quantization?

As we are going to see in the following, AudioLM’s approach has had a substantial impact on many text-to-speech and text-to-music models that have been developed throughout the last few months.

Text and Melody-Conditioning in Music Generators

Image generation methods employ text-conditioning to steer the creation of images based on textual prompts. This technique entails training the model to learn joint embeddings of text and images, capturing the relationships between words (e.g., in the form of an image caption) and image attributes.

During image synthesis (generation), the model integrates cues from the textual prompt with image encodings, thereby ensuring that the generated visuals align semantically with the given text. The following figure illustrates how this is implemented in diffusion models:

Text-conditioning within the image encodings in Imagen.

What about joint embeddings of text and music?

Describing music with words appears to be a somewhat more subtle matter. While certain musical patterns, progressions, and rhythms may evoke emotions and can be described with words like "uplifting," "high-energy," or "reflective," there is an inherent one-to-many relationship between such descriptions and the possible musical outcomes (somehow, even stronger than in the image case). In addition, paired natural-language descriptions of music and corresponding music recordings are extremely scarce. How, then, to effectively train a text-music joint embedding?

MuLan, a groundbreaking approach developed by Google, is a transformer-based model trained on an extensive dataset consisting of soundtracks from 44 million online music videos alongside their text descriptions. It generates embeddings for the text prompt and a spectrogram of the target audio. By embedding both music and text into the same representation, MuLan aims to create a flexible interface through which musical concepts can be linked to related music audio, and vice versa.

A video illustrating how joint embeddings of text and images work. The case of text-audio joint embeddings by models like MuLan is entirely analogous. Once trained, MuLan can either take a piece of music as input and generate textual descriptions and attributes, or it can take textual descriptions as input and outputs a representation of musical elements that align with the text.

MuLan serves as a core component of MusicLM, Google’s innovative music generator which can create short music clips with not only textual but also melodic conditioning, such as a whistled sketch of a melody.

Let’s now sketch the key features behind MusicLM’s approach.

MusicLM relies on a hierarchical tokenization and generation scheme (source).

MusicLM was trained to regenerate audio clips (30 seconds at 24kHz resolution) from an enormous corpus consisting of 280,000 hours of recorded music. Given an audio sample as input, the main idea behind MusicLM is to model sound across three distinct aspects by using different types of tokens, each generated by a different system:  

  • Textual-fidelity: The MuLan component generates sequences of 12 audio-text tokens per second designed to represent both music and corresponding descriptions.
  • Long-term coherence (semantic modeling):  A second component based on w2v-BERT, generates 25 semantic tokens per second that represent features of large-scale composition, such as motifs, or consistency in the timbres. It was pre-trained to generate masked tokens in speech and fine-tuned on 8,200 hours of music.
  • Small-scale details (acoustic modeling): The encoder component of SoundStream generates 600 acoustic tokens per second, capturing fine-grained acoustic details that allow the model to output high-resolution audio.
MusicLM’s generation scheme: During inference, MuLan generates audio-text tokens from an input description. A series of transformers generate semantic tokens given the audio-text tokens. Following this, another series of transformers learn to generate acoustic tokens given the semantic and audio-text tokens. The SoundStream decoder then generates a music clip (source).

Interestingly, the researchers have added a melody-conditioning component as an integral extension of MusicLM that enables the integration of melodic prompts. This is achieved by allowing the user to provide a melody in the form of humming, singing, whistling, or playing an instrument.

For its implementation, the researchers curated a new dataset consisting of audio pairs with the same melody but different acoustics. This has been realized by using different versions of the same music clip, such as covers, instrumentals, or vocals. This dataset is used to train a joint embedding model, which learns to associate similar melodies.

During music generation, the provided melody is quantized with RVQ into a series of tokens, which are then concatenated with the MuLan tokens. This concatenated sequence is fed into MusicLM, allowing for the generation of music that not only aligns with the textual description but also follows the contour of the input melody.

Voice-Generation and Advances in Text-to-Speech

Recent text-to-speech (TTS) approaches like VALL-E, NaturalSpeech 2, and Voicebox enable controllable, “zero-shot” voice cloning and generation models. Zero-shot, in this context, refers to the fact these models can learn the unique 'voice representation' of previously unseen speakers using only a handful of seconds of input, a significant leap forward from previous TTS approaches.

VALL-E’s pipeline puts together various ideas that we discussed so far and can be summarized as follows:

  1. It takes two kinds of inputs in parallel: a text prompt (content) and an acoustic prompt (voice to be cloned).
  2. The text prompt describes the semantic content of the desired output. It is converted into a sequence of tokens that represent the phoneme structure of the text (discrete phoneme codes).
  3. The acoustic prompt consists of 3 seconds of enrolled audio of the target speaker’s voice. It is converted into discrete acoustic tokens via a neural codec encoder model (EnCodec), encoding the unique vocal characteristics of an individual speaker.
  4. Both sequences of discrete codes are then passed to a transformer-based neural codec decoder, trained to map the code sequences into a waveform, which is the final audio output. To improve inference speed, the decoder combines autoregressive (token-by-token generation) and non-autoregressive (parallel generation of token sequences) strategies.
VALL-E’s architecture (source).

VALL-E was trained on a large dataset (Libri-Light) of 60,000 hours of English speech from over 7,000 unique speakers, roughly 10 times larger than datasets used by previous TTS systems.

NaturalSpeech 2 goes one step further than VALL-E, in that it is even able to produce a singing voice starting with only a natural speech sample of that voice. The main difference with VALL-E lies in some specific architectural choices and the way the TTS task is approached, which we can summarize as follows:

  • The audio representation is based on continuous vectors instead of sequences of discrete audio tokens, which can reduce the sequence length and thus avoids long-range dependencies issues.
  • NaturalSpeech 2 replaces the use of an autoregressive language model with a Latent Diffusion Model. “Latent” here means that the diffusion mechanism (iterative denoising) is performed at the level of the vector representation of the generated audio.
  • It incorporates a duration/pitch predictor component which enables the model’s zero-shot singing synthesis without a singing prompt.
The architecture of NaturalSpeech 2 (source).

The latent diffusion model offers the ability to generate samples from continuous vector distributions in a non-autoregressive fashion. In NaturalSpeech 2, the authors highlight the advantage of this approach, which effectively mitigates the issue of error propagation commonly observed in autoregressive models.

In subjective human preference tests, these models' voice cloning abilities showcase exceptional similarity to the original voice. When participants were asked to evaluate the naturalness of the synthesized speech, the results yielded a nearly zero CMOS score against the ground truth, indicating that the synthesized speech is virtually indistinguishable from authentic human recordings.

Meta’s latest TTS model Voicebox is a new, highly-efficient model which replaces diffusion with a different, non-autoregressive generative paradigm called flow-matching. Interestingly, Voicebox performs tasks such as voice editing, noise removal, and style transfer without task-specific training. We will go into more detail regarding Voicebox’s approach and architecture in a future blog post. Be sure to follow us on Twitter to get updated on our latest content.

Final Words

Generative AI for audio is making notable progress, with speech-synthesis models generating indistinguishable voices from authentic ones and music generators now capable of creating realistic melodies and harmonies, based on textual or melodic prompts.

The current development has enormous potential implications in the business of content generation and may eventually introduce completely new ways for users to experience and consume music, podcasts, and audiobooks. The progress in this field is likely to continue and we will be exploring these in future blog posts.

If you enjoyed reading this, feel free to check out some of our other recent articles:

You can also follow us on Twitter and YouTube, where we regularly release fresh content on these subjects and many other exciting aspects of AI.