What AI Music Generators Can Do (And How They Do It)

Over the past few years, we've witnessed increasing advancements in AI-powered music generation. Still, these improvements appeared far from the fast-paced progress observed in the image and text domain.

Last year’s emergence of user-friendly interfaces for models like DALL-E 2 or Stable Diffusion for images and ChatGPT for text generation was key to boost the world’s attention to generative AI. Now, the trajectory of some recent developments suggests a strongly growing focus on AI for music generation, namely:

This past May – Google announced an experimental framework powered by their MusicLM model.
In August – Meta released a tool for AI-generated audio named AudioCraft and open-sourced all of its underlying models, including MusicGen.
Last week – StabilityAI launched StableAudio, a subscription-based platform for creating music with AI models.

Some people expect that a “ChatGPT moment” for AI-powered music generation is just around the corner. What’s undeniable is that, with foundational ideas from LLMs and text-to-image models being adapted to the audio modality, the latest AI music generators have shown a giant leap forward in the quality of their outputs.

0:00

This short audio clip was generated in a handful of seconds by Google’s MusicLM with the text prompt “A rhythmic east coast boom-bap hip hop beat, uplifting and inspirational.”

But how do these new models for music generation work? And how are text-based or image-based techniques used to generate synthetic AI music?

In this blog post, we will break down the key ideas behind the latest approaches to AI music generation and outline the main differences in the current state-of-the-art models like MusicLM, MusicGen and StableAudio.

The goal is to give a good sense of how text-to-music models work under the hood to non-experts in the field. We’ll assume some general familiarity with machine learning concepts. Let's begin by examining the idea of text-conditioning in the next section.

Text-Conditioning: Text-Music Joint Embeddings

Many image generation methods like DALL-E 2 employ text-conditioning to steer the creation of images based on textual prompts. This technique entails training the model to learn joint embeddings of text and images, capturing the relationships between words (e.g., in the form of an image caption) and image attributes (content and qualities).

During image synthesis (generation), the model integrates cues from the textual prompt into image encodings, ensuring that the generated visuals align semantically with the given text.

How text-image joint embeddings are implemented in diffusion models.

What about joint embeddings of text and music? Training joint representations of text and music is a more difficult task than in the case of images for specific reasons:

Inherent difficulty: There is an inherent one-to-many relationship between natural language descriptions of music and the possible musical outcomes. While this is also true in the case of images, textual descriptors tend to map to concrete visual features more easily than they do for musical features. This also makes the evaluation step harder and highly subjective.
Data scarcity: Paired natural anguage descriptions of music and corresponding music recordings are extremely scarce, in contrast to the abundance of image/descriptions pairs available online, e.g. in online art galleries or social media.

How, then, to effectively train a text-music joint embedding? Let’s discuss a couple of prominent examples.

MuLan, a groundbreaking approach by Google Research, is a transformer-based model trained on an extensive dataset consisting of soundtracks from 44 million online music videos alongside their text descriptions. Both of MuLan’s dataset and code are closed-source. Here are the main takeaways:

MuLan generates embeddings for the text prompt and a spectrogram of the target audio.
By embedding both music and text into the same vector representation, MuLan aims to create an interface through which natural language descriptions can be linked to related music audio, and vice versa.
However, the text descriptions used to train MuLan are often only loosely connected with the corresponding audio. The main idea behind it is to leverage a large-scale dataset to extract relationships between text and music that transcend the conventional music-tags classification.

CLAP is a more recent (and fully open-source) model. It is based on a text-audio retrieval via contrastive learning paradigm and achieves state-of-the-art performance on zero-shot audio classification tasks. Here are the main takeaways:

The dataset (630k audio-text pairs), source code, and even a Hugging Face implementation of CLAP are available.
The authors provide a pipeline that allows for easily testing different audio and text encoders within CLAP.
A “feature fusion mechanism” is proposed to accommodate varied audio lengths (a problem for transformer-based audio encoders, which expect fixed length sequences). This is achieved by leveraging techniques for training the model on different lengths of audio inputs in constant computation time.

The architecture of MusicLM

MusicLM is Google’s groundbreaking music generator. It can create short music clips with not only textual but also melodic conditioning, such as a whistled sketch of a melody (see below). Let’s now sketch the key features behind MusicLM’s approach.

MusicLM’s architecture at training mode (source).

Given an audio sample as input, the main idea behind MusicLM is to model sound across three distinct components by using three different types of tokens, each generated by a different model:

Textual-fidelity tokens: The MuLan component generates sequences of 12 audio-text tokens per second designed to represent both music and corresponding descriptions.
Long-term coherence (semantic modeling) tokens: A second component based on w2v-BERT, generates 25 semantic tokens per second that represent features of large-scale composition, such as motifs, or consistency in the timbres. It was pre-trained to generate masked tokens in speech and fine-tuned on 8,200 hours of music.
Small-scale details (acoustic modeling) tokens: The encoder component of SoundStream generates 600 acoustic tokens per second, capturing fine-grained acoustic details that allow the model to output high-resolution audio. If you want to dive deeper, you can learn about how SoundStream works here.

MusicLM was trained to regenerate audio clips (30 seconds at 24kHz resolution) from a dataset of 280k hours of recorded music. During inference, MuLan uses the text tokens computed from the text prompt as conditioning signal and converts the generated audio tokens to waveforms using the SoundStream decoder, as illustrated in the diagram below:

MusicLM’s architecture at inference mode (source).

Melodic Prompts: Controlling Musical Contour

Interestingly, MusicLM integrates a melody-conditioning component that enables the integration of melodic prompts. This is achieved by allowing the user to provide a melody in the form of humming, singing, whistling, or playing an instrument.

This allows for a more natural approach to music conditioning than just text prompts that can offer more control to the user’s creative process. It may also enable the iterative refinement of the model’s output.

As an example, the following short audio clip:

Guitar solo

0:00

/0:09

Was generated with MusicLM with the input text prompt “guitar solo,” together with the following melodic prompt:

Melodic Prompt

0:00

/0:10

For its implementation, the researchers curated a (proprietary) dataset consisting of audio pairs with the same melody but different acoustics. This has been realized by using different versions of the same music clip, such as covers, instrumentals, or vocals. This dataset trains a joint embedding model, which learns to associate similar melodies.

During music generation, the provided melody is quantized with RVQ into a series of tokens, which are then concatenated with the MuLan tokens. This concatenated sequence is fed into MusicLM, allowing for the generation of music that not only aligns with the textual description but also follows the contour of the input melody.

The authors of MusicGen took a slightly different approach to melody-conditioning, based on unsupervised learning. The basic idea behind it is to condition the model on an additional variable during training: the chromagram (a specific type of spectogram) of the input audio waveform.

More specifically, the researchers found that the optimal way is to do this is to apply the following steps:

A music source separation model is applied to the original waveform to decompose the reference track into four components: drums, bass, vocals, and other. Drums and bass frequencies are discarded to recover only the melodic structure of the residual waveform.
Filtering is applied to the chromagram at each time step, extracting only the dominant frequency range. This is to avoid overfitting, which would lead to the reconstruction of the original sample as is.
Finally, the refined chromagram is quantized to create the conditioning that is later fed to the model.

Token Interleaving Patterns

MusicLM and similar autoregressive models for music generation represent music audio data using a sequence of tokens, which are quantized through residual vector quantization (RVQ). This technique compresses audio data into discrete token streams from various blocks, or codebooks.

RVQ utilizes multiple codebooks for efficient and high-fidelity data compression.

During inference, the transformer model can decipher this token sequence using different possible patterns, known as codebook interleaving patterns. To illustrate the idea, here are a few possible examples:

Parallel pattern: it predicts all codebooks simultaneously with every time step. It is computationally the most efficient, but the performance will be low because the codebooks are not statistically independent on each other.
Flattening patterns: it sequentially predicts the first time step of each codebook, then the second time step of each codebook, and so forth for all the time steps. MusicLM follows this pattern and uses a chain of two (transformer) decoders: the first one predicting the first half of the codebooks, the second one predicting the second half of the codebooks (conditioned on the first model’s outputs). This approach is high performing but comes at a high computational cost.
VALL-E pattern: it processes the first codebook across all time steps in sequence, then handles the subsequent codebooks in parallel.
Delayed patterns: The idea is to use a pattern that sets a delay interval among the codebooks, i.e., to introduce offsets between the different streams of tokens (see figure below).

Through empirical evaluations, the authors of MusicGen highlight the strengths and limitations of different codebook patterns, and showed that a simple delayed pattern can reach a similar performance to flattening patterns with only a fraction of the computational cost.

Different token interleaving patterns were systematically analyzed in the MusicGen report.

One important aspect to mention is that, unlike all other approaches to music generation that relied on cascading several models (and may suffer from error propagation), MusicGen comprises a single-stage transformer coupled with efficient token interleaving patterns. This design eliminates the need for hierarchical or upsampling models, simplifying the music generation process and enhancing a higher controllability of the generated output.

Timing-Conditioning: Controlling Output Duration

Generative audio models are trained to produce audio of a predetermined size, such as a 30-second segment in the case of MusicLM, which is quite restrictive for real-world applications. Moreover, training often requires randomly cropping longer audio files to fit the model’s specified training length, resulting, at times, in audio snippets that may start or conclude abruptly.

Stable Audio, a latent diffusion model for audio recently released by StabilityAI, tries to overcome this limitation. Developed in the lineage of the stable diffusion for images, Stable Audio deploys a Variational Autoencoder (VAE) to convert audio into embeddings.

The model is trained conditionally on text metadata alongside audio file duration and initiation time. With the infusion of timing-conditioning, the system can now create audio of a designated length, limited only by the track lengths seen in the training data. Here’s how timing-conditioning works:

During the training phase, when a segment of audio is selected, two parameters are extracted:
- seconds_start – the start time of the chunk.
- seconds_total – the cumulative duration of the original audio.
For example, for a 30-second snippet drawn from an 80-second audio track, beginning at the 0:14 mark, the seconds_start is identified as 14, while the seconds_total stands at 80.
These values transform into per-second discrete learned embeddings. Once processed, they are appended to the prompt tokens before entering the U-Net’s cross-attention strata.

The exact architecture is based on Moûsai, a cascading (2-stage) latent diffusion model. The training dataset of 800k diverse audio files includes music, sound effects, and solo instruments, each paired with descriptive annotations. The VAE in this model learns to create compact audio embeddings. A pre-trained CLAP transformer also generates text embeddings to represent musical characteristics like style, instrumentation, tempo, and mood.

As for any diffusion model, Stable Audio adds noise to the audio vector, which a U-Net Convolutional Neural Network learns to remove, guided by the text and timing embeddings.

Moûsai’s architecture at inference mode (source).

At inference, the model begins with an embedding dominated by noise, a user-supplied text prompt, and the output audio's desired length. The system iteratively removes noise to refine the audio embedding, which the decoder in the VAE then translates back into CD-quality audio.

Final Words

The landscape of Generative AI is evolving at a brisk pace. The advent of models like MusicLM, MusicGen, and Stable Audio symbolizes a significant leap in AI-generated audio, offering up-and-coming prospects for producers across diverse domains like music, videos, video games, and podcasts.

These models currently face challenges in creating coherent structures when generating extended audio outputs and in maintaining the clarity and detail of individual sound elements. Additionally, there are considerable limitations in effectively reproducing vocal sounds. However, indications are clear that the industry is moving towards the commercial deployment of these systems. We will probably witness further exciting advancements in audio generation soon.

If you’re interested in learning more about Machine Learning, feel free to check out some of the other articles on our blog, such as:

Alternatively, check out our YouTube channel or follow us on Twitter to stay in the loop when we drop new tutorials and guides.