June 24, 2024

Top Speaker Diarization Libraries and APIs in 2024

In this blog post, we’ll take a closer look at how Speaker Diarization works, why it’s useful, some of its current limitations, and the top five Speaker Diarization libraries and APIs to use.

AI Concepts

Speaker Diarization

Kelsey Foster

Growth

Kelsey Foster

Growth

Table of contents

[Visible on live site]

Get $50 in credits

In its simplest form, speaker diarization answers the question: who spoke when?

In the field of Automatic Speech Recognition (ASR), speaker diarization refers to (A) the number of speakers that can be automatically detected in an audio file, and (B) the words that can be assigned to the correct speaker in that file.

Today, many modern Speech-to-Text APIs and Speaker Diarization libraries apply advanced deep learning models to perform tasks (A) and (B) near human-level accuracy, significantly increasing the utility of Speaker Diarization APIs.

In this blog post, we’ll look at how speaker diarization works, why it’s useful, some of its current limitations, and the top three Speaker Diarization libraries and APIs for product teams and developers to use.

What is speaker diarization?

Speaker diarization answers the question: "Who spoke when?" It involves segmenting and labeling an audio stream by speaker, allowing for a clearer understanding of who is speaking at any given time. This process is essential for automatic speech recognition (ASR), meeting transcription, call center analytics, and more.

Speaker Diarization performs two key functions:

Speaker Detection: Identifying the number of distinct speakers in an audio file.
Speaker Attribution: Assigning segments of speech to the correct speaker.

The result is a transcript where each segment of speech is tagged with a speaker label (e.g., "Speaker A," "Speaker B"), making it easy to distinguish between different voices. This improves the readability of transcripts and increases the accuracy of analyses that depend on understanding who said what.

How does speaker diarization work?

The fundamental task of speaker diarization is to apply speaker labels (i.e., “Speaker A,” “Speaker B,” etc.) to each utterance in the transcription text of an audio/video file.

Accurate speaker diarization requires many steps. The first step is to break the audio file into a set of “utterances.” What constitutes an utterance? Generally, utterances are at least a half second to 10 seconds of speech. To illustrate this, let’s look at the below examples:

Utterance 1:

Hello my name is Bob.

Utterance 2:

I like cats and live in New York City.

In the same way that a single word wouldn’t be enough for a human to identify a speaker, Machine Learning models also need more data to identify speakers too. This is why the first step is to break the audio file into a set of “utterances” that can, later, be assigned to a specific speaker (e.g., “Speaker A” spoke “Utterance 1”).

There are many ways to break up an audio/video file into a set of utterances, with one common way being to use silence and punctuation markers. In our research, we start seeing a drop off in a Speaker Diarization model’s ability to correctly assign an utterance to a speaker when utterances are less than one second.

Once an audio file is broken into utterances, those utterances get sent through a deep learning model that has been trained to produce “embeddings” that are highly representative of a speaker’s characteristics. An embedding is a deep learning model’s low-dimensional representation of an input. For example, the image below shows what the embedding of a word looks like:

*Source: https://medium.com/@hari4om/word-embedding-d816f643140*

We perform a similar process to convert not words, but segments of audio, into embeddings as well.

Next, we need to determine how many speakers are present in the audio file--this is a key feature of a modern Speaker Diarization model. Legacy Speaker Diarization systems required knowing how many speakers were in an audio/video file ahead of time, but a major benefit of modern Speaker Diarization models is that they can accurately predict this number.

Our first goal here is to overestimate the number of speakers. Using a clustering method, want to determine the greatest number of speakers that could reasonably be heard in the audio. Why overestimate? It's much easier to combine the utterances of one speaker that has been incorrectly identified as two than it is to disentangle the utterances of two speakers which have incorrectly been combined into one.

After this initial step, we go back and combine speakers, or disentangle speakers, as needed to get an accurate number.

Finally, Speaker Diarization models take the utterance embeddings (produced above), and cluster them into as many clusters as there are speakers. For example, if a Speaker Diarization model predicts there are four speakers in an audio file, the embeddings will be forced into four groups based on the “similarity” of the embeddings.

For example, in the below image, let’s assume each dot is an utterance. The utterances get clustered together based on their similarity — with the idea being that each cluster corresponds to the utterances of a unique speaker.

Source: https://github.com/tranleanh/centroid-neural-networks

There are many ways to determine similarity of embeddings, and this is a core component of accurately predicting speaker labels with a Speaker Diarization model. After this step, you now have a transcription complete with accurate speaker labels!

Today’s Speaker Diarization models can be used to determine multiple speakers in the same audio/video file with high accuracy.

Speaker diarization example process

To illustrate this process, let's consider an example:

Audio Segmentation: An audio file containing a conversation is segmented into utterances based on pauses and punctuation.
Feature Extraction: Each utterance is processed by a deep learning model to create embeddings that represent the unique vocal characteristics of each speaker.
Clustering: The embeddings are clustered into groups based on the proximity in the embedding space. Each cluster is expect to correspond to one person.
Speaker Attribution: The utterances within each cluster are labeled with the same speaker tags, and these tags are used to annotate the transcript.

Why is speaker diarization useful?

Speaker diarization is useful because it takes a big wall of text and breaks it into something much more meaningful and valuable. If you were to try and read a transcription without speaker labels, your brain would automatically try and assign each word/sentence to the appropriate speaker. Speaker diarization saves you time and mental energy.

For example, let’s look at the before and after transcripts below with and without speaker diarization:

Without speaker diarization:

With speaker diarization:

See how much easier the transcription is to read with speaker diarization?

Speaker diarization is also a powerful analytics tool. By identifying and labeling speakers, product teams and developers can analyze each speaker’s behaviors, identify patterns/trends among individual speakers, make predictions, and more. For example:

A call center might analyze agent/customer calls or customer requests or complaints, to identify trends that could help facilitate better communication.
A podcast service might use speaker labels to identify the <host> and <guest>, making transcriptions more readable for end users.
A telemedicine platform might identify <doctor> and <patient> to create an accurate transcript, attach a readable transcript to patient files, or input the transcript into an EHR system.

Additional benefits of speaker diarization

Speaker diarization doesn't just make transcripts easier to read—it adds significant value across various applications. Here are some additional benefits:

Better Readability: Clearly identified speakers make transcripts easier to follow, reducing cognitive load and improving comprehension for readers.
Improved Meeting Records: In business settings, detailed speaker labels help track contributions, follow up on action items, and guarantee accountability.
Legal Compliance: Accurate identification of speakers in legal proceedings can be critical for maintaining accurate records and guaranteeing all parties are properly represented.
Enhanced Customer Insights: In customer service and support, analyzing speaker interactions can reveal insights into customer sentiment, agent performance, and areas for process improvement.
Training and Development: For training purposes, speaker-labeled transcripts can help trainers identify areas where employees may need further development or highlight best practices from top performers.
Content Creation: For media and entertainment, speaker labels make it easier to edit and produce content, as editors can quickly locate and differentiate between speakers.
Educational Use: In educational settings, speaker-labeled transcripts can help students follow along with lectures more easily and review material more effectively.
Research and Analysis: For researchers, being able to distinguish between different speakers can provide deeper insights into conversational dynamics and interaction patterns.

Top 5 speaker diarization libraries and APIs

While the models behind speaker diarization technology may seem complex, performing speaker diarization on audio files is simple thanks to specialized libraries and APIs.

Here are the three best open source libraries and APIs to consider if you would like to perform speaker diarization on an audio or video file:

AssemblyAI

AssemblyAI is a leading speech recognition startup that offers Speech-to-Text transcription with high accuracy, in addition to offering Audio Intelligence features such as Sentiment Analysis, Topic Detection, Summarization, Entity Detection, and more.

Its Core Transcription API includes an option for Speaker Diarization. Simply enable Speaker Diarization when you run an audio or video file through the API and your transcript will accurately identify speaker labels per speech segment.

Test AssemblyAI's Speaker Diarization for Free in AI Playground

PyAnnote

PyAnnote is an open source Speaker Diarization toolkit written in Python and built based on the PyTorch Machine Learning framework.

While PyAnnote does offer some pretrained models through PyAnnote.audio, developers may have to train its end-to-end neural building blocks to modify and perfect your own Speaker Diarization model. Note that pyAnnote.audio only supports Python 3.7, or later, on Linux and MacOS.

Kaldi

Kaldi is another open source option for Speaker Diarization, mostly used among researchers. With Kaldi, users can either train the models from scratch or download the pre-trained X-Vectors network or PLDA backend from the Kaldi website.

Developers will still need to put in work to achieve accurate and useful Speaker Diarization with Kaldi, whichever of the above methods you choose. This Kaldi tutorial can walk you through the necessary steps to get started with Kaldi if you are interested. Once you are familiar with Kaldi, this tutorial can help you train your own Speaker Diarization model.

NVIDIA NeMo

NVIDIA NeMo offers a robust Speaker Diarization library that segments audio recordings by speaker. NeMo’s speaker diarization pipeline includes several key modules: a Voice Activity Detector (VAD) for detecting speech, a Speaker Embedding Extractor for capturing vocal characteristics, and a Clustering Module for grouping similar speaker embeddings.

It supports both oracle VAD (using ground-truth timestamps) and system VAD (using model-generated timestamps), making it adaptable to various use cases. NeMo also provides pre-trained models and extensive configuration options to let developers tailor the diarization process to their specific needs.

This makes NVIDIA NeMo a powerful solution for applications ranging from call centers and meeting transcriptions to media production and beyond.

Speechbrain

Speechbrain is an open-source PyTorch toolkit that accelerates conversational AI development, such as speech assistants, chatbots, and large language models. It supports diarization, speech recognition, speaker recognition, speech translation, and speech enhancement.

The open-source tool provides over 200 competitive training recipes on over 40 datasets, supporting 20 tasks. These recipes enable both training from scratch and fine-tuning of pre-trained models.

With features like dynamic batching, mixed-precision training, and support for single and multi-GPU training, Speechbrain is well-suited for large-scale, high-performance AI applications.

Limitations of speaker diarization

Currently, Speaker Diarization models only work for asynchronous transcription and not real-time transcription, however this is an active area of research.

There are also several constraints that limit the accuracy of modern Speaker Diarization models:

Speaker talk time
Conversational pace

A speaker’s talk time has the biggest impact on accuracy. If a speaker talks for less than 15 seconds in an entire audio file, it’s a toss-up as to if a Speaker Diarization model will correctly identify this speaker as a unique, individual speaker. If it cannot, two outcomes may occur: the model may assign the speaker as <unknown>, or it may merge their words with a more dominant speaker. Generally speaking, a speaker has to talk for more than 30 seconds in order to accurately be detected by a Speaker Diarization model.

The pace and type of a communication have the second biggest impact on accuracy, with conversational communication being the easiest to accurately diarize. If the conversation is well-defined, with each speaker taking clear turns (think of a Joe Rogan podcast versus a phone call conversation with your best friend), has an absence of over-talking or interrupting, and minimal background noise, it is much more likely that the model will correctly label each speaker.

However, if the conversation is more energetic, with the speakers cutting each other off or speaking over one another, or has significant background noise, the model’s accuracy will decrease. If overtalk (aka crosstalk) is common, the model may even misidentify an imaginary third speaker, which includes the portions of overtalk.

AssemblyAI's Conformer-2 model - 12% improvement in robustness to noise

While there are clearly some limitations to speaker diarization today, advances in deep learning research are helping to overcome these deficiencies and to boost speaker diarization accuracy.

Build Live Transcription Solutions with AssemblyAI

By integrating speaker diarization and other advanced Speech AI features, organizations gain:

High Accuracy: Achieve near-human level transcription accuracy with our state-of-the-art models, even in challenging audio conditions.
Scalability: Seamlessly handle large volumes of audio and video data, whether for small teams or large enterprises.
Comprehensive Features: From speaker diarization and custom vocabulary to auto punctuation and confidence scores, AssemblyAI provides all the tools you need for accurate and efficient transcription.
User-Friendly API: Our easy-to-use API allows for quick integration, letting you start transcribing and analyzing audio data in no time.
Continuous Updates and Security: Stay up-to-date with monthly product improvements and benefit from enterprise-grade security practices to keep your data safe.

Experience the benefits of live transcription and speaker diarization for yourself. Test out our API in minutes with the AssemblyAI API Playground. Sign up for a free account today and access our speech-to-text and audio intelligence models.

Start Building with AssemblyAI