Top Speaker Diarization Libraries and APIs in 2022

In its simplest form, Speaker Diarization answers the question: who spoke when?

In the field of Automatic Speech Recognition (ASR), Speaker Diarization refers to (A) the number of speakers that can be automatically detected in an audio file, and (B) the words that can be assigned to the correct speaker in that file.

Today, many modern Speech-to-Text APIs and Speaker Diarization libraries apply advanced Deep Learning models to perform tasks (A) and (B) near human-level accuracy, significantly increasing the utility of Speaker Diarization APIs.

In this blog post, we’ll take a closer look at how Speaker Diarization works, why it’s useful, some of its current limitations, and the top three Speaker Diarization libraries and APIs for product teams and developers to use.

What is Speaker Diaraization and How Does Speaker Diarization Work?

The fundamental task of Speaker Diarization is to apply speaker labels (i.e., “Speaker A,” “Speaker B,” etc.) to each utterance in the transcription text of an audio/video file.

Accurate Speaker Diarization requires many steps. The first step is to break the audio file into a set of “utterances.” What constitutes an utterance? Generally, utterances are at least a half second to 10 seconds of speech. To illustrate this, let’s look at the below examples:

Utterance 1:

Hello my name is Bob.

Utterance 2:

I like cats and live in New York City.

In the same way that a single word wouldn’t be enough for a human to identify a speaker, Machine Learning models also need more data to identify speakers too. This is why the first step is to break the audio file into a set of “utterances” that can, later, be assigned to a specific speaker (e.g., “Speaker A” spoke “Utterance 1”).

There are many ways to break up an audio/video file into a set of utterances, with one common way being to use silence and punctuation markers. In our research, we start seeing a drop off in a Speaker Diarization model’s ability to correctly assign an utterance to a speaker when utterances are less than one second.

Once an audio file is broken into utterances, those utterances get sent through a Deep Learning model that has been trained to produce “embeddings” that are highly representative of a speaker’s characteristics. An embedding is a Deep Learning model’s low-dimensional representation of an input. For example, the image below shows what the embedding of a word looks like:

*Source: https://medium.com/@hari4om/word-embedding-d816f643140*

We perform a similar process to convert not words, but segments of audio, into embeddings as well.

Next, we need to determine how many speakers are present in the audio file--this is a key feature of a modern Speaker Diarization model. Legacy Speaker Diarization systems required knowing how many speakers were in an audio/video file ahead of time, but a major benefit of modern Speaker Diarization models is that they can accurately predict this number.

Our first goal here is to overestimate the number of speakers. Using a clustering method, want to determine the greatest number of speakers that could reasonably be heard in the audio. Why overestimate? It's much easier to combine the utterances of one speaker that has been incorrectly identified as two than it is to disentangle the utterances of two speakers which have incorrectly been combined into one.

After this initial step, we go back and combine speakers, or disentangle speakers, as needed to get an accurate number.

Finally, Speaker Diarization models take the utterance embeddings (produced above), and cluster them into as many clusters as there are speakers. For example, if a Speaker Diarization model predicts there are four speakers in an audio file, the embeddings will be forced into four groups based on the “similarity” of the embeddings.

For example, in the below image, let’s assume each dot is an utterance. The utterances get clustered together based on their similarity - with the idea being that each cluster corresponds to the utterances of a unique speaker.

Source: https://github.com/tranleanh/centroid-neural-networks

There are many ways to determine similarity of embeddings, and this is a core component of accurately predicting speaker labels with a Speaker Diarization model. After this step, you now have a transcription complete with accurate speaker labels!

Today’s Speaker Diarization models can be used to determine up to 26 speakers in the same audio/video file with high accuracy.

Why is Speaker Diarization Useful?

Speaker Diarization is useful because it takes a big wall of text and breaks it into something much more meaningful and valuable. If you were to try and read a transcription without speaker labels, your brain would automatically try and assign each word/sentence to the appropriate speaker. Speaker Diarization saves you time and mental energy.

For example, let’s look at the before and after transcripts below with and without Speaker Diarization:

Without Speaker Diarization:

But how did you guys first meet and how do you guys know each other? I  
actually met her not too long ago. I met her, I think last year in  
December, during pre season, we were both practicing at Carson a lot. 
And then we kind of met through other players. And then I saw her a few 
her last few torments this year, and we would just practice together 
sometimes, and she's really, really nice. I obviously already knew who 
she was because she was so good. Right. So. And I looked up to and I met 
her. I already knew who she was, but that was cool for me. And then I 
watch her play her last few events, and then I'm actually doing an 
exhibition for her charity next month. I think super cool. Yeah. I'm 
excited to be a part of that. Yeah. Well, we'll definitely highly
promote that. Vania and I are both together on the Diversity and 
Inclusion committee for the USDA, so I'm sure she'll tell me all about 
that. And we're really excited to have you as a part of that tournament. 
So thank you so much. And you have had an exciting year so far. My 
goodness. Within your first WTI 1000 doubles tournament, the Italian 
Open.Congrats to that. That's huge. Thank you.

With Speaker Diarization:

<Speaker A> But how did you guys first meet and how do you guys know each 
other?
<Speaker B> I actually met her not too long ago. I met her, I think last 
year in December, during pre season, we were both practicing at Carson a 
lot. And then we kind of met through other players. And then I saw her a 
few her last few torments this year, and we would just practice together 
sometimes, and she's really, really nice. I obviously already knew who
she was because she was so good.
<Speaker A> Right. So.
<Speaker B> And I looked up to and I met her. I already knew who she 
was, but that was cool for me. And then I watch her play her last few 
events, and then I'm actually doing an exhibition for her charity next 
month.
<Speaker A> I think super cool.
<Speaker B> Yeah. I'm excited to be a part of that.
<Speaker A> Yeah. Well, we'll definitely highly promote that. Vania and 
I are both together on the Diversity and Inclusion committee for the 
USDA. So I'm sure she'll tell me all about that. And we're really 
excited to have you as a part of that tournament. So thank you so much. 
And you have had an exciting year so far. My goodness. Within your first 
WTI 1000 doubles tournament, the Italian Open. Congrats to that. That's 
huge.
<Speaker B> Thank you.

See how much easier the transcription is to read with Speaker Diarization?

Speaker Diarization is also a powerful analytic tool. By identifying and labeling speakers, product teams and developers can analyze each speaker’s behaviors, identify patterns/trends among individual speakers, make predictions, and more. For example:

A call center might analyze agent/customer calls or customer requests or complaints, to identify trends that could help facilitate better communication.
A podcast service might use speaker labels to identify the <host> and <guest>, making transcriptions more readable for end users.
A telemedicine platform might identify <doctor> and <patient> to create an accurate transcript, attach a readable transcript to patient files, or input the transcript into an EHR system.

Top 3 Speaker Diarization Libraries and APIs

While the models behind Speaker Diarization technology may seem complex, performing Speaker Diarization on audio files is thankfully quite simple thanks to specialized libraries and APIs!

Here are the three best open source libraries and APIs to consider if you would like to perform Speaker Diarization on an audio or video file:

AssemblyAI

AssemblyAI is a leading speech recognition startup that offers Speech-to-Text transcription with high accuracy, in addition to offering Audio Intelligence features such as Sentiment Analysis, Topic Detection, Summarization, Entity Detection, and more.

Its Core Transcription API includes an option for Speaker Diarization. Simply enable Speaker Diarization when you run an audio or video file through the API and your transcript will accurately identify Speaker Labels per speech segment.

Test AssemblyAI's Speaker Diarization for Free in AI Playground

PyAnnote

PyAnnote is an open source Speaker Diarization toolkit written in Python and built based on the PyTorch Machine Learning framework.

While PyAnnote does offer some pretrained models through PyAnnote.audio, developers may have to train its end-to-end neural building blocks to modify and perfect your own Speaker Diarization model. Note that pyAnnote.audio only supports Python 3.7, or later, on Linux and MacOS.

Kaldi

Kaldi is another open source option for Speaker Diarization. With Kaldi, developers can either train the models from scratch or download the pre-trained X-Vectors network or PLDA backend from the Kaldi website.

Developers will still need to put in work to achieve accurate and useful Speaker Diarization with Kaldi, whichever of the above methods you choose. This Kaldi tutorial can walk you through the necessary steps to get started with Kaldi if you are interested. Once you are familiar with Kaldi, this tutorial can help you train your own Speaker Diarization model.

Limitations of Speaker Diarization

Currently, Speaker Diarization models only work for asynchronous transcription and not real-time transcription, however this is an active area of research.

There are also several constraints that limit the accuracy of modern Speaker Diarization models:

Speaker talk time
Conversational pace

A speaker’s talk time has the biggest impact on accuracy. If a speaker talks for less than 15 seconds in an entire audio file, it’s a toss-up as to if a Speaker Diarization model will correctly identify this speaker as a unique, individual speaker. If it cannot, two outcomes may occur: the model may assign the speaker as <unknown>, or it may merge their words with a more dominant speaker. Generally speaking, a speaker has to talk for more than 30 seconds in order to accurately be detected by a Speaker Diarization model.

The pace and type of a communication have the second biggest impact on accuracy, with conversational communication being the easiest to accurately diarize. If the conversation is well-defined, with each speaker taking clear turns (think of a Joe Rogan podcast versus a phone call conversation with your best friend), has an absence of over-talking or interrupting, and minimal background noise, it is much more likely that the model will correctly label each speaker.

However, if the conversation is more energetic, with the speakers cutting each other off or speaking over one another, or has significant background noise, the model’s accuracy will decrease. If overtalk (aka crosstalk) is common, the model may even misidentify an imaginary third speaker, which includes the portions of overtalk.

AssemblyAI's Conformer-2 model - 12% improvement in robustness to noise

While there are clearly some limitations to Speaker Diarization today, advances in Deep Learning research are helping to overcome these deficiencies and to boost Speaker Diarization accuracy.