In its simplest form, Speaker Diarization answers the question: who spoke when?
In the field of Automatic Speech Recognition (ASR), Speaker Diarization refers to (A) the number of speakers that can be automatically detected in an audio file, and (B) the words that can be assigned to the correct speaker in that file.
Today, many modern Speech-to-Text APIs, like AssemblyAI, use deep learning research to accurately perform tasks (A) and (B) automatically.
In this blog post, we’ll take a closer look at how Speaker Diarization works, why it’s useful, some of its current limitations, and how to easily use Speaker Diarization on audio/video files.
How Does Speaker Diarization Work?
The fundamental task of Speaker Diarization is to apply speaker labels (i.e., “Speaker A,” “Speaker B,” etc.) to each word in the transcription text of an audio/video file.
Accurate Speaker Diarization requires many steps. The first step is to break the audio file into a set of “utterances.” What makes up an utterance? Generally, utterances are at least a half second to 10 seconds of speech. To illustrate this, let’s look at the below examples:
Hello my name is Bob.
I like cats and live in New York City.
In the same way that a single word wouldn’t be enough for a human to identify a speaker, machine learning models also need more data to identify speakers too. This is why the first step is to break the audio file into a set of “utterances” that can, later, be assigned to a specific speaker (e.g., “Speaker A” spoke “Utterance 1”).
There are many ways to break up an audio/video file in a set of utterances, with one common way being to use silence and punctuation markers. In our research, we start seeing a drop off in Speaker Diarization’s ability to correctly assign an utterance to a speaker when utterances are less than one second.
Once an audio file is broken into utterances, those utterances get sent through a deep learning model that has been trained to produce “embeddings” that are highly representative of a speaker’s characteristics. An embedding is a deep learning model’s high-dimensional representation of an input. For example, the image below shows what the embedding of a word looks like:
We do a similar process to convert not words, but segments of audio, into embeddings as well.
Next, we need to make a choice about how many speakers are present in the audio file--this is a key feature of a modern Speaker Diarization model. Legacy Speaker Diarization systems required knowing how many speakers were in an audio/video file ahead of time, but a major benefit of modern Speaker Diarization models is that they can accurately predict this number.
Our first goal here is to overestimate the number of speakers. Using a clustering method, we want to estimate the highest number of speakers that is reasonably possible. Why overestimate? It’s much easier to combine speakers’ utterances if the model breaks them up into different speaker labels than it is to disentangle two speakers being combined into one.
After this initial step, we go back and combine speakers, or disentangle speakers, as needed to get an accurate number.
Finally, Speaker Diarization models take the embeddings (produced above), and cluster them into as many clusters as there are speakers. For example, if a Speaker Diarization model predicts there are four speakers in an audio file, the embeddings will be forced into four groups based on the “similarity” of the embeddings.
For example, in the below image, let’s assume each dot is an utterance. The utterances get clustered together based on their similarity - with the idea being that each cluster is a unique speaker.
There are many ways to determine similarity of embeddings, and this is a core component of accurately predicting speaker labels with a Speaker Diarization model.
After this step, you now have a transcription complete with accurate speaker labels!
Today’s speaker diarization models can be used to determine up to 26 speakers in the same audio/video file with high accuracy.
Why is Speaker Diarization Useful?
Speaker diarization is useful because it takes a big wall of text and breaks it into something much more meaningful and valuable. If you were to try and read a transcription without speaker labels, your brain would automatically try and assign each word/sentence to the appropriate speaker. Speaker diarization saves you time and mental energy.
For example, let’s look at the before and after transcripts below with and without Speaker Diarization:
Without Speaker Diarization:
But how did you guys first meet and how do you guys know each other? I actually met her not too long ago. I met her, I think last year in December, during pre season, we were both practicing at Carson a lot. And then we kind of met through other players. And then I saw her a few her last few torments this year, and we would just practice together sometimes, and she's really, really nice. I obviously already knew who she was because she was so good. Right. So. And I looked up to and I met her. I already knew who she was, but that was cool for me. And then I watch her play her last few events, and then I'm actually doing an exhibition for her charity next month. I think super cool. Yeah. I'm excited to be a part of that. Yeah. Well, we'll definitely highly promote that. Vania and I are both together on the Diversity and Inclusion committee for the USDA, so I'm sure she'll tell me all about that. And we're really excited to have you as a part of that tournament. So thank you so much. And you have had an exciting year so far. My goodness. Within your first WTI 1000 doubles tournament, the Italian Open.Congrats to that. That's huge. Thank you.
With Speaker Diarization:
<Speaker A> But how did you guys first meet and how do you guys know each other? <Speaker B> I actually met her not too long ago. I met her, I think last year in December, during pre season, we were both practicing at Carson a lot. And then we kind of met through other players. And then I saw her a few her last few torments this year, and we would just practice together sometimes, and she's really, really nice. I obviously already knew who she was because she was so good. <Speaker A> Right. So. <Speaker B> And I looked up to and I met her. I already knew who she was, but that was cool for me. And then I watch her play her last few events, and then I'm actually doing an exhibition for her charity next month. <Speaker A> I think super cool. <Speaker B> Yeah. I'm excited to be a part of that. <Speaker A> Yeah. Well, we'll definitely highly promote that. Vania and I are both together on the Diversity and Inclusion committee for the USDA. So I'm sure she'll tell me all about that. And we're really excited to have you as a part of that tournament. So thank you so much. And you have had an exciting year so far. My goodness. Within your first WTI 1000 doubles tournament, the Italian Open. Congrats to that. That's huge. <Speaker B> Thank you.
See how much easier the transcription is to read with Speaker Diarization?
Speaker diarization is also a powerful analytic tool. By identifying and labeling speakers, you can analyze each speaker’s behaviors, identify patterns/trends among individual speakers, make predictions, and more. For example:
- A call center might analyze agent messages versus customer requests, or complaints, to identify trends that could help facilitate better communication.
- A podcast service might use speaker labels to identify the <host> and <guest>, making transcriptions more readable for end users.
- A telemedicine platform might identify <doctor> and <patient> to create an accurate transcript, attach a readable transcript to patient files, or input the transcript into an EHR system.
Limitations of Speaker Diarization
Currently, Speaker Diarization models only work for asynchronous transcription and not real-time transcription, however this is an active area of research.
There are also several constraints that limit the accuracy of modern Speaker Diarization models:
- Speaker talk time
- Conversational pace
A speaker’s talk time has the biggest impact on accuracy. If a speaker talks for less than 15 seconds in an entire audio file, it’s a toss-up as to if a Speaker Diarization model will correctly identify this speaker as a unique, individual speaker. If it cannot, two outcomes may occur: the model may assign the speaker as <unknown>, or it may merge their words with a more dominant speaker. Generally speaking, a speaker has to talk for more than 30 seconds in order to accurately be detected by a Speaker Diarization model.
Audio files with a conversational pace have the second biggest impact on accuracy. If the conversation is well-defined, with each speaker taking clear turns (think of a Joe Rogan podcast versus a phone call conversation), has an absence of over-talking or interrupting, and minimal background noise, it is much more likely that the model will correctly label each speaker. However, if the conversation is more energetic, with the speakers cutting each other off or speaking over one another, or has significant background noise, the model’s accuracy will decrease. If overtalk (aka crosstalk) is common, the model may even misidentify an imaginary third speaker, which includes the portions of overtalk.
While there are clearly some limitations to Speaker Diarization today, Speech-to-Text APIs like AssemblyAI are using deep learning research to overcome these deficiencies and boost Speaker Diarization accuracy.
Using Speaker Diarization Technology
While the models behind Speaker Diarization technology may seem complex, using Speaker Diarization on audio files is thankfully quite simple!
Companies like AssemblyAI offer an accurate transcription and Speaker Diarization API.
All you have to do is run your audio file through the AssemblyAI API to get an accurate transcription, with speaker labels assigned to each word in the transcription file. The API will identify a label per word that tells us which speaker spoke when, or at least the best guess based on the parameters outlined above! The examples above were created using the AssemblyAI API.
What’s on the horizon for Speaker Diarization models? Increased accuracy, of course, is always top of mind. At AssemblyAI, we’re also working to create a “voice print” for each speaker that could be used to identify the same speaker across multiple audio files. This would speed up the diarization process, and make it easier for companies to identify not just “Speaker A” but actually identify who is speaking, for example, “Bob” or “Janet.” Real-time Speaker Diarization is also an active area of research, as we outlined above.