- Accuracy of speaker count estimation boosted from 79% to 97% on our benchmark datasets
- Segmentation is now primarily text-based
- Longer clips while maintaining our expectation of a single speaker
- Improved our method for extracting features from the clips
- Incorporates some of the latest research in speech and vision into its trunk, pooling, and training regime
- The system outputs an ‘Unknown’ label for words it is very uncertain about, rather than forcing a guess like most other diarization systems
What is speaker diarization?
The output of a standard speech recognition system is a list of words associated with their start and end times. This list of words, when viewed as a whole, is the transcription.
If you’ve ever tried to read the transcript of a conversation, usually just a wall-of-text, then you’ll have been keenly aware of the absence of diarization, even if you haven’t realized it.
As you read along, your brain will try to interpret the flow of the conversation, who says what when, and that can be tiring and inaccurate.
In the context of speech recognition, diarization is the process of assigning a speaker label to each recognized word.
It answers the question “Who spoke when?”.
These labels not only enable readability, but they also make it possible to analyze a conversation at the participant level: roles, vocabularies, points of focus, summaries, sentiment, and many other speaker-specific metrics to suit your application.
Speaker diarization is the process of partitioning an input audio stream into homogeneous segments according to the speaker identity.
What are the challenges with speaker diarization?
Diarization is a challenging task with several different components.
It requires segmenting a call into units (clips) of audio which we expect to not only contain speech but also for that speech to belong to a single speaker. The clip is then processed in the following way:
- extracting representative features
- using those features to estimate the correct number of speakers on the call
- using them again to assign each clip to its appropriate speaker
That’s a lot, so it’s useful to compare our expectations for complex machine learning systems in the context of human performance.
If we consider a conversation between a pair of related strangers or between a handful of people with an accent dramatically different than your own, the difficulty begins to clarify.
Guessing the number of distinct speakers can be challenging, and keeping track of them even more so, especially for short or fast-paced calls.
Humans are also surprisingly bad at ABX tests for speakers of isolated audio clips, especially when those clips are short.
Our model shares many of these human difficulties. We’ve taken care to train and benchmark it over a range of ages, accents, and genders.
However, just like with humans, the model can confuse speakers whose voices have very similar characteristics. The model also shares our difficulty with short conversations: the longer a person’s turn in a conversation and the longer the conversation goes, the more context there is for it to make a better decision.
There are other challenges that an automated diarization system doesn’t share with humans.
Humans are extremely adept at filtering out noise and background music. We’ve augmented our model’s training to encourage it to exhibit this same robustness, but we don’t expect it to generalize perfectly.
One quirk of the system is that a person that both speaks and sings within the same call will likely be recognized as two separate speakers, even when singing acapella!
People also have an incredible ability to separate and follow overtalk.
Effectively handling overtalk in a diarization system remains an open area of research in the field, and we are continuing to experiment with existing and novel solutions.
How is the new model an improvement?
This new system is an improvement across many fronts.
Previously, the segmentation of a call was done uniformly over the speech areas identified from earlier in the speech recognition process, meaning it was sliced up into overlapping clips that were only a fraction of a second in duration. With the duration and the overlap, it’s a reasonable assumption that most of those clips contained a single speaker.
However, like people, a neural network’s feature extraction becomes significantly less discriminative when a clip is less than a second long.
Segmentation is now primarily text-based, using signals from the output of our new punctuation model over the transcribed text with additional support from shallow features of the audio data such as pauses in speech. This enables longer clips while maintaining our expectation of a single speaker.
We’ve also improved our method for extracting features from the clips. The new model incorporates some of the latest research in speech and vision into its trunk, pooling, and training regime. This model has the ability to quickly extract a single meaningful feature vector (embedding) from a clip of audio regardless of its duration.
Finally, we’ve made improvements to our speaker count estimation and clustering algorithms. These changes boosted the accuracy of our speaker count estimation from 79% to 97% over a test partition of a popular meetings dataset. They also allow our system to output an ‘Unknown’ label for words it is very uncertain about, rather than forcing a guess like most other diarization systems.
See it in Action
We used the audio clip in this GitHub repo for the following demo.
Here is a snippet of the transcription with speakers labelled A, B and C respectively.
B Okay. So what are we having for dinner tonight?
C I think I'm gonna have pasta, but I'm not sure yet. Like, my my girlfriend said, like, he's going to prepare pasta, so probably past time.
A My God, I don't know what I should have. I guess I haven't really thought about it, but I had rice and chicken for lunch. So something that is not right and shaken for dinner.
C How about you feeling?
B Yeah, I did Bart chicken and rice last night, so it's probably going to be either spaghetti or had a tiny for me.
B I do Cook usually. I don't love it, but I don't actively dislike it, which is my wife.
C That's pretty cool. Yeah. I order most of the time outside. Don't ask. Most of the time. Well, have you even Micho? Right.
A I do do that a lot, but I've been trying to Cook more often, so I have been doing that all my cooking, like simple meals, like rice and chicken.
B Do you have a go to take out place?
C Me. Let's see. There are a bunch of places around here. Like, the one I like the most is called Deli Board. It's like a sandwich place. I've been telling Michael that every time he comes to take him there, but it's never been there. The Deli Board one. Remember Michael? Yeah, that's the best. And it's in SF. It's pretty cool.
A I still need to try that for me. Take out. I really like, I guess. I don't know. Sushi or Indian food. The Pakistani restaurant? Yeah. Pretty similar. There's Pakistani restaurant. That's Super good. And there's a Sushi restaurant nearby, too. That's really good. So that's what I'd like to take out. It's kind of expensive, though.
B Yeah, definitely. We moved away. We're close to to our SSI place, and now it's, like, 15 minutes away. Is it worth it? I guess so. Yeah.
C Do you live in like, you live in Denver, right?
C Like, do you live around downtown or, like, pretty far we did last year.
B Last year, we were in walking distance at the baseball Stadium Super downtown. But we moved, like, 15, 20 minutes East. Now we have a second bedroom. It's fine for us who didn't be right there.
C Yeah. Makes sense. That's pretty cool.