July 16, 2025

Introducing our most accurate Speaker Diarization yet—30% better in noisy, overlapping audio

Our new in-house speaker embedding model delivers a breakthrough 30% improvement in speaker tracking accuracy for noisy and far-field audio scenarios, while maintaining exceptional performance on clean recordings. This means your conversation intelligence just got significantly more reliable where it matters most.

Speaker Diarization

Madison Boyd

Product Marketing Manager

Madison Boyd

Product Marketing Manager

Reviewed by

No items found.

Table of contents

[Visible on live site]

Get $50 in credits

Meeting recordings with overlapping voices, call center conversations with background noise, and conference room audio where speakers move around—these challenging scenarios have long been the weakness of speaker diarization technology. Today, we're changing that.

Our new in-house speaker embedding model delivers a breakthrough 30% improvement in speaker tracking accuracy for noisy and far-field audio scenarios, while maintaining exceptional performance on clean recordings. This means your conversation intelligence just got significantly more reliable where it matters most.‍

‍
The improvements are live now for all customers—no code changes required.‍

Real-world audio. Real results.

Speaker diarization has always worked well in ideal conditions: clean audio, speakers close to microphones, minimal background noise. But real-world audio is messy. Conference rooms have ambient noise,multi-speaker discussions have overlapping voices, and remote meetings suffer from poor audio quality.

Our new speaker embedding model tackles these exact scenarios:

30% improvement in diarization for challenging conditions - Error rates dropped from 29.1% to 20.4% in noisy, far-field scenarios
Breakthrough short-segment speaker identification performance - 43% improvement in very short audio segments (250ms) under noisy conditions
Reverberant environment excellence - 57% improvement in mid-length reverberant audio
Automatic deployment - All customers benefit immediately with no integration work

Here's how our enhanced speaker diarization performs compared to our previous model:

Enhanced Speaker Verification Performance: Error rates across different audio conditions and segment lengths

Audio Condition	Segment Length	Previous Model	New Model	Improvement
Clean Audio	Very Short (250ms)	18.8%	16.4%	13% better
Clean Audio	Short (500ms)	10.4%	6.4%	38% better
Noisy Audio	Very Short (250ms)	46.8%	26.4%	44% better
Noisy Audio	Short (500ms)	18.4%	14.4%	22% better
Reverberant Audio	Mid-length (1.5s)	15.2%	4.4%	71% better
Noise + Reverb	Short (500ms)	40.0%	22.8%	43% better

Listen to the difference

Example 1: Very Quiet Interactions

0:00 / 0:13

When speakers drop their voice or speak away from the microphone, traditional systems often miss or misattribute these quiet moments. Our enhanced model maintains speaker tracking even in quiet segments.

Old Model:

A	I'm gonna sort of break the rules a bit. Go against the system.
B	Yeah.
A	I got myself a Valentine's present.
B	Sorry, what? I got myself a Valentine's present. Oh, I just actually heard it the second time a bit louder for everyone else.

New model:

A    So I thought, I'm gonna sort of break the rules a bit. Go against the system.
B    Yeah.
A    I got myself a Valentine's present.
B    Sorry, what?
A    I got myself a Valentine's present.
B    Oh, I just actually heard it the second time a bit louder for everyone else.

‍

Example 2: Short Word Attribution

0:00 / 0:00

Brief acknowledgements like "okay," "yes," and "I'll do it" are now tracked accurately, enabling better conversation flow analysis even with background noise.

Old Model:

A	I'm gonna sort of break the rules a bit. Go against the system.
B	Yeah.
A	I got myself a Valentine's present.
B	Sorry, what? I got myself a Valentine's present. Oh, I just actually heard it the second time a bit louder for everyone else.

New model:

A    So I thought, I'm gonna sort of break the rules a bit. Go against the system.
B    Yeah.
A    I got myself a Valentine's present.
B    Sorry, what?
A    I got myself a Valentine's present.
B    Oh, I just actually heard it the second time a bit louder for everyone else.

Example 3: Noisy In-Person Meeting

0:00 / 0:00

Conference room recordings with typing, paper shuffling, and ambient noise are now tracked more accurately. The enhanced model maintains consistent speaker identity throughout challenging acoustic conditions.

Old Model:

C    Sorry, just explain again. You're saying that it's not.
B    We're not recording. It doesn't get recorded properly and people don't get the replies to their emails and people don't get replies to their phone calls.
A    So it's the contact shift.
D    No. Yes.
C    I think maybe I'll need to speak to

New Model:

C    Sorry, just explain again. You're saying that it's not. We're not recording.
B    It doesn't get recorded properly and people don't get the replies to their emails and people don't get replies to their phone calls.
A    So it's the contact shift.
C    No. Yes. I think maybe I'll need to speak to.

Example 4: Multi-Speaker Robustness

0:00 / 0:00

Complex audio with multiple speakers and background noise that previously collapsed to a single speaker is now accurately separated, demonstrating improved robustness to high acoustic variability.

Old Model:

A    David. I've pulled my hamstring. Can you help me out? I'll finish it up for you. Thanks. We got a great show for you tonight. When we get back,

New Model:

A    David. I've pulled my hamstring. Can you help me out?
B    I'll finish it up for you.
A    Thanks.
B    We got a great show for you tonight. When we get back,

‍Under the hood: what makes it work

Our enhanced speaker diarization is built from the ground up to handle real-world audio challenges. We developed a completely new in-house framework for training speaker embeddings, with advanced data augmentation targeting the exact scenarios where traditional models fail.

1. Robust to acoustic variations

Traditional speaker diarization systems struggle when recording conditions change. Our new model handles:

Inconsistent volume levels between speakers without losing speaker identity
Distance variations when speakers move away from microphones during meetings
Natural meeting sounds like paper shuffling, typing, and ambient noise

2. Breakthrough short-segment performance

Previous models required substantial audio segments to maintain speaker identity. Our enhanced model excels at:

250ms segments - Accurate speaker ID for utterances as short as one word
Natural conversation flow - Tracking brief acknowledgments and interruptions
Improved conversation intelligence - Better understanding of turn-taking patterns

3. Reverberant environment excellence

Conference rooms, large spaces, and poor acoustic environments no longer break speaker tracking:

57% improvement in mid-length reverberant audio performance
Consistent accuracy across different room sizes and acoustic properties
Robust speaker embeddings trained specifically for challenging far-field and noisy conditions

"With the latest improvements to speaker diarization, even meetings recorded in noisy rooms with many participants and a single microphone now produce transcripts as accurate as virtual calls. That level of precision is key to how Fellow uses AI to transform conversations into lasting knowledge, so teams can rely on what’s captured to make decisions, follow through, and stay aligned."

- Alexandra S., VP Engineering @ Fellow.app

Solving real problems with better accuracy

Enhanced speaker diarization directly addresses the frustrating scenarios that break existing systems, enabling applications that previously struggled with real-world audio:

Conference room & in-person collaboration

Reliable attribution in noisy environments where multiple conversations happen simultaneously
Consistent speaker tracking despite ambient noise, HVAC systems, and office activity
Multi-speaker recognition that doesn't collapse complex discussions to just 2-3 voices

Hybrid & remote meeting intelligence

Capture quiet side conversations that were previously missed or misattributed
Track brief acknowledgments like "yes," "okay," and "got it" for complete participation analysis
Maintain speaker identity even when participants type, shuffle papers, or move around the room

Field recording & interviews

Separate overlapping speakers in interview scenarios with background noise
Process real-world source audio from phones, field recordings, or portable devices
Handle rapid speaker transitions in talk shows, panels, and multi-guest formats

Transparent performance, automatic benefits

Unlike solutions that require complex integration or configuration changes, our enhanced speaker diarization delivers immediate value:

Automatic deployment - Improvements are live for all customers now
No code changes required - Existing integrations continue working seamlessly
Consistent API - Same endpoints, same response format, better results
Transparent improvements - Clear metrics showing exactly where performance improved

Built for real-world performance

Most speaker diarization systems perform well in ideal conditions but struggle when audio gets challenging. Our enhanced model is specifically designed to excel where others fail:

Robust in challenging environments - Maintains speaker diarization performance in scenarios where noise and reverberation previously compromised quality
Consistent across conditions - Delivers reliable performance whether audio is clean or impacted by the imperfections of real-world recordings
Production-ready reliability - 30% better accuracy in the noisy, far-field scenarios your applications actually encounter

Start using enhanced speaker diarization today

Speaker diarization that actually works in real-world conditions isn't just a nice-to-have—it's essential for reliable conversation intelligence. Don't settle for systems that fail when audio gets challenging.

Three ways to experience the improvements:

Automatic benefits: If you're already using our speaker diarization, you're getting the improvements now
Try the playground: Test enhanced speaker diarization with your specific audio files using our interactive testing environment
Explore the documentation: Review our comprehensive Speaker Diarization Guide for implementation details and best practices

Try a More Powerful Speech-to-Text API

AssemblyAI delivers comprehensive Speech AI solutions including automatic transcription, speaker identification, and intelligent audio analysis. Create your free account and receive $50 in credits to explore our Speaker Diarization capabilities.

Performance metrics based on comprehensive evaluation across 205+ hours of audio including meeting recordings, call center conversations, and challenging acoustic environments. Improvements measured using Diarization Error Rate (DER) across multiple standard datasets including AMI, DIPCO, and VoxConverse.

Introducing our most accurate Speaker Diarization yet—30% better in noisy, overlapping audio

Real-world audio. Real results.

Listen to the difference

Example 1: Very Quiet Interactions

Example 2: Short Word Attribution

Example 3: Noisy In-Person Meeting

Example 4: Multi-Speaker Robustness

‍Under the hood: what makes it work

1. Robust to acoustic variations

2. Breakthrough short-segment performance

3. Reverberant environment excellence

Solving real problems with better accuracy

Conference room & in-person collaboration

Hybrid & remote meeting intelligence

Field recording & interviews

Transparent performance, automatic benefits

Built for real-world performance

Start using enhanced speaker diarization today

Speaker Diarization: Adding speaker labels for enterprise speech-to-text