Releases
July 16, 2025

Introducing our most accurate Speaker Diarization yet—30% better in noisy, overlapping audio

Our new in-house speaker embedding model delivers a breakthrough 30% improvement in speaker tracking accuracy for noisy and far-field audio scenarios, while maintaining exceptional performance on clean recordings. This means your conversation intelligence just got significantly more reliable where it matters most.

Madison Boyd
Product Marketing Manager
Madison Boyd
Product Marketing Manager
Reviewed by
No items found.
No items found.
No items found.
No items found.

Meeting recordings with overlapping voices, call center conversations with background noise, and conference room audio where speakers move around—these challenging scenarios have long been the weakness of speaker diarization technology. Today, we're changing that.

Our new in-house speaker embedding model delivers a breakthrough 30% improvement in speaker tracking accuracy for noisy and far-field audio scenarios, while maintaining exceptional performance on clean recordings. This means your conversation intelligence just got significantly more reliable where it matters most.


The improvements are live now for all customers—no code changes required.

Real-world audio. Real results.

Speaker diarization has always worked well in ideal conditions: clean audio, speakers close to microphones, minimal background noise. But real-world audio is messy. Conference rooms have ambient noise,multi-speaker discussions have overlapping voices, and remote meetings suffer from poor audio quality.

Our new speaker embedding model tackles these exact scenarios:

  • 30% improvement in diarization for challenging conditions - Error rates dropped from 29.1% to 20.4% in noisy, far-field scenarios
  • Breakthrough short-segment speaker identification performance - 43% improvement in very short audio segments (250ms) under noisy conditions
  • Reverberant environment excellence - 57% improvement in mid-length reverberant audio
  • Automatic deployment - All customers benefit immediately with no integration work

Here's how our enhanced speaker diarization performs compared to our previous model:

Enhanced Speaker Verification Performance: Error rates across different audio conditions and segment lengths

Audio Condition Segment Length Previous Model New Model Improvement
Clean Audio Very Short (250ms) 18.8% 16.4% 13% better
Clean Audio Short (500ms) 10.4% 6.4% 38% better
Noisy Audio Very Short (250ms) 46.8% 26.4% 44% better
Noisy Audio Short (500ms) 18.4% 14.4% 22% better
Reverberant Audio Mid-length (1.5s) 15.2% 4.4% 71% better
Noise + Reverb Short (500ms) 40.0% 22.8% 43% better

Listen to the difference

Example 1: Very Quiet Interactions

0:00 / 0:13

When speakers drop their voice or speak away from the microphone, traditional systems often miss or misattribute these quiet moments. Our enhanced model maintains speaker tracking even in quiet segments.

Old Model: 

A	I'm gonna sort of break the rules a bit. Go against the system.
B	Yeah.
A	I got myself a Valentine's present.
B	Sorry, what? I got myself a Valentine's present. Oh, I just actually heard it the second time a bit louder for everyone else.

New model:

A    So I thought, I'm gonna sort of break the rules a bit. Go against the system.
B    Yeah.
A    I got myself a Valentine's present.
B    Sorry, what?
A    I got myself a Valentine's present.
B    Oh, I just actually heard it the second time a bit louder for everyone else.

Example 2: Short Word Attribution

0:00 / 0:00

Brief acknowledgements like "okay," "yes," and "I'll do it" are now tracked accurately, enabling better conversation flow analysis even with background noise.

Old Model: 

A	I'm gonna sort of break the rules a bit. Go against the system.
B	Yeah.
A	I got myself a Valentine's present.
B	Sorry, what? I got myself a Valentine's present. Oh, I just actually heard it the second time a bit louder for everyone else.

New model:

A    So I thought, I'm gonna sort of break the rules a bit. Go against the system.
B    Yeah.
A    I got myself a Valentine's present.
B    Sorry, what?
A    I got myself a Valentine's present.
B    Oh, I just actually heard it the second time a bit louder for everyone else.

Example 3: Noisy In-Person Meeting

0:00 / 0:00

Conference room recordings with typing, paper shuffling, and ambient noise are now tracked more accurately. The enhanced model maintains consistent speaker identity throughout challenging acoustic conditions.

Old Model:

C    Sorry, just explain again. You're saying that it's not.
B    We're not recording. It doesn't get recorded properly and people don't get the replies to their emails and people don't get replies to their phone calls.
A    So it's the contact shift.
D    No. Yes.
C    I think maybe I'll need to speak to

New Model: 

C    Sorry, just explain again. You're saying that it's not. We're not recording.
B    It doesn't get recorded properly and people don't get the replies to their emails and people don't get replies to their phone calls.
A    So it's the contact shift.
C    No. Yes. I think maybe I'll need to speak to.

Example 4:  Multi-Speaker Robustness

0:00 / 0:00

Complex audio with multiple speakers and background noise that previously collapsed to a single speaker is now accurately separated, demonstrating improved robustness to high acoustic variability.

Old Model:

A    David. I've pulled my hamstring. Can you help me out? I'll finish it up for you. Thanks. We got a great show for you tonight. When we get back,

New Model:

A    David. I've pulled my hamstring. Can you help me out?
B    I'll finish it up for you.
A    Thanks.
B    We got a great show for you tonight. When we get back,

Under the hood: what makes it work

Our enhanced speaker diarization is built from the ground up to handle real-world audio challenges. We developed a completely new in-house framework for training speaker embeddings, with advanced data augmentation targeting the exact scenarios where traditional models fail.

1. Robust to acoustic variations

Traditional speaker diarization systems struggle when recording conditions change. Our new model handles:

  • Inconsistent volume levels between speakers without losing speaker identity
  • Distance variations when speakers move away from microphones during meetings
  • Natural meeting sounds like paper shuffling, typing, and ambient noise

2. Breakthrough short-segment performance

Previous models required substantial audio segments to maintain speaker identity. Our enhanced model excels at:

  • 250ms segments - Accurate speaker ID for utterances as short as one word
  • Natural conversation flow - Tracking brief acknowledgments and interruptions
  • Improved conversation intelligence - Better understanding of turn-taking patterns

3. Reverberant environment excellence

Conference rooms, large spaces, and poor acoustic environments no longer break speaker tracking:

  • 57% improvement in mid-length reverberant audio performance
  • Consistent accuracy across different room sizes and acoustic properties
  • Robust speaker embeddings trained specifically for challenging far-field and noisy conditions
"With the latest improvements to speaker diarization, even meetings recorded in noisy rooms with many participants and a single microphone now produce transcripts as accurate as virtual calls. That level of precision is key to how Fellow uses AI to transform conversations into lasting knowledge, so teams can rely on what’s captured to make decisions, follow through, and stay aligned."

- Alexandra S., VP Engineering @
Fellow.app

Solving real problems with better accuracy

Enhanced speaker diarization directly addresses the frustrating scenarios that break existing systems, enabling applications that previously struggled with real-world audio:

Conference room & in-person collaboration

  • Reliable attribution in noisy environments where multiple conversations happen simultaneously
  • Consistent speaker tracking despite ambient noise, HVAC systems, and office activity
  • Multi-speaker recognition that doesn't collapse complex discussions to just 2-3 voices

Hybrid & remote meeting intelligence

  • Capture quiet side conversations that were previously missed or misattributed
  • Track brief acknowledgments like "yes," "okay," and "got it" for complete participation analysis
  • Maintain speaker identity even when participants type, shuffle papers, or move around the room

Field recording & interviews

  • Separate overlapping speakers in interview scenarios with background noise
  • Process real-world source audio from phones, field recordings, or portable devices
  • Handle rapid speaker transitions in talk shows, panels, and multi-guest formats

Transparent performance, automatic benefits

Unlike solutions that require complex integration or configuration changes, our enhanced speaker diarization delivers immediate value:

  • Automatic deployment - Improvements are live for all customers now
  • No code changes required - Existing integrations continue working seamlessly
  • Consistent API - Same endpoints, same response format, better results
  • Transparent improvements - Clear metrics showing exactly where performance improved

Built for real-world performance

Most speaker diarization systems perform well in ideal conditions but struggle when audio gets challenging. Our enhanced model is specifically designed to excel where others fail:

  • Robust in challenging environments - Maintains speaker diarization performance in scenarios where noise and reverberation previously compromised quality
  • Consistent across conditions - Delivers reliable performance whether audio is clean or impacted by the imperfections of real-world recordings
  • Production-ready reliability - 30% better accuracy in the noisy, far-field scenarios your applications actually encounter

Start using enhanced speaker diarization today

Speaker diarization that actually works in real-world conditions isn't just a nice-to-have—it's essential for reliable conversation intelligence. Don't settle for systems that fail when audio gets challenging.

Three ways to experience the improvements:

  1. Automatic benefits: If you're already using our speaker diarization, you're getting the improvements now
  2. Try the playground: Test enhanced speaker diarization with your specific audio files using our interactive testing environment
  3. Explore the documentation: Review our comprehensive Speaker Diarization Guide for implementation details and best practices
Try a More Powerful Speech-to-Text API

AssemblyAI delivers comprehensive Speech AI solutions including automatic transcription, speaker identification, and intelligent audio analysis. Create your free account and receive $50 in credits to explore our Speaker Diarization capabilities.

Sign up now

Performance metrics based on comprehensive evaluation across 205+ hours of audio including meeting recordings, call center conversations, and challenging acoustic environments. Improvements measured using Diarization Error Rate (DER) across multiple standard datasets including AMI, DIPCO, and VoxConverse.

Title goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Button Text
Speaker Diarization