enen_auen_uken_usesfrdeitptnlhijazhfikoplrutrukviafsqamarhyasazbaeubebnbsbrbgmycahrcsdaetfoglkaelguhthahawhehuisidjwknkkkmlolalvlnltlbmkmgmsmlmtmimrmnnenonnocpapsfarosasrsnsdsiskslsosuswsvtltgtatttethbotkuruzcyyiyouniversal-3-prouniversal-2US & EU
Learn how to identify who said what in your audio files by adding speaker labels to your transcript.
The Speaker Diarization model lets you detect multiple speakers in an audio file and what each speaker said.
If you enable Speaker Diarization, the resulting transcript will return a list of utterances, where each utterance corresponds to an uninterrupted segment of speech from a single speaker.
Speaker Diarization assigns generic labels like “Speaker A” and “Speaker B” to distinguish between speakers. If you want to replace these labels with actual names or roles (e.g., “John Smith” or “Customer”), use Speaker Identification. Speaker Identification analyzes the conversation content to infer who is speaking and transforms your transcript from generic labels to meaningful identifiers.
To enable Speaker Diarization, set speaker_labels to True in the POST request body:
The default upper limit on the number of speakers depends on the audio duration:
If you need a different limit, you can use speaker_options to
set a range of possible
speakers.
You can set the number of speakers expected in the audio file by setting the speakers_expected parameter.
Only use this parameter if you are certain about the number of speakers in the audio file.
Building on the Quickstart above, add speakers_expected to your transcription config:
You can set a range of possible speakers in the audio file by setting the speaker_options parameter. By default, the maximum number of speakers depends on the audio duration (no limit for 0–2 minutes, 10 for 2–10 minutes, and 30 for 10+ minutes).
This parameter is suitable for use cases where there is a known minimum/maximum number of speakers in the audio file that is outside the bounds of the default limits.
Setting max_speakers_expected too high may reduce diarization accuracy,
causing sentences from the same speaker to be split across multiple speaker
labels.
When using multichannel with speaker_labels, the speaker_options parameters are applied per channel, not globally across the entire file. For example, setting min_speakers_expected: 5 and max_speakers_expected: 7 on a 5-channel file means the model will find 5–7 speakers on each channel, resulting in 25–35 total speakers. Adjust your speaker options accordingly when using multichannel transcription.
Building on the Quickstart above, add speaker_options to your transcription config:
Speakers Expected
Speaker Options
The response also includes the request parameters used to generate the transcript.
Speaker Diarization assigns generic labels like “Speaker A” and “Speaker B” to each speaker. If you want to replace these labels with actual names or roles, you can use Speaker Identification to transform your transcript.
Before Speaker Identification:
After Speaker Identification:
The following example shows how to transcribe audio with Speaker Diarization and then apply Speaker Identification to replace the generic speaker labels with actual names.
For more details on Speaker Identification, including how to identify speakers by role and how to apply it to existing transcripts, see the Speaker Identification guide.
Follow these tips to get the best results from Speaker Diarization:
speaker_options instead of speakers_expected when uncertain. Only use speakers_expected when you are confident about the exact number of speakers. If this number is incorrect, the model may produce random splits of single-speaker segments or merge multiple speakers into one. It’s generally recommended to use min_speakers_expected and set max_speakers_expected slightly higher (e.g., min_speakers_expected + 2) to allow flexibility.max_speakers_expected too high. Setting the maximum too high may reduce accuracy, causing sentences from the same speaker to be split across multiple speaker labels.The default upper limit on the number of speakers depends on the audio duration:
If you need a different limit, you can use
speaker_options
to set a range of possible speakers.
The accuracy of the Speaker Diarization model depends on several factors, including the quality of the audio, the number of speakers, and the length of the audio file. See the best practices section above for tips on improving accuracy.