Speaker Diarization
Supported languages
en
en_au
en_uk
en_us
es
fr
de
it
pt
nl
hi
ja
zh
fi
ko
pl
ru
tr
uk
vi
af
sq
am
ar
hy
as
az
ba
eu
be
bn
bs
br
bg
ca
hr
cs
da
et
fo
gl
ka
el
gu
ht
ha
haw
he
hu
is
id
jw
kn
kk
lo
la
lv
ln
lt
lb
mk
mg
ms
ml
mt
mi
mr
mn
ne
no
nn
oc
pa
ps
fa
ro
sa
sr
sn
sd
si
sk
sl
so
su
sw
sv
tl
tg
ta
tt
te
tk
ur
uz
cy
yi
yo
Supported models
slam-1
universal
Supported regions
US & EU
The Speaker Diarization model lets you detect multiple speakers in an audio file and what each speaker said.
If you enable Speaker Diarization, the resulting transcript will return a list of utterances, where each utterance corresponds to an uninterrupted segment of speech from a single speaker.
Speaker Diarization and Multichannel
Speaker Diarization doesn’t support multichannel transcription. Enabling both Speaker Diarization and multichannel will result in an error.
Quickstart
Python SDK
Python
JavaScript SDK
JavaScript
C#
Ruby
PHP
To enable Speaker Diarization, set speaker_labels
to True
in the transcription config.
Set number of speakers expected
You can set the number of speakers expected in the audio file by setting the speakers_expected
parameter.
Only use this parameter if you are certain about the number of speakers in the audio file.
Example
Set a range of possible speakers
You can set a range of possible speakers in the audio file by setting the speaker_options
parameter. By default, the model will return between 1 and 10 speakers.
This parameter is suitable for use cases where there is a known minimum/maximum number of speakers in the audio file that is outside the bounds of the default value of 1 to 10 speakers.
Setting max_speakers_expected
higher than is necessary may hurt model
accuracy.
Example
API reference
Request
Response
The response also includes the request parameters used to generate the transcript.
Frequently asked questions & troubleshooting
How can I improve the performance of the Speaker Diarization model?
To improve the performance of the Speaker Diarization model, it’s recommended to ensure that each speaker speaks for at least 30 seconds uninterrupted. Avoiding scenarios where a person only speaks a few short phrases like “Yeah”, “Right”, or “Sounds good” can also help. If possible, avoiding cross-talking can also improve performance.
How many speakers can the model handle?
By default, the upper limit on the number of speakers for Speaker Diarization
is 10. If you expect more than 10 speakers, you can use
speaker_options
to set a range of possible speakers. Please note, setting
max_speakers_expected
higher than necessary may hurt model accuracy.
How accurate is the Speaker Diarization model?
The accuracy of the Speaker Diarization model depends on several factors, including the quality of the audio, the number of speakers, and the length of the audio file. Ensuring that each speaker speaks for at least 30 seconds uninterrupted and avoiding scenarios where a person only speaks a few short phrases can improve accuracy. However, it’s important to note that the model isn’t perfect and may make mistakes, especially in more challenging scenarios.
Why is the speaker diarization not performing as expected?
The speaker diarization may be performing poorly if a speaker only speaks once or infrequently throughout the audio file. Additionally, if the speaker speaks in short or single-word utterances, the model may struggle to create separate clusters for each speaker. Lastly, if the speakers sound similar, there may be difficulties in accurately identifying and separating them. Background noise, cross-talk, or an echo may also cause issues.