Skip to main content

Speaker Diarization

With the AssemblyAI Speaker Diarization model, businesses and individuals can extract valuable insights from audio recordings that involve multiple speakers.


In the Identifying speakers in audio recordings guide, the client submits an audio file with multiple speakers and configures the API request to enable speaker diarization. The result is a transcription that not only contains the text but also includes speaker labels, enhancing the overall structure and organization of the output.

You can also view the full source code here.

Specifying the number of speakers

The AssemblyAI Speaker Diarization model provides an optional parameter, speakers_expected, that can be used to specify the expected number of speakers in an audio file. This parameter is an integer that can be set by the user to help the model more accurately cluster the audio data.

While it can help the model choose the correct number of speakers, there may be situations where the model returns a different number based on the spoken audio data. Nonetheless, specifying the expected number can be helpful in cases where it is known or can be reasonably estimated, allowing for more accurate and efficient speaker diarization.

Understanding the response

When using the AssemblyAI Speaker Diarization model, the response you receive is a transcript object that contains information about the transcription process and its result, including speaker labels for each word. This object has various attributes, each with its own purpose and value.

The utterances key contains a list of turn-by-turn utterances found in the audio, each marked with a speaker label. For each turn, the response includes the confidence level, start and end times, the text for the entire speaker turn, and the individual words from the turn.

To extract information about each individual word in the transcript with speaker labels, we can use the words key within each turn in the utterances key. This key contains a list of individual words found in the audio, along with their respective start and end times, confidence levels, and speaker labels.


# {
# "confidence": 0.8704614285714285,
# "end": 21970,
# "speaker": "B",
# "start": 8630,
# "text": "Ted Talks are recorded live at Ted Conference. This episode features...",
# "words": [
# {
# "text": "Ted",
# "start": 8630,
# "end": 8798,
# "confidence": 0.89791,
# "speaker": "B"
# },
# {
# "text": "Talks",
# "start": 8834,
# "end": 9146,
# "confidence": 0.99688,
# "speaker": "B"
# },
# ...
# ]
# }


Why is the speaker diarization not performing as expected?

The speaker diarization may be performing poorly if a speaker only speaks once or infrequently throughout the audio file. Additionally, if the speaker speaks in short or single-word utterances, the model may struggle to create separate clusters for each speaker. Lastly, if the speakers sound similar, there may be difficulties in accurately identifying and separating them. Background noise, cross-talk, or an echo may also cause issues.

How can I improve the performance of the Speaker Diarization model?

To improve the performance of the Speaker Diarization model, it is recommended to ensure that each speaker speaks for at least 30 seconds uninterrupted. Avoiding scenarios where a person only speaks a few short phrases like “Yeah”, “Right”, or “Sounds good” can also help. If possible, avoiding cross-talking can also improve performance.

How many speakers can the model handle?

The upper limit on the number of speakers for Speaker Diarization is currently 10.

Why am I getting an error message when I enable Dual-Channel Transcription and Speaker Labels?

Speaker Labels are not supported when Dual-Channel Transcription is turned on. Enabling both will result in an error message: "error": "Both dual_channel and speaker_labels cannot be set to True."

How accurate is the Speaker Diarization model?

The accuracy of the Speaker Diarization model depends on several factors, including the quality of the audio, the number of speakers, and the length of the audio file. Ensuring that each speaker speaks for at least 30 seconds uninterrupted and avoiding scenarios where a person only speaks a few short phrases can improve accuracy. However, it is important to note that the model is not perfect and may make mistakes, especially in more challenging scenarios.