Skip to main content

Speaker Diarization

With the AssemblyAI Speaker Diarization model, businesses and individuals can extract valuable insights from audio recordings that involve multiple speakers.

When this model is applied, the transcription not only contains the text but also includes speaker labels, enhancing the overall structure and organization of the output.

Quickstart

In the Identifying speakers in audio recordings guide, include the speaker_labels parameter in your request body and set it to true.

You can explore the full JSON response here:

Show JSON

You run this code snippet in Colab here, or you can view the full source code here.

Understanding the response

The JSON object above contains all information about the transcription. Depending on which Models are used to analyze the audio, the attributes of this object will vary. For example, in the quickstart above we did not enable Summarization, which is reflected by the summarization: false key-value pair in the JSON above. Had we activated Summarization, then the summary, summary_type, and summary_model key values would contain the file summary (and additional details) rather than the current null values.

To access the Speaker Diarization information, we use the speaker_labels and utterances keys:

The reference table below lists all relevant attributes along with their descriptions, where we've called the JSON response object results. Object attributes are accessed via dot notation, and arbitrary array elements are denoted with [i]. For example, results.words[i].text refers to the text attribute of the i-th element of the words array in the JSON results object.

results.speaker_labelsbooleanWhether Speaker Diarization was enabled in the transcription request
results.utterancesarrayA turn-by-turn temporal sequence of the transcript, where the i-th element is an object containing information about the i-th utterance in the audio file
results.utterances[i].confidencenumberThe confidence score for the transcript of this utterance
results.utterances[i].endnumberThe ending time, in milliseconds, of the utterance in the audio file
results.utterances[i].speakerstringThe speaker of this utterance, where each speaker is assigned a sequential capital letter - e.g. "A" for Speaker A, "B" for Speaker B, etc.
results.utterances[i].startnumberThe starting time, in milliseconds, of the utterance in the audio file
results.utterances[i].textstringThe transcript for this utterance
results.utterances[i].wordsarrayA sequential array for the words in the transcript, where the j-th element is an object containing information about the j-th word in the utterance
results.utterances[i].words[j].textstringThe text of the j-th word in the i-th utterance
results.utterances[i].words[j].startnumberThe starting time for when the j-th word is spoken in the i-th utterance, in milliseconds
results.utterances[i].words[j].endnumberThe ending time for when the j-th word is spoken in the i-th utterance, in milliseconds
results.utterances[i].words[j].confidencenumberThe confidence score for the transcript of the j-th word in the i-th utterance
results.utterances[i].words[j].speakerstringThe speaker who uttered the j-th word in the i-th utterance

We can also access speaker information in the results.words object, as mentioned in the Speech Recognition page.

Specifying the number of speakers

The AssemblyAI Speaker Diarization model provides an optional parameter, speakers_expected, that can be used to specify the expected number of speakers in an audio file. This parameter is an integer that can be set by the user to help the model more accurately cluster the audio data.

While it can help the model choose the correct number of speakers, there may be situations where the model returns a different number based on the spoken audio data. Nonetheless, specifying the expected number can be helpful in cases where it's known or can be reasonably estimated, allowing for more accurate and efficient speaker diarization.

Troubleshooting

Why is the speaker diarization not performing as expected?

The speaker diarization may be performing poorly if a speaker only speaks once or infrequently throughout the audio file. Additionally, if the speaker speaks in short or single-word utterances, the model may struggle to create separate clusters for each speaker. Lastly, if the speakers sound similar, there may be difficulties in accurately identifying and separating them. Background noise, cross-talk, or an echo may also cause issues.

How can I improve the performance of the Speaker Diarization model?

To improve the performance of the Speaker Diarization model, it's recommended to ensure that each speaker speaks for at least 30 seconds uninterrupted. Avoiding scenarios where a person only speaks a few short phrases like “Yeah”, “Right”, or “Sounds good” can also help. If possible, avoiding cross-talking can also improve performance.

How many speakers can the model handle?

The upper limit on the number of speakers for Speaker Diarization is currently 10.

Why am I getting an error message when I enable Dual-channel transcription and Speaker labels?

Speaker labels aren't supported when Dual-channel transcription is turned on. Enabling both results in an error message: "error": "Both dual_channel and speaker_labels can't be set to True."

How accurate is the Speaker Diarization model?

The accuracy of the Speaker Diarization model depends on several factors, including the quality of the audio, the number of speakers, and the length of the audio file. Ensuring that each speaker speaks for at least 30 seconds uninterrupted and avoiding scenarios where a person only speaks a few short phrases can improve accuracy. However, it's important to note that the model isn't perfect and may make mistakes, especially in more challenging scenarios.