Speaker Diarization
With the AssemblyAI Speaker Diarization model, businesses and individuals can extract valuable insights from audio recordings that involve multiple speakers.
When this model is applied, the transcription not only contains the text but also includes speaker labels, enhancing the overall structure and organization of the output.
Quickstart
In the Identifying speakers in audio recordings guide, include the speaker_labels
parameter in your request body and set it to true
.
You can explore the full JSON response here:
Show JSON
You run this code snippet in Colab here, or you can view the full source code here.
Understanding the response
The JSON object above contains all information about the transcription. Depending on which Models are used to analyze the audio, the attributes of this object will vary. For example, in the quickstart above we did not enable Summarization, which is reflected by the summarization: false
key-value pair in the JSON above. Had we activated Summarization, then the summary
, summary_type
, and summary_model
key values would contain the file summary (and additional details) rather than the current null
values.
To access the Speaker Diarization information, we use the speaker_labels
and utterances
keys:
The reference table below lists all relevant attributes along with their descriptions, where we've called the JSON response object results
. Object attributes are accessed via dot notation, and arbitrary array elements are denoted with [i]
.
For example, results.words[i].text
refers to the text
attribute of the i-th
element of the words
array in the JSON results
object.
results.speaker_labels | boolean | Whether Speaker Diarization was enabled in the transcription request |
results.utterances | array | A turn-by-turn temporal sequence of the transcript, where the i-th element is an object containing information about the i-th utterance in the audio file |
results.utterances[i].confidence | number | The confidence score for the transcript of this utterance |
results.utterances[i].end | number | The ending time, in milliseconds, of the utterance in the audio file |
results.utterances[i].speaker | string | The speaker of this utterance, where each speaker is assigned a sequential capital letter - e.g. "A" for Speaker A, "B" for Speaker B, etc. |
results.utterances[i].start | number | The starting time, in milliseconds, of the utterance in the audio file |
results.utterances[i].text | string | The transcript for this utterance |
results.utterances[i].words | array | A sequential array for the words in the transcript, where the j-th element is an object containing information about the j-th word in the utterance |
results.utterances[i].words[j].text | string | The text of the j-th word in the i-th utterance |
results.utterances[i].words[j].start | number | The starting time for when the j-th word is spoken in the i-th utterance, in milliseconds |
results.utterances[i].words[j].end | number | The ending time for when the j-th word is spoken in the i-th utterance, in milliseconds |
results.utterances[i].words[j].confidence | number | The confidence score for the transcript of the j-th word in the i-th utterance |
results.utterances[i].words[j].speaker | string | The speaker who uttered the j-th word in the i-th utterance |
We can also access speaker information in the results.words
object, as mentioned in the Speech Recognition page.
Specifying the number of speakers
The AssemblyAI Speaker Diarization model provides an optional parameter, speakers_expected
, that can be used to specify the expected number of speakers in an audio file. This parameter is an integer that can be set by the user to help the model more accurately cluster the audio data.
While it can help the model choose the correct number of speakers, there may be situations where the model returns a different number based on the spoken audio data. Nonetheless, specifying the expected number can be helpful in cases where it's known or can be reasonably estimated, allowing for more accurate and efficient speaker diarization.
Troubleshooting
The speaker diarization may be performing poorly if a speaker only speaks once or infrequently throughout the audio file. Additionally, if the speaker speaks in short or single-word utterances, the model may struggle to create separate clusters for each speaker. Lastly, if the speakers sound similar, there may be difficulties in accurately identifying and separating them. Background noise, cross-talk, or an echo may also cause issues.
To improve the performance of the Speaker Diarization model, it's recommended to ensure that each speaker speaks for at least 30 seconds uninterrupted. Avoiding scenarios where a person only speaks a few short phrases like “Yeah”, “Right”, or “Sounds good” can also help. If possible, avoiding cross-talking can also improve performance.
The upper limit on the number of speakers for Speaker Diarization is currently 10.
Speaker labels aren't supported when Dual-channel transcription is turned on. Enabling both results in an error message: "error": "Both dual_channel and speaker_labels can't be set to True."
The accuracy of the Speaker Diarization model depends on several factors, including the quality of the audio, the number of speakers, and the length of the audio file. Ensuring that each speaker speaks for at least 30 seconds uninterrupted and avoiding scenarios where a person only speaks a few short phrases can improve accuracy. However, it's important to note that the model isn't perfect and may make mistakes, especially in more challenging scenarios.