Speaker Diarization

Learn how to identify who said what in your audio files by adding speaker labels to your transcript.

The Speaker Diarization model lets you detect multiple speakers in an audio file and what each speaker said.

If you enable Speaker Diarization, the resulting transcript will return a list of utterances, where each utterance corresponds to an uninterrupted segment of speech from a single speaker.

Want to name your speakers?

Speaker Diarization assigns generic labels like “Speaker A” and “Speaker B” to distinguish between speakers. If you want to replace these labels with actual names or roles (e.g., “John Smith” or “Customer”), use Speaker Identification. Speaker Identification analyzes the conversation content to infer who is speaking and transforms your transcript from generic labels to meaningful identifiers.

Quickstart

To enable Speaker Diarization, set speaker_labels to True in the transcription config.

1import assemblyai as aai
2
3aai.settings.api_key = "<YOUR_API_KEY>"
4
5# You can use a local filepath:
6# audio_file = "./example.mp3"
7
8# Or use a publicly-accessible URL:
9audio_file = (
10 "https://assembly.ai/wildfires.mp3"
11)
12
13config = aai.TranscriptionConfig(
14 speech_models=["universal-3-pro", "universal-2"],
15 language_detection=True,
16 speaker_labels=True,
17)
18
19transcript = aai.Transcriber().transcribe(audio_file, config)
20
21for utterance in transcript.utterances:
22 print(f"Speaker {utterance.speaker}: {utterance.text}")

Example output

1Speaker A: Smoke from hundreds of wildfires in Canada is triggering air quality alerts
2throughout the US. Skylines from Maine to Maryland to Minnesota are gray and smoggy.
3And in some places, the weights of the air qualitative index has exceeded 300.
4Speaker B: Well, we have seen in the past when smoke from prior years and prior fires
5has come up from prior years to the east coast. But this particular time around,
6it's been notable in just how much of the of the country has been affected.

By default, the upper limit on the number of speakers for Speaker Diarization is 10. If you expect more than 10 speakers, you can use speaker_options to set a range of possible speakers.

Set number of speakers expected

You can set the number of speakers expected in the audio file by setting the speakers_expected parameter.

Only use this parameter if you are certain about the number of speakers in the audio file.

Building on the Quickstart above, add speakers_expected to your transcription config:

1config = aai.TranscriptionConfig(
2 speech_models=["universal-3-pro", "universal-2"],
3 language_detection=True,
4 speaker_labels=True,
5 speakers_expected=5,
6)

Set a range of possible speakers

You can set a range of possible speakers in the audio file by setting the speaker_options parameter. By default, the model will return between 1 and 10 speakers.

This parameter is suitable for use cases where there is a known minimum/maximum number of speakers in the audio file that is outside the bounds of the default value of 1 to 10 speakers.

Setting max_speakers_expected too high may reduce diarization accuracy, causing sentences from the same speaker to be split across multiple speaker labels.

Building on the Quickstart above, add speaker_options to your transcription config:

1config = aai.TranscriptionConfig(
2 speech_models=["universal-3-pro", "universal-2"],
3 language_detection=True,
4 speaker_labels=True,
5 speaker_options=aai.SpeakerOptions(
6 min_speakers_expected=3,
7 max_speakers_expected=5
8 ),
9)

API reference

Request

Speakers Expected

$curl https://api.assemblyai.com/v2/transcript \
>--header "Authorization: <YOUR_API_KEY>" \
>--header "Content-Type: application/json" \
>--data '{
> "audio_url": "YOUR_AUDIO_URL",
> "speech_models": ["universal-3-pro", "universal-2"],
> "language_detection": true,
> "speaker_labels": true,
> "speakers_expected": 3
>}'

Speaker Options

$curl https://api.assemblyai.com/v2/transcript \
>--header "Authorization: <YOUR_API_KEY>" \
>--header "Content-Type: application/json" \
>--data '{
> "audio_url": "YOUR_AUDIO_URL",
> "speech_models": ["universal-3-pro", "universal-2"],
> "language_detection": true,
> "speaker_labels": true,
> "speaker_options": {
> "min_speakers_expected": 3,
> "max_speakers_expected": 5
> }
>}'
KeyTypeDescription
speaker_labelsbooleanEnable Speaker Diarization.
speakers_expectednumberSet number of speakers.
speaker_optionsobjectSet range of possible speakers.
speaker_options.min_speakers_expectednumberThe minimum number of speakers expected in the audio file.
speaker_options.max_speakers_expectednumberThe maximum number of speakers expected in the audio file.

Response

KeyTypeDescription
utterancesarrayA turn-by-turn temporal sequence of the transcript, where the i-th element is an object containing information about the i-th utterance in the audio file.
utterances[i].confidencenumberA score between 0 and 1 indicating the model’s confidence in the accuracy of the transcribed text for this utterance.
utterances[i].endnumberThe ending time, in milliseconds, of the utterance in the audio file.
utterances[i].speakerstringThe speaker of this utterance, where each speaker is assigned a sequential capital letter. For example, “A” for Speaker A, “B” for Speaker B, and so on.
utterances[i].startnumberThe starting time, in milliseconds, of the utterance in the audio file.
utterances[i].textstringThe transcript for this utterance.
utterances[i].wordsarrayA sequential array for the words in the transcript, where the j-th element is an object containing information about the j-th word in the utterance.
utterances[i].words[j].textstringThe text of the j-th word in the i-th utterance.
utterances[i].words[j].startnumberThe starting time for when the j-th word is spoken in the i-th utterance, in milliseconds.
utterances[i].words[j].endnumberThe ending time for when the j-th word is spoken in the i-th utterance, in milliseconds.
utterances[i].words[j].confidencenumberThe confidence score for the transcript of the j-th word in the i-th utterance.
utterances[i].words[j].speakerstringThe speaker who uttered the j-th word in the i-th utterance.

The response also includes the request parameters used to generate the transcript.

Identify speakers by name

Speaker Diarization assigns generic labels like “Speaker A” and “Speaker B” to each speaker. If you want to replace these labels with actual names or roles, you can use Speaker Identification to transform your transcript.

Before Speaker Identification:

Speaker A: Good morning, and welcome to the show.
Speaker B: Thanks for having me.

After Speaker Identification:

Michel Martin: Good morning, and welcome to the show.
Peter DeCarlo: Thanks for having me.

The following example shows how to transcribe audio with Speaker Diarization and then apply Speaker Identification to replace the generic speaker labels with actual names.

1import requests
2import time
3
4base_url = "https://api.assemblyai.com"
5
6headers = {
7 "authorization": "<YOUR_API_KEY>"
8}
9
10audio_url = "https://assembly.ai/wildfires.mp3"
11
12# Configure transcript with speaker diarization and speaker identification
13data = {
14 "audio_url": audio_url,
15 "speech_models": ["universal-3-pro", "universal-2"],
16 "language_detection": True,
17 "speaker_labels": True,
18 "speech_understanding": {
19 "request": {
20 "speaker_identification": {
21 "speaker_type": "name",
22 "known_values": ["Michel Martin", "Peter DeCarlo"]
23 }
24 }
25 }
26}
27
28# Submit the transcription request
29response = requests.post(base_url + "/v2/transcript", headers=headers, json=data)
30transcript_id = response.json()["id"]
31polling_endpoint = base_url + f"/v2/transcript/{transcript_id}"
32
33# Poll for transcription results
34while True:
35 transcript = requests.get(polling_endpoint, headers=headers).json()
36
37 if transcript["status"] == "completed":
38 break
39 elif transcript["status"] == "error":
40 raise RuntimeError(f"Transcription failed: {transcript['error']}")
41 else:
42 time.sleep(3)
43
44# Print utterances with identified speaker names
45for utterance in transcript["utterances"]:
46 print(f"{utterance['speaker']}: {utterance['text']}")

For more details on Speaker Identification, including how to identify speakers by role and how to apply it to existing transcripts, see the Speaker Identification guide.

Best practices for accurate diarization

Follow these tips to get the best results from Speaker Diarization:

  • Ensure sufficient speech per speaker. Each speaker should speak for at least 30 seconds uninterrupted. The model may struggle to create separate clusters for speakers who only contribute short phrases like “Yeah”, “Right”, or “Sounds good”.
  • Minimize cross-talk. Overlapping speech between speakers can reduce diarization accuracy. Where possible, ensure speakers take turns.
  • Reduce background noise. Background noise, echoes, or playback of recorded audio during a conversation can interfere with speaker separation.
  • Use speaker_options instead of speakers_expected when uncertain. Only use speakers_expected when you are confident about the exact number of speakers. If this number is incorrect, the model may produce random splits of single-speaker segments or merge multiple speakers into one. It’s generally recommended to use min_speakers_expected and set max_speakers_expected slightly higher (e.g., min_speakers_expected + 2) to allow flexibility.
  • Avoid setting max_speakers_expected too high. Setting the maximum too high may reduce accuracy, causing sentences from the same speaker to be split across multiple speaker labels.
  • Be aware of speaker similarity. If speakers sound similar, the model may have difficulty distinguishing between them.

Frequently asked questions

By default, the upper limit on the number of speakers for Speaker Diarization is 10. If you expect more than 10 speakers, you can use speaker_options to set a range of possible speakers.

The accuracy of the Speaker Diarization model depends on several factors, including the quality of the audio, the number of speakers, and the length of the audio file. See the best practices section above for tips on improving accuracy.