Speaker Diarization - AssemblyAI

Overview

Speaker diarization identifies individual speakers in your audio and labels each segment of the transcript with the speaker. When enabled, the transcript is returned as a list of utterances, where each utterance represents an uninterrupted segment of speech from a single speaker. Accuracy improves the more each speaker talks as the model accumulates embedding context. For best results, each speaker should have at least 30 seconds of continuous speech.

Quickstart

Python
Python SDK
JavaScript
JavaScript SDK

To enable Speaker Diarization, set speaker_labels to True in the POST request body:

import requests
import time

base_url = "https://api.assemblyai.com"

headers = {
    "authorization": "<YOUR_API_KEY>"
}

with open("./my-audio.mp3", "rb") as f:
  response = requests.post(base_url + "/v2/upload",
                          headers=headers,
                          data=f)

upload_url = response.json()["upload_url"]

data = {
    "audio_url": upload_url, # You can also use a URL to an audio or video file on the web
    "language_detection": True,
    "speaker_labels": True
}

url = base_url + "/v2/transcript"
response = requests.post(url, json=data, headers=headers)

transcript_id = response.json()['id']
polling_endpoint = base_url + "/v2/transcript/" + transcript_id

while True:
  transcription_result = requests.get(polling_endpoint, headers=headers).json()

  if transcription_result['status'] == 'completed':
    print(f"Transcript ID: {transcript_id}")
    break

  elif transcription_result['status'] == 'error':
    raise RuntimeError(f"Transcription failed: {transcription_result['error']}")

  else:
    time.sleep(3)

for utterance in transcription_result['utterances']:
  print(f"Speaker {utterance['speaker']}: {utterance['text']}")

To enable Speaker Diarization, set speaker_labels to True in the transcription config.

import assemblyai as aai

aai.settings.api_key = "<YOUR_API_KEY>"

# You can use a local filepath:
# audio_file = "./example.mp3"

# Or use a publicly-accessible URL:
audio_file = (
    "https://assembly.ai/wildfires.mp3"
)

config = aai.TranscriptionConfig(
  language_detection=True,
  speaker_labels=True,
)

transcript = aai.Transcriber().transcribe(audio_file, config)

for utterance in transcript.utterances:
  print(f"Speaker {utterance.speaker}: {utterance.text}")

To enable Speaker Diarization, set speaker_labels to true in the POST request body:

import fs from "fs-extra";

const baseUrl = "https://api.assemblyai.com";

const headers = {
  authorization: "<YOUR_API_KEY>",
};

const path = "./audio/audio.mp3";
const audioData = await fs.readFile(path);

let res = await fetch(`${baseUrl}/v2/upload`, {
  method: "POST",
  headers,
  body: audioData,
});
if (!res.ok) throw new Error(`Error: ${res.status}`);
const uploadResponse = await res.json();

const uploadUrl = uploadResponse.upload_url;

const data = {
  audio_url: uploadUrl, // You can also use a URL to an audio or video file on the web
  language_detection: true,
  speaker_labels: true,
};

const url = `${baseUrl}/v2/transcript`;
res = await fetch(url, {
  method: "POST",
  headers: { ...headers, "Content-Type": "application/json" },
  body: JSON.stringify(data),
});
if (!res.ok) throw new Error(`Error: ${res.status}`);
const response = await res.json();

const transcriptId = response.id;
const pollingEndpoint = `${baseUrl}/v2/transcript/${transcriptId}`;

while (true) {
  res = await fetch(pollingEndpoint, { headers });
  if (!res.ok) throw new Error(`Error: ${res.status}`);
  const transcriptionResult = await res.json();

  if (transcriptionResult.status === "completed") {
    for (const utterance of transcriptionResult.utterances) {
      console.log(`Speaker ${utterance.speaker}: ${utterance.text}`);
    }
    break;
  } else if (transcriptionResult.status === "error") {
    throw new Error(`Transcription failed: ${transcriptionResult.error}`);
  } else {
    await new Promise((resolve) => setTimeout(resolve, 3000));
  }
}

To enable Speaker Diarization, set speaker_labels to true in the transcription config.

import { AssemblyAI } from "assemblyai";

const client = new AssemblyAI({
  apiKey: "<YOUR_API_KEY>",
});

// You can use a local filepath:
// const audioFile = "./example.mp3"

// Or use a publicly-accessible URL:
const audioFile = "https://assembly.ai/wildfires.mp3";

const params = {
  audio: audioFile,
  language_detection: true,
  speaker_labels: true,
};

const run = async () => {
  const transcript = await client.transcripts.transcribe(params);

  for (const utterance of transcript.utterances ?? []) {
    console.log(`Speaker ${utterance.speaker}: ${utterance.text}`);
  }
};

run();

Configuration

You can constrain the number of speakers with speakers_expected, or with min_speakers_expected/max_speakers_expected. These are hard boundaries on the number of speaker labels, not hints: max_speakers_expected is a strict cap — if more people speak than that, the additional speakers are merged into existing labels — and min_speakers_expected is a strict floor.

If you know the exact number of speakers, set speakers_expected.
If you have a rough idea, use min_speakers_expected and max_speakers_expected to set a range. Set max_speakers_expected a little higher than the number of speakers you expect so the model has room to identify any additional speakers. Setting it too high can cause the model to over-split and return more speaker labels than are actually present.

Key	Type	Default	Description
`speaker_labels`	boolean	`false`	Enable Speaker Diarization.
`speakers_expected`	number	—	Set the exact number of speakers.
`speaker_options`	object	—	Set range of possible speakers.
`speaker_options.min_speakers_expected`	number	—	A hard lower limit on the number of speaker labels. The model won’t return fewer speakers than this.
`speaker_options.max_speakers_expected`	number	0–2 minutes: — 2–10 minutes: 10 speakers 10+ minutes: 30 speakers	A hard upper limit on the number of speaker labels. If more people speak than this, the additional speakers are merged into existing labels. Give the model a little headroom above the number of speakers you expect; setting it too high can cause over-splitting and return more speakers than are actually present.

Only set speakers_expected when you are certain of the exact speaker count. If you’re unsure, use min_speakers_expected and max_speakers_expected to describe a range instead — providing an incorrect exact count can negatively affect diarization accuracy.

Reading the response

When diarization is enabled, the transcript includes an utterances array in place of a single text block. Each object in the array represents one uninterrupted segment of speech from a single speaker.

{
  "utterances": [
    {
      "speaker": "A",
      "text": "Smoke from hundreds of wildfires in Canada is triggering air quality alerts.",
      "start": 250,
      "end": 28840,
      "confidence": 0.97,
      "words": [
        {
          "text": "Smoke",
          "speaker": "A",
          "start": 250,
          "end": 650,
          "confidence": 0.99
        }
      ]
    }
  ]
}

The utterances array contains objects with the following fields:

Field	Type	Description
`speaker`	string	The speaker label, assigned as sequential letters such as A, B, and C.
`text`	string	The transcribed text for the utterance.
`start`	number	The start time of the utterance in milliseconds.
`end`	number	The end time of the utterance in milliseconds.
`confidence`	number	A score between 0 and 1 indicating the model’s confidence in the transcribed text.
`words`	array	A word-level breakdown of the utterance.

Each object in the words array contains the following fields:

Field	Type	Description
`text`	string	The transcribed word.
`speaker`	string	The speaker label for the word.
`start`	number	The start time of the word in milliseconds.
`end`	number	The end time of the word in milliseconds.
`confidence`	number	A score between 0 and 1 indicating the model’s confidence in the word.

Identify speakers by name

Speaker Diarization assigns generic labels like “Speaker A” and “Speaker B” to each speaker. If you want to replace these labels with actual names or roles, you can use Speaker Identification to transform your transcript. Before Speaker Identification:

Speaker A: Good morning, and welcome to the show.
Speaker B: Thanks for having me.

After Speaker Identification:

Michel Martin: Good morning, and welcome to the show.
Peter DeCarlo: Thanks for having me.

The following example shows how to transcribe audio with Speaker Diarization and then apply Speaker Identification to replace the generic speaker labels with actual names.

Python
JavaScript

maxLines=30

import requests
import time

base_url = "https://api.assemblyai.com"

headers = {
    "authorization": "<YOUR_API_KEY>"
}

audio_url = "https://assembly.ai/wildfires.mp3"

# Configure transcript with speaker diarization and speaker identification
data = {
    "audio_url": audio_url,
    "language_detection": True,
    "speaker_labels": True,
    "speech_understanding": {
        "request": {
            "speaker_identification": {
                "speaker_type": "name",
                "known_values": ["Michel Martin", "Peter DeCarlo"]
            }
        }
    }
}

# Submit the transcription request
response = requests.post(base_url + "/v2/transcript", headers=headers, json=data)
transcript_id = response.json()["id"]
polling_endpoint = base_url + f"/v2/transcript/{transcript_id}"

# Poll for transcription results
while True:
    transcript = requests.get(polling_endpoint, headers=headers).json()

    if transcript["status"] == "completed":
        break
    elif transcript["status"] == "error":
        raise RuntimeError(f"Transcription failed: {transcript['error']}")
    else:
        time.sleep(3)

# Print utterances with identified speaker names
for utterance in transcript["utterances"]:
    print(f"{utterance['speaker']}: {utterance['text']}")

maxLines=30

const baseUrl = "https://api.assemblyai.com";

const headers = {
  authorization: "<YOUR_API_KEY>",
  "content-type": "application/json",
};

const audioUrl = "https://assembly.ai/wildfires.mp3";

// Configure transcript with speaker diarization and speaker identification
const data = {
  audio_url: audioUrl,
  language_detection: true,
  speaker_labels: true,
  speech_understanding: {
    request: {
      speaker_identification: {
        speaker_type: "name",
        known_values: ["Michel Martin", "Peter DeCarlo"],
      },
    },
  },
};

async function main() {
  // Submit the transcription request
  const response = await fetch(`${baseUrl}/v2/transcript`, {
    method: "POST",
    headers: headers,
    body: JSON.stringify(data),
  });
  if (!response.ok) throw new Error(`Error: ${response.status}`);

  const { id: transcriptId } = await response.json();
  const pollingEndpoint = `${baseUrl}/v2/transcript/${transcriptId}`;

  // Poll for transcription results
  while (true) {
    const pollingResponse = await fetch(pollingEndpoint, { headers });
    if (!pollingResponse.ok) throw new Error(`Error: ${pollingResponse.status}`);
    const transcript = await pollingResponse.json();

    if (transcript.status === "completed") {
      // Print utterances with identified speaker names
      for (const utterance of transcript.utterances) {
        console.log(`${utterance.speaker}: ${utterance.text}`);
      }
      break;
    } else if (transcript.status === "error") {
      throw new Error(`Transcription failed: ${transcript.error}`);
    } else {
      await new Promise((resolve) => setTimeout(resolve, 3000));
    }
  }
}

main().catch(console.error);

For more details on Speaker Identification, including how to identify speakers by role and how to apply it to existing transcripts, see the Speaker Identification guide.

Best practices for accurate diarization

Follow these tips to get the best results from Speaker Diarization:

Ensure sufficient speech per speaker. Each speaker should speak for at least 30 seconds uninterrupted. The model may struggle to create separate clusters for speakers who only contribute short phrases like “Yeah”, “Right”, or “Sounds good”.
Minimize cross-talk. Overlapping speech between speakers can reduce diarization accuracy. Where possible, ensure speakers take turns.
Reduce background noise. Background noise, echoes, or playback of recorded audio during a conversation can interfere with speaker separation.
Use speaker_options instead of speakers_expected when uncertain. Only use speakers_expected when you are confident about the exact number of speakers. If this number is incorrect, the model may produce random splits of single-speaker segments or merge multiple speakers into one. It’s generally recommended to use min_speakers_expected and set max_speakers_expected slightly higher (e.g., min_speakers_expected + 2) to allow flexibility.
Avoid setting max_speakers_expected too high. Setting the maximum too high may reduce accuracy, causing sentences from the same speaker to be split across multiple speaker labels.
Be aware of speaker similarity. If speakers sound similar, the model may have difficulty distinguishing between them.

​Overview

​Quickstart

​Configuration

​Reading the response

​Identify speakers by name

​Best practices for accurate diarization

Overview

Quickstart

Configuration

Reading the response

Identify speakers by name

Best practices for accurate diarization