Speaker Identification

Global Englishen
Australian Englishen_au
British Englishen_uk
US Englishen_us
Spanishes
Frenchfr
Germande
Italianit
Portuguesept
Dutchnl
Hindihi
Japaneseja
Chinesezh
Finnishfi
Koreanko
Polishpl
Russianru
Turkishtr
Ukrainianuk
Vietnamesevi
Afrikaansaf
Albaniansq
Amharicam
Arabicar
Armenianhy
Assameseas
Azerbaijaniaz
Bashkirba
Basqueeu
Belarusianbe
Bengalibn
Bosnianbs
Bretonbr
Bulgarianbg
Catalanca
Croatianhr
Czechcs
Danishda
Estonianet
Faroesefo
Galiciangl
Georgianka
Greekel
Gujaratigu
Haitianht
Hausaha
Hawaiianhaw
Hebrewhe
Hungarianhu
Icelandicis
Indonesianid
Javanesejw
Kannadakn
Kazakhkk
Laolo
Latinla
Latvianlv
Lingalaln
Lithuanianlt
Luxembourgishlb
Macedonianmk
Malagasymg
Malayms
Malayalamml
Maltesemt
Maorimi
Marathimr
Mongolianmn
Nepaline
Norwegianno
Norwegian Nynorsknn
Occitanoc
Panjabipa
Pashtops
Persianfa
Romanianro
Sanskritsa
Serbiansr
Shonasn
Sindhisd
Sinhalasi
Slovaksk
Sloveniansl
Somaliso
Sundanesesu
Swahilisw
Swedishsv
Tagalogtl
Tajiktg
Tamilta
Tatartt
Telugute
Turkmentk
Urduur
Uzbekuz
Welshcy
Yiddishyi
Yorubayo

Slam 1slam-1
Universaluniversal

US only

Overview

Speaker Identification allows you to identify speakers by their actual names or roles, transforming generic labels like “Speaker A” or “Speaker B” into meaningful identifiers that you provide. Speaker identities are inferred based on the conversation content.

Example transformation:

Before:

Speaker A: Good morning, and welcome to the show.
Speaker B: Thanks for having me.
Speaker A: Let's dive into today's topic...

After:

Michel Martin: Good morning, and welcome to the show.
Peter DeCarlo: Thanks for having me.
Michel Martin: Let's dive into today's topic...

Speaker Identification requires that a file be transcribed with Speaker Diarization enabled. See this section of our documentation to learn more about the Speaker Diarization feature.

To reliably identify speakers, your audio should contain clear, distinguishable voices and sufficient spoken audio from each speaker. The accuracy of Speaker Diarization depends on the quality of the audio and the distinctiveness of each speaker’s voice, which will have a downstream effect on the quality of Speaker Identification.

How to use Speaker Identification

There are two ways to use Speaker Identification:

  1. Transcribe and identify in one request - Best when you’re starting a new transcription and want speaker identification included automatically
  2. Transcribe and identify in separate requests - Best when you already have a completed transcript or for more complex workflows where you might want to perform other tasks between the transcription and speaker identification process

Method 1: Transcribe and identify in one request

This method is ideal when you’re starting fresh and want both transcription and speaker identification in a single workflow.

1import requests
2import time
3
4base_url = "https://api.assemblyai.com"
5
6headers = {
7 "authorization": "<YOUR_API_KEY>"
8}
9
10# Need to transcribe a local file? Learn more here: https://www.assemblyai.com/docs/getting-started/transcribe-an-audio-file
11upload_url = "https://assembly.ai/wildfires.mp3"
12
13# Configure transcript with speaker identification
14data = {
15 "audio_url": upload_url,
16 "speaker_labels": True,
17 "speech_understanding": {
18 "request": {
19 "speaker_identification": {
20 "speaker_type": "name",
21 "known_values": ["Michel Martin", "Peter DeCarlo"] # Change these values to match the names of the speakers in your file
22 }
23 }
24 }
25}
26
27# Submit the transcription request
28response = requests.post(base_url + "/v2/transcript", headers=headers, json=data)
29transcript_id = response.json()["id"]
30polling_endpoint = base_url + f"/v2/transcript/{transcript_id}"
31
32# Poll for transcription results
33while True:
34 transcript = requests.get(polling_endpoint, headers=headers).json()
35
36 if transcript["status"] == "completed":
37 break
38
39 elif transcript["status"] == "error":
40 raise RuntimeError(f"Transcription failed: {transcript['error']}")
41
42 else:
43 time.sleep(3)
44
45# Access the results and print utterances to the terminal
46for utterance in transcript["utterances"]:
47 print(f"{utterance['speaker']}: {utterance['text']}")

Method 2: Transcribe and identify in separate requests

This method is useful when you already have a completed transcript or for more complex workflows where you need to separate transcription from speaker identification.

1import requests
2import time
3
4base_url = "https://api.assemblyai.com"
5
6headers = {
7 "authorization": "<YOUR_API_KEY>"
8}
9
10# Need to transcribe a local file? Learn more here: https://www.assemblyai.com/docs/getting-started/transcribe-an-audio-file
11upload_url = "https://assembly.ai/wildfires.mp3"
12
13data = {
14 "audio_url": upload_url,
15 "speaker_labels": True
16}
17
18# Transcribe file
19response = requests.post(base_url + "/v2/transcript", headers=headers, json=data)
20
21transcript_id = response.json()["id"]
22polling_endpoint = base_url + f"/v2/transcript/{transcript_id}"
23
24# Poll for transcription results
25while True:
26 transcript = requests.get(polling_endpoint, headers=headers).json()
27
28 if transcript["status"] == "completed":
29 break
30
31 elif transcript["status"] == "error":
32 raise RuntimeError(f"Transcription failed: {transcript['error']}")
33
34 else:
35 time.sleep(3)
36
37# Enable speaker identification
38understanding_body = {
39 "transcript_id": transcript_id,
40 "speech_understanding": {
41 "request": {
42 "speaker_identification": {
43 "speaker_type": "name",
44 "known_values": ["Michel Martin", "Peter DeCarlo"] # Change these values to match the names of the speakers in your file
45 }
46 }
47 }
48}
49
50# Send the modified transcript to the Speech Understanding API
51result = requests.post(
52 "https://llm-gateway.assemblyai.com/v1/understanding",
53 headers = headers,
54 json = understanding_body
55).json()
56
57# Access the results and print utterances to the terminal
58for utterance in result["utterances"]:
59 print(f"{utterance['speaker']}: {utterance['text']}")

Output format details

Here is how the structure of the utterances in the utterances key differs when Speaker Diarization is used versus when Speaker Identification is used:

Before (Speaker Diarization only):

Speaker A: ... We wanted to better understand what's happening here and why, so we called Peter DeCarlo, an associate professor in the Department of Environmental Health and Engineering at Johns Hopkins University. Good morning, Professor.
Speaker B: Good morning.
Speaker A: So what is it about the conditions right now that have caused this round of wildfires to affect so many people so far away?
Speaker B: Well, there's a couple of things. The season has been pretty dry already, and then the fact that we're getting hit in the US is because there's a couple weather systems that are essentially channeling the smoke from those Canadian wildfires through Pennsylvania into the mid Atlantic and the Northeast and kind of just dropping the smoke there.

After (with Speaker Identification):

Michel Martin: ... We wanted to better understand what's happening here and why, so we called Peter DeCarlo, an associate professor in the Department of Environmental Health and Engineering at Johns Hopkins University. Good morning, Professor.
Peter DeCarlo: Good morning.
Michel Martin: So what is it about the conditions right now that have caused this round of wildfires to affect so many people so far away?
Peter DeCarlo: Well, there's a couple of things. The season has been pretty dry already, and then the fact that we're getting hit in the US is because there's a couple weather systems that are essentially channeling the smoke from those Canadian wildfires through Pennsylvania into the mid Atlantic and the Northeast and kind of just dropping the smoke there.

Advanced usage

Identifying speakers by role

Instead of identifying speakers by name as shown in the examples above, you can also identify speakers by role.

This can be useful in customer service calls, AI interactions, or any scenario where you may not know the specific names of the speakers but still want to identify them by something more than a generic identifier like A, B, or C.

To identify speakers by role, use the speaker_type parameter with a value of “role”:

Example

1# For Method 1 (transcribe and identify in one request):
2data = {
3 "audio_url": upload_url,
4 "speaker_labels": True,
5 "speech_understanding": {
6 "request": {
7 "speaker_identification": {
8 "speaker_type": "role",
9 "known_values": ["Interviewer", "Interviewee"] # Roles instead of names
10 }
11 }
12 }
13}
14
15# For Method 2 (add identification to existing transcript):
16understanding_body = {
17 "transcript_id": transcript_id,
18 "speech_understanding": {
19 "request": {
20 "speaker_identification": {
21 "speaker_type": "role",
22 "known_values": ["Interviewer", "Interviewee"] # Roles instead of names
23 }
24 }
25 }
26}
27
28# Send the modified transcript to the Speech Understanding API
29result = requests.post(
30 "https://llm-gateway.assemblyai.com/v1/understanding",
31 headers = headers,
32 json = understanding_body
33).json()

Common role combinations

  • ["Agent", "Customer"] - Customer service calls
  • ["AI Assistant", "User"] - AI chatbot interactions
  • ["Support", "Customer"] - Technical support calls
  • ["Interviewer", "Interviewee"] - Interview recordings
  • ["Host", "Guest"] - Podcast or show recordings
  • ["Moderator", "Panelist"] - Panel discussions

API reference

Request

Method 1: Transcribe and identify in one request

When creating a new transcription, include the speech_understanding parameter directly in your transcription request:

$curl -X POST \
> "https://api.assemblyai.com/v2/transcript" \
> -H "Authorization: YOUR_API_KEY" \
> -H "Content-Type: application/json" \
> -d '{
> "audio_url": "https://assembly.ai/wildfires.mp3",
> "speaker_labels": true,
> "speech_understanding": {
> "request": {
> "speaker_identification": {
> "speaker_type": "name",
> "known_values": ["Michel Martin", "Peter DeCarlo"]
> }
> }
> }
> }'

Method 2: Add identification to existing transcripts

For existing transcripts, retrieve the completed transcript and send it to the Speech Understanding API:

$# Step 1: Submit transcription job
>curl -X POST "https://api.assemblyai.com/v2/transcript" \
> -H "authorization: <YOUR_API_KEY>" \
> -H "Content-Type: application/json" \
> -d '{
> "audio_url": "https://assembly.ai/wildfires.mp3",
> "speaker_labels": true
> }'
>
># Save the transcript_id from the response above, then use it in the following commands
>
># Step 2: Poll for transcription status (repeat until status is "completed")
>curl -X GET "https://api.assemblyai.com/v2/transcript/{transcript_id}" \
> -H "authorization: <YOUR_API_KEY>"
>
># Step 3: Once transcription is completed, enable speaker identification
>curl -X POST "https://llm-gateway.assemblyai.com/v1/understanding" \
> -H "authorization: <YOUR_API_KEY>" \
> -H "Content-Type: application/json" \
> -d '{
> "transcript_id": "{transcript_id}",
> "speech_understanding": {
> "request": {
> "speaker_identification": {
> "speaker_type": "name",
> "known_values": ["Michel Martin", "Peter DeCarlo"]
> }
> }
> }
> }'

Request parameters

KeyTypeRequired?Description
speech_understandingobjectYesContainer for speech understanding requests.
speech_understanding.requestobjectYesThe understanding request configuration.
speech_understanding.request.speaker_identificationobjectYesSpeaker identification configuration.
speaker_identification.speaker_typestringYesThe type of speakers being identified, values accepted are “name” for actual names or “role” for roles/titles.
speaker_identification.known_valuesarrayConditionalList of speaker names or roles. Required when speaker_type is set to “role”. Optional when speaker_type is set to “name”.

Response

The Speaker Identification API returns a modified version of your transcript with updated speaker labels in the utterances key.

1{
2 "utterances": [
3 {
4 "speaker": "Michel Martin",
5 "text": "Smoke from hundreds of wildfires in Canada is triggering air quality alerts throughout the US Skylines from Maine to Maryland to Minnesota are gray and smoggy. And in some places, the air quality warnings include the warning to stay inside. We wanted to better understand what's happening here and why, so we called Peter DeCarlo, an associate professor in the Department of Environmental Health and Engineering at Johns Hopkins University. Good morning, Professor.",
6 "start": 240,
7 "end": 26560,
8 "confidence": 0.9815734,
9 "words": [
10 {
11 "text": "Smoke",
12 "start": 240,
13 "end": 640,
14 "confidence": 0.90152997,
15 "speaker": "Michel Martin"
16 }
17 // ... more words
18 ]
19 },
20 {
21 "speaker": "Peter DeCarlo",
22 "text": "Good morning.",
23 "start": 28060,
24 "end": 28620,
25 "confidence": 0.98217773,
26 "words": [
27 {
28 "text": "Good",
29 "start": 28060,
30 "end": 28260,
31 "confidence": 0.96484375,
32 "speaker": "Peter DeCarlo"
33 }
34 // ... more words
35 ]
36 }
37 ]
38}

Response fields

KeyTypeDescription
utterancesarrayA turn-by-turn temporal sequence of the transcript, where the i-th element is an object containing information about the i-th utterance in the audio file.
utterances[i].confidencenumberThe confidence score for the transcript of this utterance.
utterances[i].endnumberThe ending time, in milliseconds, of the utterance in the audio file.
utterances[i].speakerstringThe identified speaker name or role for this utterance.
utterances[i].startnumberThe starting time, in milliseconds, of the utterance in the audio file.
utterances[i].textstringThe transcript for this utterance.
utterances[i].wordsarrayA sequential array for the words in the transcript, where the j-th element is an object containing information about the j-th word in the utterance.
utterances[i].words[j].textstringThe text of the j-th word in the i-th utterance.
utterances[i].words[j].startnumberThe starting time for when the j-th word is spoken in the i-th utterance, in milliseconds.
utterances[i].words[j].endnumberThe ending time for when the j-th word is spoken in the i-th utterance, in milliseconds.
utterances[i].words[j].confidencenumberThe confidence score for the transcript of the j-th word in the i-th utterance.
utterances[i].words[j].speakerstringThe identified speaker name or role who uttered the j-th word in the i-th utterance.

Key differences from standard transcription

FieldStandard TranscriptionWith Speaker Identification
utterances[].speakerGeneric labels ("A", "B", "C")Identified names ("Michel Martin", "Peter DeCarlo") or roles ("Agent", "Customer")
utterances[].words[].speakerGeneric labels ("A", "B", "C")Identified names or roles matching the utterance speaker

All other fields (text, start, end, confidence, words) remain unchanged from the original transcript.