Speaker Identification | AssemblyAI

Supported languages

Global Englishen

Australian Englishen_au

British Englishen_uk

US Englishen_us

Spanishes

Frenchfr

Germande

Italianit

Portuguesept

Dutchnl

Hindihi

Japaneseja

Chinesezh

Finnishfi

Koreanko

Polishpl

Russianru

Turkishtr

Ukrainianuk

Vietnamesevi

Afrikaansaf

Albaniansq

Amharicam

Arabicar

Armenianhy

Assameseas

Azerbaijaniaz

Bashkirba

Basqueeu

Belarusianbe

Bengalibn

Bosnianbs

Bretonbr

Bulgarianbg

Catalanca

Croatianhr

Czechcs

Danishda

Estonianet

Faroesefo

Galiciangl

Georgianka

Greekel

Gujaratigu

Haitianht

Hausaha

Hawaiianhaw

Hebrewhe

Hungarianhu

Icelandicis

Indonesianid

Javanesejw

Kannadakn

Kazakhkk

Laolo

Latinla

Latvianlv

Lingalaln

Lithuanianlt

Luxembourgishlb

Macedonianmk

Malagasymg

Malayms

Malayalamml

Maltesemt

Maorimi

Marathimr

Mongolianmn

Nepaline

Norwegianno

Norwegian Nynorsknn

Occitanoc

Panjabipa

Pashtops

Persianfa

Romanianro

Sanskritsa

Serbiansr

Shonasn

Sindhisd

Sinhalasi

Slovaksk

Sloveniansl

Somaliso

Sundanesesu

Swahilisw

Swedishsv

Tagalogtl

Tajiktg

Tamilta

Tatartt

Telugute

Turkmentk

Urduur

Uzbekuz

Welshcy

Yiddishyi

Yorubayo

Supported models

Universal-3-Prouniversal-3-pro

Universal-2universal-2

Supported regions

US only

Overview

Speaker Identification allows you to identify speakers by their actual names or roles, transforming generic labels like “Speaker A” or “Speaker B” into meaningful identifiers that you provide. Speaker identities are inferred based on the conversation content.

Example transformation:

Before:

Speaker A: Good morning, and welcome to the show.
Speaker B: Thanks for having me.
Speaker A: Let's dive into today's topic...

After:

Michel Martin: Good morning, and welcome to the show.
Peter DeCarlo: Thanks for having me.
Michel Martin: Let's dive into today's topic...

Speaker Identification requires that a file be transcribed with Speaker Diarization enabled. See this section of our documentation to learn more about the Speaker Diarization feature.

To reliably identify speakers, your audio should contain clear, distinguishable voices and sufficient spoken audio from each speaker. The accuracy of Speaker Diarization depends on the quality of the audio and the distinctiveness of each speaker’s voice, which will have a downstream effect on the quality of Speaker Identification.

How to use Speaker Identification

There are two ways to use Speaker Identification:

Transcribe and identify in one request - Best when you’re starting a new transcription and want speaker identification included automatically
Transcribe and identify in separate requests - Best when you already have a completed transcript or for more complex workflows where you might want to perform other tasks between the transcription and speaker identification process

Method 1: Transcribe and identify in one request

This method is ideal when you’re starting fresh and want both transcription and speaker identification in a single workflow.

Python

JavaScript

1 import requests
2 import time
3 
4 base_url = "https://api.assemblyai.com"
5 
6 headers = {
7 "authorization": "<YOUR_API_KEY>"
8 }
9 
10 # Need to transcribe a local file? Learn more here: https://www.assemblyai.com/docs/getting-started/transcribe-an-audio-file
11 
12 upload_url = "https://assembly.ai/wildfires.mp3"
13 
14 # Configure transcript with speaker identification
15 
16 data = {
17 "audio_url": upload_url,
18 "speech_models": ["universal-3-pro", "universal-2"],
19 "language_detection": True,
20 "speaker_labels": True,
21 "speech_understanding": {
22 "request": {
23 "speaker_identification": {
24 "speaker_type": "name",
25 "known_values": ["Michel Martin", "Peter DeCarlo"] # Change these values to match the names of the speakers in your file
26 }
27 }
28 }
29 }
30 
31 # Submit the transcription request
32 
33 response = requests.post(base_url + "/v2/transcript", headers=headers, json=data)
34 transcript_id = response.json()["id"]
35 polling_endpoint = base_url + f"/v2/transcript/{transcript_id}"
36 
37 # Poll for transcription results
38 
39 while True:
40 transcript = requests.get(polling_endpoint, headers=headers).json()
41 
42 if transcript["status"] == "completed":
43 break
44 
45 elif transcript["status"] == "error":
46 raise RuntimeError(f"Transcription failed: {transcript['error']}")
47 
48 else:
49 time.sleep(3)
50 
51 # Access the results and print utterances to the terminal
52 
53 for utterance in transcript["utterances"]:
54 print(f"{utterance['speaker']}: {utterance['text']}")

Method 2: Transcribe and identify in separate requests

This method is useful when you already have a completed transcript or for more complex workflows where you need to separate transcription from speaker identification.

Python

JavaScript

1 import requests
2 import time
3 
4 base_url = "https://api.assemblyai.com"
5 
6 headers = {
7 "authorization": "<YOUR_API_KEY>"
8 }
9 
10 # Need to transcribe a local file? Learn more here: https://www.assemblyai.com/docs/getting-started/transcribe-an-audio-file
11 
12 upload_url = "https://assembly.ai/wildfires.mp3"
13 
14 data = {
15 "audio_url": upload_url,
16 "speech_models": ["universal-3-pro", "universal-2"],
17 "language_detection": True,
18 "speaker_labels": True
19 }
20 
21 # Transcribe file
22 
23 response = requests.post(base_url + "/v2/transcript", headers=headers, json=data)
24 
25 transcript_id = response.json()["id"]
26 polling_endpoint = base_url + f"/v2/transcript/{transcript_id}"
27 
28 # Poll for transcription results
29 
30 while True:
31 transcript = requests.get(polling_endpoint, headers=headers).json()
32 
33 if transcript["status"] == "completed":
34 break
35 
36 elif transcript["status"] == "error":
37 raise RuntimeError(f"Transcription failed: {transcript['error']}")
38 
39 else:
40 time.sleep(3)
41 
42 # Enable speaker identification
43 
44 understanding_body = {
45 "transcript_id": transcript_id,
46 "speech_understanding": {
47 "request": {
48 "speaker_identification": {
49 "speaker_type": "name",
50 "known_values": ["Michel Martin", "Peter DeCarlo"] # Change these values to match the names of the speakers in your file
51 }
52 }
53 }
54 }
55 
56 # Send the modified transcript to the Speech Understanding API
57 
58 result = requests.post(
59 "https://llm-gateway.assemblyai.com/v1/understanding",
60 headers = headers,
61 json = understanding_body
62 ).json()
63 
64 # Access the results and print utterances to the terminal
65 
66 for utterance in result["utterances"]:
67 print(f"{utterance['speaker']}: {utterance['text']}")

Output format details

Here is how the structure of the utterances in the utterances key differs when Speaker Diarization is used versus when Speaker Identification is used:

Before (Speaker Diarization only):

Speaker A: ... We wanted to better understand what's happening here and why, so we called Peter DeCarlo, an associate professor in the Department of Environmental Health and Engineering at Johns Hopkins University. Good morning, Professor.
Speaker B: Good morning.
Speaker A: So what is it about the conditions right now that have caused this round of wildfires to affect so many people so far away?
Speaker B: Well, there's a couple of things. The season has been pretty dry already, and then the fact that we're getting hit in the US is because there's a couple weather systems that are essentially channeling the smoke from those Canadian wildfires through Pennsylvania into the mid Atlantic and the Northeast and kind of just dropping the smoke there.

After (with Speaker Identification):

Michel Martin: ... We wanted to better understand what's happening here and why, so we called Peter DeCarlo, an associate professor in the Department of Environmental Health and Engineering at Johns Hopkins University. Good morning, Professor.
Peter DeCarlo: Good morning.
Michel Martin: So what is it about the conditions right now that have caused this round of wildfires to affect so many people so far away?
Peter DeCarlo: Well, there's a couple of things. The season has been pretty dry already, and then the fact that we're getting hit in the US is because there's a couple weather systems that are essentially channeling the smoke from those Canadian wildfires through Pennsylvania into the mid Atlantic and the Northeast and kind of just dropping the smoke there.

Advanced usage

Identifying speakers by role

Instead of identifying speakers by name as shown in the examples above, you can also identify speakers by role.

This can be useful in customer service calls, AI interactions, or any scenario where you may not know the specific names of the speakers but still want to identify them by something more than a generic identifier like A, B, or C.

To identify speakers by role, use the speaker_type parameter with a value of “role”:

Example

1 # For Method 1 (transcribe and identify in one request):
2 data = {
3   "audio_url": upload_url,
4   "speaker_labels": True,
5   "speech_understanding": {
6     "request": {
7       "speaker_identification": {
8         "speaker_type": "role",
9         "known_values": ["Interviewer", "Interviewee"]  # Roles instead of names
10       }
11     }
12   }
13 }
14 
15 # For Method 2 (add identification to existing transcript):
16 understanding_body = {
17   "transcript_id": transcript_id,
18   "speech_understanding": {
19     "request": {
20       "speaker_identification": {
21         "speaker_type": "role",
22         "known_values": ["Interviewer", "Interviewee"]  # Roles instead of names
23       }
24     }
25   }
26 }
27 
28 # Send the modified transcript to the Speech Understanding API
29 result = requests.post(
30   "https://llm-gateway.assemblyai.com/v1/understanding",
31   headers = headers,
32   json = understanding_body
33 ).json()

Common role combinations

["Agent", "Customer"] - Customer service calls
["AI Assistant", "User"] - AI chatbot interactions
["Support", "Customer"] - Technical support calls
["Interviewer", "Interviewee"] - Interview recordings
["Host", "Guest"] - Podcast or show recordings
["Moderator", "Panelist"] - Panel discussions

API reference

Request

Method 1: Transcribe and identify in one request

When creating a new transcription, include the speech_understanding parameter directly in your transcription request:

$ curl -X POST \
>   "https://api.assemblyai.com/v2/transcript" \
>   -H "Authorization: YOUR_API_KEY" \
>   -H "Content-Type: application/json" \
>   -d '{
>     "audio_url": "https://assembly.ai/wildfires.mp3",
>     "speaker_labels": true,
>     "speech_understanding": {
>       "request": {
>         "speaker_identification": {
>           "speaker_type": "name",
>           "known_values": ["Michel Martin", "Peter DeCarlo"]
>         }
>       }
>     }
>   }'

Method 2: Add identification to existing transcripts

For existing transcripts, retrieve the completed transcript and send it to the Speech Understanding API:

$ # Step 1: Submit transcription job
$ curl -X POST "https://api.assemblyai.com/v2/transcript" \
>   -H "authorization: <YOUR_API_KEY>" \
>   -H "Content-Type: application/json" \
>   -d '{
>     "audio_url": "https://assembly.ai/wildfires.mp3",
>     "speaker_labels": true
>   }'
$ 
$ # Save the transcript_id from the response above, then use it in the following commands
$ 
$ # Step 2: Poll for transcription status (repeat until status is "completed")
$ curl -X GET "https://api.assemblyai.com/v2/transcript/{transcript_id}" \
>   -H "authorization: <YOUR_API_KEY>"
$ 
$ # Step 3: Once transcription is completed, enable speaker identification
$ curl -X POST "https://llm-gateway.assemblyai.com/v1/understanding" \
>   -H "authorization: <YOUR_API_KEY>" \
>   -H "Content-Type: application/json" \
>   -d '{
>     "transcript_id": "{transcript_id}",
>     "speech_understanding": {
>       "request": {
>         "speaker_identification": {
>           "speaker_type": "name",
>           "known_values": ["Michel Martin", "Peter DeCarlo"]
>         }
>       }
>     }
>   }'

Request parameters

Key	Type	Required?	Description
`speech_understanding`	object	Yes	Container for speech understanding requests.
`speech_understanding.request`	object	Yes	The understanding request configuration.
`speech_understanding.request.speaker_identification`	object	Yes	Speaker identification configuration.
`speaker_identification.speaker_type`	string	Yes	The type of speakers being identified, values accepted are “name” for actual names or “role” for roles/titles.
`speaker_identification.known_values`	array	Conditional	List of speaker names or roles. Required when `speaker_type` is set to “role”. Optional when `speaker_type` is set to “name”. Each value must be 35 characters or less.

Response

The Speaker Identification API returns a modified version of your transcript with updated speaker labels in the utterances key.

1 {
2   "utterances": [
3     {
4       "speaker": "Michel Martin",
5       "text": "Smoke from hundreds of wildfires in Canada is triggering air quality alerts throughout the US Skylines from Maine to Maryland to Minnesota are gray and smoggy. And in some places, the air quality warnings include the warning to stay inside. We wanted to better understand what's happening here and why, so we called Peter DeCarlo, an associate professor in the Department of Environmental Health and Engineering at Johns Hopkins University. Good morning, Professor.",
6       "start": 240,
7       "end": 26560,
8       "confidence": 0.9815734,
9       "words": [
10         {
11           "text": "Smoke",
12           "start": 240,
13           "end": 640,
14           "confidence": 0.90152997,
15           "speaker": "Michel Martin"
16         }
17         // ... more words
18       ]
19     },
20     {
21       "speaker": "Peter DeCarlo",
22       "text": "Good morning.",
23       "start": 28060,
24       "end": 28620,
25       "confidence": 0.98217773,
26       "words": [
27         {
28           "text": "Good",
29           "start": 28060,
30           "end": 28260,
31           "confidence": 0.96484375,
32           "speaker": "Peter DeCarlo"
33         }
34         // ... more words
35       ]
36     }
37   ]
38 }

Response fields

Key	Type	Description
`utterances`	array	A turn-by-turn temporal sequence of the transcript, where the i-th element is an object containing information about the i-th utterance in the audio file.
`utterances[i].confidence`	number	The confidence score for the transcript of this utterance.
`utterances[i].end`	number	The ending time, in milliseconds, of the utterance in the audio file.
`utterances[i].speaker`	string	The identified speaker name or role for this utterance.
`utterances[i].start`	number	The starting time, in milliseconds, of the utterance in the audio file.
`utterances[i].text`	string	The transcript for this utterance.
`utterances[i].words`	array	A sequential array for the words in the transcript, where the j-th element is an object containing information about the j-th word in the utterance.
`utterances[i].words[j].text`	string	The text of the j-th word in the i-th utterance.
`utterances[i].words[j].start`	number	The starting time for when the j-th word is spoken in the i-th utterance, in milliseconds.
`utterances[i].words[j].end`	number	The ending time for when the j-th word is spoken in the i-th utterance, in milliseconds.
`utterances[i].words[j].confidence`	number	The confidence score for the transcript of the j-th word in the i-th utterance.
`utterances[i].words[j].speaker`	string	The identified speaker name or role who uttered the j-th word in the i-th utterance.

Key differences from standard transcription

Field	Standard Transcription	With Speaker Identification
`utterances[].speaker`	Generic labels (`"A"`, `"B"`, `"C"`)	Identified names (`"Michel Martin"`, `"Peter DeCarlo"`) or roles (`"Agent"`, `"Customer"`)
`utterances[].words[].speaker`	Generic labels (`"A"`, `"B"`, `"C"`)	Identified names or roles matching the utterance speaker

All other fields (text, start, end, confidence, words) remain unchanged from the original transcript.