u3-rt-prouniversal-streaming-englishuniversal-streaming-multilingualwhisper-rtStreaming Diarization lets you identify and label individual speakers in real
time directly from the Streaming API. Each Turn event includes a speaker_label
field (e.g. A, B) indicating the dominant speaker for that turn. Each final
word in the words array also carries a speaker field, enabling mid-turn speaker
change detection. Speaker accuracy improves over the course of a session as the model
accumulates embedding context — so the longer the conversation, the better the labels.
You can enable Streaming Diarization by adding speaker_labels: true to your
connection parameters. No other changes are required — the speaker_label field
will appear on every Turn event, and each final word in the words array will
include a speaker field automatically.
Get started with Streaming Diarization using the code below. This example streams audio from your microphone and prints each turn with its speaker label.
Enable Streaming Diarization by adding speaker_labels: true to your connection
parameters. You can optionally cap the number of speakers with max_speakers.
Diarization is supported on all streaming models: u3-rt-pro,
universal-streaming-english, universal-streaming-multilingual, and
whisper-rt. You do not need to change your speech model to use it — just
add speaker_labels: true.
When diarization is enabled, every Turn event includes a speaker_label field
reflecting the dominant speaker for that turn.
Each final word in the words array also carries a speaker field. This allows
you to detect speaker changes within a single turn — for example, a turn where one
speaker finishes another’s sentence, or where a brief interjection appears mid-turn.
A few things to keep in mind when consuming speaker:
speaker field only appears on words where word_is_final: true. Non-final (in-progress) words never carry it.speaker can be absent on individual words. If the field is missing from a word entirely, treat that word as unattributed and fall back to the turn-level speaker_label if you need a label. Absent means the field is omitted from the JSON — it will never be null.UNKNOWN at word level means the model couldn’t confidently attribute that word to any specific speaker — common for short backchannels (“uh huh”, “yeah”) or brief low-quality audio segments. It is not an ambiguity flag between two known speakers; words in a confidently-attributed stretch carry the speaker’s letter, not UNKNOWN.If a turn contains less than approximately 1 second of audio, the turn-level speaker_label
will be set to "UNKNOWN". This is because the model needs at least ~1 second of
audio to generate a reliable diarization embedding — without enough audio, embeddings
may be inaccurate and could lead to a single speaker being labeled as multiple
speakers. Labeling short turns as "UNKNOWN" ensures that speaker labels remain
as accurate as possible.
Your application should handle this case gracefully.
A typical multi-speaker exchange looks like this:
Streaming Diarization builds a speaker profile incrementally as audio flows in. In practice this means:
For long-form use cases (call center, clinical scribe, meeting transcription), the model will settle into accurate, stable labels well before the end of the conversation.
Real-time diarization is an inherently harder problem than diarization for async transcription on pre-recorded audio. The following limitations apply to the current beta:
"UNKNOWN" because there is insufficient audio to generate a reliable
speaker embedding. This prevents inaccurate embeddings from causing a single
speaker to be split across multiple labels.For the best results, use a microphone setup that minimizes cross-talk and background noise, and ensure each speaker produces at least a few complete sentences before you rely on per-turn labels for downstream processing.
To transcribe multichannel streaming audio, we recommend creating a separate session for each channel. This approach allows you to maintain clear speaker separation and get accurate diarized transcriptions for conversations, phone calls, or interviews where speakers are recorded on two different channels.
The following code example demonstrates how to transcribe a dual-channel audio file with diarized, speaker-separated transcripts. This same approach can be applied to any multi-channel audio stream, including those with more than two channels.
The examples above use turn detection settings optimized for short responses and rapid back-and-forth conversations. To optimize for your specific audio scenario, you can adjust the turn detection parameters.
For configuration examples tailored to different use cases, refer to our Configuration examples.
Modify the turn detection parameters in API_PARAMS: