ModelsUniversal-3 Pro Streaming

Universal-3 Pro Streaming: Message Sequence Breakdown

For a description of each message field, refer to our Turn object explanation.

Understanding transcript vs utterance

Before walking through the message sequence, it’s important to understand the difference between the transcript and utterance fields:

  • transcript — The full transcript of the current turn up to this point in time.
  • utterance — Only populated on the end_of_turn: true message, where it always equals transcript. On all other Turn messages, utterance is an empty string "".

Key takeaway: For Universal-3 Pro Streaming, you can always use transcript — the utterance field provides no additional information beyond what transcript already contains. This field exists for API consistency with Universal-Streaming, where utterance boundaries can fire independently of turn boundaries, typically for the purposes of eager LLM inference.

Universal-3 Pro Streaming handles message sequences differently from Universal Streaming. Instead of emitting word-by-word partial transcripts as audio is processed, Universal-3 Pro Streaming produces stable, fully transcribed segments. Key differences include:

  • Early partial + silence-based partials — an early partial is emitted after 750ms of continuous speech to provide a fast transcript signal for barge-in and speculative inference. After that, additional partials are emitted when the speaker pauses.
  • Formatting is built inturn_is_formatted is true on end-of-turn transcripts. There is no separate formatting step.
  • Punctuation-based turn detection — turns end when terminal punctuation (. ? !) is detected, not based on a confidence threshold.
  • end_of_turn_confidence is always 1 when triggered by terminal punctuation.

For this example, we walk through a user saying: My name is Sonny.

The speaker pauses briefly mid-sentence (after “is”) before 750ms of continuous speech has elapsed, so the first partial is a silence-based partial rather than an early partial. The speaker then finishes the sentence, producing a final end-of-turn transcript.

If the speaker had spoken continuously for 750ms or more without pausing, an early partial would have been emitted first. See Turn Detection and Partials for details on early partials.

Session initialization

When the session begins, you receive a Begin message with the session ID and expiration time.

1{
2 "type": "Begin",
3 "id": "3207b601-2054-48df-ba77-8784dfcf9fb8",
4 "expires_at": 1772570132
5}

Speech detected

Before any Turn messages are sent, the server sends a SpeechStarted message indicating that speech has been detected. The timestamp field indicates when the speech was detected, in milliseconds relative to the beginning of the audio stream. The confidence field is the confidence score that speech has started.

SpeechStarted is only emitted when the model produces a transcript.

1{
2 "type": "SpeechStarted",
3 "timestamp": 1216,
4 "confidence": 0.987654
5}

Partial transcript

The speaker says “My name is” and pauses briefly. Because the speaker has stopped talking but no terminal punctuation has been detected, Universal-3 Pro Streaming emits a partial transcript.

Notice that:

  • end_of_turn is false — the turn has not ended yet.
  • turn_is_formatted is false — this is not a finalized transcript.
  • end_of_turn_confidence is 0 — no terminal punctuation detected.
  • All words have word_is_final: false — the transcript may be revised in the final message.
  • The transcript ends with an em dash (), indicating the utterance is incomplete.
  • The utterance field is an empty string because the turn has not ended. Use transcript to access the current partial text.
1{
2 "turn_order": 0,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "My name is—",
6 "end_of_turn_confidence": 0,
7 "words": [
8 {
9 "start": 1216,
10 "end": 1627,
11 "text": "My",
12 "confidence": 0.956314,
13 "word_is_final": false
14 },
15 {
16 "start": 1668,
17 "end": 2490,
18 "text": "name",
19 "confidence": 0.999393,
20 "word_is_final": false
21 },
22 {
23 "start": 2531,
24 "end": 3067,
25 "text": "is—",
26 "confidence": 0.753325,
27 "word_is_final": false
28 }
29 ],
30 "utterance": "",
31 "type": "Turn"
32}

Each silence period produces at most one partial. If the speaker continues pausing without finishing the sentence, no additional partial is emitted until new speech is detected.

End of turn (Final transcript)

The speaker continues and says “Sonny.” — completing the sentence with a period. Universal-3 Pro Streaming detects the terminal punctuation and ends the turn with a fully formatted final transcript.

Notice how the final transcript differs from the partial:

  • end_of_turn is now true — the turn has ended.
  • turn_is_formatted is true — this is a finalized, formatted transcript.
  • end_of_turn_confidence is 1 — terminal punctuation triggered the end of turn.
  • All words now have word_is_final: true — the transcript is final and will not be revised.
  • The word timestamps and confidences have been refined compared to the partial.
  • The utterance field now contains the complete finalized text.
  • The incomplete “is—” from the partial has been resolved to “is” and “Sonny.” in the final transcript.
1{
2 "turn_order": 0,
3 "turn_is_formatted": true,
4 "end_of_turn": true,
5 "transcript": "My name is Sonny.",
6 "end_of_turn_confidence": 1,
7 "words": [
8 {
9 "start": 1216,
10 "end": 1635,
11 "text": "My",
12 "confidence": 0.956583,
13 "word_is_final": true
14 },
15 {
16 "start": 1676,
17 "end": 2515,
18 "text": "name",
19 "confidence": 0.999199,
20 "word_is_final": true
21 },
22 {
23 "start": 2556,
24 "end": 2975,
25 "text": "is",
26 "confidence": 0.999535,
27 "word_is_final": true
28 },
29 {
30 "start": 3016,
31 "end": 4155,
32 "text": "Sonny.",
33 "confidence": 0.316031,
34 "word_is_final": true
35 }
36 ],
37 "utterance": "My name is Sonny.",
38 "type": "Turn"
39}

Unlike Universal Streaming, there is no separate formatting message. The end-of-turn transcript is always formatted.

Keep alive

KeepAlive messages are not required. By default, sessions remain open until explicitly terminated or until the 3-hour maximum session duration is reached.

KeepAlive is only relevant if you have configured the inactivity_timeout connection parameter, which closes the session after a period of no audio or messages being sent. If you are using inactivity_timeout and want to keep the session open during periods where no audio is being sent, send a KeepAlive message to reset the inactivity timer:

1{ "type": "KeepAlive" }

Session termination

To end a session, the client must send a Terminate message. The server then responds with a Termination message containing the total audio and session durations, and closes the connection.

Client sends:

1{ "type": "Terminate" }

Server responds:

1{
2 "type": "Termination",
3 "audio_duration_seconds": 13,
4 "session_duration_seconds": 13
5}

Always terminate sessions explicitly. Sessions that are not terminated remain open and continue to accrue charges until the server auto-closes them after 3 hours (error code 3008). See Common errors for more details.

Summary

The complete message flow for this example is:

  1. Begin — session initialized
  2. SpeechStarted — speech detected at 1216ms
  3. Turn (partial) — speaker pauses mid-sentence; end_of_turn: false, turn_is_formatted: false
  4. Turn (final) — speaker finishes with terminal punctuation; end_of_turn: true, turn_is_formatted: true
  5. Termination — session ended

For more details on how partials work and how to tune turn detection timing, see Turn Detection and Partials.

Comparison with Universal Streaming

BehaviorUniversal-3 Pro StreamingUniversal Streaming
Partial frequencyOne early partial after 750ms of continuous speech, plus at most one per silence periodEvery audio frame (word-by-word)
FormattingBuilt in to every end-of-turn transcriptSeparate turn_is_formatted message when format_turns=true
Turn detectionPunctuation-based (min_turn_silence / max_turn_silence)Confidence-based (end_of_turn_confidence_threshold)
end_of_turn_confidenceAlways 1 when triggered by punctuationVaries based on model confidence
Words in partialsAll word_is_final: falseMix of true and false as words are finalized incrementally

For the Universal Streaming message sequence, see Message Sequence.