Models & featuresUniversal Streaming

Streaming API: Message Sequence Breakdown

For a description of each message field, refer to our Turn object explanation.

Understanding transcript vs utterance

Before walking through the message sequence, it’s important to understand the difference between the transcript and utterance fields:

  • transcript — The running transcript of the current turn. Updated with each Turn message, it accumulates finalized words progressively. Use this field for displaying live captions or getting the full turn text at any point.
  • utterance — Populated when the system detects a pause in speech (based on a combination of silence duration and model confidence). Contains the finalized text of the current utterance at that moment. On all other Turn messages, utterance is an empty string "". This enables eager LLM inference, allowing you to start processing text before the turn officially ends. Note that the message where utterance is populated may or may not also be an end_of_turn: true message, depending on whether the turn ends at the same time.

Key takeaway: To get the complete text of a turn, always read transcript. The utterance field is an optimization for low-latency voice agent pipelines — it lets you start processing text as soon as an utterance boundary is detected, before the turn officially ends.

For this example, we are going to walk through a user saying: Hi my name is Sonny. I am a voice agent.

Session initialization

When the session begins you will get a message indicating the session ID and expiration time.

1{
2 "type": "Begin",
3 "id": "de5d9927-73a6-4be8-b52d-b4c07be37e6b",
4 "expires_at": 1759796682
5}

Initial utterance and turn

Start of utterance (Partial transcript), and Start of turn

The user begins saying Hi my name is Sonny. Each message return finalizes words as they come through. Finalized words appear in the transcript field.

1{
2 "turn_order": 0,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "",
6 "end_of_turn_confidence": 0.017454,
7 "words": [
8 {
9 "start": 1920,
10 "end": 2000,
11 "text": "hi",
12 "confidence": 0.874618,
13 "word_is_final": false
14 }
15 ],
16 "utterance": "",
17 "type": "Turn"
18}
1{
2 "turn_order": 0,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "hi",
6 "end_of_turn_confidence": 1.4e-5,
7 "words": [
8 {
9 "start": 1920,
10 "end": 2000,
11 "text": "hi",
12 "confidence": 0.874618,
13 "word_is_final": true
14 },
15 {
16 "start": 2800,
17 "end": 2880,
18 "text": "my",
19 "confidence": 0.812831,
20 "word_is_final": false
21 }
22 ],
23 "utterance": "",
24 "type": "Turn"
25}
1{
2 "turn_order": 0,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "hi my",
6 "end_of_turn_confidence": 0.001037,
7 "words": [
8 {
9 "start": 1920,
10 "end": 2000,
11 "text": "hi",
12 "confidence": 0.874618,
13 "word_is_final": true
14 },
15 {
16 "start": 2800,
17 "end": 2880,
18 "text": "my",
19 "confidence": 0.812831,
20 "word_is_final": true
21 },
22 {
23 "start": 2960,
24 "end": 3040,
25 "text": "name",
26 "confidence": 0.999999,
27 "word_is_final": false
28 }
29 ],
30 "utterance": "",
31 "type": "Turn"
32}
1{
2 "turn_order": 0,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "hi my name",
6 "end_of_turn_confidence": 0.001456,
7 "words": [
8 {
9 "start": 1920,
10 "end": 2000,
11 "text": "hi",
12 "confidence": 0.874618,
13 "word_is_final": true
14 },
15 {
16 "start": 2800,
17 "end": 2880,
18 "text": "my",
19 "confidence": 0.812831,
20 "word_is_final": true
21 },
22 {
23 "start": 2960,
24 "end": 3040,
25 "text": "name",
26 "confidence": 0.999999,
27 "word_is_final": true
28 },
29 {
30 "start": 3040,
31 "end": 3120,
32 "text": "is",
33 "confidence": 0.795226,
34 "word_is_final": false
35 }
36 ],
37 "utterance": "",
38 "type": "Turn"
39}
1{
2 "turn_order": 0,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "hi my name is",
6 "end_of_turn_confidence": 0.00545,
7 "words": [
8 {
9 "start": 1920,
10 "end": 2000,
11 "text": "hi",
12 "confidence": 0.874618,
13 "word_is_final": true
14 },
15 {
16 "start": 2800,
17 "end": 2880,
18 "text": "my",
19 "confidence": 0.812831,
20 "word_is_final": true
21 },
22 {
23 "start": 2960,
24 "end": 3040,
25 "text": "name",
26 "confidence": 0.999999,
27 "word_is_final": true
28 },
29 {
30 "start": 3040,
31 "end": 3120,
32 "text": "is",
33 "confidence": 0.795226,
34 "word_is_final": true
35 },
36 {
37 "start": 3600,
38 "end": 3680,
39 "text": "son",
40 "confidence": 0.701546,
41 "word_is_final": false
42 }
43 ],
44 "utterance": "",
45 "type": "Turn"
46}
1{
2 "turn_order": 0,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "hi my name is",
6 "end_of_turn_confidence": 0.057058,
7 "words": [
8 {
9 "start": 1920,
10 "end": 2000,
11 "text": "hi",
12 "confidence": 0.874618,
13 "word_is_final": true
14 },
15 {
16 "start": 2800,
17 "end": 2880,
18 "text": "my",
19 "confidence": 0.812831,
20 "word_is_final": true
21 },
22 {
23 "start": 2960,
24 "end": 3040,
25 "text": "name",
26 "confidence": 0.999999,
27 "word_is_final": true
28 },
29 {
30 "start": 3040,
31 "end": 3120,
32 "text": "is",
33 "confidence": 0.795226,
34 "word_is_final": true
35 },
36 {
37 "start": 3600,
38 "end": 3840,
39 "text": "sonny",
40 "confidence": 0.822963,
41 "word_is_final": false
42 }
43 ],
44 "utterance": "",
45 "type": "Turn"
46}

End of utterance (Final transcript), and End of turn

When the utterance is completed, the turn may or may not be predicted to end (this depends entirely on the end_of_turn_confidence value). Since the utterance is complete, we push the utterance value to the utterance key so you can process it for pre-emptive generation. This is especially useful to get the fastest possible return from our streaming STT model.

In this case, the turn also ends "end_of_turn":True since end_of_turn_confidence is greater than 0.5. If the turn confidence was higher, the turn will end later - see End of utterance (Final Transcript), not End of turn section below.

Because the utterance end and turn end occur on the same message here, both utterance and transcript contain the finalized text. When the utterance ends before the turn (see below), utterance is populated on the earlier message and will be empty on the subsequent end_of_turn: true message.

1{
2 "turn_order": 0,
3 "turn_is_formatted": false,
4 "end_of_turn": true,
5 "transcript": "hi my name is sonny",
6 "end_of_turn_confidence": 0.5005,
7 "words": [
8 {
9 "start": 1920,
10 "end": 2000,
11 "text": "hi",
12 "confidence": 0.874618,
13 "word_is_final": true
14 },
15 {
16 "start": 2800,
17 "end": 2880,
18 "text": "my",
19 "confidence": 0.812831,
20 "word_is_final": true
21 },
22 {
23 "start": 2960,
24 "end": 3040,
25 "text": "name",
26 "confidence": 0.999999,
27 "word_is_final": true
28 },
29 {
30 "start": 3040,
31 "end": 3120,
32 "text": "is",
33 "confidence": 0.795226,
34 "word_is_final": true
35 },
36 {
37 "start": 3600,
38 "end": 3840,
39 "text": "sonny",
40 "confidence": 0.822963,
41 "word_is_final": true
42 }
43 ],
44 "utterance": "Hi my name is sonny",
45 "type": "Turn"
46}

End of turn formatting

Once the turn ends, if formatting is enabled there will be an additional message where "turn_is_formatted":True comes back. This will format the final message.

Since LLMs can parse unformatted text, waiting for this message is NOT recommended for voice agents as it adds additional latency. It is recommended for other use cases like closed captioning where the transcript will be displayed to the end user.

1{
2 "turn_order": 0,
3 "turn_is_formatted": true,
4 "end_of_turn": true,
5 "transcript": "Hi, my name is Sonny.",
6 "end_of_turn_confidence": 0.5005,
7 "words": [
8 {
9 "start": 1920,
10 "end": 2000,
11 "text": "Hi,",
12 "confidence": 0.874618,
13 "word_is_final": true
14 },
15 {
16 "start": 2800,
17 "end": 2880,
18 "text": "my",
19 "confidence": 0.812831,
20 "word_is_final": true
21 },
22 {
23 "start": 2960,
24 "end": 3040,
25 "text": "name",
26 "confidence": 0.999999,
27 "word_is_final": true
28 },
29 {
30 "start": 3040,
31 "end": 3120,
32 "text": "is",
33 "confidence": 0.795226,
34 "word_is_final": true
35 },
36 {
37 "start": 3600,
38 "end": 3840,
39 "text": "Sonny.",
40 "confidence": 0.822963,
41 "word_is_final": true
42 }
43 ],
44 "utterance": "",
45 "type": "Turn"
46}

Additional utterance(s) and turn(s)

Start of a new utterance (Partial transcript), and Start of a new turn

In this case, the user is not actually done speaking yet. Normally you would have a VAD set up to cancel the end of turn above and let the user keep speaking even though End of Turn was predicted. The user now continues to speak I am a voice agent.

1{
2 "turn_order": 1,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "",
6 "end_of_turn_confidence": 7.1e-5,
7 "words": [
8 {
9 "start": 5200,
10 "end": 5280,
11 "text": "i",
12 "confidence": 0.561945,
13 "word_is_final": false
14 }
15 ],
16 "utterance": "",
17 "type": "Turn"
18}
1{
2 "turn_order": 1,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "i",
6 "end_of_turn_confidence": 0.001201,
7 "words": [
8 {
9 "start": 5200,
10 "end": 5280,
11 "text": "i",
12 "confidence": 0.561945,
13 "word_is_final": true
14 },
15 {
16 "start": 5440,
17 "end": 5520,
18 "text": "am",
19 "confidence": 0.804111,
20 "word_is_final": false
21 }
22 ],
23 "utterance": "",
24 "type": "Turn"
25}
1{
2 "turn_order": 1,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "i am",
6 "end_of_turn_confidence": 1e-5,
7 "words": [
8 {
9 "start": 5200,
10 "end": 5280,
11 "text": "i",
12 "confidence": 0.561945,
13 "word_is_final": true
14 },
15 {
16 "start": 5440,
17 "end": 5520,
18 "text": "am",
19 "confidence": 0.804111,
20 "word_is_final": true
21 },
22 {
23 "start": 5600,
24 "end": 5680,
25 "text": "a",
26 "confidence": 0.998217,
27 "word_is_final": false
28 }
29 ],
30 "utterance": "",
31 "type": "Turn"
32}
1{
2 "turn_order": 1,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "i am a",
6 "end_of_turn_confidence": 0.00727,
7 "words": [
8 {
9 "start": 5200,
10 "end": 5280,
11 "text": "i",
12 "confidence": 0.561945,
13 "word_is_final": true
14 },
15 {
16 "start": 5440,
17 "end": 5520,
18 "text": "am",
19 "confidence": 0.804111,
20 "word_is_final": true
21 },
22 {
23 "start": 5600,
24 "end": 5680,
25 "text": "a",
26 "confidence": 0.998217,
27 "word_is_final": true
28 },
29 {
30 "start": 5680,
31 "end": 5760,
32 "text": "voice",
33 "confidence": 0.910144,
34 "word_is_final": false
35 }
36 ],
37 "utterance": "",
38 "type": "Turn"
39}
1{
2 "turn_order": 1,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "i am a voice",
6 "end_of_turn_confidence": 0.000474,
7 "words": [
8 {
9 "start": 5200,
10 "end": 5280,
11 "text": "i",
12 "confidence": 0.561945,
13 "word_is_final": true
14 },
15 {
16 "start": 5440,
17 "end": 5520,
18 "text": "am",
19 "confidence": 0.804111,
20 "word_is_final": true
21 },
22 {
23 "start": 5600,
24 "end": 5680,
25 "text": "a",
26 "confidence": 0.998217,
27 "word_is_final": true
28 },
29 {
30 "start": 5680,
31 "end": 5760,
32 "text": "voice",
33 "confidence": 0.910144,
34 "word_is_final": true
35 },
36 {
37 "start": 5920,
38 "end": 6000,
39 "text": "agen",
40 "confidence": 0.955449,
41 "word_is_final": false
42 }
43 ],
44 "utterance": "",
45 "type": "Turn"
46}
1{
2 "turn_order": 1,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "i am a voice",
6 "end_of_turn_confidence": 0.332227,
7 "words": [
8 {
9 "start": 5200,
10 "end": 5280,
11 "text": "i",
12 "confidence": 0.561945,
13 "word_is_final": true
14 },
15 {
16 "start": 5440,
17 "end": 5520,
18 "text": "am",
19 "confidence": 0.804111,
20 "word_is_final": true
21 },
22 {
23 "start": 5600,
24 "end": 5680,
25 "text": "a",
26 "confidence": 0.998217,
27 "word_is_final": true
28 },
29 {
30 "start": 5680,
31 "end": 5760,
32 "text": "voice",
33 "confidence": 0.910144,
34 "word_is_final": true
35 },
36 {
37 "start": 5920,
38 "end": 6080,
39 "text": "agent",
40 "confidence": 0.970552,
41 "word_is_final": false
42 }
43 ],
44 "utterance": "",
45 "type": "Turn"
46}

End of utterance (Final Transcript), not End of turn

In many cases, the turn may not end but we will have finalized an utterance. For this reason, we supply the utterance field to finalize each utterance as fast as possible for downstream LLM pre-emptive generation.

If you’re using a voice agent provider like LiveKit or Pipecat, this logic is pre-configured for you with the fastest performance settings.

1{
2 "turn_order": 1,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "i am a voice",
6 "end_of_turn_confidence": 0.454477,
7 "words": [
8 {
9 "start": 5200,
10 "end": 5280,
11 "text": "i",
12 "confidence": 0.561945,
13 "word_is_final": true
14 },
15 {
16 "start": 5440,
17 "end": 5520,
18 "text": "am",
19 "confidence": 0.804111,
20 "word_is_final": true
21 },
22 {
23 "start": 5600,
24 "end": 5680,
25 "text": "a",
26 "confidence": 0.998217,
27 "word_is_final": true
28 },
29 {
30 "start": 5680,
31 "end": 5760,
32 "text": "voice",
33 "confidence": 0.910144,
34 "word_is_final": true
35 },
36 {
37 "start": 5920,
38 "end": 6080,
39 "text": "agent",
40 "confidence": 0.970552,
41 "word_is_final": false
42 }
43 ],
44 "utterance": "I am a voice agent.",
45 "type": "Turn"
46}

End of turn, after End of utterance

Once the confidence score of 0.5 is passed for End of Turn, we will end the turn.

Notice that utterance is an empty string in this message. The utterance field was already populated on the previous message when the utterance boundary was first detected. On the end_of_turn: true message, use transcript to get the complete turn text.

1{
2 "turn_order": 1,
3 "turn_is_formatted": false,
4 "end_of_turn": true,
5 "transcript": "i am a voice agent",
6 "end_of_turn_confidence": 0.750889,
7 "words": [
8 {
9 "start": 5200,
10 "end": 5280,
11 "text": "i",
12 "confidence": 0.561945,
13 "word_is_final": true
14 },
15 {
16 "start": 5440,
17 "end": 5520,
18 "text": "am",
19 "confidence": 0.804111,
20 "word_is_final": true
21 },
22 {
23 "start": 5600,
24 "end": 5680,
25 "text": "a",
26 "confidence": 0.998217,
27 "word_is_final": true
28 },
29 {
30 "start": 5680,
31 "end": 5760,
32 "text": "voice",
33 "confidence": 0.910144,
34 "word_is_final": true
35 },
36 {
37 "start": 5920,
38 "end": 6080,
39 "text": "agent",
40 "confidence": 0.970552,
41 "word_is_final": true
42 }
43 ],
44 "utterance": "",
45 "type": "Turn"
46}

End of turn formatting

Lastly we will format the transcript if formatting is enabled. Again this is only recommended for use cases where the end user will be shown the transcript as it adds some latency for voice agents.

1{
2 "turn_order": 1,
3 "turn_is_formatted": true,
4 "end_of_turn": true,
5 "transcript": "I am a voice agent.",
6 "end_of_turn_confidence": 0.750889,
7 "words": [
8 {
9 "start": 5200,
10 "end": 5280,
11 "text": "I",
12 "confidence": 0.561945,
13 "word_is_final": true
14 },
15 {
16 "start": 5440,
17 "end": 5520,
18 "text": "am",
19 "confidence": 0.804111,
20 "word_is_final": true
21 },
22 {
23 "start": 5600,
24 "end": 5680,
25 "text": "a",
26 "confidence": 0.998217,
27 "word_is_final": true
28 },
29 {
30 "start": 5680,
31 "end": 5760,
32 "text": "voice",
33 "confidence": 0.910144,
34 "word_is_final": true
35 },
36 {
37 "start": 5920,
38 "end": 6080,
39 "text": "agent.",
40 "confidence": 0.970552,
41 "word_is_final": true
42 }
43 ],
44 "utterance": "",
45 "type": "Turn"
46}

Best practices

Voice agents

  • Grab the utterance parameter as soon as it is available for the fastest STT model response. This is particularly useful if you are using an external Turn Detection model. You can read more about configuring for third-party turn detection here.
  • Avoid using format_turns as it will significantly increase latency. LLMs don’t need formatting so you can just pass raw text as soon as its ready.
  • Utilize the end_of_turn parameter to determine when a user is likely to end their turn. Combined with a VAD you can determine if the user is continuing to speak, or if the turn has ended and you can safely interrupt with your voice agent.
    • Note that AssemblyAI has a built-in VAD to the model. If you modify min_turn_silence latency will increase but you can substitute the model entirely for a VAD.

Notetakers and Closed captioning

  • Utilize format_turns since text will be displayed to the end user and this will make it human-readable. The added latency will be less noticeable to the human eye in these applications.
  • Utilize keyterms_prompt to increase the accuracy of rare words and entities. Again since the text will be shown to the end user this will improve their perception of accuracy.

Session termination

When the session ends (either by sending a Terminate message or when the session expires), the server sends a Termination message with the total audio and session durations.

1{
2 "type": "Termination",
3 "audio_duration_seconds": 7,
4 "session_duration_seconds": 7
5}

After receiving the Termination message, no further messages will be sent and the WebSocket connection will be closed.