Streaming API: Message Sequence Breakdown

For a description of each message field, refer to our Turn object explanation.

For this example, we are going to walk through a user saying: Hi my name is Sonny. I am a voice agent.

Session initialization

When the session begins you will get a message indicating the session ID and expiration time.

1{
2 "type": "Begin",
3 "id": "de5d9927-73a6-4be8-b52d-b4c07be37e6b",
4 "expires_at": 1759796682
5}

Initial utterance and turn

Start of utterance (Partial transcript), and Start of turn

The user begins saying Hi my name is Sonny. Each message return finalizes words as they come through. Finalized words appear in the transcript field.

1{
2 "turn_order": 0,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "",
6 "end_of_turn_confidence": 0.017454,
7 "words": [
8 {
9 "start": 1920,
10 "end": 2000,
11 "text": "hi",
12 "confidence": 0.874618,
13 "word_is_final": false
14 }
15 ],
16 "utterance": "",
17 "type": "Turn"
18}
1{
2 "turn_order": 0,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "hi",
6 "end_of_turn_confidence": 1.4e-05,
7 "words": [
8 {
9 "start": 1920,
10 "end": 2000,
11 "text": "hi",
12 "confidence": 0.874618,
13 "word_is_final": true
14 },
15 {
16 "start": 2800,
17 "end": 2880,
18 "text": "my",
19 "confidence": 0.812831,
20 "word_is_final": false
21 }
22 ],
23 "utterance": "",
24 "type": "Turn"
25}
1{
2 "turn_order": 0,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "hi my",
6 "end_of_turn_confidence": 0.001037,
7 "words": [
8 {
9 "start": 1920,
10 "end": 2000,
11 "text": "hi",
12 "confidence": 0.874618,
13 "word_is_final": true
14 },
15 {
16 "start": 2800,
17 "end": 2880,
18 "text": "my",
19 "confidence": 0.812831,
20 "word_is_final": true
21 },
22 {
23 "start": 2960,
24 "end": 3040,
25 "text": "name",
26 "confidence": 0.999999,
27 "word_is_final": false
28 }
29 ],
30 "utterance": "",
31 "type": "Turn"
32}
1{
2 "turn_order": 0,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "hi my name",
6 "end_of_turn_confidence": 0.001456,
7 "words": [
8 {
9 "start": 1920,
10 "end": 2000,
11 "text": "hi",
12 "confidence": 0.874618,
13 "word_is_final": true
14 },
15 {
16 "start": 2800,
17 "end": 2880,
18 "text": "my",
19 "confidence": 0.812831,
20 "word_is_final": true
21 },
22 {
23 "start": 2960,
24 "end": 3040,
25 "text": "name",
26 "confidence": 0.999999,
27 "word_is_final": true
28 },
29 {
30 "start": 3040,
31 "end": 3120,
32 "text": "is",
33 "confidence": 0.795226,
34 "word_is_final": false
35 }
36 ],
37 "utterance": "",
38 "type": "Turn"
39}
1{
2 "turn_order": 0,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "hi my name is",
6 "end_of_turn_confidence": 0.00545,
7 "words": [
8 {
9 "start": 1920,
10 "end": 2000,
11 "text": "hi",
12 "confidence": 0.874618,
13 "word_is_final": true
14 },
15 {
16 "start": 2800,
17 "end": 2880,
18 "text": "my",
19 "confidence": 0.812831,
20 "word_is_final": true
21 },
22 {
23 "start": 2960,
24 "end": 3040,
25 "text": "name",
26 "confidence": 0.999999,
27 "word_is_final": true
28 },
29 {
30 "start": 3040,
31 "end": 3120,
32 "text": "is",
33 "confidence": 0.795226,
34 "word_is_final": true
35 },
36 {
37 "start": 3600,
38 "end": 3680,
39 "text": "son",
40 "confidence": 0.701546,
41 "word_is_final": false
42 }
43 ],
44 "utterance": "",
45 "type": "Turn"
46}
1{
2 "turn_order": 0,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "hi my name is",
6 "end_of_turn_confidence": 0.057058,
7 "words": [
8 {
9 "start": 1920,
10 "end": 2000,
11 "text": "hi",
12 "confidence": 0.874618,
13 "word_is_final": true
14 },
15 {
16 "start": 2800,
17 "end": 2880,
18 "text": "my",
19 "confidence": 0.812831,
20 "word_is_final": true
21 },
22 {
23 "start": 2960,
24 "end": 3040,
25 "text": "name",
26 "confidence": 0.999999,
27 "word_is_final": true
28 },
29 {
30 "start": 3040,
31 "end": 3120,
32 "text": "is",
33 "confidence": 0.795226,
34 "word_is_final": true
35 },
36 {
37 "start": 3600,
38 "end": 3840,
39 "text": "sonny",
40 "confidence": 0.822963,
41 "word_is_final": false
42 }
43 ],
44 "utterance": "",
45 "type": "Turn"
46}

End of utterance (Final transcript), and End of turn

When the utterance is completed, the turn may or may not be predicted to end (this depends entirely on the end_of_turn_confidence value). Since the utterance is complete, we push the utterance value to the utterance key so you can process it for pre-emptive generation. This is especially useful to get the fastest possible return from our streaming STT model.

In this case, the turn also ends "end_of_turn":True since end_of_turn_confidence is greater than 0.5. If the turn confidence was higher, the turn will end later - see End of utterance (Final Transcript), not End of turn section below.

1{
2 "turn_order": 0,
3 "turn_is_formatted": false,
4 "end_of_turn": true,
5 "transcript": "hi my name is sonny",
6 "end_of_turn_confidence": 0.5005,
7 "words": [
8 {
9 "start": 1920,
10 "end": 2000,
11 "text": "hi",
12 "confidence": 0.874618,
13 "word_is_final": true
14 },
15 {
16 "start": 2800,
17 "end": 2880,
18 "text": "my",
19 "confidence": 0.812831,
20 "word_is_final": true
21 },
22 {
23 "start": 2960,
24 "end": 3040,
25 "text": "name",
26 "confidence": 0.999999,
27 "word_is_final": true
28 },
29 {
30 "start": 3040,
31 "end": 3120,
32 "text": "is",
33 "confidence": 0.795226,
34 "word_is_final": true
35 },
36 {
37 "start": 3600,
38 "end": 3840,
39 "text": "sonny",
40 "confidence": 0.822963,
41 "word_is_final": true
42 }
43 ],
44 "utterance": "Hi my name is sonny",
45 "type": "Turn"
46}

End of turn formatting

Once the turn ends, if formatting is enabled there will be an additional message where "turn_is_formatted":True comes back. This will format the final message.

Since LLMs can parse unformatted text, waiting for this message is NOT recommended for voice agents as it adds additional latency. It is recommended for other use cases like closed captioning where the transcript will be displayed to the end user.

1{
2 "turn_order": 0,
3 "turn_is_formatted": true,
4 "end_of_turn": true,
5 "transcript": "Hi, my name is Sonny.",
6 "end_of_turn_confidence": 0.5005,
7 "words": [
8 {
9 "start": 1920,
10 "end": 2000,
11 "text": "Hi,",
12 "confidence": 0.874618,
13 "word_is_final": true
14 },
15 {
16 "start": 2800,
17 "end": 2880,
18 "text": "my",
19 "confidence": 0.812831,
20 "word_is_final": true
21 },
22 {
23 "start": 2960,
24 "end": 3040,
25 "text": "name",
26 "confidence": 0.999999,
27 "word_is_final": true
28 },
29 {
30 "start": 3040,
31 "end": 3120,
32 "text": "is",
33 "confidence": 0.795226,
34 "word_is_final": true
35 },
36 {
37 "start": 3600,
38 "end": 3840,
39 "text": "Sonny.",
40 "confidence": 0.822963,
41 "word_is_final": true
42 }
43 ],
44 "utterance": "",
45 "type": "Turn"
46}

Additional utterance(s) and turn(s)

Start of a new utterance (Partial transcript), and Start of a new turn

In this case, the user is not actually done speaking yet. Normally you would have a VAD set up to cancel the end of turn above and let the user keep speaking even though End of Turn was predicted. The user now continues to speak I am a voice agent.

1{
2 "turn_order": 1,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "",
6 "end_of_turn_confidence": 7.1e-05,
7 "words": [
8 {
9 "start": 5200,
10 "end": 5280,
11 "text": "i",
12 "confidence": 0.561945,
13 "word_is_final": false
14 }
15 ],
16 "utterance": "",
17 "type": "Turn"
18}
1{
2 "turn_order": 1,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "i",
6 "end_of_turn_confidence": 0.001201,
7 "words": [
8 {
9 "start": 5200,
10 "end": 5280,
11 "text": "i",
12 "confidence": 0.561945,
13 "word_is_final": true
14 },
15 {
16 "start": 5440,
17 "end": 5520,
18 "text": "am",
19 "confidence": 0.804111,
20 "word_is_final": false
21 }
22 ],
23 "utterance": "",
24 "type": "Turn"
25}
1{
2 "turn_order": 1,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "i am",
6 "end_of_turn_confidence": 1e-05,
7 "words": [
8 {
9 "start": 5200,
10 "end": 5280,
11 "text": "i",
12 "confidence": 0.561945,
13 "word_is_final": true
14 },
15 {
16 "start": 5440,
17 "end": 5520,
18 "text": "am",
19 "confidence": 0.804111,
20 "word_is_final": true
21 },
22 {
23 "start": 5600,
24 "end": 5680,
25 "text": "a",
26 "confidence": 0.998217,
27 "word_is_final": false
28 }
29 ],
30 "utterance": "",
31 "type": "Turn"
32}
1{
2 "turn_order": 1,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "i am a",
6 "end_of_turn_confidence": 0.00727,
7 "words": [
8 {
9 "start": 5200,
10 "end": 5280,
11 "text": "i",
12 "confidence": 0.561945,
13 "word_is_final": true
14 },
15 {
16 "start": 5440,
17 "end": 5520,
18 "text": "am",
19 "confidence": 0.804111,
20 "word_is_final": true
21 },
22 {
23 "start": 5600,
24 "end": 5680,
25 "text": "a",
26 "confidence": 0.998217,
27 "word_is_final": true
28 },
29 {
30 "start": 5680,
31 "end": 5760,
32 "text": "voice",
33 "confidence": 0.910144,
34 "word_is_final": false
35 }
36 ],
37 "utterance": "",
38 "type": "Turn"
39}
1{
2 "turn_order": 1,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "i am a voice",
6 "end_of_turn_confidence": 0.000474,
7 "words": [
8 {
9 "start": 5200,
10 "end": 5280,
11 "text": "i",
12 "confidence": 0.561945,
13 "word_is_final": true
14 },
15 {
16 "start": 5440,
17 "end": 5520,
18 "text": "am",
19 "confidence": 0.804111,
20 "word_is_final": true
21 },
22 {
23 "start": 5600,
24 "end": 5680,
25 "text": "a",
26 "confidence": 0.998217,
27 "word_is_final": true
28 },
29 {
30 "start": 5680,
31 "end": 5760,
32 "text": "voice",
33 "confidence": 0.910144,
34 "word_is_final": true
35 },
36 {
37 "start": 5920,
38 "end": 6000,
39 "text": "agen",
40 "confidence": 0.955449,
41 "word_is_final": false
42 }
43 ],
44 "utterance": "",
45 "type": "Turn"
46}
1{
2 "turn_order": 1,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "i am a voice",
6 "end_of_turn_confidence": 0.332227,
7 "words": [
8 {
9 "start": 5200,
10 "end": 5280,
11 "text": "i",
12 "confidence": 0.561945,
13 "word_is_final": true
14 },
15 {
16 "start": 5440,
17 "end": 5520,
18 "text": "am",
19 "confidence": 0.804111,
20 "word_is_final": true
21 },
22 {
23 "start": 5600,
24 "end": 5680,
25 "text": "a",
26 "confidence": 0.998217,
27 "word_is_final": true
28 },
29 {
30 "start": 5680,
31 "end": 5760,
32 "text": "voice",
33 "confidence": 0.910144,
34 "word_is_final": true
35 },
36 {
37 "start": 5920,
38 "end": 6080,
39 "text": "agent",
40 "confidence": 0.970552,
41 "word_is_final": false
42 }
43 ],
44 "utterance": "",
45 "type": "Turn"
46}

End of utterance (Final Transcript), not End of turn

In many cases, the turn may not end but we will have finalized an utterance. For this reason, we supply the utterances param to finalize each utterance as fast as possible for downstream LLM pre-emptive generation.

If you’re using a voice agent provider like LiveKit or Pipecat, this logic is pre-configured for you with the fastest performance settings.

1{
2 "turn_order": 1,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "i am a voice",
6 "end_of_turn_confidence": 0.454477,
7 "words": [
8 {
9 "start": 5200,
10 "end": 5280,
11 "text": "i",
12 "confidence": 0.561945,
13 "word_is_final": true
14 },
15 {
16 "start": 5440,
17 "end": 5520,
18 "text": "am",
19 "confidence": 0.804111,
20 "word_is_final": true
21 },
22 {
23 "start": 5600,
24 "end": 5680,
25 "text": "a",
26 "confidence": 0.998217,
27 "word_is_final": true
28 },
29 {
30 "start": 5680,
31 "end": 5760,
32 "text": "voice",
33 "confidence": 0.910144,
34 "word_is_final": true
35 },
36 {
37 "start": 5920,
38 "end": 6080,
39 "text": "agent",
40 "confidence": 0.970552,
41 "word_is_final": false
42 }
43 ],
44 "utterance": "I am a voice agent.",
45 "type": "Turn"
46}

End of turn, after End of utterance

Once the confidence score of 0.5 is passed for End of Turn, we will end the turn.

1{
2 "turn_order": 1,
3 "turn_is_formatted": false,
4 "end_of_turn": true,
5 "transcript": "i am a voice agent",
6 "end_of_turn_confidence": 0.750889,
7 "words": [
8 {
9 "start": 5200,
10 "end": 5280,
11 "text": "i",
12 "confidence": 0.561945,
13 "word_is_final": true
14 },
15 {
16 "start": 5440,
17 "end": 5520,
18 "text": "am",
19 "confidence": 0.804111,
20 "word_is_final": true
21 },
22 {
23 "start": 5600,
24 "end": 5680,
25 "text": "a",
26 "confidence": 0.998217,
27 "word_is_final": true
28 },
29 {
30 "start": 5680,
31 "end": 5760,
32 "text": "voice",
33 "confidence": 0.910144,
34 "word_is_final": true
35 },
36 {
37 "start": 5920,
38 "end": 6080,
39 "text": "agent",
40 "confidence": 0.970552,
41 "word_is_final": true
42 }
43 ],
44 "utterance": "",
45 "type": "Turn"
46}

End of turn formatting

Lastly we will format the transcript if formatting is enabled. Again this is only recommended for use cases where the end user will be shown the transcript as it adds some latency for voice agents.

1{
2 "turn_order": 1,
3 "turn_is_formatted": true,
4 "end_of_turn": true,
5 "transcript": "I am a voice agent.",
6 "end_of_turn_confidence": 0.750889,
7 "words": [
8 {
9 "start": 5200,
10 "end": 5280,
11 "text": "I",
12 "confidence": 0.561945,
13 "word_is_final": true
14 },
15 {
16 "start": 5440,
17 "end": 5520,
18 "text": "am",
19 "confidence": 0.804111,
20 "word_is_final": true
21 },
22 {
23 "start": 5600,
24 "end": 5680,
25 "text": "a",
26 "confidence": 0.998217,
27 "word_is_final": true
28 },
29 {
30 "start": 5680,
31 "end": 5760,
32 "text": "voice",
33 "confidence": 0.910144,
34 "word_is_final": true
35 },
36 {
37 "start": 5920,
38 "end": 6080,
39 "text": "agent.",
40 "confidence": 0.970552,
41 "word_is_final": true
42 }
43 ],
44 "utterance": "",
45 "type": "Turn"
46}

Best practices

Voice agents

  • Grab the utterance parameter as soon as it is available for the fastest STT model response. This is particularly useful if you are using an external Turn Detection model. You can read more about configuring for third-party turn detection here.
  • Avoid using format_turns as it will significantly increase latency. LLMs don’t need formatting so you can just pass raw text as soon as its ready.
  • Utilize the end_of_turn parameter to determine when a user is likely to end their turn. Combined with a VAD you can determine if the user is continuing to speak, or if the turn has ended and you can safely interrupt with your voice agent.
    • Note that AssemblyAI has a built-in VAD to the model. If you modify min_end_of_turn_silence_when_confident latency will increase but you can substitute the model entirely for a VAD.

Notetakers and Closed captioning

  • Utilize format_turns since text will be displayed to the end user and this will make it human-readable. The added latency will be less noticeable to the human eye in these applications.
  • Utilize keyterms_prompt to increase the accuracy of rare words and entities. Again since the text will be shown to the end user this will improve their perception of accuracy.