Streaming Audio | AssemblyAI

Quickstart

In this quick guide you will learn how to use AssemblyAI’s Streaming Speech-to-Text feature to transcribe audio from your microphone.

To run this quickstart you will need:

Python or JavaScript installed
A valid AssemblyAI API key

To run the quickstart:

Python SDK

Python

JavaScript SDK

JavaScript

Create a new Python file (for example, main.py) and paste the code provided below inside.

Insert your API key to line 17.

Install the necessary libraries

$ pip install assemblyai pyaudio

Run with python main.py

Python SDK

Python

JavaScript SDK

JavaScript

1 import logging
2 from typing import Type
3 
4 import assemblyai as aai
5 from assemblyai.streaming.v3 import (
6     BeginEvent,
7     StreamingClient,
8     StreamingClientOptions,
9     StreamingError,
10     StreamingEvents,
11     StreamingParameters,
12     StreamingSessionParameters,
13     TerminationEvent,
14     TurnEvent,
15 )
16 
17 api_key = "<YOUR_API_KEY>"
18 
19 logging.basicConfig(level=logging.INFO)
20 logger = logging.getLogger(**name**)
21 
22 def on_begin(self: Type[StreamingClient], event: BeginEvent):
23 print(f"Session started: {event.id}")
24 
25 def on_turn(self: Type[StreamingClient], event: TurnEvent):
26 print(f"{event.transcript} ({event.end_of_turn})")
27 
28     if event.end_of_turn and not event.turn_is_formatted:
29         params = StreamingSessionParameters(
30             format_turns=True,
31         )
32 
33         self.set_params(params)
34 
35 def on_terminated(self: Type[StreamingClient], event: TerminationEvent):
36 print(
37 f"Session terminated: {event.audio_duration_seconds} seconds of audio processed"
38 )
39 
40 def on_error(self: Type[StreamingClient], error: StreamingError):
41 print(f"Error occurred: {error}")
42 
43 def main():
44 client = StreamingClient(
45 StreamingClientOptions(
46 api_key=api_key,
47 api_host="streaming.assemblyai.com",
48 )
49 )
50 
51     client.on(StreamingEvents.Begin, on_begin)
52     client.on(StreamingEvents.Turn, on_turn)
53     client.on(StreamingEvents.Termination, on_terminated)
54     client.on(StreamingEvents.Error, on_error)
55 
56     client.connect(
57         StreamingParameters(
58             sample_rate=16000,
59             format_turns=True,
60         )
61     )
62 
63     try:
64         client.stream(
65           aai.extras.MicrophoneStream(sample_rate=16000)
66         )
67     finally:
68         client.disconnect(terminate=True)
69 
70 if **name** == "**main**":
71 main()

Core concepts

For a message-by-message breakdown of a turn, see our Streaming API: Message Sequence Breakdown guide.

Universal-Streaming is built based upon two core concepts: Turn objects and immutable transcriptions.

Turn object

A Turn object is intended to correspond to a speaking turn in the context of voice agent applications, and therefore it roughly corresponds to an utterance in a broader context. We assign a unique ID to each Turn object, which is included in our response. Specifically, the Universal-Streaming response is formatted as follows:

1 {
2   "turn_order": 1,
3   "turn_is_formatted": false,
4   "end_of_turn": false,
5   "transcript": "modern medicine is",
6   "end_of_turn_confidence": 0.7,
7   "words": [
8     { "text": "modern", "word_is_final": true, ... },
9     { "text": "medicine", "word_is_final": true, ... },
10     { "text": "is", "word_is_final": true, ... },
11     { "text": "amazing", "word_is_final": false, ... }
12   ]
13 }

turn_order: Integer that increments with each new turn
turn_is_formatted: Boolean indicating if the text in the transcript field is formatted. Text formatting is enabled when format_turns is set to true. It adds punctuation as well as performs casing and inverse text normalization to display various entities, such as dates, times, and phone numbers, in a human-friendly format
end_of_turn: Boolean indicating if this is the end of the current turn
transcript: String containing only finalized words
end_of_turn_confidence: Floating number (0-1) representing the confidence that the current turn has finished, i.e., the current speaker has completed their turn
words: List of Word objects with individual metadata

Each Word object in the words array includes:

text: The string representation of the word
word_is_final: Boolean indicating if the word is finalized, where a finalized word means the word won’t be altered in future transcription responses
start: Timestamp for word start
end: Timestamp for word end
confidence: Confidence score for the word

Immutable transcription

AssemblyAI’s streaming system receives audio in a streaming fashion, it returns transcription responses in real-time using the format specified above. Unlike many other streaming speech-to-text models that implement the concept of partial/variable transcriptions to show transcripts in an ongoing manner, Universal-Streaming transcriptions are immutable. In other words, the text that has already been produced will not be overwritten in future transcription responses. Therefore, with Universal-Streaming, the transcriptions will be delivered in the following way:

1 → hello my na
2 → hello my name
3 → hello my name
4 → hello my name is
5 → hello my name is zac
6 → hello my name is zack

When an end of the current turn is detected, you then receive a message with end_of_turn being true. Additionally, if you enable text formatting, you will also receive a transcription response with turn_is_formatted being true.

1 → hello my name is zack (unformatted)
2 → Hello my name is Zack. (formatted)

In this example, you may have noticed that the last word of each transcript may occasionally be a subword (“zac” in the example shown above). Each Word object has the word_is_final field to indicate whether the model is confident that the last word is a completed word. Note that, except for the last word, word_is_final is always true.

Use-case specific recommendations

Live captioning

The default setting for Streaming Speech-to-Text is optimized for the Voice Agent use case, where you expect one person speaking with long silences happening during the agent’s speaking turn. For applications such as live captioning, where the input audio stream typically contains multiple people speaking, it is usually beneficial to wait longer before detecting turns, which trigger text formatting.

When captioning conversations with multiple speakers, we recommend setting min_end_of_turn_silence_when_confident to 560 ms. By default, this is set to 160 ms.

Voice agents

To optimize for latency when building a voice agent, we recommend using the unformatted transcript as it’s received more quickly than the formatted version. In typical voice agent applications involving large language models (LLMs), the lack of formatting makes little impact on the subsequent LLM processing. For more information, see Voice agents.

API Reference

Connection parameters

token

string

Authenticate the session using a generated temporary token.

sample_rate

intRequired

The sample rate of the audio stream.

encoding

stringRequired

The encoding of the audio stream. Allowed values: pcm_s16le, pcm_mulaw

format_turns

booleanDefaults to False

Whether to return formatted final transcripts.

If enabled, formatted final transcripts will be emitted shortly following an end-of-turn detection.

end_of_turn_confidence_threshold

floatDefaults to 0.7

The confidence threshold (0.0 to 1.0) to use when determining if the end of a turn has been reached.

Raise or lower the threshold based on how confident you’d like us to be before triggering end of turn based on confidence score

min_end_of_turn_silence_when_confident

intDefaults to 160 ms

The minimum amount of silence in milliseconds required to detect end of turn when confident.

Increase or decrease the amount of time we wait to trigger end of turn when confident

max_turn_silence

intDefaults to 2400 ms

The maximum amount of silence in milliseconds allowed in a turn before end of turn is triggered.

Lower or raise the amount of time needed to trigger end of turn when end of turn isn’t triggered by a high confidence score

Audio requirements

The audio format must conform to the following requirements:

PCM16 or Mu-law encoding (See Specify the encoding)
A sample rate that matches the value of the sample_rate parameter
Single-channel
50 milliseconds of audio per message (recommended)

Message types

You send:

Audio data

1 "UklGRtjIAABXQVZFZ"

Endpointing config

1 {
2   "type": "UpdateConfiguration",
3   "end_of_turn_confidence_threshold": 0.5
4 }

Session termination

1 { "type": "Terminate" }

Force endpoint

1 { "type": "ForceEndpoint" }

You receive:

Session Begin

1 {
2     "type": "Begin",
3     "id": "cfd280c7-5a9b-4dd6-8c05-235ccfa3c97f",
4     "expires_at": 1745483367
5 }

Turn

1 {
2   "turn_order": 0,
3   "turn_is_formatted": true,
4   "end_of_turn": true,
5   "transcript": "Hi, my name is Sonny.",
6   "end_of_turn_confidence": 0.8095446228981018,
7   "words":
8   [
9       {
10           "start": 1440,
11           "end": 1520,
12           "text": "Hi,",
13           "confidence": 0.9967870712280273,
14           "word_is_final": true
15       },
16       {
17           "start": 1600,
18           "end": 1680,
19           "text": "my",
20           "confidence": 0.999546468257904,
21           "word_is_final": true
22       },
23       {
24           "start": 1600,
25           "end": 1680,
26           "text": "name",
27           "confidence": 0.9597182273864746,
28           "word_is_final": true
29       },
30       {
31           "start": 1680,
32           "end": 1760,
33           "text": "is",
34           "confidence": 0.8261497616767883,
35           "word_is_final": true
36       },
37       {
38           "start": 2320,
39           "end": 3040,
40           "text": "Sonny.",
41           "confidence": 0.5737350583076477,
42           "word_is_final": true
43       }
44   ],
45   "type": "Turn"
46 }

For the full breakdown of the message sequence for a turn, see the Message sequence breakdown guide.

Session Termination

1 {
2     "type": "Termination",
3     "audio_duration_seconds": 2000,
4     "session_duration_seconds": 2000
5 }