Streaming Audio
Quickstart
In this quick guide you will learn how to use AssemblyAI’s Streaming Speech-to-Text feature to transcribe audio from your microphone.
To run this quickstart you will need:
- Python or JavaScript installed
- A valid AssemblyAI API key
To run the quickstart:
Python SDK
Python
JavaScript SDK
JavaScript
Python SDK
Python
JavaScript SDK
JavaScript
Core concepts
Streaming API: Message Sequence Breakdown
Universal-Streaming is built based upon two core concepts: Turn objects and immutable transcriptions.
Turn object
A Turn object is intended to correspond to a speaking turn in the context of voice agent applications, and therefore it roughly corresponds to an utterance in a broader context. We assign a unique ID to each Turn object, which is included in our response. Specifically, the Universal-Streaming response is formatted as follows:
turn_order
: Integer that increments with each new turnturn_is_formatted
: Boolean indicating if the text in the transcript field is formatted. Text formatting is enabled whenformat_turns
is set totrue
. It adds punctuation as well as performs casing and inverse text normalization to display various entities, such as dates, times, and phone numbers, in a human-friendly formatend_of_turn
: Boolean indicating if this is the end of the current turntranscript
: String containing only finalized wordsend_of_turn_confidence
: Floating number (0-1) representing the confidence that the current turn has finished, i.e., the current speaker has completed their turnwords
: List of Word objects with individual metadata
Each Word object in the words
array includes:
text
: The string representation of the wordword_is_final
: Boolean indicating if the word is finalized, where a finalized word means the word won’t be altered in future transcription responsesstart
: Timestamp for word startend
: Timestamp for word endconfidence
: Confidence score for the word
Immutable transcription
AssemblyAI’s streaming system receives audio in a streaming fashion, it returns transcription responses in real-time using the format specified above. Unlike many other streaming speech-to-text models that implement the concept of partial/variable transcriptions to show transcripts in an ongoing manner, Universal-Streaming transcriptions are immutable. In other words, the text that has already been produced will not be overwritten in future transcription responses. Therefore, with Universal-Streaming, the transcriptions will be delivered in the following way:
When an end of the current turn is detected, you then receive a message with end_of_turn
being true
. Additionally, if you enable text formatting, you will also receive a transcription response with turn_is_formatted
being true
.
In this example, you may have noticed that the last word of each transcript may occasionally be a subword (“zac” in the example shown above). Each Word object has the word_is_final
field to indicate whether the model is confident that the last word is a completed word. Note that, except for the last word, word_is_final
is always true.
Recommendations
- Use an audio chunk size of 50ms. Larger chunk sizes are workable, but may result in latency fluctuations.
Voice agent use case
Possible implementation strategy
Since all our transcripts are immutable, the data is immediately ready to be sent in the voice agent pipeline. Here’s one way to handle the conversation flow:
- When you receive a transcription response with the
end_of_turn
value beingtrue
but your Voice Agent (i.e., your own turn detection logic) hasn’t detected end of turn, save this data in a variable (let’s call itrunning_transcript
). - When the voice agent detects end of turn, combine the
running_transcript
with the latest partial transcript and send it to the LLM. - Clear the
running_transcript
after sending and be sure to ignore the next transcription withend_of_turn
oftrue
, that will eventually arrive for the latest partial you used. This prevents duplicate information from being processed in future turns.
What you send to the voice agent should look like: running_transcript
+ ’ ’ + latest_partial
Example flow
Utilizing our ongoing transcriptions in this manner will allow you to achieve the fastest possible latency for this step of your Voice Agent. Please reach out to the AssemblyAI team with any questions.
Voice agent orchestrators
View our Livekit integration guide.
View our Pipecat integration guide.
View our Vapi integration guide.
Reference
Connection parameters
Authenticate the session using a generated temporary token.
The sample rate of the audio stream.
The encoding of the audio stream. Allowed values: pcm_s16le
, pcm_mulaw
Whether to return formatted final transcripts.
If enabled, formatted final transcripts will be emitted shortly following an end-of-turn detection.
The confidence threshold to use when determining if the end of a turn has been reached.
The minimum amount of silence required to detect end of turn when confident.
The maximum amount of silence allowed in a turn before end of turn is triggered.
Audio requirements
The audio format must conform to the following requirements:
- PCM16 or Mu-law encoding (See Specify the encoding)
- A sample rate that matches the value of the
sample_rate
parameter - Single-channel
- 50 milliseconds of audio per message (recommended)
Message types
You send:
Audio data
Endpointing config
Session termination
Force endpoint
You receive:
Session Begin
Turn
For the full breakdown of the message sequence for a turn, see the Message sequence breakdown guide.