Universal Streaming

By default, Universal-Streaming is set to transcribe English audio. If you’d like to enable multilingual streaming (support for English, Spanish, French, German, Italian, and Portuguese), enable multilingual transcription instead.

Streaming is now available in EU-West via streaming.eu.assemblyai.com. To use the EU streaming endpoint, replace streaming.assemblyai.com with streaming.eu.assemblyai.com in your connection configuration.

Quickstart

In this quick guide you will learn how to use AssemblyAI’s Streaming Speech-to-Text feature to transcribe audio from your microphone.

To run this quickstart you will need:

  • Python or JavaScript installed
  • A valid AssemblyAI API key

To run the quickstart:

1

Create a new Python file (for example, main.py) and paste the code provided below inside.

2

Insert your API key to line 17.

3

Install the necessary libraries

$pip install assemblyai pyaudio
4

Run with python main.py

1import logging
2from typing import Type
3
4import assemblyai as aai
5from assemblyai.streaming.v3 import (
6 BeginEvent,
7 StreamingClient,
8 StreamingClientOptions,
9 StreamingError,
10 StreamingEvents,
11 StreamingParameters,
12 StreamingSessionParameters,
13 TerminationEvent,
14 TurnEvent,
15)
16
17api_key = "<YOUR_API_KEY>"
18
19logging.basicConfig(level=logging.INFO)
20logger = logging.getLogger(__name__)
21
22def on_begin(self: Type[StreamingClient], event: BeginEvent):
23 print(f"Session started: {event.id}")
24
25def on_turn(self: Type[StreamingClient], event: TurnEvent):
26 print(f"{event.transcript} ({event.end_of_turn})")
27
28def on_terminated(self: Type[StreamingClient], event: TerminationEvent):
29 print(
30 f"Session terminated: {event.audio_duration_seconds} seconds of audio processed"
31 )
32
33def on_error(self: Type[StreamingClient], error: StreamingError):
34 print(f"Error occurred: {error}")
35
36def main():
37 client = StreamingClient(
38 StreamingClientOptions(
39 api_key=api_key,
40 api_host="streaming.assemblyai.com",
41 )
42 )
43
44 client.on(StreamingEvents.Begin, on_begin)
45 client.on(StreamingEvents.Turn, on_turn)
46 client.on(StreamingEvents.Termination, on_terminated)
47 client.on(StreamingEvents.Error, on_error)
48
49 client.connect(
50 StreamingParameters(
51 sample_rate=16000,
52 format_turns=True,
53 )
54 )
55
56 try:
57 client.stream(
58 aai.extras.MicrophoneStream(sample_rate=16000)
59 )
60 finally:
61 client.disconnect(terminate=True)
62
63if __name__ == "__main__":
64 main()

Core concepts

For a message-by-message breakdown of a turn, see our Streaming API: Message Sequence Breakdown guide.

Universal-Streaming is built based upon two core concepts: Turn objects and immutable transcriptions.

Turn object

A Turn object is intended to correspond to a speaking turn in the context of voice agent applications, and therefore it roughly corresponds to an utterance in a broader context. We assign a unique ID to each Turn object, which is included in our response. Specifically, the Universal-Streaming response is formatted as follows:

1{
2 "turn_order": 1,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "modern medicine is",
6 "end_of_turn_confidence": 0.7,
7 "words": [
8 { "text": "modern", "word_is_final": true, ... },
9 { "text": "medicine", "word_is_final": true, ... },
10 { "text": "is", "word_is_final": true, ... },
11 { "text": "amazing", "word_is_final": false, ... }
12 ]
13}
  • turn_order: Integer that increments with each new turn
  • turn_is_formatted: Boolean indicating if the text in the transcript field is formatted. Text formatting is enabled when format_turns is set to true. It adds punctuation as well as performs casing and inverse text normalization to display various entities, such as dates, times, and phone numbers, in a human-friendly format
  • end_of_turn: Boolean indicating if this is the end of the current turn
  • transcript: String containing only finalized words
  • end_of_turn_confidence: Floating number (0-1) representing the confidence that the current turn has finished, i.e., the current speaker has completed their turn
  • words: List of Word objects with individual metadata

Each Word object in the words array includes:

  • text: The string representation of the word
  • word_is_final: Boolean indicating if the word is finalized, where a finalized word means the word won’t be altered in future transcription responses
  • start: Timestamp for word start
  • end: Timestamp for word end
  • confidence: Confidence score for the word

Immutable transcription

AssemblyAI’s streaming system receives audio in a streaming fashion, it returns transcription responses in real-time using the format specified above. Unlike many other streaming speech-to-text models that implement the concept of partial/variable transcriptions to show transcripts in an ongoing manner, Universal-Streaming transcriptions are immutable. In other words, the text that has already been produced will not be overwritten in future transcription responses. Therefore, with Universal-Streaming, the transcriptions will be delivered in the following way:

1→ hello my na
2→ hello my name
3→ hello my name
4→ hello my name is
5→ hello my name is zac
6→ hello my name is zack

When an end of the current turn is detected, you then receive a message with end_of_turn being true. Additionally, if you enable text formatting, you will also receive a transcription response with turn_is_formatted being true.

1→ hello my name is zack (unformatted)
2→ Hello my name is Zack. (formatted)

In this example, you may have noticed that the last word of each transcript may occasionally be a subword (“zac” in the example shown above). Each Word object has the word_is_final field to indicate whether the model is confident that the last word is a completed word. Note that, except for the last word, word_is_final is always true.