Using real-time streaming
AssemblyAI's real-time transcription service allows you to transcribe live audio streams with high accuracy and low latency. By streaming your audio data to our secure WebSocket API, you can receive transcripts back within a few hundred milliseconds, and our system will continue to revise these transcripts with greater accuracy over time as more context arrives.
In this guide, you'll learn how to establish a WebSocket connection, send audio data, and receive partial and final transcription results. Real-time transcription requires 16-bit signed integer PCM-encoded, single-channel audio, matching a predefined sample rate. You can send between 100 and 2000 milliseconds of audio over a WebSocket at a time.
You can also learn the content on this page from Real-time Speech Recognition in 15 minutes with AssemblyAI on AssemblyAI's YouTube channel.
Get started
Before we begin, make sure you have an AssemblyAI account and an API token. You can sign up for a free account and get your API token from your dashboard. Please note that this feature is available for paid accounts only. If you are on the free plan, you will need to upgrade.
The entire source code of this guide can be viewed here.
Step-by-step instructions
- 1
Establish a WebSocket connection with the real-time endpoint by using a WebSocket client and connecting to
wss://api.assemblyai.com/v2/realtime/ws
.Authenticate your request by including your API token in the authorization header of your WebSocket connection, and provide the sample rate of your audio data as a query parameter to the real-time endpoint.
- 2
Optional: Add up to 2,500 characters of custom vocabulary to your real-time session by including the
word_boost
parameter as an optional query parameter in the URL.See also Adding Custom Vocabulary
- 3
Send audio data over the WebSocket connection in JSON payload format, including the raw audio data in base64 encoding.
- 4
Update the
on_message
call again to receive partial transcription results immediately after sending audio data, which include parameters such as message type, session ID, audio start and end times, confidence scores, and transcription text. - 5
Continue sending audio data until the API detects the end of an utterance and sends final transcription results with higher accuracy, punctuation, and casing.
- 6
Update the
on_error
call to handle WebSocket errors and application-level errors, including bad sample rate, authentication failure, insufficient funds, and more.See also Closing and Status Codes for a list of errors.
Audio Requirements
The raw audio data in the audio_data
field above must comply with a strict encoding format. This is because we don't do any transcoding to your data, we send it directly to the model for transcription to reduce latency. The encoding of your audio must be in:
- WAV PCM16
- A sample rate that matches the value of the
sample_rate
query param you supply - Single-channel
- 100 to 2000 milliseconds of audio per message
Request Types
These are the types of requests that can be sent to the WebSocket API.
Opening a Session
When opening a Session you can pass the following query attributes to the Websocket URL:
sample_rate
The sample rate of the streamed audio.
Example: wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000
word_boost
See also Adding Custom Vocabulary
token
See also Creating Temporary Authentication Tokens
Sending Audio
When sending audio over the WebSocket connection, you should send a JSON payload with the following parameters:
Field | Example | Description |
---|---|---|
audio_data | "UklGRtjIAABXQVZFZ" | Raw audio data, base64 encoded. This can be the raw data recorded directly from a microphone or read from an audio file. |
Terminating a Session
When you've completed your session, clients should send a JSON message with the following field.
Field | Example | Description |
---|---|---|
terminate_session | true | A boolean value to communicate that you wish to end your real-time session forever. |
Response Types
These are the types of responses that can be received from the WebSocket API.
Session Start
Once your request is authorized and connection established, your client will receive a SessionBegins
message with the following JSON data:
Field | Example | Description |
---|---|---|
message_type | "SessionBegins" | Describes the type of the message. |
session_id | "d3e8c537-2f11-494b-b497-e59a434588bd" | Unique identifier for the established session. |
expires_at | "2023-05-24T08:09:10.161850" | Timestamp when this session will expire. |
Transcripts
Our real-time transcription pipeline uses a two-phase transcription strategy, broken into partial and final results.
Partial Transcripts
As you send audio data to the API, the API will immediately start responding with Partial Results. The following keys will be in the JSON response from the WebSocket API.
Field | Example | Description |
---|---|---|
message_type | "PartialTranscript" | Describes the type of message. |
audio_start | 0 | Start time of audio sample relative to session start, in milliseconds. |
audio_end | 1500 | End time of audio sample relative to session start, in milliseconds. |
confidence | 0.987190506414702 | The confidence score of the entire transcription, between 0 and 1. |
text | "there is a house in new orleans" | The partial transcript for your audio. |
words | [{"start": 0, "end": 440, "confidence": 1.0, "text": "there"}, ...] | An array of objects, with the information for each word in the transcription text. Will include the start /end time (in milliseconds) of the word, the confidence score of the word, and the text (i.e. the word itself). |
created | "2023-05-24T08:09:10.161850" | The timestamp for the partial transcript. |
Final Transcripts
After you've received your partial results, our model will continue to analyze incoming audio and, when it detects the end of an "utterance" (usually a pause in speech), it will finalize the results sent to you so far with higher accuracy, as well as add punctuation and casing to the transcription text.
The following keys will be in the JSON response from the WebSocket API when Final Results are sent:
Field | Example | Description |
---|---|---|
message_type | "FinalTranscript" | Describes the type of message. |
audio_start | 0 | Start time of audio sample relative to session start, in milliseconds. |
audio_end | 1500 | End time of audio sample relative to session start, in milliseconds. |
confidence | 0.997190506414702 | The confidence score of the entire transcription, between 0 and 1. |
text | "There is a house in New Orleans" | The partial transcript for your audio. |
words | [{"start": 0, "end": 440, "confidence": 1.0, "text": "There"}, ...] | An array of objects, with the information for each word in the transcription text. Will include the start /end time (in milliseconds) of the word, the confidence score of the word, and the text (i.e. the word itself). |
created | "2023-05-24T08:09:10.161850" | The timestamp for the final transcript. |
punctuated | true | Whether the text has been punctuated and cased. |
text_formatted | true | Whether the text has been formatted (e.g. Dollar -> $) |
Closing and Status Codes
The WebSocket specification provides standard errors.
Our API provides application-level WebSocket errors for well-known scenarios:
Error Condition | Status Code | Message |
---|---|---|
bad sample rate | 4000 | "Sample rate must be a positive integer" |
auth failed | 4001 | "Not Authorized" |
insufficient funds | 4002 | "Insufficient Funds" |
free tier user | 4002 | "This feature is paid-only and requires you to add a credit card. Please visit https://app.assemblyai.com/ to add a credit card to your account" |
attempt to connect to nonexistent session id | 4004 | "Session not found" |
session expired | 4008 | "Session Expired" |
attempt to connect to closed session | 4010 | "Session previously closed" |
rate limited | 4029 | "Client sent audio too fast" |
unique session violation | 4030 | "Session is handled by another websocket" |
session times out | 4031 | "Session idle for too long" |
audio too short | 4032 | "Audio duration is too short" |
audio too long | 4033 | "Audio duration is too long" |
bad json | 4100 | "Endpoint received invalid JSON" |
bad schema | 4101 | "Endpoint received a message with an invalid schema" |
too many streams | 4102 | "This account has exceeded the number of allowed streams" |
reconnected | 4103 | "This session has been reconnected. This websocket is no longer valid." |
reconnect attempts exhausted | 1013 | "Temporary server condition forced blocking client's request" |
Quotas and Limits
The following limits are imposed to ensure performance and service quality. Please contact us if you need to increase these limits.
- Idle Sessions - Sessions that do not receive audio within 1 minute will be terminated.
- Session Limit - 32 sessions at a time for paid users. Free-tier users must upgrade their account to use real-time streaming.
- Session Uniqueness - Only one WebSocket per session.
- Audio Sampling Rate Limit - Customers must send data in near real-time. If a client sends data faster than 1 second of audio per second for longer than 1 minute, we will terminate the session.
Adding Custom Vocabulary
Developers can also add up to 2500 characters of custom vocabulary to their real-time session by adding the optional query parameter word_boost
in the URL. The parameter should map to a JSON encoded list of strings as shown in this Python example:
import json
from urllib.parse import urlencode
sample_rate = 16000
word_boost = ["foo", "bar"]
params = {"sample_rate": sample_rate, "word_boost": json.dumps(word_boost)}
url = f"wss://api.assemblyai.com/v2/realtime/ws?{urlencode(params)}"
Creating Temporary Authentication Tokens
In some cases, a developer will need to authenticate on the client-side and won't want to expose their AssemblyAI token. You can do this by sending a POST
request to https://api.assemblyai.com/v2/realtime/token
with the parameter expires_in: {TTL in seconds}
. Below is a quick example in curl.
The `expires_in` parameter must be greater than or equal to 60 seconds.
curl --request POST \
--url https://api.assemblyai.com/v2/realtime/token \
--header 'authorization: YOUR_AAI_TOKEN' \
--header 'Content-Type: application/json' \
--data '{"expires_in": 60}'
In response you will receive the following JSON output:
{
"token": "b2e3c6c71d450589b2f4f0bb1ac4efd2d5e55b1f926e552e02fc0cc070eaedbd"
}
A developer can now use this temporary token in the browser to authenticate a new WebSocket session with the following endpoint wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000&token={New Temp Token}
. For example:
let socket
const token = 'b2e3c6c71d450589b2f4f0bb1ac4efd2d5e55b1f926e552e02fc0cc070eaedbd'
socket = new WebSocket(
`wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000&token=${token}`
)
Conclusion
Real-time transcription is a powerful feature with even more powerful possibilities for integration. On the AssemblyAI blog, you can learn about using real-time transcription to:
- Automatically Transcribe Zoom Calls in Real Time
- Transcribe Twilio Phone Calls
- Connect to the real-time transcription API using a PyAudio stream
You can also find an example of using Express.js for real-time transcription on GitHub.