Skip to main content

Real-Time API reference

Real-Time Transcription uses WebSockets to stream audio.

Open a session

Request

To open a session, connect to the following WebSocket URL:

wss://api.assemblyai.com/v2/realtime/ws

To configure the session, you can add the following parameters. The sample_rate is mandatory and must match the sample rate of the streamed audio.

ParameterTypeDescriptionRequired
sample_ratenumberThe sample rate of the streamed audio.Yes
encodingstringThe encoding of the streamed audio (options: pcm_s16le and pcm_mulaw). See also Specify the encoding
word_boostarrayA list of custom vocabulary to boost transcription probability for. See also Add custom vocabulary
tokenstringA temporary authentication token. See also Authenticate with a temporary token

Example: wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000

Response

Once your request is authorized and connection established, your client receives a SessionBegins message with the following JSON data:

FieldExampleDescription
message_type"SessionBegins"Describes the type of the message.
session_id"d3e8c537-2f11-494b-b497-e59a434588bd"Unique identifier for the established session.
expires_at"2023-05-24T08:09:10.161850"Timestamp when this session will expire.

Send audio data

Request

When sending audio over the WebSocket connection, you can use the WebSocket's binary mode to send raw audio data. This can be the raw data recorded directly from a microphone or read from an audio file.

# read from the microphone
data = stream.read(FRAMES_PER_BUFFER)

# binary data can be sent directly
ws.send(data)

# Note: Some WebSocket clients require that you specify the type:
# ws.send(data, opcode=websocket.ABNF.OPCODE_BINARY)
Heads up

Sending audio_data via JSON is also supported but will be deprecated in the future. Use the binary mode instead.

FieldExampleDescription
audio_data"UklGRtjIAABXQVZFZ"Raw audio data, base64 encoded.

Response

The real-time transcriber returns two types of transcripts:

  • Partial transcripts
  • Final transcripts

As you send audio data to the API, the API immediately starts responding with partial transcripts.

When the model detects the end of an utterance (usually a pause in speech), it'll finalize the results sent to you so far with higher accuracy, as well as add punctuation and casing to the transcription text.

The following keys are returned from the WebSocket API:

FieldExampleDescription
message_type"PartialTranscript" or "FinalTranscript"Describes the type of message.
audio_start0Start time of audio sample relative to session start, in milliseconds.
audio_end1500End time of audio sample relative to session start, in milliseconds.
confidence0.987190506414702The confidence score of the entire transcription, between 0 and 1.
text"there is a house in new orleans"The partial/final transcript for your audio.
words[{"start": 0, "end": 440, "confidence": 1.0, "text": "there"}, ...]An array of objects, with the information for each word in the transcription text. Includes the start/end time (in milliseconds) of the word, the confidence score of the word, and the text (i.e. the word itself).
created"2023-05-24T08:09:10.161850"The timestamp for the partial/final transcript.

Final transcripts also contain the following fields:

FieldExampleDescription
punctuatedtrueWhether the text has been punctuated and cased.
text_formattedtrueWhether the text has been formatted (e.g. Dollar -> $)

Terminate a session

Request

When you've completed your session, the client should send a JSON message with the following field:

FieldExampleDescription
terminate_sessiontrueA boolean value to communicate that you wish to end your real-time session forever.

After requesting session termination, the server will send the remaining transcript messages, followed by a SessionTerminated message.

Response

After requesting session termination, the server will send the remaining transcript messages, followed by a SessionTerminated message. Your client receives a SessionTerminated message with the following JSON data:

FieldExampleDescription
message_type"SessionTerminated"Describes the type of the message.

Manually end current utterance

Request

To manually end an utterance, the client should send a JSON message with the following field:

FieldExampleDescription
force_end_utterancetrueA boolean value to communicate that you wish to force the end of the utterance.

Manually ending an utterance immediately produces a final transcript.

Configure the threshold for automatic utterance detection

Request

Real-Time Transcription uses the duration of silence to determine the end of an utterance. To configure the threshold for how long to wait before ending an utterance, the client should send a JSON message with the following fields:

FieldExampleDescription
end_utterance_silence_threshold300The duration threshold in milliseconds. Default is 700.

By default, Real-Time Transcription ends an utterance after 700 milliseconds of silence. You can configure the duration threshold any number of times during a session after the session has started.

Closing and status codes

The WebSocket specification provides standard errors.

Our API provides application-level WebSocket errors for well-known scenarios:

Error ConditionStatus CodeMessage
bad sample rate4000"Sample rate must be a positive integer"
auth failed4001"Not Authorized"
insufficient funds4002"Insufficient Funds"
free tier user4002"This feature is paid-only and requires you to add a credit card. Please visit https://app.assemblyai.com/ to add a credit card to your account"
attempt to connect to nonexistent session id4004"Session not found"
session expired4008"Session Expired"
attempt to connect to closed session4010"Session previously closed"
rate limited4029"Client sent audio too fast"
unique session violation4030"Session is handled by another WebSocket"
session times out4031"Session idle for too long"
audio too short4032"Audio duration is too short"
audio too long4033"Audio duration is too long"
bad json4100"Endpoint received invalid JSON"
bad schema4101"Endpoint received a message with an invalid schema"
too many streams4102"This account has exceeded the number of allowed streams"
reconnected4103"This session has been reconnected. This WebSocket is no longer valid."
reconnect attempts exhausted1013"Temporary server condition forced blocking client's request"

Quotas and limits

The following limits are imposed to ensure performance and service quality.

  • Idle sessions - Sessions that don't receive audio within 1 minute will be terminated.
  • Session limit - 100 sessions at a time for paid users. Please contact us if you need to increase this limit. Free-tier users must upgrade their account to use real-time streaming.
  • Session uniqueness - Only one WebSocket per session.
  • Audio sampling rate limit - Customers must send data in near real-time. If a client sends data faster than 1 second of audio per second for longer than 1 minute, we'll terminate the session.

Temporary token

To generate a temporary token, send a POST request to https://api.assemblyai.com/v2/realtime/token. Use the expires_in parameter to specify how long the token should be valid for.

See also Authenticate with a temporary token.

MethodEndpointDescription
POSThttps://api.assemblyai.com/v2/realtime/tokenCreates a temporary token to authenticate on the client.
KeyTypeDescription
expires_inintegerSpecifies how long the token should be valid for, in seconds. Valid values are in the range [60,360000] inclusive.

Example request:

curl --request POST \
--url https://api.assemblyai.com/v2/realtime/token \
--header 'authorization: YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data '{"expires_in": 360000}'

Example response:

{
"token": "b2e3c6c71d450589b2f4f0bb1ac4efd2d5e55b1f926e552e02fc0cc070eaedbd"
}