Universal-3 Pro

Stream audio and receive real-time transcription results using the Universal-3 Pro model. The most accurate streaming model for voice agents that demand the highest quality, with best-in-class accuracy and advanced prompting capabilities. Supports: English, Spanish, German, French, Portuguese, and Italian. <Note> To use our EU server for Streaming STT, replace `streaming.assemblyai.com` with `streaming.eu.assemblyai.com`. </Note>

Handshake

WSS
wss://streaming.assemblyai.com/v3/ws

Headers

AuthorizationstringOptional

Use your API key for authentication, or alternatively generate a temporary token and pass it via the token query parameter.

Query parameters

speech_modelenumRequired
The speech model to use.
Allowed values:
encodingenumOptionalDefaults to pcm_s16le
Encoding of the audio stream.
Allowed values:
inactivity_timeoutintegerOptional5-3600
Optional time in seconds of inactivity before session is terminated. If not set, no inactivity timeout is applied.
keyterms_promptlist of stringsOptional
A list of words and phrases to improve recognition accuracy for.
language_detectionenumOptionalDefaults to false

Whether to return language_code and language_confidence in turn messages. Universal-3 Pro natively code-switches between English, Spanish, German, French, Portuguese, and Italian by default without any necessary configuration.

Allowed values:
max_turn_silenceintegerOptionalDefaults to 1000
Maximum silence in milliseconds before the turn is forced to end, regardless of punctuation.
min_turn_silenceintegerOptionalDefaults to 100

Silence duration in milliseconds before a speculative end-of-turn check. If terminal punctuation is found, the turn ends. Otherwise, a partial is emitted and the turn continues.

promptstringOptionalBeta
Prompting is a beta feature. Custom transcription instructions for the model. When not provided, a default prompt optimized for native turn detection is used automatically.
sample_rateintegerRequiredDefaults to 16000
Sample rate of the audio stream.
tokenstringOptional

API token for authentication (if using a temporary token).

vad_thresholddoubleOptionalDefaults to 0.3

The confidence threshold (0.0 to 1.0) for classifying audio frames as silence. Frames with VAD confidence below this value are considered silent. Increase for noisy environments to reduce false speech detection.

Send

sendAudiostringRequiredformat: "binary"
Send audio data chunks for transcription. The payload must be of type bytes and contain audio data between 50ms and 1000ms in length.
OR
sendUpdateConfigurationobjectRequired

Update streaming configuration parameters during an active session. You can update prompt, keyterms_prompt, min_turn_silence, and max_turn_silence.

OR
sendForceEndpointobjectRequired
Manually force an endpoint in the transcription.
OR
sendSessionTerminationobjectRequired
Gracefully terminate the streaming session.

Receive

receiveSessionBeginsobjectRequired
Receive confirmation that the streaming session has successfully started.
OR
receiveSpeechStartedobjectRequired
Receive a notification that speech has been detected in the audio stream.
OR
receiveTurnobjectRequired

Receive a formatted turn-based transcription result.

OR
receiveTerminationobjectRequired
Receive confirmation that the session has been terminated by the server.