Stream audio and receive real-time transcription results. Fast, cost-effective streaming transcription available in three variants:
- **Universal-Streaming English** — the fastest real-time English transcription
- **Universal-Streaming Multilingual** — multilingual support (English, Spanish, German, French, Portuguese, and Italian) at the same speed and price
- **Whisper-Streaming** — open-source Whisper powered by AssemblyAI's infrastructure with 99+ languages
<Note> To use our EU server for Streaming STT, replace `streaming.assemblyai.com` with
`streaming.eu.assemblyai.com`. </Note>
Handshake
WSS
wss://streaming.assemblyai.com/v3/ws
Headers
AuthorizationstringOptional
Use your API key for authentication, or alternatively generate a temporary token and pass it via the token query parameter.
Query parameters
speech_modelenumRequired
The speech model used for your Streaming session.
Allowed values:
encodingenumOptionalDefaults to pcm_s16le
Encoding of the audio stream.
Allowed values:
format_turnsenumOptionalDefaults to false
Whether to return formatted final transcripts.
Allowed values:
inactivity_timeoutintegerOptional5-3600
Optional time in seconds of inactivity before session is terminated. If not set, no inactivity timeout is applied.
keyterms_promptlist of stringsOptional
A list of words and phrases to improve recognition accuracy for. See [Keyterms Prompting](https://www.assemblyai.com/docs/streaming/keyterms-prompting) for more details.
language_detectionenumOptionalDefaults to false
Whether to detect the language and return language metadata on utterances and final turns. Only available for the multilingual model.
Allowed values:
max_turn_silenceintegerOptionalDefaults to 1280
The maximum amount of silence in milliseconds allowed in a turn before end of turn is triggered. See [Turn Detection](https://www.assemblyai.com/docs/streaming/universal-streaming/turn-detection) for configuration details.
min_turn_silenceintegerOptionalDefaults to 400
The minimum amount of silence in milliseconds required to detect end of turn when confident. See [Turn Detection](https://www.assemblyai.com/docs/streaming/universal-streaming/turn-detection) for configuration details.
sample_rateintegerRequiredDefaults to 16000
Sample rate of the audio stream.
speaker_labelsenumOptionalDefaults to false
Whether to enable [Streaming Speaker Diarization](https://www.assemblyai.com/docs/streaming/diarization-and-multichannel). When enabled, each Turn event will include a `speaker_label` field indicating the speaker.
Allowed values:
max_speakersintegerOptional1-10
The maximum number of speakers expected in the audio stream (1-10). Setting this can improve speaker label accuracy when you know the number of speakers in advance. Only used when `speaker_labels` is enabled. See [Streaming Diarization](https://www.assemblyai.com/docs/streaming/diarization-and-multichannel) for more details.
The confidence threshold (0.0 to 1.0) for classifying audio frames as silence. Frames with VAD confidence below this value are considered silent. Increase for noisy environments to reduce false speech detection.
end_of_turn_confidence_thresholddoubleOptionalDefaults to 0.4
The confidence threshold (0.0 to 1.0) to use when determining if the end of a turn has been reached. See [Turn Detection](https://www.assemblyai.com/docs/streaming/universal-streaming/turn-detection) for configuration details.
Note: This parameter is only supported for the Universal-streaming model.
domainenumRequired
Enable domain-specific transcription models to improve accuracy for specialized terminology. Set to `"medical-v1"` to enable [Medical Mode](https://www.assemblyai.com/docs/streaming/medical-mode) for improved accuracy of medical terms such as medications, procedures, conditions, and dosages. Supported languages: English (`en`), Spanish (`es`), German (`de`), French (`fr`). If used with an unsupported language, the parameter is ignored and a warning is returned.
Allowed values:
languageenumOptionalDefaults to enDeprecated
The language of your audio stream.
Allowed values:
Send
sendAudiostringRequiredformat: "binary"
Send audio data chunks for transcription. The payload must be of type bytes and contain audio data between 50ms and 1000ms in length.
OR
sendUpdateConfigurationobjectRequired
Update streaming configuration parameters during an active session.
OR
sendForceEndpointobjectRequired
Manually force an endpoint in the transcription.
OR
sendSessionTerminationobjectRequired
Gracefully terminate the streaming session.
Receive
receiveSessionBeginsobjectRequired
Receive confirmation that the streaming session has successfully started.
OR
receiveTurnobjectRequired
Receive a formatted turn-based transcription result.
OR
receiveTerminationobjectRequired
Receive confirmation that the session has been terminated by the server.
Stream audio and receive real-time transcription results. Fast, cost-effective streaming transcription available in three variants:
Universal-Streaming English — the fastest real-time English transcription
Universal-Streaming Multilingual — multilingual support (English, Spanish, German, French, Portuguese, and Italian) at the same speed and price
Whisper-Streaming — open-source Whisper powered by AssemblyAI’s infrastructure with 99+ languages
To use our EU server for Streaming STT, replace streaming.assemblyai.com with
streaming.eu.assemblyai.com.
A list of words and phrases to improve recognition accuracy for. See Keyterms Prompting for more details.
The maximum amount of silence in milliseconds allowed in a turn before end of turn is triggered. See Turn Detection for configuration details.
The minimum amount of silence in milliseconds required to detect end of turn when confident. See Turn Detection for configuration details.
Whether to enable Streaming Speaker Diarization. When enabled, each Turn event will include a speaker_label field indicating the speaker.
The maximum number of speakers expected in the audio stream (1-10). Setting this can improve speaker label accuracy when you know the number of speakers in advance. Only used when speaker_labels is enabled. See Streaming Diarization for more details.
The confidence threshold (0.0 to 1.0) to use when determining if the end of a turn has been reached. See Turn Detection for configuration details.
Note: This parameter is only supported for the Universal-streaming model.
Enable domain-specific transcription models to improve accuracy for specialized terminology. Set to "medical-v1" to enable Medical Mode for improved accuracy of medical terms such as medications, procedures, conditions, and dosages. Supported languages: English (en), Spanish (es), German (de), French (fr). If used with an unsupported language, the parameter is ignored and a warning is returned.