AssemblyAI's Real-Time Transcription allows you to transcribe live audio streams with high accuracy and low latency. By streaming your audio data to our secure WebSocket API, you can receive transcripts back within a few hundred milliseconds.
Real-Time Transcription is only available for English. See Supported languages.
The audio format must conform to the following requirements:
- PCM16 or Mu-law encoding (See Specify the encoding)
- A sample rate that matches the value of the supplied
- 100 to 2000 milliseconds of audio per message
Audio segments with a duration between 100 ms and 450 ms produce the best results in transcription accuracy.
Add custom vocabulary
You can add up to 2500 characters of custom vocabulary to boost their transcription probability.
For this, create a list of strings and set the
Authenticate with a temporary token
If you need to authenticate on the client, you can avoid exposing your API key by using temporary authentication tokens.
To generate a temporary token, call
expires_inparameter to specify how long the token should be valid for, in seconds.note
expires_inparameter must have a value between 60 and 360000 seconds.
You can now use this temporary token to authenticate a new WebSocket session.note
Each token has a one-time use restriction and can only be used for a single session.
To use it, specify the
tokenparameter when initializing the real-time transcriber.
Manually end current utterance
To manually end an utterance, call
Manually ending an utterance immediately produces a final transcript.
Configure the threshold for automatic utterance detection
You can configure the threshold for how long to wait before ending an utterance.
To change the threshold, you can specify the
end_utterance_silence_threshold parameter when initializing the real-time transcriber.
After the session has started, you can change the threshold by calling
By default, Real-Time Transcription ends an utterance after 700 milliseconds of silence. You can configure the duration threshold any number of times during a session after the session has started. The valid range is between 0 and 20000.
To learn about using Real-Time Transcription, see the following resources: