Walkthroughs
Authentication is handled via the authorization
header. This header should be included in all of your API requests, and the value of this header should be your API token. All endpoints require authentication, and require you to set the authorization
header.
The AssemblyAI API can transcribe audio/video files that are accessible via a URL. For example, audio files in an S3 bucket, on your server, via the Twilio API, etc.
Local files?
Need to upload files directly to the API? Jump to the tutorial on uploading files.
In the code sample to the right, we show how to submit the URL of your audio/video file to the API for transcription. After submitting your POST
request, you will get a response that includes an id
key and a status
key.
The status
key shows the status of your transcription. It will start with "queued"
, and then go to "processing"
, and finally to "completed"
.
To check on the status of your transcription, see the docs for Getting the Transcription Result
Note
There is a minimum and maximum audio duration for files submitted to the API for transcription. The minium duration is 160 milliseconds and the maximum duration is 10 hours. If you submit a file with a duration shorter or longer than those requirements the transcription will be unsuccessful and the JSON response for the GET
request will contain a status
of error
and an error message that the duration of audio is too short or too long.
After you submit an audio file for processing, the "status"
key will go from "queued"
to "processing"
to "completed"
. You can make a GET
request, as shown on the right, to check for updates on the status of your transcription.
You'll have to make repeated GET
requests until the status is "completed"
or "error"
. Once the status
key is shown as "completed"
, you'll see the text
, words
, and other keys, including the results of any Audio Intelligence features you enabled, with the results of your transcription populated in the JSON response.
The language_code
key can be used to specify the language of the speech in your audio file. For example, English or Spanish. For a full list of supported languages, see the Supported Languages page.
In the code examples to the right, you can see how to submit an audio file to the API for transcription with the language_code
key included.
If you are unsure of the dominant language spoken in your audio file, you can use our Automatic Language Detection feature to automatically identify the dominant language in your file.
Pro tip
The language_code
parameter is optional. If you do not include a language_code
parameter in your request the default value will be en_us
.
If your audio files aren't accessible via a URL already (like in an S3 bucket, static file server, or via an API like Twilio), you can upload your files directly to the AssemblyAI API.
Once your upload finishes, you'll get back a JSON response that includes an upload_url
key. The upload_url
points to a private URL, accessible only to AssemblyAI's backend servers, that you can submit for processing via the /v2/transcript
endpoint.
Once your audio file is uploaded, you can submit it for transcription as you would any normal audio file. The URL in the upload_url
key is what you'll use as the audio_url
when Submitting Files for Transcription.
Pro tip
If you're not using our code examples, keep in mind the API expects the upload to be streamed to the API using Chunked Transfer Encoding. Most HTTP libraries have a nice interface for handling this. For example, in Python the "requests" library has a simple way to do Chunked Transfer Encoding uploads
Heads up
For privacy and security reasons, all uploads are immediately deleted after transcription completes. Because of this the upload_url
you receive when uploading a local file can only be used once. If you try to use that URL more than once you will get an error.
Instead of polling for the result of your transcription, you can receive a webhook once your transcript is complete, or if there was an error transcribing your audio file.
When submitting an audio file for transcription, you can include the additional parameter webhook_url
in your POST request. This must be a URL that can be reached by our backend.
You'll receive a webhook when your transcription goes to status "completed"
, or when your transcription goes to status "error"
. In either case, a POST
request will be made to the webhook URL you supplied. AssemblyAI sends webhook requests from a static IP address: 44.238.19.20
. The headers and body will look like this:
headers
---
content-type: application/json
content-length: 82
accept: */*
accept-encoding: gzip, deflate
user-agent: python-requests/2.25.1
request body
--
transcript_id: 5552493-16d8-42d8-8feb-c2a16b56f6e8,
status: completed
Once you receive the webhook, you can make a GET
request to /v2/transcript
to fetch the final result of your transcription.
Note
Static IP addresses are used for added security to verify webhook requests sent from AssemblyAI. This feature allows a developer to optionally whitelist IP addresses for the incoming webhook sent from our servers, giving you the ability to validate the source of all incoming webhook requests.
A Custom Header can be used for added security to authenticate webhook requests from AssemblyAI. This feature allows a developer to optionally provide a value to be used as an authorization header on the returning webhook from AssemblyAI, giving the ability to validate incoming webhook requests.
To use a Custom Header, you will include two additional parameters in your POST request: webhook_auth_header_name
and webhook_auth_header_value
. The webhook_auth_header_name
parameter accepts a string containing the header's name which will be inserted into the webhook request. The webhook_auth_header_value
parameter accepts a string with the value of the header that will be inserted into the webhook request. See the code examples to the right for more information on including a Custom Header in your request.
Here is what the headers and body of the webhook request will look like if you include a Custom Header with a webhook_auth_header_name
of "Authorization" and a webhook_auth_header_value
of "Bearer foobar":
headers
---
content-type: application/json
content-length: 82
authorization: Bearer foobar
accept: */*
accept-encoding: gzip, deflate
user-agent: python-requests/2.25.1
request body
--
transcript_id: 5552493-16d8-42d8-8feb-c2a16b56f6e8
status: completed
Often times, you'll want to associate certain meta data with your transcription request, such as a customer ID, and have that passed back to your webhook. The easiest way to do this is to include these parameters in your webhook URL as query parameters, for example:
https://foo.com/webhook?myParam1=foo&myParam2=bar
Then, when you receive the webhook, you can parse these parameters out of the webhook URL.
If we get a non 2xx
response when we POST
to your webhook URL, we'll retry the request 10 times, with a 10 second interval between each retry. After all 10 retries fail, we'll consider the webhook to be permanently failed.
If we are unable to reach your webhook URL (usually caused by a timeout, or your server being offline), no retries will be attempted.
If you're working with live audio streams, you can stream your audio data in real-time to our secure WebSocket API found at wss://api.assemblyai.com/v2/realtime/ws
. We will stream transcripts back to you within a few hundred milliseconds, and additionally, revise these transcripts with more accuracy over time as more context arrives.
Here are some open-source examples of our real-time endpoint to help you get started.
Websocat is a CLI for testing out websocket APIs. We shall use this tool in our examples. You can find more info on Websocat here.
To connect with the real-time endpoint, you must use a WebSocket client and establish a connection with wss://api.assemblyai.com/v2/realtime/ws
.
Authentication
Authentication is handled via the authorization
header. The value of this header should be your API token. For example, in websocat:
$ websocat wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000 -H Authorization:<API_TOKEN>
{
"message_type": "SessionBegins",
"session_id": "d3e8c537-2f11-494b-b497-e59a434588bd",
"expires_at": "2021-04-07T11:32:25.300329"
}
If you would like to create a temporary token for in-browser authentication you can learn more on that here.
Required Query Params
This endpoint also requires a query param sample_rate
that defines the sample rate of your audio data. For example, in websocat:
$ websocat wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000 -H Authorization:<API_TOKEN>
{
"message_type": "SessionBegins",
"session_id": "d3e8c537-2f11-494b-b497-e59a434588bd",
"expires_at": "2021-04-07T11:32:25.300329"
}
Once your request is authorized and connection established, your client will receive a "SessionBegins"
message with the following JSON data:
Parameter | Example | Info |
---|---|---|
message_type | SessionBegins | Describes the message type. |
session_id | d3e8c537-2f11-494b-b497-e59a434588bd | Unique identifier for the established session. Can be used to reestablish session. |
expires_at | 2021-04-07T11:32:25.300329 | Timestamp when this session will expire. |
When sending audio over the WebSocket connection, you should send a JSON payload with the following parameters.
Parameter | Example | Info |
---|---|---|
audio_data | UklGRtjIAABXQVZFZ… | Raw audio data, base64 encoded. This can be the raw data recorded directly from a microphone or read from an audio file. |
base64 encoding
base64 encoding is a simple way to encode your raw audio data so that it can be included as a JSON parameter in your websocket message. Most programming languages have very simple built-in functions for encoding binary data to base64.
For example, a message payload would look like this:
{
"audio_data": "UklGRtjIAABXQVZFZ..."
}
The raw audio data in the audio_data
field above must comply with a strict encoding format. This is because we don't do any transcoding to your data, we send it directly to the model for transcription to reduce latency. The encoding of your audio must be in:
sample_rate
query param you supplyOur real-time transcription pipeline uses a two-phase transcription strategy, broken into partial and final results.
As you send audio data to the API, the API will immediately start responding with Partial Results. The following keys will be in the JSON response from the WebSocket API.
Parameter | Example | Info |
---|---|---|
message_type | PartialTranscript | Describes the type of message. |
session_id | "5551722-f677-48a6-9287-39c0aafd9ac1" | The unique id of your transcription. |
audio_start | 1200 | Start time of audio sample relative to session start, in milliseconds. |
audio_end | 1850 | End time of audio sample relative to session start, in milliseconds. |
confidence | 0.956 | The confidence score of the entire transcription, between 0 and 1. |
text | "You know Demons on TV like..." | The complete transcription for your audio. |
words | [{"confidence": 1.0, "end": 440, "start": 0, "text": "You"}, ...] | An array of objects, with the information for each word in the transcription text. Will include the start/end time (in milliseconds) of the word, and the confidence score of the word. |
created | "2019-06-27 22:26:47.048512" | The timestamp for your request. |
After you've received your partial results, our model will continue to analyze incoming audio and, when it detects the end of an "utterance" (usually a pause in speech), it will finalize the results sent to you so far with higher accuracy, as well as add punctuation and casing to the transcription text.
The following keys will be in the JSON response from the WebSocket API when Final Results are sent:
Parameter | Example | Info |
---|---|---|
message_type | FinalTranscript | Describes the type of message. |
session_id | "5551722-f677-48a6-9287-39c0aafd9ac1" | The unique id of your transcription. |
audio_start | 1200 | Start time of audio sample relative to session start, in milliseconds. |
audio_end | 1850 | End time of audio sample relative to session start, in milliseconds. |
confidence | 0.956 | The confidence score of the entire transcription, between 0 and 1. |
text | "You know Demons on TV like..." | The complete transcription for your audio. |
words | [{"confidence": 1.0, "end": 440, "start": 0, "text": "You"}, ...] | An array of objects, with the information for each word in the transcription text. Will include the start/end time (in milliseconds) of the word, and the confidence score of the word. |
created | "2019-06-27 22:26:47.048512" | The timestamp for your request. |
When you've completed your session, clients should send a JSON message with the following field.
Parameter | Example | Info |
---|---|---|
terminate_session | true | A boolean value to communicate that you wish to end your real-time session forever. |
This JSON message can be sent to the websocket as shown in this Python example:
# Create the data to send
data = { 'terminate_session': True }
# Convert the data to a JSON string
json_data = json.dumps(data)
# Send the data through the socket
websockets.send(json_data)
If you have outstanding final transcripts, they will be sent to you. To finalize the session, a SessionTerminated
message is sent to confirm our API has terminated your session. A terminated session cannot be reused.
$ websocat wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000 -H authorization:<API_TOKEN>
{
"message_type": "SessionBegins",
"session_id": "d3e8c537-2f11-494b-b497-e59a434588bd"
}
...send audio...
...receive results...
{"message_type": "SessionTerminated"}
{"message_type": "FinalTranscript", ...}
{"message_type": "SessionTerminated", "session_id": "d3e8c537-2f11-494b-b497-e59a434588bd"}
The WebSocket specification provides standard errors. Here's a brief breakdown of them here.
Our API provides application-level WebSocket errors for well-known scenarios. Here's a breakdown of them.
Error Condition | Status Code | Message |
---|---|---|
bad sample rate | 4000 | "Sample rate must be a positive integer" |
auth failed | 4001 | "Not Authorized" |
insufficient funds | 4002 | "Insufficient Funds" |
free tier user | 4002 | "This feature is paid-only and requires you to add a credit card. Please visit https://app.assemblyai.com/ to add a credit card to your account" |
attempt to connect to nonexistent session id | 4004 | "Session not found" |
session expired | 4008 | "Session Expired" |
attempt to connect to closed session | 4010 | "Session previously closed" |
rate limited | 4029 | "Client sent audio too fast" |
unique session violation | 4030 | "Session is handled by another websocket" |
session times out | 4031 | "Session idle for too long" |
audio too short | 4032 | "Audio duration is too short" |
audio too long | 4033 | "Audio duration is too long" |
bad json | 4100 | "Endpoint received invalid JSON" |
bad schema | 4101 | "Endpoint received a message with an invalid schema" |
too many streams | 4102 | "This account has exceeded the number of allowed streams" |
reconnected | 4103 | "This session has been reconnected. This websocket is no longer valid." |
reconnect attempts exhausted | 1013 | "Temporary server condition forced blocking client's request" |
The following limits are imposed to ensure performance and service quality. Please contact us if you'd like to increase these limits.
Developers can also add up to 2500 characters of custom vocabulary to their real-time session by adding the optional query parameter word_boost
in the URL. The parameter should map to a JSON encoded list of strings as shown in this Python example:
import json
from urllib.parse import urlencode
sample_rate = 16000
word_boost = ["foo", "bar"]
params = {"sample_rate": sample_rate, "word_boost": json.dumps(word_boost)}
url = f"wss://api.assemblyai.com/v2/realtime/ws?{urlencode(params)}"
In some cases, a developer will need to authenticate on the client-side and won't want to expose their AssemblyAI token. You can do this by sending a POST
request to https://api.assemblyai.com/v2/realtime/token
with the parameter expires_in: {TTL in seconds}
. Below is a quick example in curl.
The `expires_in` parameter must be greater than or equal to 60 seconds.
curl --request POST \
--url https://api.assemblyai.com/v2/realtime/token \
--header 'authorization: YOUR_AAI_TOKEN' \
--header 'Content-Type: application/json' \
--data '{"expires_in": 60}'
In response you will receive the following JSON output:
{
"token": "b2e3c6c71d450589b2f4f0bb1ac4efd2d5e55b1f926e552e02fc0cc070eaedbd"
}
A developer can now use this temporary token in the browser to authenticate a new WebSocket session with the following endpoint wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000&token={New Temp Token}
. An example of JavaScript in the browser would be as follows.
let socket;
const token =
"b2e3c6c71d450589b2f4f0bb1ac4efd2d5e55b1f926e552e02fc0cc070eaedbd";
socket = new WebSocket(
`wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000&token=${token}`
);