Walkthroughs

#

Authentication

Authentication is handled via the authorization header. This header should be included in all of your API requests, and the value of this header should be your API token. All endpoints require authentication, and require you to set the authorization header.

#

Submitting Files for Transcription

The AssemblyAI API can transcribe audio/video files that are accessible via a URL. For example, audio files in an S3 bucket, on your server, via the Twilio API, etc.

Local files?

Need to upload files directly to the API? Jump to the tutorial on uploading files.

In the code sample to the right, we show how to submit the URL of your audio/video file to the API for transcription. After submitting your POST request, you will get a response that includes an id key and a status key.

The status key shows the status of your transcription. It will start with "queued", and then go to "processing", and finally to "completed".

To check on the status of your transcription, see the docs for Getting the Transcription Result

Note

There is a minimum and maximum audio duration for files submitted to the API for transcription. The minium duration is 160 milliseconds and the maximum duration is 10 hours. If you submit a file with a duration shorter or longer than those requirements the transcription will be unsuccessful and the JSON response for the GET request will contain a status of error and an error message that the duration of audio is too short or too long.

#

Getting the Transcription Result

After you submit an audio file for processing, the "status" key will go from "queued" to "processing" to "completed". You can make a GET request, as shown on the right, to check for updates on the status of your transcription.

You'll have to make repeated GET requests until the status is "completed" or "error". Once the status key is shown as "completed", you'll see the text, words, and other keys, including the results of any Audio Intelligence features you enabled, with the results of your transcription populated in the JSON response.

#

Specifying a Language

The language_code key can be used to specify the language of the speech in your audio file. For example, English or Spanish. For a full list of supported languages, see the Supported Languages page.

In the code examples to the right, you can see how to submit an audio file to the API for transcription with the language_code key included.

If you are unsure of the dominant language spoken in your audio file, you can use our Automatic Language Detection feature to automatically identify the dominant language in your file.

Pro tip

The language_code parameter is optional. If you do not include a language_code parameter in your request the default value will be en_us.

#

Uploading Local Files for Transcription

If your audio files aren't accessible via a URL already (like in an S3 bucket, static file server, or via an API like Twilio), you can upload your files directly to the AssemblyAI API.

Once your upload finishes, you'll get back a JSON response that includes an upload_url key. The upload_url points to a private URL, accessible only to AssemblyAI's backend servers, that you can submit for processing via the /v2/transcript endpoint.

Submit your Upload for Transcription

Once your audio file is uploaded, you can submit it for transcription as you would any normal audio file. The URL in the upload_url key is what you'll use as the audio_url when Submitting Files for Transcription.

Pro tip

If you're not using our code examples, keep in mind the API expects the upload to be streamed to the API using Chunked Transfer Encoding. Most HTTP libraries have a nice interface for handling this. For example, in Python the "requests" library has a simple way to do Chunked Transfer Encoding uploads

Heads up

For privacy and security reasons, all uploads are immediately deleted after transcription completes. Because of this the upload_url you receive when uploading a local file can only be used once. If you try to use that URL more than once you will get an error.

#

Using Webhooks

Instead of polling for the result of your transcription, you can receive a webhook once your transcript is complete, or if there was an error transcribing your audio file.

Specify Your Webhook URL

When submitting an audio file for transcription, you can include the additional parameter webhook_url in your POST request. This must be a URL that can be reached by our backend.

Receiving the Webhook

You'll receive a webhook when your transcription goes to status "completed", or when your transcription goes to status "error". In either case, a POST request will be made to the webhook URL you supplied. AssemblyAI sends webhook requests from a static IP address: 44.238.19.20. The headers and body will look like this:

headers
---
content-type: application/json
content-length: 82
accept: */*
accept-encoding: gzip, deflate
user-agent: python-requests/2.25.1

request body
--
transcript_id: 5552493-16d8-42d8-8feb-c2a16b56f6e8,
status: completed

Once you receive the webhook, you can make a GET request to /v2/transcript to fetch the final result of your transcription.

Note

Static IP addresses are used for added security to verify webhook requests sent from AssemblyAI. This feature allows a developer to optionally whitelist IP addresses for the incoming webhook sent from our servers, giving you the ability to validate the source of all incoming webhook requests.

Including a Custom Header for Authentication

A Custom Header can be used for added security to authenticate webhook requests from AssemblyAI. This feature allows a developer to optionally provide a value to be used as an authorization header on the returning webhook from AssemblyAI, giving the ability to validate incoming webhook requests.

To use a Custom Header, you will include two additional parameters in your POST request: webhook_auth_header_name and webhook_auth_header_value. The webhook_auth_header_name parameter accepts a string containing the header's name which will be inserted into the webhook request. The webhook_auth_header_value parameter accepts a string with the value of the header that will be inserted into the webhook request. See the code examples to the right for more information on including a Custom Header in your request.

Here is what the headers and body of the webhook request will look like if you include a Custom Header with a webhook_auth_header_name of "Authorization" and a webhook_auth_header_value of "Bearer foobar":

headers
---
content-type: application/json
content-length: 82
authorization: Bearer foobar
accept: */*
accept-encoding: gzip, deflate
user-agent: python-requests/2.25.1

request body
--
transcript_id: 5552493-16d8-42d8-8feb-c2a16b56f6e8
status: completed

Including Custom Parameters in Your Webhook Request

Often times, you'll want to associate certain meta data with your transcription request, such as a customer ID, and have that passed back to your webhook. The easiest way to do this is to include these parameters in your webhook URL as query parameters, for example:

https://foo.com/webhook?myParam1=foo&myParam2=bar

Then, when you receive the webhook, you can parse these parameters out of the webhook URL.

Failed webhooks and retries

If we get a non 2xx response when we POST to your webhook URL, we'll retry the request 10 times, with a 10 second interval between each retry. After all 10 retries fail, we'll consider the webhook to be permanently failed.

If we are unable to reach your webhook URL (usually caused by a timeout, or your server being offline), no retries will be attempted.

#

Real-Time Streaming Transcription

If you're working with live audio streams, you can stream your audio data in real-time to our secure WebSocket API found at wss://api.assemblyai.com/v2/realtime/ws. We will stream transcripts back to you within a few hundred milliseconds, and additionally, revise these transcripts with more accuracy over time as more context arrives.

Open Source Example Code

Here are some open-source examples of our real-time endpoint to help you get started.

Establishing a Websocket Connection

Websocat is a CLI for testing out websocket APIs. We shall use this tool in our examples. You can find more info on Websocat here.

To connect with the real-time endpoint, you must use a WebSocket client and establish a connection with wss://api.assemblyai.com/v2/realtime/ws.

Authentication

Authentication is handled via the authorization header. The value of this header should be your API token. For example, in websocat:

$ websocat wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000 -H Authorization:<API_TOKEN>
{
    "message_type": "SessionBegins",
    "session_id": "d3e8c537-2f11-494b-b497-e59a434588bd",
    "expires_at": "2021-04-07T11:32:25.300329"
}

If you would like to create a temporary token for in-browser authentication you can learn more on that here.

Required Query Params

This endpoint also requires a query param sample_rate that defines the sample rate of your audio data. For example, in websocat:

$ websocat wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000 -H Authorization:<API_TOKEN>
{
    "message_type": "SessionBegins",
    "session_id": "d3e8c537-2f11-494b-b497-e59a434588bd",
    "expires_at": "2021-04-07T11:32:25.300329"
}

Session Descriptor Message

Once your request is authorized and connection established, your client will receive a "SessionBegins" message with the following JSON data:

Parameter Example Info
message_type SessionBegins Describes the message type.
session_id d3e8c537-2f11-494b-b497-e59a434588bd Unique identifier for the established session. Can be used to reestablish session.
expires_at 2021-04-07T11:32:25.300329 Timestamp when this session will expire.

Sending Audio

Input Message

When sending audio over the WebSocket connection, you should send a JSON payload with the following parameters.

Parameter Example Info
audio_data UklGRtjIAABXQVZFZ… Raw audio data, base64 encoded. This can be the raw data recorded directly from a microphone or read from an audio file.

base64 encoding

base64 encoding is a simple way to encode your raw audio data so that it can be included as a JSON parameter in your websocket message. Most programming languages have very simple built-in functions for encoding binary data to base64.

For example, a message payload would look like this:

{
  "audio_data": "UklGRtjIAABXQVZFZ..."
}

Audio Requirements

The raw audio data in the audio_data field above must comply with a strict encoding format. This is because we don't do any transcoding to your data, we send it directly to the model for transcription to reduce latency. The encoding of your audio must be in:

  • 16-bit Signed Integer PCM encoding
  • A sample rate that matches the value of the sample_rate query param you supply
  • 16-bit Precision
  • Single-channel
  • 100 to 2000 milliseconds of audio per message

Transcription Response Types

Our real-time transcription pipeline uses a two-phase transcription strategy, broken into partial and final results.

Partial Results

As you send audio data to the API, the API will immediately start responding with Partial Results. The following keys will be in the JSON response from the WebSocket API.

Parameter Example Info
message_type PartialTranscript Describes the type of message.
session_id "5551722-f677-48a6-9287-39c0aafd9ac1" The unique id of your transcription.
audio_start 1200 Start time of audio sample relative to session start, in milliseconds.
audio_end 1850 End time of audio sample relative to session start, in milliseconds.
confidence 0.956 The confidence score of the entire transcription, between 0 and 1.
text "You know Demons on TV like..." The complete transcription for your audio.
words [{"confidence": 1.0, "end": 440, "start": 0, "text": "You"}, ...] An array of objects, with the information for each word in the transcription text. Will include the start/end time (in milliseconds) of the word, and the confidence score of the word.
created "2019-06-27 22:26:47.048512" The timestamp for your request.

Final Results

After you've received your partial results, our model will continue to analyze incoming audio and, when it detects the end of an "utterance" (usually a pause in speech), it will finalize the results sent to you so far with higher accuracy, as well as add punctuation and casing to the transcription text.

The following keys will be in the JSON response from the WebSocket API when Final Results are sent:

Parameter Example Info
message_type FinalTranscript Describes the type of message.
session_id "5551722-f677-48a6-9287-39c0aafd9ac1" The unique id of your transcription.
audio_start 1200 Start time of audio sample relative to session start, in milliseconds.
audio_end 1850 End time of audio sample relative to session start, in milliseconds.
confidence 0.956 The confidence score of the entire transcription, between 0 and 1.
text "You know Demons on TV like..." The complete transcription for your audio.
words [{"confidence": 1.0, "end": 440, "start": 0, "text": "You"}, ...] An array of objects, with the information for each word in the transcription text. Will include the start/end time (in milliseconds) of the word, and the confidence score of the word.
created "2019-06-27 22:26:47.048512" The timestamp for your request.

Ending a Session

When you've completed your session, clients should send a JSON message with the following field.

Parameter Example Info
terminate_session true A boolean value to communicate that you wish to end your real-time session forever.

This JSON message can be sent to the websocket as shown in this Python example:

# Create the data to send
data = { 'terminate_session': True }

# Convert the data to a JSON string
json_data = json.dumps(data)

# Send the data through the socket
websockets.send(json_data)  

If you have outstanding final transcripts, they will be sent to you. To finalize the session, a SessionTerminated message is sent to confirm our API has terminated your session. A terminated session cannot be reused.

$ websocat wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000 -H authorization:<API_TOKEN>
{
    "message_type": "SessionBegins",
    "session_id": "d3e8c537-2f11-494b-b497-e59a434588bd"
}
...send audio...
...receive results...
{"message_type": "SessionTerminated"}
{"message_type": "FinalTranscript", ...}
{"message_type": "SessionTerminated", "session_id": "d3e8c537-2f11-494b-b497-e59a434588bd"}

Closing and Status Codes

The WebSocket specification provides standard errors. Here's a brief breakdown of them here.

Our API provides application-level WebSocket errors for well-known scenarios. Here's a breakdown of them.

Error Condition Status Code Message
bad sample rate 4000 "Sample rate must be a positive integer"
auth failed 4001 "Not Authorized"
insufficient funds 4002 "Insufficient Funds"
free tier user 4002 "This feature is paid-only and requires you to add a credit card. Please visit https://app.assemblyai.com/ to add a credit card to your account"
attempt to connect to nonexistent session id 4004 "Session not found"
session expired 4008 "Session Expired"
attempt to connect to closed session 4010 "Session previously closed"
rate limited 4029 "Client sent audio too fast"
unique session violation 4030 "Session is handled by another websocket"
session times out 4031 "Session idle for too long"
audio too short 4032 "Audio duration is too short"
audio too long 4033 "Audio duration is too long"
bad json 4100 "Endpoint received invalid JSON"
bad schema 4101 "Endpoint received a message with an invalid schema"
too many streams 4102 "This account has exceeded the number of allowed streams"
reconnected 4103 "This session has been reconnected. This websocket is no longer valid."
reconnect attempts exhausted 1013 "Temporary server condition forced blocking client's request"

Quotas and Limits

The following limits are imposed to ensure performance and service quality. Please contact us if you'd like to increase these limits.

  • Idle Sessions - Sessions that do not receive audio within 1 minute will be terminated
  • Session Limit - 32 sessions at a time for paid users. Free-tier users must upgrade their account to use real-time streaming.
  • Session Uniqueness - Only one WebSocket per session
  • Audio Sampling Rate Limit - Customers must send data in near real-time. If a client sends data faster than 1 second of audio per second for longer than 1 minute, we will terminate the session.

Adding Custom Vocabulary

Developers can also add up to 2500 characters of custom vocabulary to their real-time session by adding the optional query parameter word_boost in the URL. The parameter should map to a JSON encoded list of strings as shown in this Python example:

import json
from urllib.parse import urlencode

sample_rate = 16000
word_boost = ["foo", "bar"]
params = {"sample_rate": sample_rate, "word_boost": json.dumps(word_boost)}

url = f"wss://api.assemblyai.com/v2/realtime/ws?{urlencode(params)}"

Creating Temporary Authentication Tokens

In some cases, a developer will need to authenticate on the client-side and won't want to expose their AssemblyAI token. You can do this by sending a POST request to https://api.assemblyai.com/v2/realtime/token with the parameter expires_in: {TTL in seconds}. Below is a quick example in curl.

The `expires_in` parameter must be greater than or equal to 60 seconds.

curl --request POST \
  --url https://api.assemblyai.com/v2/realtime/token \
  --header 'authorization: YOUR_AAI_TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{"expires_in": 60}'

In response you will receive the following JSON output:

{
  "token": "b2e3c6c71d450589b2f4f0bb1ac4efd2d5e55b1f926e552e02fc0cc070eaedbd"
}

A developer can now use this temporary token in the browser to authenticate a new WebSocket session with the following endpoint wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000&token={New Temp Token}. An example of JavaScript in the browser would be as follows.

let socket;
const token =
  "b2e3c6c71d450589b2f4f0bb1ac4efd2d5e55b1f926e552e02fc0cc070eaedbd";

socket = new WebSocket(
  `wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000&token=${token}`
);