Our Speech Recognition model enables you to convert spoken words into written text.
In the Transcribing an audio file guide, the client initiates a transcription request and periodically checks for the completed result. By polling the API at regular intervals, the client can track the transcription progress and receive the final output as soon as it's ready, ensuring a seamless and efficient transcription experience.
You can explore the full JSON response here:
Understanding the response
The JSON object above contains all information about the transcription. Depending on which Models are used to analyze the audio, the attributes of this object will vary. For example, in the quickstart above we did not enable Summarization, which is reflected by the
summarization: false key-value pair in the JSON above. Had we activated Summarization, then the
summary_model key values would contain the file summary (and additional details) rather than the current
In our example above, we performed a simple transcription, the results of which we access through the
The reference table below lists all relevant attributes along with their descriptions, where we've called the JSON response object
results. Object attributes are accessed via dot notation, and arbitrary array elements are denoted with
results.words[i].text refers to the
text attribute of the
i-th element of the
words array in the JSON
|string||The transcript of the audio file.|
|array||An array containing information about each word|
|string||The text of the i-th word in the transcript|
|number||The start of when this word is spoken in the audio file, in milliseconds|
|number||The end of when this word is spoken in the audio file, in milliseconds|
|number||The confidence score for the transcript of the i-th word|
|string||If Speaker Diarization is enabled, the speaker who uttered the i-th word|
Automatic punctuation and casing
By default, the AssemblyAI API automatically punctuates the transcription text and format proper nouns, as well as convert numbers to their written format. For example, the text "i ate ten hamburgers at burger king" would be transcribed as "I ate 10 hamburgers at Burger King".
If you prefer to disable these features, you can include the optional
format_text parameters in your request. Setting these parameters to
false will disable the corresponding feature.
Automatic language detection
With automatic language detection, the AssemblyAI API can identify the dominant language spoken in an audio file and automatically route it to the appropriate model for that language.
Keep in mind that the model needs approximately 50 seconds of spoken audio over the course of the audio to reliably identify the dominant language. Once the transcription is completed, you can find the detected language in the response by looking at the value of the
Automatic language detection is currently supported in English, Spanish, French, German, Italian, Portuguese, and Dutch.
To enable it, include the
language_detection parameter with a value of
true in your request when submitting a file for processing.
In addition, the
language_code key can be used to specify the language of the speech in your audio file.
The Custom Spelling feature allows you to customize how words are spelled or formatted in the transcription text. This can be useful for correcting common misspellings or formatting inconsistencies. For instance, it could be used to change "Hans Zimer" to "Hans Zimmer", or "k8s" to "Kubernetes".
To use Custom Spelling, include the
custom_spelling parameter in your API request. The parameter should be an array of objects, with each object specifying a mapping from a word or phrase to a new spelling or format. Each object should include a
from key, which specifies the word or phrase to be replaced, and a
to key, which specifies the new spelling or format.
Note that the value in the
to key is case sensitive, but the value in the
from key is not. Additionally, the
to key should only contain one word, while the
from key can contain multiple words.
word_boost parameter in your API request is an easy way to improve transcription accuracy when you know certain words or phrases appear frequently in your audio file.
You can also include the optional
boost_param parameter in your API request to control how much weight should be applied to your keywords/phrases. This value can be either
It's important to follow formatting guidelines for custom vocabulary to ensure the best results. Remove all punctuation, except apostrophes, and make sure each word is in its spoken form (e.g.
iphone seven instead of
iphone 7). Acronyms should have no spaces between letters. Additionally, the model still accepts words with unique characters such as é, but converts them to their ASCII equivalent.
There are some limitations to the parameter. You can pass a maximum of 1,000 unique keywords/phrases in your list, and each of them can contain up to 6 words.
Dual channel transcription
If you have a dual channel audio file with multiple speakers, the AssemblyAI API supports transcribing each of them separately. This can be useful for phone call recordings or any other audio file with distinct channels.
To enable it, include the
dual_channel parameter in your request when submitting a file for transcription and set it to
true. Keep in mind that it'll take approximately 25% longer to complete than normal transcriptions.
Once your transcription is complete, the API's response includes an additional
utterances key, containing a list of turn-by-turn utterances, identified by each audio channel. Each object in the list contains channel information (either "1" or "2") so you can tell which channel each utterance is from. Additionally, each word in the words array contains the channel identifier.
Export SRT or VTT caption files
You can use the AssemblyAI API to export your completed transcripts in SRT or VTT format, which can be used for subtitles and closed captions in videos. Once your transcript status shows as completed, you can make a request to the appropriate endpoint to export your transcript in VTT or SRT format.
You can also customize the maximum number of characters per caption using the
chars_per_caption URL parameter in your API requests to either the SRT or VTT endpoints. For example, adding
?chars_per_caption=32 to the SRT endpoint URL ensures that each caption has no more than 32 characters.
Exporting paragraphs and sentences
AssemblyAI provides two endpoints for retrieving transcripts that are automatically segmented into paragraphs or sentences for a more reader-friendly experience. These endpoints return the text of the transcript broken down by either paragraphs or sentences, along with additional metadata such as the start and end times of each segment, confidence scores, and more.
The response is a JSON object containing an array of objects, each representing a sentence or a paragraph in the transcript. Each object contains a
text parameter with the text,
end parameters with the start and end times of the sentence in milliseconds, a
confidence score, and an array of
word objects, each representing a word in the sentence or paragraph.
Filler Words, such as "um" and "uh" are commonly used in everyday speech. By default, the AssemblyAI API removes these words from transcripts to create a cleaner output. However, if you wish to keep them in your transcript, you can set the
disfluencies parameter to
true in your request when submitting files for processing.
The list of Filler Words the API transcribes are: "um", "uh", "hmm", "mhm" and "uh huh".
Profanity can be a concern for certain use cases, and the AssemblyAI API allows you to automatically filter it out from the transcripts. By default, the API provides a verbatim transcription, including any swear word that was spoken in the audio. However, you can enable filtering by including the
filter_profanity parameter in your request.
After your transcription is completed, you'll receive a response from the API as usual, but any profanity in the text is replaced with asterisks. Note that profanity filtering isn't perfect, and certain words may still be missed or improperly filtered.
Specifying when to start and end the transcript
If you have a long audio file and only want to transcribe a portion of it, you can set the
audio_start_from parameter to the time, in milliseconds, at which you want the transcription to start and the
audio_end_at parameter to the time at which you want the transcription to end.
These parameters are optional. If you don't include them, the API transcribes the entire audio file from beginning to end.
Set the optional
speech_threshold parameter to only transcribe files that contain at least a specified percentage of spoken audio, represented as a ratio in the range
If the percentage of speech in the audio file doesn't meet or surpass the provided threshold, then the value of
None and results in an error:
Audio speech threshold 0.9461 is below the requested speech threshold value 1.0
Understanding the status of your transcript
When working with the AssemblyAI API, it's important to understand the different statuses that a transcription can have. They let you know whether the job is processing, queued, completed, or has encountered an error.
|The audio file is being processed by the API.|
|The audio file is in the queue to be processed by the API.|
|The transcription job has been completed successfully.|
|An error occurred while processing the audio file.|
Transcription jobs can fail for various reasons, such as an unsupported audio file format, a file with no audio data, or an unreachable URL. Whenever a transcription job fails, the status of the transcription is
error and an
error key is included in the response from the API when fetching the transcription.
If a transcription job fails due to an error on the API side (a server error), it's recommended to resubmit the file for transcription. When you resubmit the file, usually a different server in the API cluster can process your audio file successfully.
The AssemblyAI API allows you to search through a completed transcript for a specific set of keywords, which is useful for quickly finding relevant information. You can search for individual words, numbers, or phrases up to five words.
The request returns a response with the following keys.
|The ID of the transcript|
|The total number of all matched instances. For example, if "word 1" matched 2 times, and "word 2" matched 3 times, the value equals 5.|
|Contains an array of all matched words and associated data.|
|The word itself.|
|The total amount of times the word is in the transcript.|
|An array of timestamps structured as |
|An array of all index locations for that word within the words array of the completed transcript.|
You can include words, phrases, or both in the
word_boost parameter. Any term included has its likelihood of being transcribed boosted.
Yes. The Custom Spelling feature gives you the ability to specify how words are spelled or formatted in the transcript text. For example, Custom Spelling could be used to change the spelling of all instances of the word "Ariana" to "Arianna". It could also be used to change the formatting of "CS 50" to "CS50".
A "400 Bad Request" error typically indicates that there's a problem with the formatting or content of the API request. Double-check the syntax of your request and ensure that all required parameters are included as described in the API Reference. If the issue persists, contact our support team for assistance.