Skip to main content

Speech Recognition

Our Speech Recognition model enables you to convert spoken words into written text.

Quickstart

In the Transcribing an audio file guide, the client initiates a transcription request and periodically checks for the completed result. By polling the API at regular intervals, the client can track the transcription progress and receive the final output as soon as it's ready, ensuring a seamless and efficient transcription experience.

You can explore the full JSON response here:

Show JSON

You run this code snippet in Colab here, or you can view the full source code here.

Understanding the response

The JSON object above contains all information about the transcription. Depending on which Models are used to analyze the audio, the attributes of this object will vary. For example, in the quickstart above we did not enable Summarization, which is reflected by the summarization: false key-value pair in the JSON above. Had we activated Summarization, then the summary, summary_type, and summary_model key values would contain the file summary (and additional details) rather than the current null values.

In our example above, we performed a simple transcription, the results of which we access through the text and words keys:

The reference table below lists all relevant attributes along with their descriptions, where we've called the JSON response object results. Object attributes are accessed via dot notation, and arbitrary array elements are denoted with [i]. For example, results.words[i].text refers to the text attribute of the i-th element of the words array in the JSON results object.

results.textstringThe transcript of the audio file.
results.wordsarrayAn array containing information about each word
results.words[i].textstringThe text of the i-th word in the transcript
results.words[i].startnumberThe start of when this word is spoken in the audio file, in milliseconds
results.words[i].endnumberThe end of when this word is spoken in the audio file, in milliseconds
results.words[i].confidencenumberThe confidence score for the transcript of the i-th word
results.words[i].speakerstringIf Speaker Diarization is enabled, the speaker who uttered the i-th word

Automatic punctuation and casing

By default, the AssemblyAI API automatically punctuates the transcription text and format proper nouns, as well as convert numbers to their written format. For example, the text "i ate ten hamburgers at burger king" would be transcribed as "I ate 10 hamburgers at Burger King".

If you prefer to disable these features, you can include the optional punctuate and/or format_text parameters in your request. Setting these parameters to false will disable the corresponding feature.

Automatic language detection

With automatic language detection, the AssemblyAI API can identify the dominant language spoken in an audio file and automatically route it to the appropriate model for that language.

Keep in mind that the model needs approximately 50 seconds of spoken audio over the course of the audio to reliably identify the dominant language. Once the transcription is completed, you can find the detected language in the response by looking at the value of the language_code key.

Automatic language detection is currently supported in English, Spanish, French, German, Italian, Portuguese, and Dutch.

To enable it, include the language_detection parameter with a value of true in your request when submitting a file for processing.


In addition, the language_code key can be used to specify the language of the speech in your audio file.

Custom spelling

The Custom Spelling feature allows you to customize how words are spelled or formatted in the transcription text. This can be useful for correcting common misspellings or formatting inconsistencies. For instance, it could be used to change "Hans Zimer" to "Hans Zimmer", or "k8s" to "Kubernetes".

To use Custom Spelling, include the custom_spelling parameter in your API request. The parameter should be an array of objects, with each object specifying a mapping from a word or phrase to a new spelling or format. Each object should include a from key, which specifies the word or phrase to be replaced, and a to key, which specifies the new spelling or format.

Note that the value in the to key is case sensitive, but the value in the from key is not. Additionally, the to key should only contain one word, while the from key can contain multiple words.

Custom vocabulary

Including the word_boost parameter in your API request is an easy way to improve transcription accuracy when you know certain words or phrases appear frequently in your audio file.

You can also include the optional boost_param parameter in your API request to control how much weight should be applied to your keywords/phrases. This value can be either low, default, or high.

It's important to follow formatting guidelines for custom vocabulary to ensure the best results. Remove all punctuation, except apostrophes, and make sure each word is in its spoken form (e.g. iphone seven instead of iphone 7). Acronyms should have no spaces between letters. Additionally, the model still accepts words with unique characters such as é, but converts them to their ASCII equivalent.

There are some limitations to the parameter. You can pass a maximum of 1,000 unique keywords/phrases in your list, and each of them can contain up to 6 words.

Dual channel transcription

If you have a dual channel audio file with multiple speakers, the AssemblyAI API supports transcribing each of them separately. This can be useful for phone call recordings or any other audio file with distinct channels.

To enable it, include the dual_channel parameter in your request when submitting a file for transcription and set it to true. Keep in mind that it'll take approximately 25% longer to complete than normal transcriptions.

Once your transcription is complete, the API's response includes an additional utterances key, containing a list of turn-by-turn utterances, identified by each audio channel. Each object in the list contains channel information (either "1" or "2") so you can tell which channel each utterance is from. Additionally, each word in the words array contains the channel identifier.

Export SRT or VTT caption files

You can use the AssemblyAI API to export your completed transcripts in SRT or VTT format, which can be used for subtitles and closed captions in videos. Once your transcript status shows as completed, you can make a request to the appropriate endpoint to export your transcript in VTT or SRT format.

You can also customize the maximum number of characters per caption using the chars_per_caption URL parameter in your API requests to either the SRT or VTT endpoints. For example, adding ?chars_per_caption=32 to the SRT endpoint URL ensures that each caption has no more than 32 characters.

Exporting paragraphs and sentences

AssemblyAI provides two endpoints for retrieving transcripts that are automatically segmented into paragraphs or sentences for a more reader-friendly experience. These endpoints return the text of the transcript broken down by either paragraphs or sentences, along with additional metadata such as the start and end times of each segment, confidence scores, and more.

The response is a JSON object containing an array of objects, each representing a sentence or a paragraph in the transcript. Each object contains a text parameter with the text, start and end parameters with the start and end times of the sentence in milliseconds, a confidence score, and an array of word objects, each representing a word in the sentence or paragraph.

Filler words

Filler Words, such as "um" and "uh" are commonly used in everyday speech. By default, the AssemblyAI API removes these words from transcripts to create a cleaner output. However, if you wish to keep them in your transcript, you can set the disfluencies parameter to true in your request when submitting files for processing.

The list of Filler Words the API transcribes are: "um", "uh", "hmm", "mhm" and "uh huh".

Profanity filtering

Profanity can be a concern for certain use cases, and the AssemblyAI API allows you to automatically filter it out from the transcripts. By default, the API provides a verbatim transcription, including any swear word that was spoken in the audio. However, you can enable filtering by including the filter_profanity parameter in your request.

After your transcription is completed, you'll receive a response from the API as usual, but any profanity in the text is replaced with asterisks. Note that profanity filtering isn't perfect, and certain words may still be missed or improperly filtered.

Specifying when to start and end the transcript

If you have a long audio file and only want to transcribe a portion of it, you can set the audio_start_from parameter to the time, in milliseconds, at which you want the transcription to start and the audio_end_at parameter to the time at which you want the transcription to end. These parameters are optional. If you don't include them, the API transcribes the entire audio file from beginning to end.

Speech Threshold

Set the optional speech_threshold parameter to only transcribe files that contain at least a specified percentage of spoken audio, represented as a ratio in the range [0, 1].

If the percentage of speech in the audio file doesn't meet or surpass the provided threshold, then the value of transcript.text is None and results in an error:

Audio speech threshold 0.9461 is below the requested speech threshold value 1.0

Understanding the status of your transcript

When working with the AssemblyAI API, it's important to understand the different statuses that a transcription can have. They let you know whether the job is processing, queued, completed, or has encountered an error.

processingThe audio file is being processed by the API.
queuedThe audio file is in the queue to be processed by the API.
completedThe transcription job has been completed successfully.
errorAn error occurred while processing the audio file.

Handling errors

Transcription jobs can fail for various reasons, such as an unsupported audio file format, a file with no audio data, or an unreachable URL. Whenever a transcription job fails, the status of the transcription is error and an error key is included in the response from the API when fetching the transcription.

If a transcription job fails due to an error on the API side (a server error), it's recommended to resubmit the file for transcription. When you resubmit the file, usually a different server in the API cluster can process your audio file successfully.

The AssemblyAI API allows you to search through a completed transcript for a specific set of keywords, which is useful for quickly finding relevant information. You can search for individual words, numbers, or phrases up to five words.

The request returns a response with the following keys.

idThe ID of the transcript
total_countThe total number of all matched instances. For example, if "word 1" matched 2 times, and "word 2" matched 3 times, the value equals 5.
matchesContains an array of all matched words and associated data.
matches.textThe word itself.
matches.countThe total amount of times the word is in the transcript.
matches.timestampsAn array of timestamps structured as [start_time, end_time].
matches.indexesAn array of all index locations for that word within the words array of the completed transcript.

Troubleshooting

How can I make certain words more likely to be transcribed?

You can include words, phrases, or both in the word_boost parameter. Any term included has its likelihood of being transcribed boosted.

Can I customize how words are spelled by the model?

Yes. The Custom Spelling feature gives you the ability to specify how words are spelled or formatted in the transcript text. For example, Custom Spelling could be used to change the spelling of all instances of the word "Ariana" to "Arianna". It could also be used to change the formatting of "CS 50" to "CS50".

Why am I receiving a "400 Bad Request" error when making an API request?

A "400 Bad Request" error typically indicates that there's a problem with the formatting or content of the API request. Double-check the syntax of your request and ensure that all required parameters are included as described in the API Reference. If the issue persists, contact our support team for assistance.