Skip to main content

Speech recognition

Our Speech recognition model enables you to convert spoken words into written text.

Quickstart

In the Transcribing an audio file guide, the client initiates a transcription request and periodically checks for the completed result. By polling the API at regular intervals, the client can track the transcription progress and receive the final output as soon as it's ready, ensuring a seamless and efficient transcription experience.

You can also view the full source code here.

Automatic punctuation and casing

By default, the AssemblyAI API will automatically punctuate the transcription text and format proper nouns, as well as convert numbers to their written format. For example, the text "i ate ten hamburgers at burger king" would be transcribed as "I ate 10 hamburgers at Burger King".

If you prefer to disable these features, you can include the optional punctuate and/or format_text parameters in your request. Setting these parameters to false will disable the corresponding feature.

Automatic language detection

With automatic language detection, the AssemblyAI API can identify the dominant language spoken in an audio file and automatically route it to the appropriate model for that language.

Keep in mind that the model needs approximately 50 seconds of spoken audio over the course of the audio to reliably identify the dominant language. Once the transcription is completed, you can find the detected language in the response by looking at the value of the language_code key.

Automatic language detection is currently supported in English, Spanish, French, German, Italian, Portuguese, and Dutch.

To enable it, include the language_detection parameter with a value of true in your request when submitting a file for processing.


In addition, the language_code key can be used to specify the language of the speech in your audio file.

Custom spelling

The Custom Spelling feature allows you to customize how words are spelled or formatted in the transcription text. This can be useful for correcting common misspellings or formatting inconsistencies. For instance, it could be used to change "Hans Zimer" to "Hans Zimmer", or "k8s" to "Kubernetes".

To use Custom Spelling, include the custom_spelling parameter in your API request. The parameter should be an array of objects, with each object specifying a mapping from a word or phrase to a new spelling or format. Each object should include a from key, which specifies the word or phrase to be replaced, and a to key, which specifies the new spelling or format.

Note that the value in the to key is case sensitive, but the value in the from key is not. Additionally, the to key should only contain one word, while the from key can contain multiple words.

Custom vocabulary

Including the word_boost parameter in your API request is an easy way to improve transcription accuracy when you know certain words or phrases will appear frequently in your audio file.

You can also include the optional boost_param parameter in your API request to control how much weight should be applied to your keywords/phrases. This value can be either low, default, or high.

It's important to follow formatting guidelines for custom vocabulary to ensure the best results. Remove all punctuation, except apostrophes, and make sure each word is in its spoken form. Acronyms should have no spaces between letters. Additionally, the model will still accept words with unique characters such as é, but will convert them to their ASCII equivalent.

There are some limitations to the parameter. You can pass a maximum of 1,000 unique keywords/phrases in your list, and each of them must contain 6 words or less.

Dual channel transcription

If you have a dual channel audio file with multiple speakers, the AssemblyAI API supports transcribing each of them separately. This can be useful for phone call recordings or any other audio file with distinct channels.

To enable it, include the dual_channel parameter in your request when submitting a file for transcription and set it to true. Keep in mind that it will take approximately 25% longer to complete than normal transcriptions.

Once your transcription is complete, the API's response will include an additional utterances key, containing a list of turn-by-turn utterances, identified by each audio channel. Each object in the list contains channel information (either "1" or "2") so you can tell which channel each utterance is from. Additionally, each word in the words array contains the channel identifier.

Export SRT or VTT caption files

You can use the AssemblyAI API to export your completed transcripts in SRT or VTT format, which can be used for subtitles and closed captions in videos. Once your transcript status shows as completed, you can make a request to the appropriate endpoint to export your transcript in VTT or SRT format.

You can also customize the maximum number of characters per caption using the chars_per_caption URL parameter in your API requests to either the SRT or VTT endpoints. For example, adding ?chars_per_caption=32 to the SRT endpoint URL will ensure that each caption has no more than 32 characters.

Exporting paragraphs and sentences

AssemblyAI provides two endpoints for retrieving transcripts that are automatically segmented into paragraphs or sentences for a more reader-friendly experience. These endpoints return the text of the transcript broken down by either paragraphs or sentences, along with additional metadata such as the start and end times of each segment, confidence scores, and more.

The response will be a JSON object containing an array of objects, each representing a sentence or a paragraph in the transcript. Each object contains a text parameter with the text, start and end parameters with the start and end times of the sentence in milliseconds, a confidence score, and an array of word objects, each representing a word in the sentence or paragraph.

Filler words

Filler Words, such as "um" and "uh" are commonly used in everyday speech. By default, the AssemblyAI API removes these words from transcripts to create a cleaner output. However, if you wish to keep them in your transcript, you can set the disfluencies parameter to true in your request when submitting files for processing.

Profanity filtering

Profanity can be a concern for certain use cases, and the AssemblyAI API allows you to automatically filter it out from the transcripts. By default, the API will provide a verbatim transcription, including any swear word that was spoken in the audio. However, you can enable filtering by including the filter_profanity parameter in your request.

After your transcription is completed, you will receive a response from the API as usual, but any profanity in the text will be replaced with asterisks. Note that profanity filtering is not perfect, and certain words may still be missed or improperly filtered.

Specifying when to start and end the transcript

If you have a long audio file and only want to transcribe a portion of it, you can set the audio_start_from parameter to the time, in milliseconds, at which you want the transcription to start and the audio_end_at parameter to the time at which you want the transcription to end.

It's important to note that these parameters are optional. If you don't include them, the API will transcribe the entire audio file from beginning to end.

Understanding the status of your transcript

When working with the AssemblyAI API, it's important to understand the different statuses that a transcription can have. They will let you know whether the job is processing, queued, completed, or has encountered an error.

processingThe audio file is being processed by the API.
queuedThe audio file is in the queue to be processed by the API.
completedThe transcription job has been completed successfully.
errorAn error occurred while processing the audio file.

Handling errors

Transcription jobs can fail for various reasons, such as an unsupported audio file format, a file with no audio data, or an unreachable URL. Whenever a transcription job fails, the status of the transcription will be error and there will be an error key in the response from the API when fetching the transcription.

If a transcription job fails due to an error on the API side (a server error), it's recommended to resubmit the file for transcription. When you resubmit the file, usually a different server in the API cluster will be able to process your audio file successfully.

Understanding the response

When using the AssemblyAI API, the response you receive is a transcript object containing information about the transcription process and its result. This object has various attributes, each with its own purpose and value. Depending on the specific AI Model used in the transcription process, some may be set or not. For example, if you enable dual channel transcription, the utterances attribute will contain turn-by-turn results for each speaker. On the other hand, if you enable automatic punctuation, the punctuate attribute will be set to true and the resulting text attribute will include punctuation.

Of all the attributes, text is at the core of AssemblyAI. It contains the transcription of the audio file you submitted, and can be accessed directly from the response.

response['text']

# What's your computer set up? What's like the perfect are you somebody
# that's flexible to no matter what laptop, four screens? Or do you
# prefer a certain setup that you're most productive? I guess the one
# that I'm familiar with is one large screen, 27 inch and my laptop on
# the side.

We can also get information about each individual word in the transcript using the words attribute. Note that the speaker information will only be included if using the Speaker Diarization model.

result['words']

# [
# {
# 'text': 'Ted',
# 'start': 8650,
# 'end': 8826,
# 'confidence': 1.0,
# 'speaker': 'A'
# },
# {
# 'text': 'talks',
# 'start': 8858,
# 'end': 9178,
# 'confidence': 0.98086,
# 'speaker': 'A'
# },
# …
#]

The AssemblyAI API allows you to search through a completed transcript for a specific set of keywords, which is useful for quickly finding relevant information. You can search for individual words, numbers, or phrases.

The request will return a response with the following keys.

idThe ID of the transcript
total_countThe total number of all matched instances. For example, if "word 1" matched 2 times, and "word 2" matched 3 times, the value will equal 5.
matchesContains an array of all matched words and associated data.
matches.textThe word itself.
matches.countThe total amount of times the word is in the transcript.
matches.timestampsAn array of timestamps structured as [start_time, end_time].
matches.indexesAn array of all index locations for that word within the words array of the completed transcript.

Troubleshooting

How can I make certain words more likely to be transcribed?

You can include words, phrases, or both in the word_boost parameter. Any term included will have its likelihood of being transcribed boosted.

Can I customize how words are spelled by the model?

Yes. The Custom Spelling feature gives you the ability to specify how words are spelled or formatted in the transcript text. For example, Custom Spelling could be used to change the spelling of all instances of the word "Ariana" to "Arianna". It could also be used to change the formatting of "CS 50" to "CS50".

Why am I receiving a "400 Bad Request" error when making an API request?

A "400 Bad Request" error typically indicates that there is a problem with the formatting or content of the API request. Double-check the syntax of your request and ensure that all required parameters are included as described in the API Reference. If the issue persists, contact our support team for assistance.