Skip to main content

Frequently asked questions

What are the API limits on file size or file duration?

Currently, the maximum file size that can be submitted to the /v2/transcript endpoint for transcription is 5GB, and the maximum duration is 10 hours.

The maximum file size for a local file uploaded to the API via the /v2/upload endpoint is 2.2GB.

How long does transcription take?

Our asynchronous transcription API performs at approximately 5x the realtime speed of audio.

Files submitted for asynchronous transcription (i.e. prerecorded audio or video files submitted to our /v2/transcript endpoint) will complete in 15-30% of the file's duration. For example, a 10-minute file would complete between one and a half and three minutes.

All submissions will also take a minimum 15-20 seconds to transcode and process through our API.

Files submitted for real-time transcription will receive a response within a few hundred milliseconds.

Can I get timestamps for individual words? How do timestamps work?

The response for a completed request will include start and end keys. These keys are timestamp values for when a given word, phrase, or sentence starts and ends. These values are in milliseconds (seconds * 1000) and are accurate to within about 400 milliseconds.

How do the Custom Vocabulary and Custom Spelling features work?

The Custom Vocabulary feature allows the you to submit a list of words or phrases to boost the likelihood that the model will predict those words. This is intended to help with words or terms that might be under-represented in the training data.

The Custom Spelling feature allows you to control how words are spelled or formatted in the transcript text. It works like a find and replace feature — anytime you would see X in the API output, it will be replaced with Y.

How long are audio or video files submitted to the API stored?

Files submitted to our API are encrypted in transit, and all submitted audio or video data is deleted from our servers as soon as the processing is completed. You can also delete the transcript itself at any time via this endpoint.

Can completed transcripts be deleted?

Completed transcripts are stored in our database, encrypted at rest, so that we can serve it to you and your application.

To permanently delete the transcription from our database once you've retrieved it, you can make a DELETE request to the API.

Can I get a list of all transcripts I have created?

You can retrieve a list of all transcripts that you have created by making a GET request to the API.

Do you offer any discounts on pricing?

If you plan to send a large amount of audio or video content through our API, please reach out to support@assemblyai.com to see if you qualify for a volume discount.

How can I get more information about an error? How do I contact support?

Any time you make a request to the API, you should receive a JSON response. If you don't receive the expected output, the JSON will contain an error key with a message value describing the error.

You can also can reach out to our support team any time by sending an email to support@assemblyai.com. When reaching out, please include a detailed description of any issues you're experiencing as well as transcript IDs for affected requests, if possible.

How do custom language models compare with general models?

In the field of automatic speech recognition (ASR), custom models are rarely more accurate than the best general models (learn more about one measure of accuracy, Word Error Rate or WER, here). This is because general models are trained on huge datasets, and are constantly maintained and updated using the latest deep learning research.

For example, at AssemblyAI, we train large deep neural networks on over 600,000 hours of speech data. This training data is a mix of many different types of audio (broadcast TV recordings, phone calls, Zoom meetings, videos, etc), accents, and speakers. This massive amount of diverse training data helps our ASR models to generalize extremely well across all types of audio/data, speakers, recording quality, and accents when converting Speech-to-Text in the real world.

Custom models usually come into the mix when dealing with audio data that have unique characteristics unseen by a general model. However, because large, accurate general models see most types of audio data during training, there are not many “unique characteristics” that would trip up a general model - or that a custom model would even be able to learn.

To learn more about this and related topics, check out the AssemblyAI blog.