Frequently asked questions
The AssemblyAI API supports most common audio and video file formats. We recommend that you submit your audio in its native format without additional transcoding or file conversion. Transcoding or converting it to another format can sometimes result in a loss of quality, especially if you're converting compressed formats like
.mp3. The AssemblyAI API converts all files to 8khz uncompressed audio as part of our transcription pipeline.
Note that when you upload a video to our API, the audio will be extracted from it and processed independently, so the list of supported video formats isn't exhaustive. If you need support for a format that isn't listed below, please contact our team at email@example.com.
|.8svx||.mts, .m2ts, .ts|
|.aif||.mp4, .m4p (with DRM), .m4v|
|.ogg, .oga, .mogg|
Currently, the maximum file size that can be submitted to the
/v2/transcript endpoint for transcription is 5GB, and the maximum duration is 10 hours.
The maximum file size for a local file uploaded to the API via the
/v2/upload endpoint is 2.2GB.
Our asynchronous transcription API performs at approximately 5x the real-time speed of audio.
Files submitted for asynchronous transcription (i.e. prerecorded audio or video files submitted to our
/v2/transcript endpoint) completes in 15-30% of the file's duration. For example, a 10-minute file would complete between 1.5 and 3 minutes.
All submissions also take a minimum 15-20 seconds to transcode and process through our API.
Files submitted for real-time transcription receive a response within a few hundred milliseconds.
The response for a completed request includes
end keys. These keys are timestamp values for when a given word, phrase, or sentence starts and ends. These values are in milliseconds and are accurate to within about 400 milliseconds.
The Custom Vocabulary feature allows you to submit a list of words or phrases to boost the likelihood that the model predicts those words. This is intended to help with words or terms that might be under-represented in the training data.
The Custom Spelling feature allows you to control how words are spelled or formatted in the transcript text. It works like a find and replace feature — anytime you would see X in the API output, it'll be replaced with Y.
Files submitted to our API are encrypted in transit, and all submitted audio or video data is deleted from our servers as soon as the transcription has completed.
If you upload a local file but don't transcribe it, we delete it after 24 hours. The corresponding
upload_url will no longer be valid.
You can also delete the transcript itself at any time.
Completed transcripts are stored in our database, encrypted at rest, so that we can serve it to you and your application.
To permanently delete the transcription from our database once you've retrieved it, you can make a
DELETE request to the API.
You can retrieve a list of all transcripts that you have created by making a
GET request to the API.
If you plan to send a large amount of audio or video content through our API, please reach out to firstname.lastname@example.org to see if you qualify for a volume discount.
Any time you make a request to the API, you should receive a JSON response. If you don't receive the expected output, the JSON contains an
error key with a message value describing the error.
You can also can reach out to our support team any time by sending an email to email@example.com. When reaching out, please include a detailed description of any issues you're experiencing as well as transcript IDs for affected requests, if possible.
In the field of automatic speech recognition (ASR), custom models are rarely more accurate than the best general models (learn more about one measure of accuracy, Word Error Rate, here). This is because general models are trained on huge datasets, and are constantly maintained and updated using the latest deep learning research.
For example, at AssemblyAI, we train large deep neural networks on over 600,000 hours of speech data. This training data is a mix of many different types of audio (broadcast TV recordings, phone calls, Zoom meetings, videos, etc.), accents, and speakers. This massive amount of diverse training data helps our ASR models to generalize extremely well across all types of audio/data, speakers, recording quality, and accents when converting Speech-to-Text in the real world.
Custom models usually come into the mix when dealing with audio data that have unique characteristics unseen by a general model. However, because large, accurate general models see most types of audio data during training, there aren't many “unique characteristics” that would trip up a general model - or that a custom model could even learn.
To learn more about this and related topics, check out the AssemblyAI blog.