Core Transcription
The AssemblyAI Speaker Diarization model can automatically detect the number of speakers in your audio file. Speakers will be labeled as Speaker A
, Speaker B
, etc and each word in the transcription text will be associated with its speaker.
To use this feature, include the speaker_labels
parameter with a value of true
in your POST
request as shown on the right.
Note
Speaker Labels is not supported when Dual Channel Transcription is turned on. You can have either Speaker Labels or Dual Channel enabled when submitting a file for transcription, but not both.
When your transcription is completed, you'll see an utterances
key in the JSON response, as shown on the right. This key will contain a list of "turn-by-turn" utterances, as they appeared in the audio recording. A "turn" refers to a change in speakers during the conversation.
Heads Up
Our Speaker Diarization model can detect up to 10 unique speakers. In order to be reliably identified as a unique speaker, a person will need to speak for approximately 30 seconds over the course of the audio file.
If a person doesn't speak much over the duration of the audio file or they tend to speak in short or single word statements like "okay" or "sounds good" the model may have difficulty identifying them as a unique speaker.
The Speaker Diarization model accepts an optional speakers_expected
parameter, which can be used to specify the expected number of speakers in a file. The value of the speakers_expected
parameter should be an integer.
Note
The speakers_expected
parameter is not fully deterministic, meaning it will help the model choose the correct number of speakers but there could be situations where the model returns a different number of speakers based on the spoken audio data.
On the right, we show you how to include the word_boost
parameter in your POST
requests to include support for Custom Vocabulary. You can include words, phrases, or both in the word_boost
parameter. Any term included will have its likelihood of being transcribed boosted.
You can also include the optional boost_param
parameter in your POST
request to control how much weight should be applied to your keywords/phrases. This value can be either low
, default
, or high
.
triple a
versus aaa
and iphone seven
versus iphone 7
abc
versus a b c
Sometimes your word boost list may contain a unique character that the model is not expecting, such as the é
in Andrés
. In these cases, our model will still accept the word and convert the special character to the ASCII equivalent if there is one; in this case, Andres
and then return the word in the transcript (if detected) without the accented/unique character.
You can pass a maximum of 1,000 unique keywords/phrases in your word_boost
list. Each keyword/phrase in the list must be 6 words or less.
The Custom Spelling feature gives you the ability to specify how words are spelled or formatted in the transcript text. For example, Custom Spelling could be used to change the spelling of all instances of the word "Ariana"
to "Arianna"
. It could also be used to change the formatting of "CS 50"
to "CS50"
.
The custom_spelling
parameter along with from
and to
values are used to define how the spelling of a word or words should be customized. The from
value is how the word would normally be predicted in the transcript. The to
value is how you would like the word to be spelled or formatted. Here is a JSON object showing how the custom_spelling
parameter could be used:
"custom_spelling": [
{
"from": ["cs 50", "cs fifty"],
"to": "CS50"
},
{
"from": ["Ariana"],
"to": "Arianna"
},
{
"from": ["Carla"],
"to": "Karla"
},
{
"from": ["Sarah"],
"to": "Sara"
}
]
Note:
The value in the to
key is case sensitive but the value in the from
key is not. Additionally, the from
key can contain multiple words, however the to
key must contain one word.
You can reference the code examples on the right to see how the custom_spelling
parameter is used in a POST request.
If you have a dual channel audio file, for example a phone call recording with the agent on one channel and the customer on the other, the API supports transcribing each channel separately.
When submitting this type of file, include the dual_channel
parameter in your POST
request when submitting a file for transcription, and set this parameter to true
.
Heads up
Dual channel transcriptions take ~25% longer to complete than normal, since we need to transcribe each channel which adds a little extra overhead!
Once your transcription is complete, there will be an additional utterances
key in the API's JSON response. The utterances
key will contain a list of turn-by-turn utterances, as they appeared in the audio recording, identified by each audio channel.
Each JSON object in the utterances
list contains the channel
information (this will be either "1"
or "2"
), so you can tell which channel each utterance is from. Each word in the words
array will also contain the channel
key.
By default, the API will remove Filler Words, like "um"
and "uh"
, from transcripts.
To include Filler Words in your transcripts, set the disfluencies
parameter to true
in your POST request when submitting files for processing to the /v2/transcript
endpoint, as shown on the right.
The list of Filler Words the API will transcribe are:
Once the transcription has been completed, you will get a response from the API as per usual, but Filler Words will be present in the transcription text and words array along with any other spoken word.
The Automatic Language Detection feature can identify the dominant language that’s spoken in an audio file, and route the file to the appropriate model for the detected language.
Note
Automatic Language Detection is supported for the following languages:
The Automatic Language Detection model will always detect the language of the audio file as one of the above languages.
To enable this feature, include the language_detection
parameter with a value of true
in your POST
request when submitting a file for processing.
If you know the language of the spoken audio in a file, you can specify that in your POST
request as shown in the documentation for Specifying a Language.
Heads Up
In order to reliably identify the dominant language in a file, the model needs approximately 50 seconds of spoken audio in that language over the course of the audio file.
In the case where the file does not contain any spoken audio, the user will receive a language_detection cannot be performed on files with no spoken audio
error.
As seen in the JSON response to the right, the language that was detected by the model can be found via the value of the language_code
key once the transcript is completed.
By default, the API will punctuate the transcription text and will automatically case proper nouns, as well as convert numbers to their written format.
For example, i ate ten hamburgers at burger king
will be converted to I ate 10 hamburgers at Burger King
. If you want to turn these features off, you can disable either, or both, of them by including a few additional parameters in your API request.
By setting the punctuate
parameter to false
and format_text
to false
, you can disable the punctuation and text formatting features and, in the above example, the transcript returned to you will be i ate ten hamburgers at burger king
.
Heads up
The transcript must be completed before using these API endpoints.
You can export your complete transcripts in SRT or VTT format, to be plugged into a video player for subtitles and closed captions.
Once your transcript status shows as completed
, you can make a GET
request to the following endpoints, as shown on the right, to export your transcript in VTT or SRT format.
The API will return a plain-text response in VTT format, like below:
WEBVTT
00:12.340 --> 00:16.220
Last year I showed these two slides said that demonstrate
00:16.200 --> 00:20.040
that the Arctic ice cap which for most of the last 3,000,000 years has been the
00:20.020 --> 00:25.040
size of the lower 48 States has shrunk by 40% but this understates
...
The API will return a plain-text response in SRT format, like below:
1
00:00:12,340 --> 00:00:16,380
Last year I showed these two slides said that demonstrate that
2
00:00:16,340 --> 00:00:19,920
the Arctic ice cap which for most of the last 3,000,000 years has been
3
00:00:19,880 --> 00:00:23,120
the size of the lower 48 States has shrunk by 40%
...
To control the maximum number of characters per caption, you can use the chars_per_caption
URL parameter in your API requests to either the SRT or VTT endpoints. For example:
https://api.assemblyai.com/v2/transcript/<your transcript id>/srt?chars_per_caption=32
In the above example, the API will make sure each caption has no more than 32 characters.
Heads up
The transcript must be completed before using these API endpoints.
You can use either of the following endpoints to retrieve a completed transcript automatically broken down into paragraphs or sentences. Using these endpoints, the API will attempt to semantically segment your transcript into paragraphs/sentences to create more reader-friendly transcripts.
/v2/transcript/{TRANSCRIPT-ID}/sentences
/v2/transcript/{TRANSCRIPT-ID}/paragraphs
The JSON response for these endpoints is shown on the right.
By default, the API will return a verbatim transcription of the audio, meaning profanity will be present in the transcript if spoken in the audio.
To replace profanity with asterisks, as shown below, include the additional parameter filter_profanity
to your request when submitting files for transcription, and set this to true
.
It was some tough s*** that they had to go through. But they did it. I mean, it blows my f****** mind every time I hear the story.
The JSON for your completed transcript will come back as-per-usual, but the text will contain asterisks when profanity was spoken.
Once a transcript has been completed, you can search through the transcript for a specific set of keywords. You can search for individual words, numbers, or phrases containing up to five words or numbers.
The words
query parameter should map to a JSON encoded list of strings as shown in this Python example:
import json
from urllib.parse import urlencode
words = ["foo", "bar", "foo bar", "42"]
params = {"words": json.dumps(words)}
url = f"wss://api.assemblyai.com/v2/transcript/YOUR-TRANSCRIPT-ID-HERE/word-search?{urlencode(params)}"
This request returns a JSON response with the following keys:
Key | Value |
---|---|
id |
The id of the transcript |
total_count |
Equals the total of all matched instances. For e.g., word 1 matched 2 times, and word2 matched 3 times, total_count will equal 5 |
matches |
Contains a list/array of all matched words and associated data |
Within matches
we will see all matched words with the keys of:
Key | Value |
---|---|
text |
The word itself |
count |
The total amount of times the word is in the transcript |
timestamps |
An array of timestamps structured as [start_time , end_time ] |
indexes |
An array of all index locations for that word within the words array of the completed transcript |
?words=ice,21,fire,5
With the list endpoint, you can retrieve a list of all the transcripts you have created. This list can also be filtered by the transcript status.
Make a GET
request, as shown to the right, with the following query parameters in your request. In the cURL statement to the right, for example, we are querying for the most recent 200 transcripts with the status of completed
.
Query | Description | Constraints | Optional |
---|---|---|---|
limit | Max results to return in a single response | Between 1 and 200 (defaults to 10) | Yes |
status | Filter by transcript status | Must be "queued", "processing", "completed", or "error" | Yes |
The API response will contain two top-level keys, they are transcripts
and page_details
.
The transcripts
key will contain an array of objects (your list of transcripts), with each object containing the following information:
key | value |
---|---|
id |
ID of the transcript |
resource_url |
Make a GET request to this URL to get the complete information for this transcript |
status |
The current status of the transcript |
created |
The date and time the transcript was created |
completed |
The date and time your transcript finished processing |
audio_url |
The audio_url that was submitted in your initial POST request when creating a transcript |
Heads up
Once you have deleted a transcript using our DELETE
endpoint, the audio url will no longer be available via the historical endpoint. The audio_url
key will show "deleted by user"
.
Since the API only returns a maximum of 200 transcripts per response, it treats each response as a "page" of results. The page_details
key will give you information about the current "page" you are on, and how to navigate to the next "page" of results.
To navigate to the next "page" of results, you will want to grab the value of prev_url
in the page_details
object from your initial GET
request. You can then make the same API call as before, replacing the endpoint with the value of prev_url
. You can continue to do this until prev_url
is null
, meaning you have pulled all your transcripts from the API!
Transcripts are listed from newest to oldest, so
prev_url
will always point to the prior "page" of older transcripts.
Here is the cURL request from earlier, for example:
curl --request GET \
--url https://api.assemblyai.com/v2/transcript?limit=200&status=completed \
--header 'authorization: YOUR-ASSEMBLYAI-TOKEN'
Once we have the response, we can make a subsequent request below using the value of prev_url
to get the next "page" of results:
curl --request GET \
--url https://api.assemblyai.com/v2/transcript?limit=200&status=completed&before_id=8w5chxgaz-dcf5-4647-8cb4-cdfeaccdaa7d \
--header 'authorization: YOUR-ASSEMBLYAI-TOKEN'
You can continue to do this until the value of prev_url
is null
, meaning you have successfully retrieved all transcripts in your account!
When making a GET
request to list transcripts, you can include any of the following parameters with your GET
request to further filter the results you get back.
Query | Description | Constraints | Optional |
---|---|---|---|
limit | Max results to return in a single response | Between 1 and 200 (inclusive with a default value of 10) | Yes |
status | Filter by transcript status | Must be queued, processing, completed, or error | Yes |
created_on | Only return transcripts created on this date | Format: YYYY-MM-DD | Yes |
before_id | Return transcripts that were created before this id | Valid transcript id | Yes |
after_id | return transcripts that were created after this id | Valid transcript id | Yes |
throttled_only | Only return throttled transcripts, overrides status filter | Boolean; true or false | Yes |
By default, AssemblyAI never stores a copy of the files you submit to the API for transcription. The transcription, however, is stored in our database, encrypted at rest, so that we can serve it to you and your application.
If you'd like to permanently delete the transcription from our database once you've retrieved it, you can do so by making a DELETE
request to the API as shown on the right.
Heads up
Transcripts can only be deleted after the transcription is completed!
Once a transcript is deleted, all of the sensitive information will be deleted but certain meta data like the transcript id
and audio_duration
will remain for billing purposes.