Core Transcription

#

Speaker Labels (Speaker Diarization)

The AssemblyAI Speaker Diarization model can automatically detect the number of speakers in your audio file. Speakers will be labeled as Speaker A, Speaker B, etc and each word in the transcription text will be associated with its speaker.

To use this feature, include the speaker_labels parameter with a value of true in your POST request as shown on the right.

Note

Speaker Labels is not supported when Dual Channel Transcription is turned on. You can have either Speaker Labels or Dual Channel enabled when submitting a file for transcription, but not both.

When your transcription is completed, you'll see an utterances key in the JSON response, as shown on the right. This key will contain a list of "turn-by-turn" utterances, as they appeared in the audio recording. A "turn" refers to a change in speakers during the conversation.

Heads Up

Our Speaker Diarization model can detect up to 10 unique speakers. In order to be reliably identified as a unique speaker, a person will need to speak for approximately 30 seconds over the course of the audio file.

If a person doesn't speak much over the duration of the audio file or they tend to speak in short or single word statements like "okay" or "sounds good" the model may have difficulty identifying them as a unique speaker.

Specifying Number of Speakers

The Speaker Diarization model accepts an optional speakers_expected parameter, which can be used to specify the expected number of speakers in a file. The value of the speakers_expected parameter should be an integer.

Note

The speakers_expected parameter is not fully deterministic, meaning it will help the model choose the correct number of speakers but there could be situations where the model returns a different number of speakers based on the spoken audio data.

#

Custom Vocabulary

On the right, we show you how to include the word_boost parameter in your POST requests to include support for Custom Vocabulary. You can include words, phrases, or both in the word_boost parameter. Any term included will have its likelihood of being transcribed boosted.

Control the Weight of the Boost

You can also include the optional boost_param parameter in your POST request to control how much weight should be applied to your keywords/phrases. This value can be either low, default, or high.

Formatting Guidelines for Custom Vocabulary

  • Remove all punctuation, except apostrophes
  • Each word should be in it's spoken form, for example: triple a versus aaa and iphone seven versus iphone 7
  • Acronyms should have no spaces between letters, for example: abc versus a b c

Special Characters

Sometimes your word boost list may contain a unique character that the model is not expecting, such as the é in Andrés. In these cases, our model will still accept the word and convert the special character to the ASCII equivalent if there is one; in this case, Andres and then return the word in the transcript (if detected) without the accented/unique character.

Limitations

You can pass a maximum of 1,000 unique keywords/phrases in your word_boost list. Each keyword/phrase in the list must be 6 words or less.

#

Custom Spelling

The Custom Spelling feature gives you the ability to specify how words are spelled or formatted in the transcript text. For example, Custom Spelling could be used to change the spelling of all instances of the word "Ariana" to "Arianna". It could also be used to change the formatting of "CS 50" to "CS50".

The custom_spelling parameter along with from and to values are used to define how the spelling of a word or words should be customized. The from value is how the word would normally be predicted in the transcript. The to value is how you would like the word to be spelled or formatted. Here is a JSON object showing how the custom_spelling parameter could be used:

"custom_spelling": [
  {
    "from": ["cs 50", "cs fifty"],
    "to": "CS50"
  },
  {
    "from": ["Ariana"],
    "to": "Arianna"
  },
  {
    "from": ["Carla"],
    "to": "Karla"
  },
  {
    "from": ["Sarah"],
    "to": "Sara"
  }
]

Note:

The value in the to key is case sensitive but the value in the from key is not. Additionally, the from key can contain multiple words, however the to key must contain one word.

You can reference the code examples on the right to see how the custom_spelling parameter is used in a POST request.

#

Dual Channel Transcription

If you have a dual channel audio file, for example a phone call recording with the agent on one channel and the customer on the other, the API supports transcribing each channel separately.

When submitting this type of file, include the dual_channel parameter in your POST request when submitting a file for transcription, and set this parameter to true.

Heads up

Dual channel transcriptions take ~25% longer to complete than normal, since we need to transcribe each channel which adds a little extra overhead!

Once your transcription is complete, there will be an additional utterances key in the API's JSON response. The utterances key will contain a list of turn-by-turn utterances, as they appeared in the audio recording, identified by each audio channel.

Each JSON object in the utterances list contains the channel information (this will be either "1" or "2"), so you can tell which channel each utterance is from. Each word in the words array will also contain the channel key.

#

Filler Words

By default, the API will remove Filler Words, like "um" and "uh", from transcripts.

To include Filler Words in your transcripts, set the disfluencies parameter to true in your POST request when submitting files for processing to the /v2/transcript endpoint, as shown on the right.

Supported Filler Words

The list of Filler Words the API will transcribe are:

  • "um"
  • "uh"
  • "hmm"
  • "mhm"
  • "uh huh"

Once the transcription has been completed, you will get a response from the API as per usual, but Filler Words will be present in the transcription text and words array along with any other spoken word.

#

Automatic Language Detection

The Automatic Language Detection feature can identify the dominant language that’s spoken in an audio file, and route the file to the appropriate model for the detected language.

Note

Automatic Language Detection is supported for the following languages:


  • English
  • Spanish
  • French
  • German
  • Italian
  • Portuguese
  • Dutch

The Automatic Language Detection model will always detect the language of the audio file as one of the above languages.

To enable this feature, include the language_detection parameter with a value of true in your POST request when submitting a file for processing.

If you know the language of the spoken audio in a file, you can specify that in your POST request as shown in the documentation for Specifying a Language.

Heads Up

In order to reliably identify the dominant language in a file, the model needs approximately 50 seconds of spoken audio in that language over the course of the audio file.

In the case where the file does not contain any spoken audio, the user will receive a language_detection cannot be performed on files with no spoken audio error.

As seen in the JSON response to the right, the language that was detected by the model can be found via the value of the language_code key once the transcript is completed.

#

Automatic Punctuation and Casing

By default, the API will punctuate the transcription text and will automatically case proper nouns, as well as convert numbers to their written format.

For example, i ate ten hamburgers at burger king will be converted to I ate 10 hamburgers at Burger King. If you want to turn these features off, you can disable either, or both, of them by including a few additional parameters in your API request.

By setting the punctuate parameter to false and format_text to false, you can disable the punctuation and text formatting features and, in the above example, the transcript returned to you will be i ate ten hamburgers at burger king.

#

Export SRT or VTT Caption Files

Heads up

The transcript must be completed before using these API endpoints.

You can export your complete transcripts in SRT or VTT format, to be plugged into a video player for subtitles and closed captions.

Once your transcript status shows as completed, you can make a GET request to the following endpoints, as shown on the right, to export your transcript in VTT or SRT format.

API Response for VTT

The API will return a plain-text response in VTT format, like below:

WEBVTT

00:12.340 --> 00:16.220
Last year I showed these two slides said that demonstrate

00:16.200 --> 00:20.040
that the Arctic ice cap which for most of the last 3,000,000 years has been the

00:20.020 --> 00:25.040
size of the lower 48 States has shrunk by 40% but this understates

...

API Response for SRT

The API will return a plain-text response in SRT format, like below:

1
00:00:12,340 --> 00:00:16,380
Last year I showed these two slides said that demonstrate that

2
00:00:16,340 --> 00:00:19,920
the Arctic ice cap which for most of the last 3,000,000 years has been

3
00:00:19,880 --> 00:00:23,120
the size of the lower 48 States has shrunk by 40%

...

Customize Caption Lengths

To control the maximum number of characters per caption, you can use the chars_per_caption URL parameter in your API requests to either the SRT or VTT endpoints. For example:

https://api.assemblyai.com/v2/transcript/<your transcript id>/srt?chars_per_caption=32

In the above example, the API will make sure each caption has no more than 32 characters.

#

Exporting Paragraphs and Sentences

Heads up

The transcript must be completed before using these API endpoints.

You can use either of the following endpoints to retrieve a completed transcript automatically broken down into paragraphs or sentences. Using these endpoints, the API will attempt to semantically segment your transcript into paragraphs/sentences to create more reader-friendly transcripts.

/v2/transcript/{TRANSCRIPT-ID}/sentences
/v2/transcript/{TRANSCRIPT-ID}/paragraphs

The JSON response for these endpoints is shown on the right.

#

Profanity Filtering

By default, the API will return a verbatim transcription of the audio, meaning profanity will be present in the transcript if spoken in the audio.

To replace profanity with asterisks, as shown below, include the additional parameter filter_profanity to your request when submitting files for transcription, and set this to true.

It was some tough s*** that they had to go through. But they did it. I mean, it blows my f****** mind every time I hear the story.

The JSON for your completed transcript will come back as-per-usual, but the text will contain asterisks when profanity was spoken.

#

Listing Historical Transcripts

With the list endpoint, you can retrieve a list of all the transcripts you have created. This list can also be filtered by the transcript status.

How to Query the List Endpoint

Make a GET request, as shown to the right, with the following query parameters in your request. In the cURL statement to the right, for example, we are querying for the most recent 200 transcripts with the status of completed.

Query Description Constraints Optional
limit Max results to return in a single response Between 1 and 200 (defaults to 10) Yes
status Filter by transcript status Must be "queued", "processing", "completed", or "error" Yes

The API response will contain two top-level keys, they are transcripts and page_details.

The transcripts key will contain an array of objects (your list of transcripts), with each object containing the following information:

key value
id ID of the transcript
resource_url Make a GET request to this URL to get the complete information for this transcript
status The current status of the transcript
created The date and time the transcript was created
completed The date and time your transcript finished processing
audio_url The audio_url that was submitted in your initial POST request when creating a transcript

Heads up

Once you have deleted a transcript using our DELETE endpoint, the audio url will no longer be available via the historical endpoint. The audio_url key will show "deleted by user".

Since the API only returns a maximum of 200 transcripts per response, it treats each response as a "page" of results. The page_details key will give you information about the current "page" you are on, and how to navigate to the next "page" of results.

Paginate Through Multiple Lists of Transcripts

To navigate to the next "page" of results, you will want to grab the value of prev_url in the page_details object from your initial GET request. You can then make the same API call as before, replacing the endpoint with the value of prev_url. You can continue to do this until prev_url is null, meaning you have pulled all your transcripts from the API!

Transcripts are listed from newest to oldest, so prev_url will always point to the prior "page" of older transcripts.

Here is the cURL request from earlier, for example:

curl --request GET \
  --url https://api.assemblyai.com/v2/transcript?limit=200&status=completed \
  --header 'authorization: YOUR-ASSEMBLYAI-TOKEN'

Once we have the response, we can make a subsequent request below using the value of prev_url to get the next "page" of results:

curl --request GET \
  --url https://api.assemblyai.com/v2/transcript?limit=200&status=completed&before_id=8w5chxgaz-dcf5-4647-8cb4-cdfeaccdaa7d \
  --header 'authorization: YOUR-ASSEMBLYAI-TOKEN'

You can continue to do this until the value of prev_url is null, meaning you have successfully retrieved all transcripts in your account!

Advanced Filter Parameters

When making a GET request to list transcripts, you can include any of the following parameters with your GET request to further filter the results you get back.

Query Description Constraints Optional
limit Max results to return in a single response Between 1 and 200 (inclusive with a default value of 10) Yes
status Filter by transcript status Must be queued, processing, completed, or error Yes
created_on Only return transcripts created on this date Format: YYYY-MM-DD Yes
before_id Return transcripts that were created before this id Valid transcript id Yes
after_id return transcripts that were created after this id Valid transcript id Yes
throttled_only Only return throttled transcripts, overrides status filter Boolean; true or false Yes
#

Deleting Transcripts From the API

By default, AssemblyAI never stores a copy of the files you submit to the API for transcription. The transcription, however, is stored in our database, encrypted at rest, so that we can serve it to you and your application.

If you'd like to permanently delete the transcription from our database once you've retrieved it, you can do so by making a DELETE request to the API as shown on the right.

Heads up

Transcripts can only be deleted after the transcription is completed!

Once a transcript is deleted, all of the sensitive information will be deleted but certain meta data like the transcript id and audio_duration will remain for billing purposes.