Audio Intelligence

#

PII Redaction

With PII Redaction, the API can automatically remove Personally Identifiable Information (PII), such as phone numbers and social security numbers, from the transcription text before it is returned to you.

All redacted text will be replaced with # characters. For example, if the phone number 111-2222 was spoken in the audio, it would be transcribed as ###-#### in the text.

Control Which Types of PII to Redact

To best-fit PII Redaction to your use case and data, you can select from a set of redaction policies when using PII Redaction. Include any or all of the policy names below in the redact_pii_policies array when making your POST request as shown on the right.

Note

The redact_pii_policies parameter is required and must contain at least one policy name from the list below.

Policy Name Description
medical_process Medical process, including treatments, procedures, and tests (e.g., heart surgery, CT scan)
medical_condition Name of a medical condition, disease, syndrome, deficit, or disorder (e.g., chronic fatigue syndrome, arrhythmia, depression)
blood_type Blood type (e.g., O-, AB positive)
drug Medications, vitamins, or supplements (e.g., Advil, Acetaminophen, Panadol)
injury Bodily injury (e.g., I broke my arm, I have a sprained wrist)
number_sequence A "lazy" rule that will redact any sequence of numbers equal to or greater than 2
email_address Email address (e.g., support@assemblyai.com)
date_of_birth Date of Birth (e.g., Date of Birth: March 7,1961)
phone_number Telephone or fax number
us_social_security_number Social Security Number or equivalent
credit_card_number Credit card number
credit_card_expiration Expiration date of a credit card
credit_card_cvv Credit card verification code (e.g., CVV: 080)
date Specific calendar date (e.g., December 18)
nationality Terms indicating nationality, ethnicity, or race (e.g., American, Asian, Caucasian)
event Name of an event or holiday (e.g., Olympics, Yom Kippur)
language Name of a natural language (e.g., Spanish, French)
location Any Location reference including mailing address, postal code, city, state, province, or country
money_amount Name and/or amount of currency (e.g., 15 pesos, $94.50)
person_name Name of a person (e.g., Bob, Doug Jones)
person_age Number associated with an age (e.g., 27, 75)
organization Name of an organization (e.g., CNN, McDonalds, University of Alaska)
political_affiliation Terms referring to a political party, movement, or ideology (e.g., Republican, Liberal)
occupation Job title or profession (e.g., professor, actors, engineer, CPA)
religion Terms indicating religious affiliation (e.g., Hindu, Catholic)
drivers_license Driver’s license number (e.g., DL# 356933-540)
banking_information Banking information, including account and routing numbers
#

Customize How Redacted PII is Transcribed

By default, any PII that is spoken will be transcribed with a hash #. For example, the credit card number 1111-2222-3333-4444 will be transcribed as ####-####-####-####.

By including the redact_pii_sub parameter in your POST request, you can customize how the PII is replaced.

Here are the options for the redact_pii_sub parameter:

Value Description
hash PII that is detected is replaced with a hash - #. For example, I'm calling for John is replaced with ####. (Applied by default)
entity_name PII that is detected is replaced with the associated policy name. For example, John is replaced with [PERSON_NAME]. This is recommended for readability.
#

Create a PII Redacted Audio File

When you transcribe a file with PII Redaction, the API can optionally generate a version of your original audio file with the PII "beeped" out when it is being spoken. To do so, include the redact_pii_audio parameter in your POST request, as shown on the right, when submitting files for transcription.

Get the Redacted Audio File

You can make a GET request to the below API endpoint, also shown on the right, to retrieve a URL that points to your redacted audio file:

https://api.assemblyai.com/v2/transcript/<your transcript id>/redacted-audio

In the JSON response shown on the right, you'll see the API responds with a redacted_audio_url key. This key contains a URL that points to your redacted audio file.

Please note that the redacted_audio_url link is only accessible for 30 minutes. You'll want to download the redacted audio file from the URL, and save a copy on your end.

Alternative Status Codes and Responses

Sometimes, you may not receive a 200 status code from the below API endpoint. In the table below, we outline the non-200 status codes you may receive, and what they mean.

https://api.assemblyai.com/v2/transcript/<your transcript id>/redacted-audio
Response Code Description
202 A 202 status code will be returned if audio redaction is still in progress. Depending on the length of the file it can take several minutes after the audio file finishes transcribing for the redacted audio file to be created.
400 A 400 will be returned if something is wrong with your request or if the redacted audio file no longer exists on our servers.

Receive a Webhook

If a webhook_url was provided in your POST request when submitting your audio file for transcription, we will send a POST to your webhook_url when the redacted audio is ready. The POST request headers and JSON body will look like this:

headers
---
content-length: 79
accept-encoding: gzip, deflate
accept: */*
user-agent: python-requests/2.21.0
content-type: application/json

params
--
status: 'redacted_audio_ready'
redacted_audio_url: 'https://link-to-redacted-audio'
#

Detect Important Phrases and Words

With Automatic Transcript Highlights, the AssemblyAI API can automatically detect important phrases and words in your transcription text.

For example, consider the following text:

We smirk because we believe that synthetic happiness is not of the same quality as what we might call natural happiness. What are these terms? Natural happiness is what we get when we get what we wanted. And synthetic happiness is what we make when we don't get what we wanted. And in our society...

Automatic Transcript Highlights will automatically detect the following key phrases/words in the text:

"synthetic happiness"
"natural happiness"
...

To enable this feature, include the auto_highlights parameter in your POST request when submitting files for transcription, and set this parameter to true.

Heads up

Your files can take up to 60 seconds longer to process when Automatic Transcript Highlights is enabled.

Once your transcription is complete, and you GET the result, you'll see an auto_highlights_result key in the JSON response.

The auto_highlights_result key in the JSON response will contain the key phrases/words the API found in your transcription text. Here is a close-up of only that key's response, and what each value means:

Response Key Description
status Will be either "success", or "unavailable" in the rare case that the Automatic Transcript Highlights model failed
results A list of all the highlights found in your transcription text
results.text The phrase/word itself that was detected
results.count How many times this phrase occurred in the text
results.rank The relevancy of this phrase - the higher the score, the better
results.timestamps a list of all the timestamps, in milliseconds, in the audio where each phrase/word is spoken
#

Content Moderation

With Content Safety Detection, AssemblyAI can detect if any of the following sensitive content is spoken in your audio/video files, and pinpoint exactly when and what was spoken:

Label Description Model Output Supported by Severity Scores
Accidents Any man-made incident that happens unexpectedly and results in damage, injury, or death. accidents Yes
Alcohol Content that discusses any alcoholic beverage or its consumption. alcohol Yes
Company Financials Content that discusses any sensitive company financial information. financials No
Crime Violence Content that discusses any type of criminal activity or extreme violence that is criminal in nature. crime_violence Yes
Drugs Content that discusses illegal drugs or their usage. drugs Yes
Gambling Includes gambling on casino-based games such as poker, slots, etc. as well as sports betting. gambling Yes
Hate Speech Content that is a direct attack against people or groups based on their sexual orientation, gender identity, race, religion, ethnicity, national origin, disability, etc. hate_speech Yes
Health Issues Content that discusses any medical or health-related problems. health_issues Yes
Manga Mangas are comics or graphic novels originating from Japan with some of the more popular series being "Pokemon", "Naruto", "Dragon Ball Z", "One Punch Man", and "Sailor Moon". manga No
Marijuana This category includes content that discusses marijuana or its usage. marijuana Yes
Natural Disasters Phenomena that happens infrequently and results in damage, injury, or death. Such as hurricanes, tornadoes, earthquakes, volcano eruptions, and firestorms. disasters Yes
Negative News News content with a negative sentiment which typically will occur in the third person as an unbiased recapping of events. negative_news No
NSFW (Adult Content) Content considered "Not Safe for Work" and consists of content that a viewer would not want to be heard/seen in a public environment. nsfw No
Pornography Content that discusses any sexual content or material. pornography Yes
Profanity Any profanity or cursing. profanity Yes
Sensitive Social Issues This category includes content that may be considered insensitive, irresponsible, or harmful to certain groups based on their beliefs, political affiliation, sexual orientation, or gender identity. sensitive_social_issues No
Terrorism Includes terrorist acts as well as terrorist groups. Examples include bombings, mass shootings, and ISIS. Note that many texts corresponding to this topic may also be classified into the crime violence topic. terrorism Yes
Tobacco Text that discusses tobacco and tobacco usage, including e-cigarettes, nicotine, vaping, and general discussions about smoking. tobacco Yes
Weapons Text that discusses any type of weapon including guns, ammunition, shooting, knives, missiles, torpedoes, etc. weapons Yes

Include the content_safety parameter in your POST request when submitting audio files for transcription, and set this parameter to true.

Interpreting Content Safety Detection Results

Once the transcription is complete, and you get the result, there will be an additional key content_safety_labels in the JSON response. Below, we'll drill into the data that is returned in the content_safety_labels key.

Response Key Description
status Will be either "success", or "unavailable" in the rare case that the Content Safety Detection model failed
results A list of all the spoken audio the Content Safety Detection model flagged
results.text The text transcription of what was spoken that triggered the Content Safety Detection Model
results.labels A list of labels the Content Safety Detection model predicted for the flagged content, as well as the confidence and severity of each label. The confidence score is a range between 0 and 1, and is how confident the model was in the label it predicted. The severity score is also a range 0 and 1, and indicates how severe the flagged content is, with 1 being most severe.
results.timestamp The start and end time, in milliseconds, for where the flagged content was spoken in the audio
summary The summary key provides the confidence of the most common labels in relation to the entire audio file.
severity_score_summary The severity_score_summary key provides the overall severity of the most common labels in relation to the entire audio file.

Understanding Severity Scores and Confidence Scores

Each label will be returned with a confidence score and a severity score. It is important to note that these two keys measure two very different things. The severity key will produce a score that shows how severe the flagged content is on a scale of 0–1. For example, a natural disaster with mass casualties would be a 1, whereas a wind storm that broke a lamppost would be a 0.1.

In comparison, confidence displays how confident the model was in predicting the label it predicted, also on a scale of 0-1.

We can break this down further by reviewing the following label:

"labels": [
    {
        "label": "health_issues",
        "confidence": 0.8225132822990417,
        "severity": 0.15090347826480865
    }
],

In the above example, the Content Safety model is indicating it is 82.25% confident that the spoken content is about Health Issues; however, it is measured at a low severity of 0.1509. This means the model is very confident the content is about health issues, but the content was not severe in nature (ie, was likely about a minor health issue).

Understanding the Severity Score Summary

The severity_score_summary key lists each label that was detected along with low, medium, and high keys.

"severity_score_summary": {
    "health_issues": {
        "low": 0.7210625030587972,
        "medium": 0.2789374969412028,
        "high": 0.0
    }
}

The value of the low, medium, and high keys reflect the API's confidence that the label is "low," "medium," or "high" in severity throughout the entire audio file. This score is based on the intersection of the length of the audio file, the frequency of low/medium/high severity tags through the file, and the confidence score for each of those occurrences.

Controlling the Threshold of Surfaced Results

By default, the content safety model will return any label with a confidence of 50% or greater. If you wish to set a higher or lower threshold, add the content_safety_confidence: {number} parameter to your POST request. This parameter will accept an integer value between 25 and 100, inclusive.

#

Topic Detection (IAB Classification)

With Topic Detection, AssemblyAI can label the topics that are spoken in your audio/video files. The predicted topic labels follow the standardized IAB Taxonomy, which makes them suitable for Contextual Targeting use cases. The below table shows the 698 potential topics the API can predict.

To use our Topic Detection feature, include the iab_categories parameter in your POST request when submitting audio files for transcription, and set this parameter to true, as shown in the cURL request on the right.

Once the transcription is complete, and you get the transcription result, there will be an additional key iab_categories_result in the JSON response. Below, we drill into that key and what data it includes.

Response Key Description
status Will be either "success", or "unavailable" in the rare case that the Topic Detection model failed
results The list of topics that were predicted for the audio file, including the text that influenced each topic label prediction, and other metadata about relevancy and timestamps
results.text The transcription text for the portion of audio that was classified with topic labels
results.timestamp The start and end time, in milliseconds, for where the portion of text in results.text was spoken in the audio file
results.labels The list of labels that were predicted for this portion of text. The relevance key gives a score between 0 and 1.0 for how relevant each label is for the portion of text.
summary The twenty topic labels from the results array with the highest relevancy score across the entire audio file. For example, if the Science>Environment label is detected only 1 time in a 60 minute audio file, the summary key will show a low relevancy score for that label, since the entire transcription was not found to consistently be about Science>Environment.
#

Sentiment Analysis

With Sentiment Analysis, AssemblyAI can detect the sentiment of each sentence of speech spoken in your audio files. Sentiment Analysis returns a result of POSITIVE, NEGATIVE, or NEUTRAL for each sentence in the transcript.

To include Sentiment Analysis in your transcript results, add the sentiment_analysis parameter in your POST request when submitting files for transcription and set this parameter to true, as shown in the cURL request on the right.

Once the transcription is complete, and you get the result, there will be an additional key sentiment_analysis_results in the JSON response.

For each sentence in the transcription text, the API will return the sentiment, confidence score, the start and end time for when that sentence was spoken, and, if applicable, the speaker label for that sentence. A detailed explanation of each key in the list of objects returned in the sentiment_analysis_results array can be found in the below table.

Key Value
text The transcription text of the sentence being analyzed
start Starting timestamp (in milliseconds) of the text in the transcript
end Ending timestamp (in milliseconds) of the text in the transcript
sentiment The detected sentiment POSITIVE, NEGATIVE, or NEUTRAL
confidence Confidence score for the detected sentiment
speaker If using dual_channel or speaker_labels, then the speaker that spoke this sentence
#

Summarization

With Summarization, AssemblyAI can generate a single abstractive summary of entire audio files submitted for transcription.

When submitting a file for transcription, include the summarization parameter in your POST request, and set this to true. By default, Summarization produces a bullet-point style summary using our Informative model. Optionally, you can customize the summararization to best fit your use case using the summary_modeland summary_type parameters.

The summary_model parameters specifies which Summarization model will be used. The summary_type parameter determines the type of summarization results you will receive.

The table below shows more information on the options available for the summary_model parameter:

Summary Model Recommended Use Cases Supported Summary Types Required Parameters
informative (default) Best for files with a single speaker such as presentations or lectures bullets, bullets_verbose, headline, or paragraph punctuate and format_text set to true
conversational Best for any 2 person conversation such as customer/agent or interview/interviewee calls bullets, bullets_verbose, headline, or paragraph punctuate, format_text, and speaker_labels set to true
catchy Best for creating video, podcast, or media titles headline or gist punctuate and format_text set to true

The table below shows more information on the options available for the summary_type parameter:

Summary Type Description Example
bullets (default) A bulleted summary with the most important points. - The human brain has nearly tripled in mass in two million years.\n- One of the main reasons that our brain got so big is because it got a new part, called the frontal lobe.\n
bullets_verbose A longer bullet point list summarizing the entire transcription text. - Dan Gilbert is a psychologist and a happiness expert. His talk is recorded live at Ted conference. He explains why the human brain has nearly tripled in size in 2 million years. He also explains the difference between winning the lottery and becoming a paraplegic.\n- In 1994, Pete Best said he's happier than he would have been with the Beatles. In the free choice paradigm, monet prints are ranked from the one they like the most to the one that they don't. People prefer the third one over the fourth one because it's a little better.\n- People synthesize happiness when they change their affective. Hedonic aesthetic reactions to a poster. The ability to make up your mind and change your mind is the friend of natural happiness. But it's the enemy of synthetic happiness. The psychological immune system works best when we are stuck. This is the difference between dating and marriage. People don't know this about themselves and it can work to their disadvantage.\n- In a photography course at Harvard, 66% of students choose not to take the course where they have the opportunity to change their mind. Adam Smith said that some things are better than others. Dan Gilbert recorded at Ted, 2004 in Monterey, California, 2004.
gist A few words summarizing the entire transcription text. A big brain
headline A single sentence summarizing the entire transcription text. The human brain has nearly tripled in mass in two million years.
paragraph A single paragraph summarizing the entire transcription text. The human brain has nearly tripled in mass in two million years. It went from the one-and-a-quarter-pound brain of our ancestor, habilis, to the almost three-pound meatloaf everybody here has between their ears.

When the transcription finishes, you will see the generated summary key in the JSON response containing the summary of their submitted audio/video file in the format (summary_type) you requested.

Learn more

For more information on how to use our Summarization models, check out the announcement on our blog.

#

Auto Chapters

Auto Chapters provides a "summary over time" for audio files transcribed with AssemblyAI. It works by first segmenting your audio files into logical "chapters" as the topic of conversation changes, and then provides an automatically generated summary for each "chapter" of content. For more information on Auto Chapters, check out this announcement from our blog.

When submitting a file for transcription, include the auto_chapters parameter in your POST request, and set this to true.

Note

Punctuation must be enabled in your request to use the Auto Chapters feature. If you attempt to use Auto Chapters with punctuate set to False you will receive the following error: "error": "punctuate must be enabled when auto_chapters is enabled."

When your transcription is completed, you'll see a chapters key in the JSON response, as shown on the right. For each chapter that was detected, the API will include the start and end timestamps (in milliseconds), a summary - which is a few sentence summary of the content spoken during that timeframe, a short headline, which can be thought of as a "summary of the summary", and a gist, which is an ultra-short, few word summary of the chapter of content.

Key Value
start Starting timestamp (in milliseconds) of the portion of audio being summarized
end Ending timestamp (in milliseconds) of the portion of audio being summarized
summary A one paragraph summary of the content spoken during this timeframe
headline A single sentence summary of the content spoken during this timeframe
gist An ultra-short summary, just a few words, of the content spoken during this timeframe
#

Entity Detection

With Entity Detection, you can identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

To include Entity Detection in your transcript response, add the entity_detection parameter in your POST request when submitting audio files for transcription, and set this parameter to true.

When your transcription is complete, you will see an entities key in the JSON response, as shown on the right. Below, we drill into the data that is returned within the list of results in the entities key.

Key Value
entity_type The entity type detected
text The text containing the entity
start Starting timestamp, in milliseconds, of the entity in the transcript
end Ending timestamp, in milliseconds, of the entity in the transcript

Entity Types Detected

When Entity Detection is enabled, the entity types listed below are automatically detected and, if found in the transcription text, will be included in the entities key as shown above. They will be listed individually in the order that they appear in the transcript.

Entity Name Description
blood_type Blood type (e.g., O-, AB positive)
credit_card_cvv Credit card verification code (e.g., CVV: 080)
credit_card_expiration Expiration date of a credit card
credit_card_number Credit card number
date Specific calendar date (e.g., December 18)
date_of_birth Date of Birth (e.g., Date of Birth: March 7, 1961)
drug Medications, vitamins, or supplements (e.g., Advil, Acetaminophen, Panadol)
event Name of an event or holiday (e.g., Olympics, Yom Kippur)
email_address Email address (e.g., support@assemblyai.com)
injury Bodily injury (e.g., I broke my arm, I have a sprained wrist)
language Name of a natural language (e.g., Spanish, French)
location Any Location reference including mailing address, postal code, city, state, province, or country
medical_condition Name of a medical condition, disease, syndrome, deficit, or disorder (e.g., chronic fatigue syndrome, arrhythmia, depression)
medical_process Medical process, including treatments, procedures, and tests (e.g., heart surgery, CT scan)
money_amount Name and/or amount of currency (e.g., 15 pesos, $94.50)
nationality Terms indicating nationality, ethnicity, or race (e.g., American, Asian, Caucasian)
occupation Job title or profession (e.g., professor, actors, engineer, CPA)
organization Name of an organization (e.g., CNN, McDonalds, University of Alaska)
person_age Number associated with an age (e.g., 27, 75)
person_name Name of a person (e.g., Bob, Doug Jones)
phone_number Telephone or fax number
political_affiliation Terms referring to a political party, movement, or ideology (e.g., Republican, Liberal)
religion Terms indicating religious affiliation (e.g., Hindu, Catholic)
us_social_security_number Social Security Number or equivalent
drivers_license Driver’s license number (e.g., DL# 356933-540)
banking_information Banking information, including account and routing numbers