Industry

What is Audio Intelligence?

Built using the latest Deep Learning and Machine Learning research, Audio Intelligence enables customers to quickly build high ROI features and applications on top of their audio data.

What is Audio Intelligence?

Table of contents

In addition to our Core Transcription API, AssemblyAI offers a host of Audio Intelligence APIs such as Sentiment Analysis, Summarization, Entity Detection, PII Redaction, and more.

What is Audio Intelligence?

Built using the latest Deep Learning, Machine Learning, and NLP research, Audio Intelligence enables customers to quickly build high ROI features and applications on top of their audio data. For example, customers are using our Audio Intelligence APIs to power enterprise call center AI platforms, smarter ad targeting inside of audio and video, and content moderation at scale, to name a few use cases.

Together, Audio Intelligence APIs work as powerful building blocks for more useful analytics, smarter applications, and increased ROI.

AssemblyAI’s Audio Intelligence APIs

AssemblyAI currently offers seven Audio Intelligence APIs: Automatic Transcript Highlights, Topic Detection, Entity Detection, Auto Chapters (Summarization), Content Moderation, PII Redaction, and Sentiment Analysis.

Let’s take a closer look at each of these.

1. Automatic Transcript Highlights

The Automatic Transcript Highlights API automatically detects important keywords and phrases in your transcription text.

For example, in the text,

We smirk because we believe that synthetic happiness is not of the same 
quality as what we might call natural happiness. What are these terms? 
Natural happiness is what we get when we get what we wanted. And 
synthetic happiness is what we make when we don't get what we wanted. 
And in our society..

The Automatic Transcript Highlights API would flag the following as important:

"synthetic happiness"
"natural happiness"
...
View Automatic Transcript Highlight Docs

2. Topic Detection

The Topic Detection API accurately predicts topics spoken in an audio or video file. We recently released our latest version of Topic Detection, v4, which boasts an 8.37% increase in relative accuracy over v3.

How does it work? Leveraging large NLP models, the API understands the context of what is being spoken across your audio files, and uses this information to predict the topics that are being discussed. The predicted topic labels follow the standardized IAB Taxonomy. The below table shows the 698 potential topics the API can predict.

Let's look at the example below created using the AssemblyAI Topic Detection API.

Here is the transcription text:

In my mind, I was basically done with Robbie Ray. He had shown flashes 
in the past, particularly with the strike. It was just too inefficient 
walk too many guys and got hit too hard too.

And here are the Topic Detection results:

Sports>Baseball: 100%

The model knows that Robbie Ray is a pitcher for the Toronto Blue Jays and that the Toronto Blue Jays are a baseball team. Thus, it accurately concludes that the topic discussed is baseball.

View Topic Detection Docs

3. Entity Detection

The Entity Detection API identifies and then categorizes key information in a transcription text. For example, Washington, D.C. is an entity that is classified as a location.

Here's an example of what a transcription response looks like with the Entity Detection API enabled:

{
    "audio_duration": 1282,
    "confidence": 0.930096506561678,
    "id": "oris9w0oou-f581-4c2e-9e4e-383f91f7f14d",
    "status": "completed",
    "text": "Ted Talks are recorded live at Ted Conference...",
    "entities": [
        {
            "entity_type": "event",
            "text": "Ted Talks",
            "start": 8630,
            "end": 9146
        },
        {
            "entity_type": "event",
            "text": "Ted Conference",
            "start": 10104,
            "end": 10946
        },
        {
            "entity_type": "occupation",
            "text": "psychologist",
            "start": 12146,
            "end": 12782
        },
        ...
    ],
    ...
}

As you can see, the API is able to determine two entity types for the transcription text – event and occupation.

There are currently 25 entities that can be detected in a transcription. These include:

View Entity Detection Docs

4. Auto Chapters

The Auto Chapters, or Summarization, API provides a “summary over time” for a transcription text. Generating Auto Chapters is a two-step process. First, the API breaks an audio or video file into logical chapters, e.g., when the conversation naturally changes topics. Second, the API generates a short summary for each of the predetermined chapters.

Auto chapters is especially useful for making long transcription texts more digestible.

Here's an example of what a transcription response looks like with the Auto Chapters API enabled:

{
    "audio_duration": 1282,
    "confidence": 0.930096506561678,
    "id": "oris9w0oou-f581-4c2e-9e4e-383f91f7f14d",
    "status": "completed",
    "text": "Ted Talks are recorded live at Ted Conference...",
    "chapters": [
        {
            "summary": "Ted talks are recorded live at ted conference. This episode features psychologist and happiness expert dan gilbert. Download the video @ ted.com here's dan gilbert.",
            "headline": "This episode features psychologist and happiness expert dan gilbert.",
            "start": 8630,
            "end": 21970,
            "gist": "live at ted conference"
        }
        ...
    ],
    ...
}   

Note that you will receive a summary, headline, and gist for each chapter, in addition to the start and end timestamps.

View Auto Chapters Docs

5. Content Moderation

The Content Moderation API automatically detects potentially sensitive or harmful content in an audio or video file.

Current topics that can be flagged are:

Here's an example of what a transcription response looks with the Content Moderation API enabled:

{
    ...
    "text": "You're listening to Ted Talks Daily. I'm Elise Hume. Neuroscientist Lisa Genova says...",
    "id": "ori4dib4sx-1dec-4386-aeb2-0e65add27049",
    "status": "completed",
    "content_safety_labels": {
        "status": "success",
        "results": [
            {
                "text": "Yes, that's it. Why does that happen? By calling off the Hunt, your brain can stop persevering on the ugly sister, giving the correct set of neurons a chance to be activated. Tip of the tongue, especially blocking on a person's name, is totally normal. 25 year olds can experience several tip of the tongues a week, but young people don't sweat them, in part because old age, memory loss, and Alzheimer's are nowhere on their radars.",
                "labels": [
                    {
                        "label": "health_issues",
                        "confidence": 0.8225132822990417,
                        "severity": 0.15090347826480865
                    }
                ],
                "timestamp": {
                    "start": 358346,
                    "end": 389018
                }
            },
            ...
        ],
        "summary": {
            "health_issues": 0.8750781728032808
            ...
        },
        "severity_score_summary": {
            "health_issues": {
                "low": 0.7210625030587972,
                "medium": 0.2789374969412028,
                "high": 0.0
            }
        }
    },
    ...
}

The API will output the flagged transcription text, the predicted content label –in the above example, health_issues, and the accompanying timestamp. It will also determine confidence and severity scores for each flagged topic.

View Content Moderation Docs

6. PII Redaction

The PII Redaction API identifies and removes (redacts) Personally Identifiable Information (PII) in a transcription text. When enabled, the PII will be replaced with a # or the entity_name (for example, [PERSON_NAME] instead of John Smith for each redacted character.

PII that can be redacted include:

View PII Redaction Docs

7. Sentiment Analysis

The Sentiment Analysis API detects positive, negative, and neutral sentiments in speech segments in an audio or video file.

When using AssemblyAI’s Sentiment Analysis API, you will receive a predicted sentiment, time stamp, and confidence score for each sentence spoken.

Here's an example of what a transcription response looks with the Sentiment Analysis API enabled:

{
    "id": "oris9w0oou-f581-4c2e-9e4e-383f91f7f14d",
    "status": "completed",
    "text": "Ted Talks are recorded live...",
    "words": [...],
    // sentiment analysis results are below
    "sentiment_analysis_results":[
        {
            "text": "Ted Talks are recorded live at Ted Conference.",
            "start": 8630,
            "end": 10946,
            "sentiment": "NEUTRAL",
            "confidence": 0.91366046667099,
            "speaker": null
         },
         {
            "text": "his episode features psychologist and happiness expert Dan Gilbert.",
            "start": 11018,
            "end": 15626,
            "sentiment": "POSITIVE",
            "confidence": 0.6465124487876892,
            "speaker": null
         },
         ...
    ],
    ...
}  
View Sentiment Analysis Docs

What Can You Do With Audio Intelligence?

Innovative businesses are leveraging AssemblyAI’s Audio Intelligence APIs to quickly build innovative features into their products and services that drive higher ROI and value to end users.

For example, a marketing analytics SaaS solution uses Automatic Transcript Highlights and PII Redaction to help power its Conversational Intelligence software. With Audio Intelligence, the company can help its customers optimize marketing spend and increase ROI with more targeted ad placements, as well as charge more for this intelligent product.

A lead tracking and reporting company uses Audio Intelligence to help qualify its leads, identify quotable leads, and flag leads for follow-up, speeding up its qualification process and increasing conversion rates.

Podcast, video, and media companies use Topic Detection to facilitate smarter content recommendations and more strategically place advertisements on videos.

Medical professionals use Entity Detection to automatically identify important patient information such as names, conditions, drugs administered, injuries, and more, helping them sort information faster and then perform more intelligent analysis on the collected data.

Telephony companies use Sentiment Analysis to label sentiments in customer-agent conversations, identify trends, analyze behavior, and improve customer service.

Additional Audio Intelligence Features Coming Soon

At AssemblyAI, our in-house team of Deep Learning researchers and engineers are constantly looking for ways to improve our Audio Intelligence APIs–and introduce new ones. Updates and improvements are shipped weekly, as detailed in our changelog.

We are particularly excited to be sharing new Audio Intelligence features soon. Be on the lookout for:

  • Emotion Detection: Identify a greater range of emotions in a transcription text, such as elated, thrilled, disappointed, etc.
  • Ad Detection: Identify start/end time stamps for voice and video ads, as well as all related sponsors and offers.
  • Translation: Convert any transcription text into 80+ different languages.
  • Intent Recognition: Identify the intent, or what a speaker hopes to achieve, in your audio or video file.

And more.