Tutorials

Content moderation on audio files with Python

Modern AI models make it easy to automatically detect the presence of sensitive topics in speech data. Learn how to perform configurable content moderation with Python in this tutorial.

Content moderation on audio files with Python

With a growing percentage of human communication happening online, ensuring that this communication is content-appropriate for a given platform is critical to maintaining the integrity and safety of online spaces. Content moderation is essential in these efforts, helping to detect and manage inappropriate or sensitive material in media files.

In this tutorial, we'll learn how you can use Python and state-of-the-art AI models to automatically perform content moderation on audio files at scale with just a few lines of code. 

We'll use this example file:

audio-thumbnail
Canadian Wildfires
0:00
/281.051375

Below is an excerpt from the output, showing a section of the file that discusses the sensitive topic of health issues, along with a severity score for the content and a confidence estimate in the prediction. Additionally, the timestamps for the relevant section are displayed.:

So what is it in this haze that makes it harmful? And I'm assuming it is harmful. It is. It is. The levels outside right now in Baltimore are considered unhealthy. And most of that is due to what's called particulate matter, which are tiny particles, microscopic, smaller than the width of your hair, that can get into your lungs and impact your respiratory system, your cardiovascular system, and even your neurological, your brain. What makes this particularly harmful?
Timestamp: 56.3s - 85.4s
Label: health_issues - Confidence: 94% - Severity: 88%

Step 1: Set up your environment

Before we start coding, we'll need to ensure your environment is properly configured. First, ensure Python is installed on your computer. You can download and install Python from the official Python website if it isn't already installed.

Next, install the assemblyai Python package, which allows us to submit files to AssemblyAI for rapid content moderation. Install the package with pip by running the following command in your terminal or command prompt:

pip install assemblyai

After installing the assemblyai package, you'll need to set your API key as an environment variable. Your AssemblyAI API key is a unique identifier that allows you access to AssemblyAI's AI models. You can get an API key for free here, or copy it from your dashboard if you already have one.

Once you've copied your API key, set it as an environment variable - for Mac and Linux users, use the terminal to run:

export ASSEMBLYAI_API_KEY=YOUR_KEY_HERE

For Windows users, use the Command Prompt to execute:

set ASSEMBLYAI_API_KEY=YOUR_KEY_HERE

Step 2: Transcribe the file with content moderation

Now that your environment is set up, the next step is to transcribe your audio file and apply content moderation, allowing you to detect potentially sensitive or inappropriate content within the file.

First, create a file called main.py, import the assemblyai package, and specify the location of the audio file you would like to use. This location can be either a local file path or a publicly-accessible download URL. If you don't want to use your own file, you can keep the default example specified below:

import assemblyai as aai

audio_url = "https://github.com/AssemblyAI-Examples/audio-examples/raw/main/20230607_me_canadian_wildfires.mp3"

Before we transcribe the audio file, we need to specify the configuration for the transcription. Create an aai.TranscriptionConfig object and enable content moderation via content_safety=True. This setting instructs AssemblyAI to analyze the audio for any content that may be considered sensitive during the transcription. You can check out the AssemblyAI docs to see other available models you can enable through the TranscriptionConfig. Add the following line to main.py:

config = aai.TranscriptionConfig(content_safety=True)

Next, pass this config into an aai.Transcriber object, and then pass the audio file into the Transcriber's transcribe method. This submits the audio file for transcription according to the settings defined in the TranscriptionConfig. Add the following lines to main.py:

transcriber = aai.Transcriber(config=config)

transcript = transcriber.transcribe(audio_url)

The resulting transcript is an aai.Transcript object which contains, among other information, the information about any potentially sensitive segments in the file. Let's take a look at what's returned now.

Step 3: Print the result

After transcribing the audio file and analyzing it for sensitive content, we can print the Content Moderation results to see what information is returned. You can then include some logic in your application to automatically handle sensitive content according to your content policies.

All of the content moderation information for the transcript is found in the transcript.content_safety object. The results attribute of this object contains a list of objects, one for each section in the audio file that the Content Moderation model flagged as sensitive content.

Below we iterate through each element in this list and print off the text for the corresponding section in the file, as well as the timestamps for the beginning and end of the section. Then, we print information for all of the content moderation labels assigned to the section. Each label specifies a different type of sensitive content that was detected in the given section, along with a confidence score and a severity rating for that type of content in that particular section. Add the following lines to main.py:

# Get the parts of the transcript which were flagged as sensitive.
for result in transcript.content_safety.results:
    print(result.text)
    print(f"Timestamp: {result.timestamp.start/1000:.1f}s - {result.timestamp.end/1000:.1f}s")

    # Get category, confidence, and severity.
    for label in result.labels:
      print(f"Label: {label.label} - Confidence: {label.confidence*100:.0f}% - Severity: {label.severity*100:.0f}%")  # content safety category
    print()

Here is one of the items that will be output when we run the script:

Smoke from hundreds of wildfires in Canada is triggering air quality alerts throughout the US. Skylines from Maine to Maryland to Minnesota are gray and smoggy. And in some places, the air quality warnings include the warning to stay inside. We wanted to better understand what's happening here and why. So he called Peter DiCarlo, an associate professor in the department of Environmental Health and Engineering at Johns Hopkins University. Good morning. Professor. Good morning.
Timestamp: 0.2s - 28.8s
Label: disasters - Confidence: 81% - Severity: 39%

We can see that this section was identified, with 81% accuracy, to touch on the sensitive topic of disasters with a 39% severity.

You can use these results to identify sections of audio that are considered sensitive according to some internal criterion. In the code block below, we've added the criterion that the multiplicative product of the severity and confidence of a sensitive section must meet a certain threshold to be reported, with the intent of reporting only sections that are at least one of reasonably severe or reasonably confident. 

THRESHOLD = 0.7

# Get the parts of the transcript which were flagged as sensitive.
for result in transcript.content_safety.results:
    if not any([label.confidence*label.severity > THRESHOLD for label in result.labels]):
        continue
    print(result.text)
    print(f"    Timestamps: {result.timestamp.start/1000:0.1f}s - {result.timestamp.end/1000:0.1f}s")

    # Get category, confidence, and severity.
    for label in result.labels:
        if label.confidence*label.severity > THRESHOLD: 
          print(f"    Label: {label.label} (Confidence: {label.confidence:.02f}, Severity: {label.severity:.02f})")  # content safety category
    print()

Here is the full output of this code block:

So what is it in this haze that makes it harmful? And I'm assuming it is harmful. It is. It is. The levels outside right now in Baltimore are considered unhealthy. And most of that is due to what's called particulate matter, which are tiny particles, microscopic, smaller than the width of your hair, that can get into your lungs and impact your respiratory system, your cardiovascular system, and even your neurological, your brain. What makes this particularly harmful?
    Timestamps: 56.3s - 85.4s
    Label: health_issues (Confidence: 0.94, Severity: 0.88)

Summarizing overall findings

Furthermore, you can summarize the overall findings of the Content Moderation model to get a broader view of the audio content's nature. Add the following lines to main.py:

# Get the confidence of the most common labels in relation to the entire audio file.
for label, confidence in transcript.content_safety.summary.items():
    print(f"{confidence * 100:.2f}% confident that the audio contains {label}")

print()

When you run the script, you will see this output, which indicates that the Content Moderation is highly confident that this audio file as a whole concerns disasters and health issues.

98.88% confident that the audio contains disasters
90.83% confident that the audio contains health_issues

Additionally, you can get a finer-grained breakdown of these issues by accessing a severity score summary. This breakdown similarly considers the audio file as a whole, but this time giving a probability across low, medium, and high severities for each of the labels that describe the file as a whole. In other words, for each sensitive topic that applies to the audio file as a whole, this breakdown gives a distribution across 3 discrete severity levels for each sensitive topic.

# Get the overall severity of the most common labels in relation to the entire audio file.
for label, severity_confidence in transcript.content_safety.severity_score_summary.items():
    print(f"{severity_confidence.low * 100:.2f}% confident that the audio contains low-severity {label}")
    print(f"{severity_confidence.medium * 100:.2f}% confident that the audio contains medium-severity {label}")
    print(f"{severity_confidence.high * 100:.2f}% confident that the audio contains high-severity {label}")

Here is the output:

53.14% confident that the audio contains low-severity disasters
46.86% confident that the audio contains medium-severity disasters
0.00% confident that the audio contains high-severity disasters
20.70% confident that the audio contains low-severity health_issues
46.23% confident that the audio contains medium-severity health_issues
33.07% confident that the audio contains high-severity health_issu

We can see that, for each distinct label that applies to the entire file, the probabilities for low, medium, and high sum to 100%.

Run python main.py in the terminal in which you set your AssemblyAI API key as an environment variable to see all of these outputs printed to the console.

Final words

In this tutorial, you learned how to perform a Content Moderation analysis of an audio file using AI. With the results printed and analyzed, you can make informed decisions to ensure your audio content aligns with your organization's safety and content standards.

If you want to learn more about how to analyze audio and video files with AI, check out more of our blog, like this article on filtering profanity from audio files with Python. Alternatively, feel free to check out our YouTube channel for educational videos on AI and AI-adjacent projects, like this video on how to automatically extract phone call insights using LLMs and Python: