Tutorials

Automatically redact PII from audio and video with Python

In this tutorial, we’ll learn how to automatically redact Personal Identifiable Information (PII) from audio and video files in 5 minutes using Python and AssemblyAI.

Automatically redact PII from audio and video with Python

Personal Identifiable Information, or PII, is information about an individual that can be linked to that individual in an identifiable manner. How and by whom PII is handled and accessed is regulated by laws such as HIPAA, GDPR, and CCPA.

Redacting PII from video or audio files (for example, a doctor/patient visit) is a very common need. Luckily, we can use AI to easily redact PII at scale.

In this tutorial, we’ll learn how to redact a wide range of PII categories, like medical conditions, email addresses, and credit card numbers, from audio and video files as well as their textual transcripts. Here you can see the final redacted transcript we’ll create, as well as the associated redacted audio file:

Good afternoon, MGK design. Hi. I'm looking to have plans drawn up for an addition in my house. Okay, let me have one of our architects return your call. May I have your name, please? My name is ####. ####. And your last name? My last name is #####. Would you spell that for me, please? # # # # # #. Okay, and your telephone number? Area code? ###-###-#### that's ###-###-#### yes, ma'am. Is there a good time to reach you? That's my cell, so he could catch me anytime on that. Okay, great. I'll have him return your call as soon as possible. Great. Thank you very much. You're welcome. Bye.
audio-thumbnail
Redacted call
0:00
/49.031837

Code relating to this tutorial can be found in this GitHub repository. Let’s get started!

Step 1: Set up environment

First, create a project directory and navigate into it. Then, create a virtual environment:

# Mac/Linux:
python3 -m venv venv
. venv/bin/activate

# Windows:
python -m venv venv
.\venv\Scripts\activate.bat

Next, install the AssemblyAI Python SDK:

pip install assemblyai

Then set your AssemblyAI API Key as an environment variable. You can get an AssemblyAI API key here for free. Note that you'll need to add funds to go beyond Speech Recognition and use our Audio Intelligence models like PII Redaction.

# Mac/Linux:
export ASSEMBLYAI_API_KEY=<YOUR_KEY>

# Windows:
set ASSEMBLYAI_API_KEY=<YOUR_KEY>

Step 2: Transcribe the file

Now that our environment is set up, we can submit an audio file for transcription. For this tutorial, we’ll use a short phone conversation between a man and an architecture firm. First, we’ll import the AssemblyAI Python SDK, and then specify the publicly-accessible URL of our audio file (you can also use a video file). Create a file called main.py and add the following lines:

import assemblyai as aai

audio_url = "https://storage.googleapis.com/aai-web-samples/architecture-call.mp3"

Next, we need to configure our transcription to redact PII, and set the types of PII we want to redact. We specify this in an aai.TranscriptionConfig. Add the following lines to main.py:

config = aai.TranscriptionConfig(
    redact_pii=True,
    redact_pii_audio=True,
    redact_pii_policies=[
        aai.PIIRedactionPolicy.person_name,
        aai.PIIRedactionPolicy.phone_number,
    ],
    redact_pii_sub=aai.PIISubstitutionPolicy.hash,
)

We enable PII Redaction through redact_pii, and specify that we also want a redacted version of the audio file itself (in addition to the redacted transcript) through redact_pii_audio.

Then, we set the categories, or policies, of PII we want to redact through redact_pii_policies. For a list of all available policies, see PII policies in our docs. Finally, we set how the PII is redacted through redact_pii_sub. In this case, we use a hash such that redacted audio is replaced with a sequence of hashes (####). You can find the possible values for this parameter in the PII Redaction API reference in our docs.

Now we can transcribe the audio file by using the transcribe method of an aai.Transcriber, passing in the configuration we just created. Add the following line to main.py:

transcript = aai.Transcriber().transcribe(audio_url, config)

Step 3: Print the redacted transcript

When the program executes, the transcribe method submits the file for transcription. The resulting transcript is available through transcript.text. Add a line to main.py to print out the redacted transcript when we execute the program:

print(transcript.text, '\n\n')

Step 4: Fetch the redacted audio files

You can print off the URL of the redacted audio file by adding the below line to main.py

print(transcript.get_redacted_audio_url())

Step 5: Run the program

In a terminal (with your AssemblyAI API key set as an environment variable as shown in Step 2), execute python main.py or python3 main.py to run the program. After a few moments, the redacted transcript will be printed:

Good afternoon, MGK design. Hi. I'm looking to have plans drawn up for an addition in my house. Okay, let me have one of our architects return your call. May I have your name, please? My name is ####. ####. And your last name? My last name is #####. Would you spell that for me, please? # # # # # #. Okay, and your telephone number? Area code? ###-###-#### that's ###-###-#### yes, ma'am. Is there a good time to reach you? That's my cell, so he could catch me anytime on that. Okay, great. I'll have him return your call as soon as possible. Great. Thank you very much. You're welcome. Bye.

As we can see, the specified PII types were redacted in the transcript. You can see the unredacted transcript below for comparison:

Good afternoon, MGK design. Hi. I'm looking to have plans drawn up for an addition in my house. Okay, let me have one of our architects return your call. May I have your name, please? My name is John. John. And your last name? My last name is Lowry. Would you spell that for me, please? L o w e r y. Okay, and your telephone number? Area code? 610-265-1714 that's 610-265-1714 yes, ma'am. Is there a good time to reach you? That's my cell, so he could catch me anytime on that. Okay, great. I'll have him return your call as soon as possible. Great. Thank you very much. You're welcome. Bye.

Finally, the URL of your redacted audio file will be printed to the console. Go to the URL, and you’ll be able to download the redacted audio file:

audio-thumbnail
Redacted call
0:00
/49.031837

Note that a redacted audio file will be returned, even if you submitted a video file for transcription. In this case, you can use a tool like FFmpeg to replace the original audio in the file with the redacted version.

Optionally, you can check out the code in compare.py in the project repository to print the differences between the unredacted and redacted versions of the transcript.

Final words

In this tutorial, we learned how to automatically redact PII from audio and video files in 5 minutes using AssemblyAI and Python. You can check out our docs on PII redaction to learn more about it or browse some of our other AI models.

Alternatively, feel free to check out our blog or YouTube channel for educational content on AI and Machine Learning, like this video on how to build an AI voice bot with Python: