Blog

PII Redaction Policies for Speech-to-Text

Announcements
PII Redaction for Speech-to-Text
Share on social icon.Share on social icon.Share on social icon.Share on social icon.

Our new PII Redaction Policies feature is here!  Learn more about how it works and how to run some tests on your audio and video files in the below walkthrough video.

Introduction

Personally Identifiable Information (PII) is any data that can be used to identify an individual, any details that might provide insight around who someone is. This can include information like:

  • Email addresses
  • Social Security Numbers
  • Credit card numbers
  • Account numbers
  • Phone numbers
  • Birthdays

PII creates security and privacy challenges, especially when specific and stringent safeguards for it are spelled out in regulations like the European Union’s (EU’s) General Data Protection Regulation (GDPR).

The loss of PII can also result in substantial loss to both businesses and individuals. According to IBM’s 2020 Cost of a Data Breach Report, they found customer data was the most-commonly compromised type of record with 80% of breached organizations saying that customer PII was affected.

With PII now being more accessible and shareable, through multiple channels, companies are having to bolster their security practices to ensure proper handling of their customer's data.

Securing PII Through Redaction

PII redaction is one of the most effective solutions to secure data, as it provides another layer of protection to make sure customer information is hidden. This is especially important when using AI, automated speech recognition, and Speech-to-Text APIs, as there is no human review.

With PII redaction, a phone number like 412-412-4124 would become ###-###-#### in the text, and the audio would replace those words with a blank sound. We've included some additional examples below.

Conference Call Platforms

Often times, customers are calling into Conference Call Platforms and sharing their email address, credit card number, phone number, and other very sensitive PII. With PII Redaction, this sensitive data can be automatically detected and redacted, so that you're confident you're not storing or processing any PII from call recordings.

AssemblyAI Speech to Text PII Redaction for Conference Call Platforms.png

Call Tracking Platforms

Call Tracking Platforms primarily record agent-customer calls for marketing, sales, and support. In many cases, companies using these platforms require verification information from customers including account numbers, email addresses, and phone numbers while making a purchase or getting support. With PII redaction, all of this personal data will be removed from the call recording and the automated transcription.

AssemblyAI_Speech_to_Text_API_PII_Redaction_Call_Tracking_Platforms.png

Telemedicine

When patients visit their doctor, there is a high likelihood they share personal medical details like health insurance policy numbers, group numbers, or account numbers. With automated recording and transcription now being used for notes both in-person and over virtual calls, there is a high likelihood that patient medical details could be compromised. PII Redaction can be leveraged to protect patients' medical information by removing it from the audio (or video) recording and transcript notes.

AssemblyAI_Speech_to_Text_API_PII_Redaction_Telemedicine.png

Hiring platforms

Common hiring platforms like Applicant Tracking Systems, Video Hiring Software, and even Human Resources Information Systems allow recruitment, HR, and management teams to efficiently manage their candidate pipeline and employee onboarding. These platforms often leverage call and video recording to make the process more effective, however, this often surfaces candidate and new hire information like emails, phone numbers, and compensation amounts. To help protect this information, PII Redaction will automatically detect and remove all candidate and new hire information from the recordings and transcriptions.

AssemblyAI_Speech_to_Text_API_PII_Redaction_Hiring_Platforms.png

Enabling PII Redaction Policies

AssemblyAI enables you to automatically detect and redact Personally Identifiable Information (PII) from the automated transcription produced by our API.

Below is a code sample that shows how easy it is to enable PII Redaction when submitting audio or video files for transcription. You can view code samples in more programming languages in our API Docs.

import requests

endpoint = "https://api.assemblyai.com/v2/transcript"

json = {
  "audio_url": "https://app.assemblyai.com/static/media/phone_demo_clip_1.wav",
  "redact_pii": True,
  "redact_pii_policies": ["all"]
}

headers = {
    "authorization": "YOUR-API-TOKEN",
    "content-type": "application/json"
}

response = requests.post(endpoint, json=json, headers=headers)

print(response.json())

Specifying Which Types of Data to Redact

To best-fit the data redaction to your application, you can select from a set of redaction policies when PII Redaction is enabled. You can include any or some of these policy names in the redact_pii_policies parameter when making your POST request as shown above.

Redact_Personally_Identifiable_Information_PII_from_transcript_text_AssemblyAI

Redact PII from Audio

When you request a transcription that has PII redacted, you also have an option to request audio redaction. In that case, we will mute the parts of your audio where PII is spoken, and will make a downloadable URL available for the redacted audio file.

Important Considerations

  • The muted portions of the audio will correspond to the timestamps where the PII was detected and replaced with # characters in the transcription text.
  • We will store the redacted audio file for 24 hours after your transcription has completed. After this time it will expire, so you'll need to download this file and store it in your own server/S3 bucket/etc.

The below code samples shows how you can submit an audio or video file for transcription and enable PII Audio Redaction. You can view code samples in more programming languages in our API Docs.

import requests

endpoint = "https://api.assemblyai.com/v2/transcript"

json = {
  "audio_url": "https://app.assemblyai.com/static/media/phone_demo_clip_1.wav",
  "redact_pii": True,
  "redact_pii_audio": True,
  # optional; receive a webhook when redacted audio is ready
  "webhook_url": "http://myserver.com/receive"
}

headers = {
    "authorization": "YOUR-API-TOKEN",
    "content-type": "application/json"
}

response = requests.post(endpoint, json=json, headers=headers)

Get the redacted audio URL

If a webhook_url was provided in your API request, we will send a POST to your webhook_url when the redacted audio is ready. The POST request to your webhook will look like this:

headers
---
content-length: 79
accept-encoding: gzip, deflate
accept: */*
user-agent: python-requests/2.21.0
content-type: application/json

params
--
status: 'redacted_audio_ready'
redacted_audio_url: 'https://link-to-redacted-audio'

If you can't receive a webhook, you can also make a GET request to the following endpoint to retrieve a URL for your redacted audio file:

https://api.assemblyai.com/v2/transcript/<your transcript id>/redacted-audio

This will return the following responses:

{
    "status": "redacted_audio_ready",
    "redacted_audio_url": "https://link-to-redacted-audio"
}


Sources

Subscribe to our blog!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

You may also like

Checkout some of our recent research and product updates

Python Speech Recognition in Under 25 Lines of Code
Tutorials
Python Speech Recognition in Under 25 Lines of Code

How to build a YouTube Downloader in Python
Tutorials
How to build a YouTube downloader in Python

How to get the transcript of a YouTube video
Tutorials
How to get the transcript of a YouTube video

In this blog post, I'm going to show you how to build a command line tool that will download a video from a YouTube link and extract the transcription for you via AssemblyAI in Python 3!

ADVANCED TRANSCRIPTON FEATURES

Unlock your media with our advanced features like PII Redaction,
Keyword Boosts, Automatic Transcript Highlights, and more