PII Redaction Policies for Speech-to-Text

Our new PII Redaction Policies feature is here!

Introduction

Personally Identifiable Information (PII) is any data that can be used to identify an individual, any details that might provide insight around who someone is. This can include information like:

Email addresses
Social Security Numbers
Credit card numbers
Account numbers
Phone numbers
Birthdays

PII creates security and privacy challenges, especially when specific and stringent safeguards for it are spelled out in regulations like the European Union’s (EU’s) General Data Protection Regulation (GDPR).

The loss of PII can also result in substantial loss to both businesses and individuals. According to IBM’s 2020 Cost of a Data Breach Report, they found customer data was the most-commonly compromised type of record with 80% of breached organizations saying that customer PII was affected.

With PII now being more accessible and shareable, through multiple channels, companies are having to bolster their security practices to ensure proper handling of their customer's data.

Securing PII Through Redaction

PII redaction is one of the most effective solutions to secure data, as it provides another layer of protection to make sure customer information is hidden. This is especially important when using AI, automated speech recognition, and Speech-to-Text APIs, as there is no human review.

With PII redaction, a phone number like 412-412-4124 would become ###-###-#### in the text, and the audio would replace those words with a blank sound. We've included some additional examples below.

Conference Call Platforms

Often times, customers are calling into Conference Call Platforms and sharing their email address, credit card number, phone number, and other very sensitive PII. With PII Redaction, this sensitive data can be automatically detected and redacted, so that you're confident you're not storing or processing any PII from call recordings.

Call Tracking Platforms

Call Tracking Platforms primarily record agent-customer calls for marketing, sales, and support. In many cases, companies using these platforms require verification information from customers including account numbers, email addresses, and phone numbers while making a purchase or getting support. With PII redaction, all of this personal data will be removed from the call recording and the automated transcription.

Telemedicine

When patients visit their doctor, there is a high likelihood they share personal medical details like health insurance policy numbers, group numbers, or account numbers. With automated recording and transcription now being used for notes both in-person and over virtual calls, there is a high likelihood that patient medical details could be compromised. PII Redaction can be leveraged to protect patients' medical information by removing it from the audio (or video) recording and transcript notes.

Hiring platforms

Common hiring platforms like Applicant Tracking Systems, Video Hiring Software, and even Human Resources Information Systems allow recruitment, HR, and management teams to efficiently manage their candidate pipeline and employee onboarding. These platforms often leverage call and video recording to make the process more effective, however, this often surfaces candidate and new hire information like emails, phone numbers, and compensation amounts. To help protect this information, PII Redaction will automatically detect and remove all candidate and new hire information from the recordings and transcriptions.

Enabling PII Redaction Policies

AssemblyAI enables you to automatically detect and redact Personally Identifiable Information (PII) from the automated transcription produced by our API.

Below is a code sample that shows how easy it is to enable PII Redaction when submitting audio or video files for transcription. You can view code samples in more programming languages in our API Docs.

import requests

endpoint = "https://api.assemblyai.com/v2/transcript"

json = {
  "audio_url": "https://app.assemblyai.com/static/media/phone_demo_clip_1.wav",
  "redact_pii": True
}

headers = {
    "authorization": "YOUR-API-TOKEN",
    "content-type": "application/json"
}

response = requests.post(endpoint, json=json, headers=headers)

print(response.json())

Specifying Which Types of Data to Redact

To best-fit the data redaction to your application, you can select from a set of redaction policies when PII Redaction is enabled. You can include any or some of these policy names in the redact_pii_policies parameter when making your POST request as shown above.

For the full list of PII policies, see our API docs.

Redact PII from Audio

When you request a transcription that has PII redacted, you also have an option to request audio redaction. In that case, we will mute the parts of your audio where PII is spoken, and will make a downloadable URL available for the redacted audio file.

Important Considerations

The muted portions of the audio will correspond to the timestamps where the PII was detected and replaced with # characters in the transcription text.
We will store the redacted audio file for 24 hours after your transcription has completed. After this time it will expire, so you'll need to download this file and store it in your own server/S3 bucket/etc.

The below code samples shows how you can submit an audio or video file for transcription and enable PII Audio Redaction. You can view code samples in more programming languages in our API Docs.

import requests

endpoint = "https://api.assemblyai.com/v2/transcript"

json = {
  "audio_url": "https://app.assemblyai.com/static/media/phone_demo_clip_1.wav",
  "redact_pii": True,
  "redact_pii_audio": True,
  # optional; receive a webhook when redacted audio is ready
  "webhook_url": "http://myserver.com/receive"
}

headers = {
    "authorization": "YOUR-API-TOKEN",
    "content-type": "application/json"
}

response = requests.post(endpoint, json=json, headers=headers)

Get the redacted audio URL

If a webhook_url was provided in your API request, we will send a POST to your webhook_url when the redacted audio is ready. The POST request to your webhook will look like this:

headers
---
content-length: 79
accept-encoding: gzip, deflate
accept: */*
user-agent: python-requests/2.21.0
content-type: application/json

params
--
status: 'redacted_audio_ready'
redacted_audio_url: 'https://link-to-redacted-audio'

If you can't receive a webhook, you can also make a GET request to the following endpoint to retrieve a URL for your redacted audio file:

https://api.assemblyai.com/v2/transcript/<your transcript id>/redacted-audio

This will return the following responses:

{
    "status": "redacted_audio_ready",
    "redacted_audio_url": "https://link-to-redacted-audio"
}

PII Redaction Policies for Speech-to-Text

Introduction

Securing PII Through Redaction

Conference Call Platforms

Call Tracking Platforms

Telemedicine

Hiring platforms

Enabling PII Redaction Policies

Specifying Which Types of Data to Redact

Redact PII from Audio

Get the redacted audio URL

Sources

Popular posts

AI trends in 2024: Graph Neural Networks

AI for Universal Audio Understanding: Qwen-Audio Explained

Combining Speech Recognition and Diarization in one model

How DALL-E 2 Actually Works