PII Redaction for Speech-to-Text

Our new PII Redaction Policies feature is here! Learn more about how it works and how to run some tests on your audio and video files.

Introduction

Personally Identifiable Information (PII) is any data that can be used to identify an individual, any details that might provide insight around who someone is. This can include information like:

  • Email addresses
  • Social Security Numbers
  • Credit card numbers
  • Account numbers
  • Phone numbers
  • Birthdays

PII creates security and privacy challenges, especially when specific and stringent safeguards for it are spelled out in regulations like the European Union’s (EU’s) General Data Protection Regulation (GDPR).

The loss of PII can also result in substantial loss to both businesses and individuals. According to IBM’s 2020 Cost of a Data Breach Report, they found customer data was the most-commonly compromised type of record with 80% of breached organizations saying that customer PII was affected.

With PII now being more accessible and shareable, through multiple channels, companies are having to bolster their security practices to ensure proper handling of their customer's data.

Securing PII through redaction

PII redaction is one of the most effective solutions to secure data, as it provides another layer of protection to make sure customer information is hidden. This is especially important when using AI, automated speech recognition, and Speech-to-Text APIs, as there is no human review.

With PII redaction, a phone number like 412-412-4124 would become ###-###-#### in the text, and the audio would replace those words with a blank sound. We've included some additional examples below.

Conference Call Platforms

Often times, customers are calling into Conference Call Platforms and sharing their email address, credit card number, phone number, and other very sensitive PII. With PII Redaction, this sensitive data can be automatically detected and redacted, so that you're confident you're not storing or processing any PII from call recordings.

AssemblyAI Speech to Text PII Redaction for Conference Call Platforms.png

Call Tracking Platforms

Call Tracking Platforms primarily record agent-customer calls for marketing, sales, and support. In many cases, companies using these platforms require verification information from customers including account numbers, email addresses, and phone numbers while making a purchase or getting support. With PII redaction, all of this personal data will be removed from the call recording and the automated transcription.

AssemblyAI_Speech_to_Text_API_PII_Redaction_Call_Tracking_Platforms.png

Telemedicine

When patients visit their doctor, there is a high likelihood they share personal medical details like health insurance policy numbers, group numbers, or account numbers. With automated recording and transcription now being used for notes both in-person and over virtual calls, there is a high likelihood that patient medical details could be compromised. PII Redaction can be leveraged to protect patients' medical information by removing it from the audio (or video) recording and transcript notes.

AssemblyAI_Speech_to_Text_API_PII_Redaction_Telemedicine.png

Hiring platforms

Common hiring platforms like Applicant Tracking Systems, Video Hiring Software, and even Human Resources Information Systems allow recruitment, HR, and management teams to efficiently manage their candidate pipeline and employee onboarding. These platforms often leverage call and video recording to make the process more effective, however, this often surfaces candidate and new hire information like emails, phone numbers, and compensation amounts. To help protect this information, PII Redaction will automatically detect and remove all candidate and new hire information from the recordings and transcriptions.

AssemblyAI_Speech_to_Text_API_PII_Redaction_Hiring_Platforms.png

Enabling AssemblyAI's PII Redaction Policies

AssemblyAI allows you to automatically detect and redact Personally Identifiable Information (PII) across your automated transcription, along with your audio or video file.

Redact PII from transcripts (Python)

View code samples in more programming languages here

import requests

endpoint = "https://api.assemblyai.com/v2/transcript"

json = {
  "audio_url": "https://app.assemblyai.com/static/media/phone_demo_clip_1.wav",
  "redact_pii": True,
  "redact_pii_policies": ["all"]
}

headers = {
    "authorization": "YOUR-API-TOKEN",
    "content-type": "application/json"
}

response = requests.post(endpoint, json=json, headers=headers)

print(response.json())

Specify which types of data to redact

To best-fit the data redaction to your application, you can select from a set of redaction policies when PII redaction is enabled. You can include any or some of these policy names in the redact_pii_policies array when making your POST request as shown above.

Redact_Personally_Identifiable_Information_PII_from_transcript_text_AssemblyAI

Redact PII from Audio

When you request a transcription that has PII redacted, you also have an option to request audio redaction. In that case, we will mute the parts of your audio where PII numbers are spoken, and make a downloadable URL available for the redacted audio file.

Important Considerations

  • The muted portions of the audio will correspond to the timestamps where the PII was detected and replaced with # characters in the transcription text.
  • We will store the redacted audio file for 24 hours after your transcription has completed. After this time it will expire, so you'll need to download this file and store it in your own server/S3 bucket/etc.

Submit an audio or video file for transcription and enable audio redaction (Python)

View code samples in more programming languages here

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import requests

endpoint = "https://api.assemblyai.com/v2/transcript"

json = {
  "audio_url": "https://app.assemblyai.com/static/media/phone_demo_clip_1.wav",
  "redact_pii": True,
  "redact_pii_audio": True,
  # optional; receive a webhook when redacted audio is ready
  "webhook_url": "http://myserver.com/receive"
}

headers = {
    "authorization": "YOUR-API-TOKEN",
    "content-type": "application/json"
}

response = requests.post(endpoint, json=json, headers=headers)

print(response.json())

Get the redacted audio URL

If a webhook_url was provided in your API request, we will send a POST to your webhook_url when the redacted audio is ready. The POST request headers and JSON body will look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
headers
---
content-length: 79
accept-encoding: gzip, deflate
accept: */*
user-agent: python-requests/2.21.0
content-type: application/json

params
--
status: 'redacted_audio_ready'
redacted_audio_url: 'https://link-to-redacted-audio'

Retrieving the redact audio URL directly from the API

If you can't receive a webhook, you can also make a GET request to the following endpoint to retrieve a URL for your redacted audio file:

1
https://api.assemblyai.com/v2/transcript/<your transcript id>/redacted-audio

This will return the following status codes and responses:

200 status code (successful)

1
2
3
4
{
    "status": "redacted_audio_ready",
    "redacted_audio_url": "https://link-to-redacted-audio"
}
Missed our last update on Speaker Labels (Diarization)? Check it out here!

Sources

https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex%3A32016R0679

https://www.csoonline.com/article/3215864/how-to-protect-personally-identifiable-information-pii-under-gdpr.html

https://www.welivesecurity.com/2020/08/12/what-is-cost-data-breach/

https://www.ibm.com/security/digital-assets/cost-data-breach-report/#/