Skip to main content

PII Redaction

Personal Identifiable Information (PII) Redaction is an AI model that is used to automatically remove sensitive information that can be used to uniquely identify an individual from your transcript text.

Quickstart

When submitting files for transcription, include the redact_pii parameter in your request body and set it to true, as well as the required parameter redact_pii_policies, listing all policies that should be redacted.

You can explore the full JSON response here:

Show JSON

You run this code snippet in Colab here, or you can view the full source code here.

Understanding the response

The JSON object above contains all information about the transcription. Depending on which Models are used to analyze the audio, the attributes of this object will vary. For example, in the quickstart above we did not enable Summarization, which is reflected by the summarization: false key-value pair in the JSON above. Had we activated Summarization, then the summary, summary_type, and summary_model key values would contain the file summary (and additional details) rather than the current null values.

To access the PII Redaction diarization information, we use the redact_pii, redact_pii_audio, redact_pii_audio_quality, redact_pii_policies, and redact_pii_sub keys:

The reference table below lists all relevant attributes along with their descriptions, where we've called the JSON response object results. Object attributes are accessed via dot notation, and arbitrary array elements are denoted with [i]. For example, results.words[i].text refers to the text attribute of the i-th element of the words array in the JSON results object.

results.redact_piibooleanWhether PII Redaction was enabled in the transcription request
results.redact_pii_audiobooleanWhether to return a redacted version of the audio file
results.redact_pii_audio_qualitystringThe quality of the redacted PII audio file
results.redact_pii_policiesarrayAn array of PII policies that were requested for redaction in the transcription request
results.redact_pii_subbooleanWhat type of substitution is used to redact PII (see below for details)

All policies supported by the model

With PII Redaction, the API can automatically remove Personally Identifiable Information (PII) such as phone numbers and social security numbers from the transcription text before it is returned. The redacted text replaces any sensitive information with "#" characters.

Below is a table that lists all the available PII Redaction policies and their descriptions:

medical_processMedical process, including treatments, procedures, and tests (e.g., heart surgery, CT scan)
medical_conditionName of a medical condition, disease, syndrome, deficit, or disorder (e.g., chronic fatigue syndrome, arrhythmia, depression)
blood_typeBlood type (e.g., O-, AB positive)
drugMedications, vitamins, or supplements (e.g., Advil, Acetaminophen, Panadol)
injuryBodily injury (e.g., I broke my arm, I have a sprained wrist)
number_sequenceA "lazy" rule that redacts any sequence of numbers equal to or greater than 2
email_addressEmail address (e.g., support@assemblyai.com)
date_of_birthDate of Birth (e.g., Date of Birth: March 7,1961)
phone_numberTelephone or fax number
us_social_security_numberSocial Security Number or equivalent
credit_card_numberCredit card number
credit_card_expirationExpiration date of a credit card
credit_card_cvvCredit card verification code (e.g., CVV: 080)
dateSpecific calendar date (e.g., December 18)
nationalityTerms indicating nationality, ethnicity, or race (e.g., American, Asian, Caucasian)
eventName of an event or holiday (e.g., Olympics, Yom Kippur)
languageName of a natural language (e.g., Spanish, French)
locationAny Location reference including mailing address, postal code, city, state, province, or country
money_amountName and/or amount of currency (e.g., 15 pesos, $94.50)
person_nameName of a person (e.g., Bob, Doug Jones)
person_ageNumber associated with an age (e.g., 27, 75)
organizationName of an organization (e.g., CNN, McDonalds, University of Alaska)
political_affiliationTerms referring to a political party, movement, or ideology (e.g., Republican, Liberal)
occupationJob title or profession (e.g., professor, actors, engineer, CPA)
religionTerms indicating religious affiliation (e.g., Hindu, Catholic)
drivers_licenseDriver’s license number (e.g., DL# 356933-540)
banking_informationBanking information, including account and routing numbers

In addition to the redact_pii_policies parameter, users can also use the redact_pii_sub parameter to further customize PII Redaction. This parameter allows users to specify the exact text substrings to be redacted, regardless of the PII policy being used.

redact_pii_sub.hashPII that is detected is replaced with a hash - #. For example, I'm calling for John is replaced with ####. (Applied by default)
redact_pii_sub.entity_namePII that is detected is replaced with the associated policy name. For example, John is replaced with [PERSON_NAME]. This is recommended for readability.

Create a redacted audio file

In addition to redacting sensitive information from the transcription text, the API can also generate a version of the original audio file with the PII "beeped" out when it's being spoken. To do so, include the redact_pii_audio parameter in your request when submitting files for transcription.

When the transcription is complete, you can retrieve a URL that points to your redacted audio file by making a request to the following API endpoint:

Using webhooks

Webhooks allow you to receive real-time updates about the status of your PII redacted audio file.

If a webhook_url was provided in your request when submitting your audio file for transcription, we'll send a POST request to the URL. Note if you're using webhooks along with PII audio redaction you'll receive two webhook calls. The first call is for the redacted audio. The second one comes a few seconds later and is for the completed transcript.

  1. 1

    When you receive the request from AssemblyAI, it'll include the following headers.

    content-type: application/json
    content-length: 79
    accept: */*
    accept-encoding: gzip, deflate
    user-agent: python-requests/2.21.0
  2. 2

    And the first request body includes the following parameters.

    status: 'redacted_audio_ready'
    redacted_audio_url: 'https://link-to-redacted-audio'

    The status field indicates whether the PII Redaction was completed successfully or if there was an error. The redacted_audio_url field contains a URL to the redacted audio file.

    The second request body includes the following parameters.

    transcript_id: 5552493-16d8-42d8-8feb-c2a16b56f6e8
    status: completed

    The transcript_id field contains the ID of the completed transcription, and the status field indicates whether the transcription was completed successfully or if there was an error.

    The redacted audio URL is accessible for 30 minutes while the redacted audio file itself is available for 24 hours. If you don't access the original URL within 30 minutes, you can make another GET request to the /v2/transcript/<transcript-id>/redacted-audio endpoint to get a new URL as long as it's within 24 hours.

Troubleshooting

Why is the PII not redacted in my transcription?
PII only redacts words under the text key. When PII is enabled together with other features such as Entity Detection and Summarization, PII can still show up in the entities or summary key. Ensure that the redact_pii_policies parameter is included in your request with the desired policy names. If you're still experiencing issues, please reach out to our support team for assistance.
Why is my webhook not being sent?
There could be several reasons why your webhook isn't being sent, such as a misconfigured URL, an unreachable endpoint, or an issue with the authentication headers. Double-check your request and ensure that the webhook_url parameter is included with a valid URL that can be reached by AssemblyAI's API. If you're using custom authentication headers, ensure that the webhook_auth_header_name and webhook_auth_header_value parameters are included and are correct. If you're still having issues, please contact our support team for assistance.