PII Redaction and Accuracy Improvements

Last month, we introduced PII Redaction Policies, as part of a big overhaul to our PII Redaction feature to make it more flexible and powerful for you to specify exactly what you want redacted from your transcriptions.

Enhanced PII (PCI) Redaction: more policies and customization

We’ve now expanded the list of Redaction Policies available to 20, with more on the way before the end of the year. Some of these new policies include credit_card_cvv, credit_card_expiration, organization, nationality, event, and location - the full list is shown below:

Redact_Personally_Identifiable_Information__PII__from_transcript_text___AssemblyAI.png

Using these new policies, you can take even greater control on how you safely redact the transcriptions produced by our API in order to comply with your, and your customers’, security standards.

Customize how Redacted PII is replaced

By default, any PII that is detected is replaced with a hash - #. For example, the credit card number 1111-2222-3333-4444 is replaced with ####-####-####-####. To make the redaction more user-friendly and readable, the redacted text can now be replaced with the policy name. For example, the credit card number 1111-2222-3333-4444 is replaced with [CREDIT_CARD_NUMBER], and the social security number 111-11-1111 would be replaced with [US_SOCIAL_SECURITY_NUMBER].

When you have a lot of redaction policies enabled, this new feature maintains the readability of your transcriptions for your end-users compared to replacing all sensitive information with hash characters.

To enable this new feature, you just have to include a new parameter in your POST request, redact_pii_sub. Below are the available options for that parameter and the behavior of each option:

  • Value: hash

    • PII that is detected is replaced with a hash - #. For example, the credit card number 1111-2222-3333-4444 is replaced with ####-####-####-####. (Applied by default)

  • Value: entity_name

    • PII that is detected is replaced with the associated policy name.

How PII Redaction works in AssemblyAI  

Testing these policies using AssemblyAI’s API only takes a couple of minutes to setup. Using the snippet below, you can run redaction on your own audio or video files:

import requests

endpoint = "https://api.assemblyai.com/v2/transcript"

json = {
  "audio_url": "https://app.assemblyai.com/static/media/phone_demo_clip_1.wav",
  "redact_pii": True,
  "redact_pii_policies": ["all"]
}

headers = {
    "authorization": "YOUR-API-TOKEN",
    "content-type": "application/json"
}

response = requests.post(endpoint, json=json, headers=headers)

print(response.json())

(shown in Python - more languages)

PII Redaction for audio files

These same PII Redaction Policies also apply to audio redaction. We will mute the parts of your audio where PII is spoken, and make a downloadable URL available for the redacted audio file.

To test audio redaction on your files, follow our guide here.

Improved accuracy: new update to acoustic and language models

We released another set of accuracy updates to our neural network - these include significant improvements to call, video, and podcast content. 

To help benchmark our current model’s accuracy, we’ve included a comparison versus common providers like Google Cloud’s Speech-to-Text (Premium Video Model) and AWS Transcribe below. 

These sample video podcast transcripts (from Joe Rogan’s Podcast) are shown alongside the Word Error Rate which calculates the accuracy of automated speech recognition vs human transcription. 

As you can see in the table, we still consistently outbenchmark big tech providers like Google and AWS. Other providers like Microsoft and IBM were included in our analysis, but recorded the lowest accuracy. 

AssemblyAI | WhatConverts Case Study: call tracking transcription

WhatConverts is a call tracking software (SaaS) that helps customers answer the question, “What marketing works?”. 

https://www.assemblyai.com/blog/what-converts-call-tracking-software-assembly-ai-case-study

Their platform integrates with all marketing channels (e.g. Google Ads, Facebook Ads, Intercom, etc.) and makes it simple to understand which campaigns are working and which campaigns aren’t delivering leads. Leads are then prioritized, ranked, and managed within the WhatConverts software. 

Read the full case study here >>

With the switch to AssemblyAI, they experienced a significant accuracy improvement, improved security - PII (PCI) Redaction, and more affordable pricing.

WhatConverts covered their switch to AssemblyAI, you can read their update here >>

WhatConverts Call Tracking | AssemblyAI Speech-to-Text