In today’s connected, data-driven world, privacy around sensitive and confidential data is a top concern for many companies. But, given the wealth of data and documents available, manually redacting all sensitive data just isn’t feasible. That’s why many companies are turning to advanced AI models to remove or redact such confidential or sensitive data automatically.
In the past few years, significant advances in Deep Learning research have pushed the AI models powering tasks like PII Redaction to new levels of accessibility, accuracy, and availability. Product teams in industries such as call tracking, conversational intelligence, hiring software, and more are looking to integrate this advanced technology into their platforms.
This article will explore how AI models are being used to build Personally Identifiable Information (PII) redaction tools. First, it will explain what PII redaction is before comparing some of the best PII Redaction APIs and AI Models on the market today. Finally, it will explore some of the top use cases for industries and product teams.
What is PII Redaction?
PII, or Personally Identifiable Information, refers either to information that could be used as an identity marker for an individual, or to sensitive or confidential information associated with an individual. Typically, this includes information such as street addresses, phone numbers, credit card numbers, social security numbers, and birth dates. It could also include information related to religion, medical conditions, nationality, and more.
Here is an example of commonly redacted PII:
PII Redaction models typically can be applied to any database or data flow, such as online documents, hiring paperwork, or even video/audio that has been transcribed with a Speech Transcription API.
Why is PII Redaction Important?
Many companies must meet internal compliance requirements and/or external compliance regulations such as GDPR, CPRA, and HIPAA. Failure to redact or exclude sensitive data as outlined in these regulations could result in penalties, fines, or criminal charges. PII Redaction APIs, backed by cutting-edge AI models, can ensure automatic, accurate compliance for each piece of data.
How Does PII Redaction Work?
PII Redaction models are typically designed as a three step process. First, the model must identify the desired entities in the text. An entity could be a personal name, address, medical condition, or other specified information. Then, the model must classify the entities that are identified into a broader category, such as
credit card number. Finally, the model will use this classification to determine if the entity needs to be redacted, and if it does, replace the entity with a
# for each redacted character. PII Redaction models can also be designed to replace the redacted characters with synthetic PII to minimize re-identification risks if data security is a concern.
Two main model designs are used to achieve the above outcomes: Ontological and Deep Learning.
Ontological PII Redaction models use a knowledge-based recognition process. This approach requires a list of relevant datasets that the model uses when making inferences. Because the datasets are static, accuracy can vary based on how appropriate the dataset is to the input text. However, ontological PII Redaction models can work really well for jargon-heavy industries such as medicine and science.
Deep Learning PII Redaction models use trained neural networks to make inferences. These neural networks are trained on thousands, millions, or billions of parameters to help models understand semantic and syntactic relationships between words and phrases in an input text. This extensive training boosts the model accuracy significantly, and is the more desired approach to accomplishing PII Redaction.
Note that input texts could be anything from online documents to textual databases to transcription texts from audio/video files.
Best APIs for PII Redaction
Now that we have an understanding of what PII Redaction is and how PII Redaction models work, we’ll explore five of the best PII Redaction models on the market today for text, audio, and video.
1. AssemblyAI’s PII Redaction API
AssemblyAI is an API platform for State-of-the-Art AI models. The company researches and develops cutting-edge AI models that help product teams better understand text, audio, and video. In addition to offering industry-leading speech transcription accuracy, its current suite of Audio Intelligence APIs includes PII Redaction, Content Moderation, Text Summarization, Sentiment Analysis, Entity Detection, and more.
AssemblyAI’s PII Redaction API helps product teams and developers build tools that redact sensitive or confidential information in a transcription text. With the API, product teams can customize which types of sensitive entities to redact in order to best fit PII redaction to their specific use case. Additionally, developers can decide how the redacted characters are displayed, either with a
# or with the specified
entity_type such as
PERSON_NAME. AssemblyAI’s PII Redaction API can also return the original audio file with the PII “beeped” out when spoken.
Pricing for bulk use begins at $0.00083 per second for its Audio Intelligence APIs, in addition to Core Transcription pricing. Those with smaller usage needs can also take advantage of the API’s free usage tier.
2. Amazon Transcribe’s PII Redaction
Amazon Transcribe offers PII Redaction for English-only text and audio/video streams. It also offers PII Redaction of real-time streams. PII that can be redacted include bank account and routing numbers, credit card numbers, CVV codes, expiration dates, PIN numbers, email addresses, U.S. mailing addresses and social security numbers. Its redaction feature does not meet the requirements for de-identification under medical privacy laws such as HIPAA.
Pricing for Amazon Transcribe can be complex, but interested users can see the breakdown or use the pricing calculator here.
3. Super.ai Redact
Super.ai’s Redact API supports image, video, and document redaction. PII that can be redacted include dates, vehicle identification numbers, license plate numbers, phone numbers, and more. Instead of replacing each redacted character with a
#, the API airbrushes out each character or replaces the characters with pseudonyms.
Those looking for further information about pricing or how the API works can set up time for a demo here.
4. Azure PII Redaction
Azure PII Redaction lets users with an Azure account and Visual Studio IDE to automatically redact sensitive information in texts. Azure has a separate API that can also be used to detect and redact PII in conversations. Currently, Azure only supports PII Redaction in English. The API can redact confidential information such as routing numbers, SWIFT codes, dates, times, IP addresses, mailing addresses, and more. See the full list of entities that can be redacted here.
5. Private AI
Private AI lets users identify, redact, or replace PII in documents, images, audio, and video. Its API supports redaction across 39 languages and supports HIPAA, CPRA, and GDPR compliance requirements. Users can choose to redact confidential entities with a series of
# or replace each entity with synthetic data for more stringent security needs. Private AI can also blur out faces or bleep out mentions of sensitive data in videos.
Interested users can contact Private AI for more information about usage and pricing tiers.
PII Redaction Use Cases
PII Redaction is used across a wide range of industries and use cases.
Product teams at leading Customer Research Platforms are integrating PII Redaction tools in order to redact confidential or identifiable information when publishing user data online. They also use PII Redaction to remove sensitive data in customer surveys or responses prior to use for analysis.
Hiring Intelligence Platforms are using PII Redaction APIs to help build Ethical Hiring tools. When administering audio or video interviews at scale, PII redaction can be used to automatically remove attributes of potential bias, such as an interviewee’s age or gender, to create a more fair interview process. It can also be used to remove any confidential information that may be mentioned from the transcription.
Product teams at Call Tracking Solutions are building PII Redaction tools to remove or redact PII from customer/agent phone calls and chats. This helps their end users meet privacy compliance requirements, laws, and/or regulations such as GDPR and HIPAA.