Personally Identifiable Information, or PII, is any data that can be used to identify a specific individual. Regulations like HIPAA, GDPR, and CCPA strictly govern how this information is handled, and failure to comply can result in significant fines and a loss of customer trust.
Manually redacting PII from audio or video files—like support calls or telehealth sessions—is slow, expensive, and prone to human error. Using an AI model to automatically detect and redact PII is the only way to ensure compliance at scale.
In this tutorial, we'll build a simple Python script to automatically redact a wide range of PII from audio files and their transcripts. Here's an example of the final redacted transcript we'll create:
PII redaction automatically identifies and removes sensitive information from audio, video, and text files using AI models. This includes names, phone numbers, credit cards, medical data, and other personally identifiable information.
Manual redaction doesn't scale for production applications processing thousands of audio files. AssemblyAI's PII Redaction model detects dozens of types of sensitive data and replaces it with hashes, entity labels, or removes it entirely.
AssemblyAI's PII Redaction model can detect and remove a wide range of sensitive information. You specify which types to redact by enabling different policies:
You can enable any combination of these policies based on your compliance requirements. For a complete list, check our PII policies documentation.
First, create a project directory and navigate into it. Then, create a virtual environment:
Then set your AssemblyAI API Key as an environment variable. You can get an AssemblyAI API key here for free. Our free tier includes access to transcription and all our Speech Understanding models, including PII Redaction. You only need to add a credit card to use our Large Language Model framework, LeMUR.
Get your free AssemblyAI API key
Sign up to access transcription and PII Redaction with the Python SDK. The free tier includes Speech Understanding models so you can start building right away.
Get free API key
Step 2: Transcribe the file
Now that our environment is set up, we can submit an audio file for transcription. For this tutorial, we'll use a short phone conversation between a man and an architecture firm. First, we'll import the AssemblyAI Python SDK, and then specify the publicly-accessible URL of our audio file (you can also use a video file). Create a file called main.py
and add the following lines:
import assemblyai as aai
audio_url = "https://storage.googleapis.com/aai-web-samples/architecture-call.mp3"
Next, we need to configure our transcription to redact PII, and set the types of PII we want to redact. We specify this in an aai.TranscriptionConfig
. Add the following lines to main.py
:
config = aai.TranscriptionConfig(
redact_pii=True,
redact_pii_audio=True,
redact_pii_policies=[
aai.PIIRedactionPolicy.person_name,
aai.PIIRedactionPolicy.phone_number,
],
redact_pii_sub=aai.PIISubstitutionPolicy.hash,
)
We enable PII Redaction through redact_pii
, and specify that we also want a redacted version of the audio file itself (in addition to the redacted transcript) through redact_pii_audio
.
Then, we set the categories, or policies, of PII we want to redact through redact_pii_policies
. For a list of all available policies, see PII policies in our docs. Finally, we set how the PII is redacted through redact_pii_sub
. In this case, we use a hash such that redacted audio is replaced with a sequence of hashes (####
). You can find the possible values for this parameter in the API reference for creating a transcript.
Now we can transcribe the audio file by using the transcribe
method of an aai.Transcriber
, passing in the configuration we just created. Add the following line to main.py
:
transcript = aai.Transcriber().transcribe(audio_url, config)
Step 3: Print the redacted transcript
When the program executes, the transcribe
method submits the file for transcription. The resulting transcript is available through transcript.text
. Add a line to main.py
to print out the redacted transcript when we execute the program:
print(transcript.text, '\n\n')
Step 4: Fetch the redacted audio files
You can print off the URL of the redacted audio file by adding the below line to main.py
:
print(transcript.get_redacted_audio_url())
Step 5: Run the program
In a terminal (with your AssemblyAI API key set as an environment variable as shown in Step 2), execute python main.py
or python3 main.py
to run the program. After a few moments, the redacted transcript will be printed:
Good afternoon, MGK design. Hi. I'm looking to have plans drawn up for an addition in my house. Okay, let me have one of our architects return your call. May I have your name, please? My name is ####. ####. And your last name? My last name is #####. Would you spell that for me, please? # # # # # #. Okay, and your telephone number? Area code? ###-###-#### that's ###-###-#### yes, ma'am. Is there a good time to reach you? That's my cell, so he could catch me anytime on that. Okay, great. I'll have him return your call as soon as possible. Great. Thank you very much. You're welcome. Bye.
As we can see, the specified PII types were redacted in the transcript. You can see the unredacted transcript below for comparison:
Good afternoon, MGK design. Hi. I'm looking to have plans drawn up for an addition in my house. Okay, let me have one of our architects return your call. May I have your name, please? My name is John. John. And your last name? My last name is Lowry. Would you spell that for me, please? L o w e r y. Okay, and your telephone number? Area code? 610-265-1714 that's 610-265-1714 yes, ma'am. Is there a good time to reach you? That's my cell, so he could catch me anytime on that. Okay, great. I'll have him return your call as soon as possible. Great. Thank you very much. You're welcome. Bye.
Finally, the URL of your redacted audio file will be printed to the console. Go to the URL, and you'll be able to download the redacted audio file:
Redacted call
0:00
/0:49
1×
Note that a redacted audio file will be returned, even if you submitted a video file for transcription. In this case, you can use a tool like FFmpeg to replace the original audio in the file with the redacted version.
Optionally, you can check out the code in compare.py
in the project repository to print the differences between the unredacted and redacted versions of the transcript.
Handle errors and validate redaction results
Production applications require robust error handling for PII redaction failures. Common issues include:
- Invalid API keys
- Network timeouts
- Transcription failures
- File format errors
Wrap API calls in try/except blocks:
import assemblyai as aai
audio_url = "https://storage.googleapis.com/aai-web-samples/architecture-call.mp3"
config = aai.TranscriptionConfig(
redact_pii=True,
redact_pii_audio=True,
redact_pii_policies=[
aai.PIIRedactionPolicy.person_name,
aai.PIIRedactionPolicy.phone_number,
],
redact_pii_sub=aai.PIISubstitutionPolicy.hash,
)
try:
transcript = aai.Transcriber().transcribe(audio_url, config)
if transcript.status == aai.TranscriptStatus.error:
print(f"Transcription failed: {transcript.error}")
else:
print(transcript.text)
print(f"\nRedacted audio URL: {transcript.get_redacted_audio_url()}")
except Exception as e:
print(f"An error occurred: {e}")
Validate redaction completeness
You can access detailed information about what was redacted through the pii_redaction
property:
if transcript.pii_redaction:
print(f"\nRedaction summary:")
print(f"Total PII instances found: {len(transcript.pii_redaction)}")
# Count by PII type
pii_counts = {}
for pii in transcript.pii_redaction:
entity_type = pii.entity_type
pii_counts[entity_type] = pii_counts.get(entity_type, 0) + 1
for entity_type, count in pii_counts.items():
print(f" {entity_type}: {count} instances")
This validation helps you confirm that the redaction is working as expected and provides an audit trail for compliance purposes.
Optimize performance and manage costs
When processing audio at scale, performance optimization becomes critical. Here are key strategies to maximize efficiency while managing costs.
File size and format considerations
AssemblyAI supports most common audio and video formats, but preprocessing your files can improve processing speed:
# For large files, consider using async transcription
import asyncio
import assemblyai as aai
async def transcribe_large_file(audio_url):
config = aai.TranscriptionConfig(
redact_pii=True,
redact_pii_audio=True,
redact_pii_policies=[
aai.PIIRedactionPolicy.person_name,
aai.PIIRedactionPolicy.phone_number,
aai.PIIRedactionPolicy.email_address,
],
)
transcriber = aai.Transcriber()
transcript = await transcriber.transcribe_async(audio_url, config)
return transcript
Choose the right redaction strategy
Choose your substitution policy based on compliance requirements and data utility needs:
Policy | Output | Best For | Technical Impact |
---|
hash
| My name is #### | Maximum privacy | Smallest file size |
entity_name
| My name is [PERSON_NAME] | Context preservation | Larger file size |
Implement PII redaction at scale
Use configurable substitution policies like hashes or entity labels, and apply async or batch processing with the Python SDK to control cost and performance.
Start for free
Cost optimization tips
Building custom PII models requires 6-12 months of development plus ongoing maintenance. AssemblyAI's models achieve 95%+ accuracy across 15+ PII types with zero setup time.
Optimize API usage with these strategies:
- Selective policies: Use
redact_pii_policies=[aai.PIIRedactionPolicy.person_name]
instead of enabling all policies - Batch processing: Process multiple files in a single API call with
transcriber.transcribe_group(urls, config)
for maximum efficiency. - Caching: Store results in Redis/database for repeated file processing
- File preprocessing: Convert to optimal format before API calls
Process multiple files with batch redaction
In production environments, you'll often need to process multiple files. Here's how to efficiently handle batch processing with proper error handling for each file.
Basic batch processing
For batching a list of URLs, the most efficient method is to use transcribe_group()
. This submits all files for processing at once and waits for them to complete.
import assemblyai as aai
from typing import List, Dict
def batch_redact_files(audio_urls: List[str]) -> Dict[str, any]:
"""
Process multiple audio files for PII redaction using transcribe_group.
Returns a dictionary mapping URLs to their transcripts or errors.
"""
config = aai.TranscriptionConfig(
redact_pii=True,
redact_pii_audio=True,
redact_pii_policies=[
aai.PIIRedactionPolicy.person_name,
aai.PIIRedactionPolicy.phone_number,
aai.PIIRedactionPolicy.email_address,
aai.PIIRedactionPolicy.credit_card_number,
],
redact_pii_sub=aai.PIISubstitutionPolicy.hash,
)
transcriber = aai.Transcriber()
transcript_group = transcriber.transcribe_group(audio_urls, config)
results = {}
for transcript in transcript_group:
if transcript.status == aai.TranscriptStatus.error:
results[transcript.audio_url] = {'error': transcript.error}
else:
results[transcript.audio_url] = {
'text': transcript.text,
'redacted_audio_url': transcript.get_redacted_audio_url(),
'pii_count': len(transcript.pii_redaction) if transcript.pii_redaction else 0
}
return results
# Example usage
audio_files = [
"https://storage.googleapis.com/aai-web-samples/architecture-call.mp3",
"https://storage.googleapis.com/aai-web-samples/meeting_chunk_1.mp3",
"https://storage.googleapis.com/aai-web-samples/meeting_chunk_2.mp3",
]
results = batch_redact_files(audio_files)
for url, result in results.items():
print(f"\n{url}:")
if 'error' in result:
print(f" Error: {result['error']}")
else:
print(f" PII instances found: {result['pii_count']}")
print(f" Redacted audio: {result['redacted_audio_url']}")
Async batch processing for better performance
For improved performance when processing many files, use async processing:
import asyncio
import assemblyai as aai
from typing import List
async def async_batch_redact(audio_urls: List[str]):
"""
Process multiple files concurrently for better performance.
For batching a list of URLs, `transcribe_group` is a simpler alternative.
This `asyncio.gather` pattern is useful for more complex async workflows.
"""
config = aai.TranscriptionConfig(
redact_pii=True,
redact_pii_audio=True,
redact_pii_policies=[
aai.PIIRedactionPolicy.person_name,
aai.PIIRedactionPolicy.phone_number,
aai.PIIRedactionPolicy.email_address,
],
)
transcriber = aai.Transcriber()
# Create tasks for concurrent processing
tasks = []
for url in audio_urls:
task = transcriber.transcribe_async(url, config)
tasks.append(task)
# Wait for all transcriptions to complete
transcripts = await asyncio.gather(*tasks, return_exceptions=True)
# Process results
for url, transcript in zip(audio_urls, transcripts):
if isinstance(transcript, Exception):
print(f"Error processing {url}: {transcript}")
elif transcript.status == aai.TranscriptStatus.error:
print(f"Transcription failed for {url}: {transcript.error}")
else:
print(f"Successfully redacted {url}")
print(f" Redacted text preview: {transcript.text[:100]}...")
# Run the async function
audio_files = [
"https://storage.googleapis.com/aai-web-samples/architecture-call.mp3",
"https://storage.googleapis.com/aai-web-samples/meeting_chunk_1.mp3",
]
asyncio.run(async_batch_redact(audio_files))
Build secure and compliant applications
In this tutorial, we built a Python script to automatically find and redact sensitive information in audio files and transcripts. This is a critical capability for any application handling user data, ensuring you remain compliant with privacy regulations like HIPAA and GDPR while protecting your users.
PII redaction is just one part of building secure applications. AssemblyAI provides additional security features including:
- SOC 2 Type 2 certification for enterprise-grade security
- HIPAA compliance with Business Associate Agreements available
- End-to-end encryption for all data in transit and at rest
- Data retention controls to automatically delete your data after processing
With PII redaction handled by a reliable API, you can focus on building your application's core features instead of maintaining complex AI infrastructure. Our models are continuously improved by our research team, ensuring you always have access to the latest advances in Voice AI technology.
Ready to add PII redaction to your own application? Try our API for free and see how easy it is to build compliant voice applications.
Frequently asked questions about PII redaction implementation
How do I handle files that exceed the size limits?
Split large files using FFmpeg or use our Streaming Speech-to-Text endpoint for real-time processing without size limits.
What should I do if PII detection confidence is low?
Enable multiple related PII policies and use the word boost feature for domain-specific terms. For critical applications, implement manual review for low-confidence results.
How can I customize redaction for domain-specific PII types?
While AssemblyAI provides comprehensive PII policies out of the box, you can enhance detection for your specific domain by combining multiple policies (for example, using medical_information
for healthcare applications), using the entity name substitution to maintain context while redacting, and post-processing the transcript with your own regex patterns for industry-specific identifiers. You can also use our keyterms_prompt
feature to improve recognition of domain-specific terms before redaction.
What are the performance differences between audio and text-only redaction?
When you enable redact_pii_audio=True
, the API generates a new audio file with the PII portions replaced by silence or tones, which takes additional processing time. Text-only redaction is faster since it only modifies the transcript. For most use cases, the difference is minimal, but if you're processing large volumes and only need redacted transcripts, disabling audio redaction can improve throughput.
How do I validate that all sensitive information was successfully redacted?
Validation is crucial for compliance. You can verify redaction completeness by checking the pii_redaction
property of the transcript object to see all detected PII instances, implementing spot checks by comparing samples of original and redacted transcripts, setting up alerts when no PII is detected in files that typically contain sensitive information, and maintaining audit logs of all redaction operations including timestamps and PII types found. For high-stakes applications, consider implementing a two-pass approach where you run the redacted output through a separate PII detection system to verify completeness.
Title goes here
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Button Text