Redact PII from Text Using LeMUR | AssemblyAI

This guide will show you how to use AssemblyAI’s LeMUR framework to redact personally identifiable information (PII) from text.

Quickstart

1 import assemblyai as aai
2 import json
3 import os
4 
5 aai.settings.api_key = 'YOUR API KEY'
6 
7 def generate_ner(transcript_text):
8     prompt = '''
9     You will be given a transcript of a conversation or text. Your task is to generate named entities from the given transcript text.
10 
11     Please identify and extract the following named entities from the transcript:
12 
13     1. Person names
14     2. Organization names
15     3. Email addresses
16     4. Phone numbers
17     5. Full addresses
18 
19     When extracting these entities, make sure to return the exact spelling and formatting as they appear in the transcript. Do not modify or standardize the entities in any way.
20 
21     Present your results in a JSON format with a single field named "named_entities". This field should contain an array of strings, where each string is a named entity you've identified. For example:
22     {
23       "named_entities": ["John Doe", "Acme Corp", "john.doe@example.com", "123-456-7890", "123 Main St, Anytown, USA 12345"]
24     }
25 
26     Important: Do not include any other information, explanations, or text in your response. Your output should consist solely of the JSON object containing the named entities.
27 
28     If you do not find any named entities of a particular type, simply return a empty array for the "named_entities" field.
29     '''
30 
31     response = aai.Lemur().task(
32         prompt=prompt,
33         input_text=transcript_text,
34         max_output_size=4000,
35         temperature=0.0,
36         final_model=aai.LemurModel.claude3_5_sonnet
37     ).response
38 
39     try:
40       res_json = json.loads(response)
41     except:
42       res_json = {'named_entities': []}
43 
44     named_entities = res_json.get('named_entities', [])
45 
46     return named_entities
47 
48 transcriber = aai.Transcriber(config=aai.TranscriptionConfig(language_code='en'))
49 transcript = transcriber.transcribe('YOUR_AUDIO_URL')
50 
51 redacted_transcript = ''
52 
53 for sentence in transcript.get_sentences():
54   generated_entities = generate_ner(sentence.text)
55 
56   redacted_sentence = sentence.text
57 
58   for entity in generated_entities:
59     redacted_sentence = redacted_sentence.replace(entity, '#' * len(entity))
60 
61   redacted_transcript += redacted_sentence + ' '
62   print(redacted_sentence)
63 
64 print('Full redacted transcript:')
65 print(redacted_transcript)

Get Started

Before we begin, make sure you have an AssemblyAI account and an API key. You can sign up for an account and get your API key from your dashboard.

For information about LeMUR pricing, see our pricing page.

Step-by-Step Instructions

Install the SDK.

1 pip install assemblyai

Import the assemblyai package and set your API key.

1 import assemblyai as aai
2 import json
3 import os
4 
5 aai.settings.api_key = 'YOUR API KEY'

Define a function generate_ner that uses LeMUR to identify named entities (person names, organizations, emails, phone numbers, addresses) in a given text.

1 def generate_ner(transcript_text):
2     prompt = '''
3     You will be given a transcript of a conversation or text. Your task is to generate named entities from the given transcript text.
4 
5     Please identify and extract the following named entities from the transcript:
6 
7     1. Person names
8     2. Organization names
9     3. Email addresses
10     4. Phone numbers
11     5. Full addresses
12 
13     When extracting these entities, make sure to return the exact spelling and formatting as they appear in the transcript. Do not modify or standardize the entities in any way.
14 
15     Present your results in a JSON format with a single field named "named_entities". This field should contain an array of strings, where each string is a named entity you've identified. For example:
16     {
17       "named_entities": ["John Doe", "Acme Corp", "john.doe@example.com", "123-456-7890", "123 Main St, Anytown, USA 12345"]
18     }
19 
20     Important: Do not include any other information, explanations, or text in your response. Your output should consist solely of the JSON object containing the named entities.
21 
22     If you do not find any named entities of a particular type, simply return a empty array for the "named_entities" field.
23     '''
24 
25     response = aai.Lemur().task(
26         prompt=prompt,
27         input_text=transcript_text,
28         max_output_size=4000,
29         temperature=0.0,
30         final_model=aai.LemurModel.claude3_5_sonnet
31     ).response
32 
33     try:
34       res_json = json.loads(response)
35     except:
36       res_json = {'named_entities': []}
37 
38     named_entities = res_json.get('named_entities', [])
39 
40     return named_entities

Transcribe an audio file using the AssemblyAI Transcriber.

1 transcriber = aai.Transcriber(config=aai.TranscriptionConfig(language_code='en'))
2 transcript = transcriber.transcribe('YOUR_AUDIO_URL')

Iterate through each sentence in the transcript, identify named entities using generate_ner, and replace them with # characters.

1 redacted_transcript = ''
2 
3 for sentence in transcript.get_sentences():
4   generated_entities = generate_ner(sentence.text)
5 
6   redacted_sentence = sentence.text
7 
8   for entity in generated_entities:
9     redacted_sentence = redacted_sentence.replace(entity, '#' * len(entity))
10 
11   redacted_transcript += redacted_sentence + ' '
12   print(redacted_sentence)

Print the redacted transcript.

1 print('Full redacted transcript:')
2 print(redacted_transcript)