Announcements

Redacting Sensitive Medical Information from Transcriptions

AssemblyAI's Speech-to-Text API now supports automatically detecting and redacting medical information, like drug names, injuries, and medical conditions, from transcription text!

Redacting Sensitive Medical Information from Transcriptions

Many of our customers have been making use of our advanced PII Detection & Redaction Feature, which can detect over 20 types of PII like credit card numbers, names, and dates of birth. Over the past few months, we've heard from many customers that operate products in industries where they need to ensure that private medical information remains confidential as well.

That's why today we are excited to release 5 new medical redaction policies to our PII Detection & Redaction Feature policies. These new policies are:

  • medical_process - Medical process, including treatments, procedures and test. E.g., "heart surgery", "CT scan."
  • medical_condition - A medical condition. Includes diseases, syndromes, deficits, disorders. E.g., chronic fatigue syndrome, arrhythmia, depression.
  • blood_type - A person's blood type.
  • drug - Medical drug, including vitamins and minerals. E.g., Advil, Acetaminophen, Panadol.
  • injury - Human injury, e.q., I broke my arm, I have a sprained wrist. Includes mutations, miscarriages and dislocations.

For example, this is the redacted transcription for the below audio file:

"Hi. I'd like to schedule an appointment for an [MEDICAL_PROCESS]. The
doctor wanted to take a look at the [INJURY] while working had fallen off
of a roof that I was actually working on. He mentioned to go ahead and
give you my blood type, which is [BLOOD_TYPE] for coming in and also
request for a stronger prescription of [DRUG]. Currently I've just been
taking some [DRUG] and [DRUG] [DRUG] and whenever laying around. So when
you get a second, please call me back. We'll love to get that set up.
Thank you. Bye."

Using these new medical redaction policies is very simple! Here's how you'd transcribe an audio file with these new redaction policies enabled in Python:

import requests

endpoint = 'https://api.assemblyai.com/v2/transcript'

headers = {
    'authoriztion': 'YOUR-API-TOKEN',
    'content-type': 'application/json',
}

json = {
    # The path to our audio file in an S3 or GCP bucket, for example.
    'audio_url': 'https://s3-us-west-2.amazonaws.com/blog.assemblyai.com/audio/8-7-2018-post/7510.mp3',

    # This will tell the API we want a redacted transcript.
    'redact_pii': True,

    # Setting this parameter tells the API to replace the redacted text with the category
    # of what is being redacted. This will make the transcript more readable! 
    'redact_pii_sub': 'entity_name',

    # Here we specify what we want to redact; in this case we are specifying the
    # new medical policies.
    'redact_pii_policies': ['medical_process', 'medical_condition', 'blood_type', 'drug', 'injury'],
}

respone = requests.post(endpoint, json=json, headers=headers)

print(response.json())

For more information, check out the PII Detection & Redaction Documentation, or send us an email at support@assemblyai.com!