April 29, 2026

Create an ambient AI scribe that works during telehealth video calls

Ambient AI scribe tutorial for telehealth: learn to transcribe visits, label speakers, and generate SOAP notes in Python for provider review, plus HIPAA tips.

Kelsey Foster

Growth

AI voice agents

Medical

Healthcare

Reviewed by

Table of contents

[Visible on live site]

This tutorial walks through building an ambient AI scribe for telehealth visits in Python. AssemblyAI's Universal-3 Pro model with Medical Mode handles the accuracy requirements of clinical audio, including medications, procedures, and dosages. You'll need Python 3.8 or later, an AssemblyAI API key, and an OpenAI API key to follow along.

What is an ambient AI scribe?

An ambient AI scribe is software that listens to a patient-provider conversation, transcribes it, and turns it into a structured clinical note—automatically. This means the provider never has to type a single word during or after the visit.

It's worth distinguishing from two things it's often confused with:

Ambient dictation: The provider actively speaks to a recorder to narrate notes after the visit. An ambient scribe is passive—it listens to the natural conversation as it happens.
Human scribes: A person (in the room or remote) manually types notes in real time. An ambient AI scribe replaces this step entirely with software.

Every ambient AI scribe follows the same three-stage pipeline: audio capture → speech-to-text transcription → clinical note generation. That's exactly what you'll build here.

Why ambient AI scribes matter in clinical settings

Providers spend a significant portion of their workday on documentation rather than direct patient care—and telehealth compounds this problem since they're already managing a screen-based interaction.

Building an ambient scribe addresses three outcomes health systems care most about:

Less time on documentation: Providers reclaim time previously spent on after-hours charting and manual note entry.
More present patient interactions: With ambient documentation handling notes, providers make more eye contact and engage more naturally during visits.
Lower burnout: Administrative burden from EHR documentation is one of the leading drivers of clinician burnout, and ambient scribes reduce that load directly.

Telehealth makes ambient scribing even more practical—the visit is already recorded by the platform, so audio capture is a natural part of the existing workflow rather than an added step.

Build an ambient AI scribe in Python

Here's what you'll build, step by step:

Set up your environment
Transcribe the telehealth session audio
Separate patient and provider speech with speaker diarization
Generate a structured clinical note with an LLM

Set up your environment

You'll need Python 3.8 or later, an AssemblyAI API key, and an OpenAI API key.

Install the required packages:

pip install assemblyai openai

Then configure your AssemblyAI API key:‍

import assemblyai as aai

aai.settings.api_key = "YOUR_ASSEMBLYAI_API_KEY"

This tutorial assumes the telehealth session has already been recorded and saved as an audio file. MP3, WAV, and MP4 are all supported—these are the standard output formats from telehealth platforms like Zoom, Microsoft Teams, and Doxy.me. No video SDK integration is required.

Transcribe the telehealth session audio

Speech-to-text converts the telehealth recording into a transcript the LLM can work with. Submit your recorded audio file to AssemblyAI's transcription API like this:‍

config = aai.TranscriptionConfig(
    speech_models=["universal-3-pro"],
    domain="medical-v1",
    speaker_labels=True,
    keyterms_prompt=["metformin", "lisinopril", "hypertension", "SOAP note"]
)

transcriber = aai.Transcriber(config=config)
transcript = transcriber.transcribe("telehealth_session.mp3")
print(transcript.text)

Two configuration options here are specifically valuable for clinical audio:

speech_models=["universal-3-pro"] with domain="medical-v1": This selects AssemblyAI's Universal-3 Pro model with Medical Mode enabled—an add-on that improves recognition of medications, procedures, conditions, and dosages. Medical terminology is notoriously hard for general speech-to-text models to get right, so this matters.
keyterms_prompt: Keyterms prompting lets you pass a list of specialty-specific words—medication names, procedure names, lab values—that the model should pay close attention to. Universal-3 Pro supports up to 1,000 keyterms.

If you're building for a multilingual practice or need broader language coverage, Universal-2 supports 99 languages with keyterms prompting up to 200 words.

Model	Best for	Languages	Keyterms limit
Universal-3 Pro	Highest accuracy, medical terminology	6 (EN, ES, DE, FR, PT, IT)	1,000
Universal-2	Broad language coverage	99	200

Test clinical transcription with diarization

Upload a sample telehealth recording and see Universal models separate patient and provider speech while capturing clinical terms with high accuracy—all in your browser.

Open playground

Separate patient and provider speech with speaker diarization

Speaker diarization identifies who said what in a conversation—so the transcript gets labeled by speaker rather than appearing as one undifferentiated block of text. This is what makes the LLM's job possible: it needs to know which lines are the provider's observations and which are the patient's complaints.

Setting speaker_labels=True in the config (which you already did above) enables this automatically. Access the labeled transcript through the utterances property:

for u in transcript.utterances:
    print(f"{'Provider' if u.speaker == 'A' else 'Patient'}: {u.text}")

AssemblyAI labels the first detected speaker as "A." In a telehealth session, this is typically the provider who opens the call. You can confirm this by checking the first utterance, or set it manually if your practice has a consistent call structure.

Here's what the diarized output looks like:

Provider: Good morning. What brings you in today?
Patient: I've been having chest tightness for about three days.
Provider: Is it worse with exertion or at rest?
Patient: It gets worse when I climb stairs or walk quickly.
Provider: Any shortness of breath or pain radiating to your arm?
Patient: No, just the tightness in my chest.

This structured, speaker-labeled transcript is what you'll pass to the LLM in the next step.

Generate a structured clinical note with an LLM

Now you'll use an LLM to read the diarized transcript and generate a SOAP note—a standard clinical documentation format that organizes information into four sections: Subjective, Objective, Assessment, and Plan.‍

from openai import OpenAI

client = OpenAI(api_key="YOUR_OPENAI_API_KEY")

conversation = "\n".join(
    f"{'Provider' if u.speaker == 'A' else 'Patient'}: {u.text}"
    for u in transcript.utterances
)

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {
            "role": "system",
            "content": "You are a clinical documentation assistant. Generate accurate SOAP notes from patient-provider conversation transcripts. Be concise and use standard clinical language."
        },
        {
            "role": "user",
            "content": f"Generate a SOAP note from this telehealth visit transcript:\n\n{conversation}"
        }
    ]
)

soap_note = response.choices[0].message.content
print(soap_note)

Here's what the output looks like for the example conversation above:‍

SUBJECTIVE:
Patient is a [age] presenting with chest tightness for 3 days.
Symptoms worsen with exertion (climbing stairs, walking quickly).
Denies shortness of breath, radiation to arm, nausea, or diaphoresis.

OBJECTIVE:
Telehealth visit. Patient appears comfortable at rest. No visible distress.

ASSESSMENT:
Exertional chest tightness. Differential includes stable angina,
musculoskeletal pain, or anxiety-related chest symptoms.

PLAN:
1. Order ECG and troponin levels
2. Start aspirin 81mg daily pending workup
3. Follow up in person within 48 hours
4. Patient advised to go to ED if symptoms worsen or change in character

Alternative: Use AssemblyAI's LLM Gateway instead of OpenAI. LLM Gateway provides access to 20+ models—including Claude, GPT, Gemini, and more—through an OpenAI-compatible chat completions API. The advantage for clinical workflows is keeping your entire pipeline on one vendor: transcription, diarization, and LLM-powered note generation all through AssemblyAI, with one bill and one set of logs. The endpoint is llm-gateway.assemblyai.com/v1/chat/completions, so switching from the OpenAI SDK requires only changing the base URL and API key.

SOAP is the most common format, but you can adapt the system prompt for other clinical note types depending on your specialty:

Format	Description	When to use it
SOAP	Subjective, Objective, Assessment, Plan	General medical visits
DAP	Data, Assessment, Plan	Mental health sessions
Narrative	Free-text summary	Complex or multisystem cases
After-visit summary	Patient-friendly summary	Patient portal delivery

Start building your telehealth scribe

Get an API key to transcribe visits with Universal-3 Pro and speaker diarization, then feed labeled transcripts into your LLM to generate SOAP notes.

Get API key

Privacy and compliance requirements for clinical ambient scribes

Any ambient AI scribe handling telehealth recordings must meet three requirements before it goes anywhere near a real patient visit.

Patient consent: Patients must know the visit is being recorded and transcribed, and must be able to opt out. In telehealth, this is typically a verbal confirmation at the start of the call, documented in the patient record.
Provider review and sign-off: AI-generated notes are drafts, not final records. The provider reviews, edits for accuracy, and signs before the note enters the EHR.
HIPAA and data handling: Telehealth recordings and transcripts contain protected health information (PHI). AssemblyAI enables covered entities and their business associates subject to HIPAA to use AssemblyAI services to process PHI—AssemblyAI is considered a business associate under HIPAA and offers a Business Associate Addendum (BAA) that is required under HIPAA to ensure AssemblyAI appropriately safeguards PHI. Any LLM provider you use for note generation must also support BAA execution.

What you've built

You now have a working ambient AI scribe pipeline: a recorded telehealth session goes in, a speaker-labeled transcript comes out, and a structured clinical note gets generated—ready for provider review.

AssemblyAI's Universal-3 Pro model, Medical Mode, and speaker diarization handle the hardest parts of clinical Voice AI: getting medical terminology right, telling patient from provider, and doing it accurately enough that the generated note is actually useful. Teams building on this foundation can extend it to add post-call analytics, patient sentiment analysis, or automated after-visit summaries using AssemblyAI's LLM Gateway for applying LLMs to transcribed audio.

What about real-time ambient scribing? This tutorial processes recorded audio after the visit—and that's how most production ambient scribes work today. But if you're building toward live, in-visit documentation where the scribe generates notes as the conversation happens, AssemblyAI's Voice Agent API is worth exploring. It provides a single WebSocket connection that handles speech understanding, LLM reasoning, and voice generation at a $4.50/hr flat rate—one API, one bill, no multi-vendor orchestration. The same Universal-3 Pro foundation powers the speech understanding layer, so the medical terminology accuracy you get in this tutorial carries over. It's designed as invisible infrastructure: you configure your agent's behavior and build your product without managing the voice plumbing underneath.

Extend your ambient AI scribe solution

Create an account to access Universal-3 Pro and speaker diarization used throughout this tutorial, plus sample code to accelerate integration.

Start now

Frequently asked questions

What is the difference between an ambient AI scribe and traditional medical dictation?

Traditional medical dictation requires the provider to speak directly to a recorder after the visit to narrate notes—it's an active, intentional step that still takes time. An ambient AI scribe passively listens to the natural patient-provider conversation and generates documentation automatically, without the provider changing their behavior at all.

How accurate is speech-to-text transcription for medical terminology?

General-purpose speech-to-text models often misrecognize medications, procedures, and clinical shorthand because these terms rarely appear in general training data. AssemblyAI's Universal-3 Pro model with Medical Mode is specifically optimized for clinical vocabulary, and keyterms prompting (up to 1,000 terms) lets you further tune recognition for your specialty.

Does an ambient AI scribe built with this tutorial need to integrate with an EHR?

No—the scribe generates a structured text note that providers can copy into any EHR manually. Direct EHR integration via APIs like Epic's FHIR is a common next step for production deployments, but it's outside the scope of this tutorial.

What audio file formats does AssemblyAI's transcription API accept?

AssemblyAI accepts all common audio and video formats including MP3, MP4, WAV, M4A, FLAC, and WebM—which covers the export formats of every major telehealth platform. You can submit files as a local file path or a publicly accessible URL.

Do ambient AI scribes need to process audio in real time during the telehealth visit?

Not necessarily—and for most clinical workflows you wouldn't want them to. The async approach in this tutorial processes the recorded audio file after the visit concludes, which is how most ambient scribe products work in production. For teams that do want live, in-visit transcription and note generation, AssemblyAI's Voice Agent API provides a single WebSocket that handles speech understanding, LLM reasoning, and voice generation in one connection—no separate STT, LLM, and TTS providers to stitch together. That said, real-time streaming adds complexity and latency tradeoffs, so evaluate whether your use case truly requires it or if post-visit processing is the better fit.