Create an ambient AI scribe that works during telehealth video calls
Ambient AI scribe tutorial for telehealth: learn to transcribe visits, label speakers, and generate SOAP notes in Python for provider review, plus HIPAA tips.



This tutorial walks through building an ambient AI scribe for telehealth visits in Python. AssemblyAI's Universal-3 Pro model with Medical Mode handles the accuracy requirements of clinical audio, including medications, procedures, and dosages. You'll need Python 3.8 or later, an AssemblyAI API key, and an OpenAI API key to follow along.
What is an ambient AI scribe?
An ambient AI scribe is software that listens to a patient-provider conversation, transcribes it, and turns it into a structured clinical note—automatically. This means the provider never has to type a single word during or after the visit.
It's worth distinguishing from two things it's often confused with:
- Ambient dictation: The provider actively speaks to a recorder to narrate notes after the visit. An ambient scribe is passive—it listens to the natural conversation as it happens.
- Human scribes: A person (in the room or remote) manually types notes in real time. An ambient AI scribe replaces this step entirely with software.
Every ambient AI scribe follows the same three-stage pipeline: audio capture → speech-to-text transcription → clinical note generation. That's exactly what you'll build here.
Why ambient AI scribes matter in clinical settings
Providers spend a significant portion of their workday on documentation rather than direct patient care—and telehealth compounds this problem since they're already managing a screen-based interaction.
Building an ambient scribe addresses three outcomes health systems care most about:
- Less time on documentation: Providers reclaim time previously spent on after-hours charting and manual note entry.
- More present patient interactions: With ambient documentation handling notes, providers make more eye contact and engage more naturally during visits.
- Lower burnout: Administrative burden from EHR documentation is one of the leading drivers of clinician burnout, and ambient scribes reduce that load directly.
Telehealth makes ambient scribing even more practical—the visit is already recorded by the platform, so audio capture is a natural part of the existing workflow rather than an added step.
Build an ambient AI scribe in Python
Here's what you'll build, step by step:
- Set up your environment
- Transcribe the telehealth session audio
- Separate patient and provider speech with speaker diarization
- Generate a structured clinical note with an LLM
Set up your environment
You'll need Python 3.8 or later, an AssemblyAI API key, and an OpenAI API key.
Install the required packages:
pip install assemblyai openai
Then configure your AssemblyAI API key:
import assemblyai as aai
aai.settings.api_key = "YOUR_ASSEMBLYAI_API_KEY"This tutorial assumes the telehealth session has already been recorded and saved as an audio file. MP3, WAV, and MP4 are all supported—these are the standard output formats from telehealth platforms like Zoom, Microsoft Teams, and Doxy.me. No video SDK integration is required.
Transcribe the telehealth session audio
Speech-to-text converts the telehealth recording into a transcript the LLM can work with. Submit your recorded audio file to AssemblyAI's transcription API like this:
config = aai.TranscriptionConfig(
speech_models=["universal-3-pro"],
domain="medical-v1",
speaker_labels=True,
keyterms_prompt=["metformin", "lisinopril", "hypertension", "SOAP note"]
)
transcriber = aai.Transcriber(config=config)
transcript = transcriber.transcribe("telehealth_session.mp3")
print(transcript.text)Two configuration options here are specifically valuable for clinical audio:
speech_models=["universal-3-pro"]withdomain="medical-v1": This selects AssemblyAI's Universal-3 Pro model with Medical Mode enabled—an add-on that improves recognition of medications, procedures, conditions, and dosages. Medical terminology is notoriously hard for general speech-to-text models to get right, so this matters.keyterms_prompt: Keyterms prompting lets you pass a list of specialty-specific words—medication names, procedure names, lab values—that the model should pay close attention to. Universal-3 Pro supports up to 1,000 keyterms.
If you're building for a multilingual practice or need broader language coverage, Universal-2 supports 99 languages with keyterms prompting up to 200 words.
Separate patient and provider speech with speaker diarization
Speaker diarization identifies who said what in a conversation—so the transcript gets labeled by speaker rather than appearing as one undifferentiated block of text. This is what makes the LLM's job possible: it needs to know which lines are the provider's observations and which are the patient's complaints.
Setting speaker_labels=True in the config (which you already did above) enables this automatically. Access the labeled transcript through the utterances property:
for u in transcript.utterances:
print(f"{'Provider' if u.speaker == 'A' else 'Patient'}: {u.text}")
AssemblyAI labels the first detected speaker as "A." In a telehealth session, this is typically the provider who opens the call. You can confirm this by checking the first utterance, or set it manually if your practice has a consistent call structure.
Here's what the diarized output looks like:
Provider: Good morning. What brings you in today?
Patient: I've been having chest tightness for about three days.
Provider: Is it worse with exertion or at rest?
Patient: It gets worse when I climb stairs or walk quickly.
Provider: Any shortness of breath or pain radiating to your arm?
Patient: No, just the tightness in my chest.This structured, speaker-labeled transcript is what you'll pass to the LLM in the next step.
Generate a structured clinical note with an LLM
Now you'll use an LLM to read the diarized transcript and generate a SOAP note—a standard clinical documentation format that organizes information into four sections: Subjective, Objective, Assessment, and Plan.
from openai import OpenAI
client = OpenAI(api_key="YOUR_OPENAI_API_KEY")
conversation = "\n".join(
f"{'Provider' if u.speaker == 'A' else 'Patient'}: {u.text}"
for u in transcript.utterances
)
response = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": "You are a clinical documentation assistant. Generate accurate SOAP notes from patient-provider conversation transcripts. Be concise and use standard clinical language."
},
{
"role": "user",
"content": f"Generate a SOAP note from this telehealth visit transcript:\n\n{conversation}"
}
]
)
soap_note = response.choices[0].message.content
print(soap_note)Here's what the output looks like for the example conversation above:
SUBJECTIVE:
Patient is a [age] presenting with chest tightness for 3 days.
Symptoms worsen with exertion (climbing stairs, walking quickly).
Denies shortness of breath, radiation to arm, nausea, or diaphoresis.
OBJECTIVE:
Telehealth visit. Patient appears comfortable at rest. No visible distress.
ASSESSMENT:
Exertional chest tightness. Differential includes stable angina,
musculoskeletal pain, or anxiety-related chest symptoms.
PLAN:
1. Order ECG and troponin levels
2. Start aspirin 81mg daily pending workup
3. Follow up in person within 48 hours
4. Patient advised to go to ED if symptoms worsen or change in characterAlternative: Use AssemblyAI's LLM Gateway instead of OpenAI. LLM Gateway provides access to 20+ models—including Claude, GPT, Gemini, and more—through an OpenAI-compatible chat completions API. The advantage for clinical workflows is keeping your entire pipeline on one vendor: transcription, diarization, and LLM-powered note generation all through AssemblyAI, with one bill and one set of logs. The endpoint is llm-gateway.assemblyai.com/v1/chat/completions, so switching from the OpenAI SDK requires only changing the base URL and API key.
SOAP is the most common format, but you can adapt the system prompt for other clinical note types depending on your specialty:
Privacy and compliance requirements for clinical ambient scribes
Any ambient AI scribe handling telehealth recordings must meet three requirements before it goes anywhere near a real patient visit.
- Patient consent: Patients must know the visit is being recorded and transcribed, and must be able to opt out. In telehealth, this is typically a verbal confirmation at the start of the call, documented in the patient record.
- Provider review and sign-off: AI-generated notes are drafts, not final records. The provider reviews, edits for accuracy, and signs before the note enters the EHR.
- HIPAA and data handling: Telehealth recordings and transcripts contain protected health information (PHI). AssemblyAI enables covered entities and their business associates subject to HIPAA to use AssemblyAI services to process PHI—AssemblyAI is considered a business associate under HIPAA and offers a Business Associate Addendum (BAA) that is required under HIPAA to ensure AssemblyAI appropriately safeguards PHI. Any LLM provider you use for note generation must also support BAA execution.
What you've built
You now have a working ambient AI scribe pipeline: a recorded telehealth session goes in, a speaker-labeled transcript comes out, and a structured clinical note gets generated—ready for provider review.
AssemblyAI's Universal-3 Pro model, Medical Mode, and speaker diarization handle the hardest parts of clinical Voice AI: getting medical terminology right, telling patient from provider, and doing it accurately enough that the generated note is actually useful. Teams building on this foundation can extend it to add post-call analytics, patient sentiment analysis, or automated after-visit summaries using AssemblyAI's LLM Gateway for applying LLMs to transcribed audio.
What about real-time ambient scribing? This tutorial processes recorded audio after the visit—and that's how most production ambient scribes work today. But if you're building toward live, in-visit documentation where the scribe generates notes as the conversation happens, AssemblyAI's Voice Agent API is worth exploring. It provides a single WebSocket connection that handles speech understanding, LLM reasoning, and voice generation at a $4.50/hr flat rate—one API, one bill, no multi-vendor orchestration. The same Universal-3 Pro foundation powers the speech understanding layer, so the medical terminology accuracy you get in this tutorial carries over. It's designed as invisible infrastructure: you configure your agent's behavior and build your product without managing the voice plumbing underneath.
Frequently asked questions
What is the difference between an ambient AI scribe and traditional medical dictation?
Traditional medical dictation requires the provider to speak directly to a recorder after the visit to narrate notes—it's an active, intentional step that still takes time. An ambient AI scribe passively listens to the natural patient-provider conversation and generates documentation automatically, without the provider changing their behavior at all.
How accurate is speech-to-text transcription for medical terminology?
General-purpose speech-to-text models often misrecognize medications, procedures, and clinical shorthand because these terms rarely appear in general training data. AssemblyAI's Universal-3 Pro model with Medical Mode is specifically optimized for clinical vocabulary, and keyterms prompting (up to 1,000 terms) lets you further tune recognition for your specialty.
Does an ambient AI scribe built with this tutorial need to integrate with an EHR?
No—the scribe generates a structured text note that providers can copy into any EHR manually. Direct EHR integration via APIs like Epic's FHIR is a common next step for production deployments, but it's outside the scope of this tutorial.
What audio file formats does AssemblyAI's transcription API accept?
AssemblyAI accepts all common audio and video formats including MP3, MP4, WAV, M4A, FLAC, and WebM—which covers the export formats of every major telehealth platform. You can submit files as a local file path or a publicly accessible URL.
Do ambient AI scribes need to process audio in real time during the telehealth visit?
Not necessarily—and for most clinical workflows you wouldn't want them to. The async approach in this tutorial processes the recorded audio file after the visit concludes, which is how most ambient scribe products work in production. For teams that do want live, in-visit transcription and note generation, AssemblyAI's Voice Agent API provides a single WebSocket that handles speech understanding, LLM reasoning, and voice generation in one connection—no separate STT, LLM, and TTS providers to stitch together. That said, real-time streaming adds complexity and latency tradeoffs, so evaluate whether your use case truly requires it or if post-visit processing is the better fit.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.



