What is an ambient AI scribe and how does it work?

An ambient AI scribe is a voice-aware system that listens passively to a patient-provider conversation and automatically generates a structured clinical note (SOAP, DAP, or specialty-specific template) without disrupting the encounter. The pipeline has three stages: speech-to-text transcribes the conversation in real time with speaker diarization separating provider and patient turns; PHI redaction masks sensitive identifiers before downstream processing; and an LLM (e.g., via AssemblyAI's LLM Gateway with 25+ models across Claude, GPT, and Gemini) takes the diarized transcript and produces a draft note. The provider reviews and signs off before the note syncs to the EHR — typically Epic, Cerner, or another system via API integration.

How do I build an AI medical scribe with speech-to-text?

Start with AssemblyAI's Voice Agent API for a fully managed pipeline (~1s end-to-end) or Universal-3.5 Pro Realtime ($0.45/hr) for BYO orchestration with LiveKit or Pipecat. Capture audio from a room mic, smartphone, or telehealth stream. Enable Medical Mode (87% fewer medical entity errors) and speaker_labels=True so provider and patient turns are separated automatically. Pass medication names, dosages, ICD codes, and specialty terms as keyterms_prompt to boost recognition. Pipe the diarized transcript into your LLM — or AssemblyAI's LLM Gateway — with a SOAP-note prompt template, then push the approved note to your EHR.

What speech-to-text accuracy do I need for clinical documentation?

Clinical documentation needs accuracy on two distinct axes: general transcription accuracy (target <10% word error rate on conversational speech) and medical entity accuracy (correctly capturing drug names, dosages, anatomical terms, and ICD codes). AssemblyAI's Medical Mode reduces medical entity errors by 87% versus the base model — critical because a mis-transcribed medication name or dosage can introduce patient-safety risk into the EHR. Pair it with built-in speaker diarization so notes attribute statements to the right participant. For ambient settings, the model also has to hold accuracy in far-field audio — Universal-3.5 Pro Realtime captures clean transcripts from 20+ feet as providers move around the exam room.

Can ambient AI scribes work for telehealth visits?

Yes. For telehealth, the scribe ingests the call's audio track in real time — most teams bridge from Twilio Programmable Video, LiveKit, or a Zoom/Meet bot rather than capturing the video stream directly. Stream the audio into AssemblyAI's WebSocket with Medical Mode and speaker_labels enabled. The same SOAP-note pipeline works whether the encounter is in-person or telehealth, with the same Medical Mode accuracy and diarization quality. AssemblyAI is a business associate under HIPAA and offers a Business Associate Addendum (BAA) for covered entities processing PHI — contact sales to execute one.

Should I use the Voice Agent API or Streaming STT for my ambient scribe?

It depends on whether your scribe needs to talk back. The Voice Agent API ($4.50/hr all-in) is the fastest path to an interactive scribe — one WebSocket handles STT, LLM, and TTS, so the agent can read back orders, confirm medications, or prompt the provider for missing fields during the encounter. Universal-3.5 Pro Realtime ($0.45/hr) is the right pick for passive ambient scribes where you bring your own LLM (or use AssemblyAI's LLM Gateway) for SOAP-note generation and your own EHR integration logic. Both share the same Universal-3.5 Pro speech foundation, so transcription quality is identical — the difference is whether you want a managed conversational pipeline or full architectural control.

What does it cost to build versus buy an AI medical scribe?

The build cost on AssemblyAI is transparent: $0.45/hr for Universal-3.5 Pro Realtime (transcription + diarization + PHI redaction, BYO LLM and EHR integration) or $4.50/hr for the Voice Agent API (STT + LLM + TTS managed end-to-end). The LLM Gateway gives you access to 25+ models — Claude, GPT, Gemini, and more — through one unified API for SOAP-note generation. Buying a vertically integrated scribe product typically costs $200–600 per provider per month with vendor lock-in on the model and EHR integration. Building on AssemblyAI gives you control over the LLM, the note template, and the EHR push — which matters as you scale across specialties or customize for your workflow.

Solutions

Voice agents for AI medical scribe & ambient documentation

Q: Can ambient AI scribes work for telehealth visits?

Yes. For telehealth, the scribe ingests the call's audio track in real time — most teams bridge from Twilio Programmable Video, LiveKit, or a Zoom/Meet bot rather than capturing the video stream directly. Stream the audio into AssemblyAI's WebSocket with Medical Mode and speaker_labels enabled. The same SOAP-note pipeline works whether the encounter is in-person or telehealth, with the same Medical Mode accuracy and diarization quality. AssemblyAI is a business associate under HIPAA and offers a Business Associate Addendum (BAA) for covered entities processing PHI — contact sales to execute one.

Q: Should I use the Voice Agent API or Streaming STT for my ambient scribe?

It depends on whether your scribe needs to talk back. The Voice Agent API ($4.50/hr all-in) is the fastest path to an interactive scribe — one WebSocket handles STT, LLM, and TTS, so the agent can read back orders, confirm medications, or prompt the provider for missing fields during the encounter. Universal-3.5 Pro Realtime ($0.45/hr) is the right pick for passive ambient scribes where you bring your own LLM (or use AssemblyAI's LLM Gateway) for SOAP-note generation and your own EHR integration logic. Both share the same Universal-3.5 Pro speech foundation, so transcription quality is identical — the difference is whether you want a managed conversational pipeline or full architectural control.

Q: What does it cost to build versus buy an AI medical scribe?

The build cost on AssemblyAI is transparent: $0.45/hr for Universal-3.5 Pro Realtime (transcription + diarization + PHI redaction, BYO LLM and EHR integration) or $4.50/hr for the Voice Agent API (STT + LLM + TTS managed end-to-end). The LLM Gateway gives you access to 25+ models — Claude, GPT, Gemini, and more — through one unified API for SOAP-note generation. Buying a vertically integrated scribe product typically costs $200–600 per provider per month with vendor lock-in on the model and EHR integration. Building on AssemblyAI gives you control over the LLM, the note template, and the EHR push — which matters as you scale across specialties or customize for your workflow.

Build ambient AI scribes that listen to patient-provider conversations and automatically generate structured clinical notes. Powered by Medical Mode with 87% fewer medical entity errors, speaker diarization, and LLM Gateway for SOAP note generation.

Get started free Contact sales

SOAP note — auto-generated

Visit: Annual wellness · Dr. Patel · 14 min

Subjective

Patient reports persistent fatigue over 3 weeks. Denies chest pain, SOB. Sleep quality poor…

Objective

BP 128/82, HR 74, Temp 98.6°F. BMI 27.3. No lymphadenopathy…

Assessment & plan

R53.83 Fatigue. Order CBC, CMP, TSH, ferritin. F/U 2 weeks…

The problem

Documentation is burning out your clinicians

Providers spend two hours on documentation for every one hour of patient care. That overhead drives burnout, shrinks appointment availability, and costs health systems thousands per provider annually in lost revenue. Ambient AI scribes — built on clinical-grade speech-to-text, speaker diarization, and LLM-powered note generation — eliminate the typing so providers can focus on the patient in front of them.

Built for clinical documentation accuracy

Medical accuracy 87%

Fewer medical entity errors with Medical Mode.

Ambient range 20ft+

Far-field capture as providers move around the room.

Latency ~150ms

P50 median streaming latency on Universal-3.5 Pro.

LLM Gateway 25+

Models for note generation — Claude, GPT, Gemini, and more through one API.

Two ways to build

Pick the API that fits your scribe architecture

Ship an ambient scribe with our managed pipeline, or drop medical-grade STT into the orchestrator you already run.

Recommended

Voice Agent API

Our proprietary voice stack with Medical Mode via one WebSocket. Real-time ambient transcription with built-in speaker diarization, LLM reasoning, and TTS for interactive scribes.

Best for

Interactive ambient scribes with voice confirmation
Medical Mode with 87% fewer entity errors built in
Teams shipping fast — working scribe in an afternoon
Business Associate Addendum (BAA) available for PHI workloads

$4.50/hr — speech, LLM, and voice all included

Get started for free

Free tier available · No credit card required

Bring Your Own Stack

Universal-3.5 Pro Realtime STT API

The medical-grade STT layer for your ambient scribe pipeline. Pair with your own LLM for SOAP generation and your own EHR integration logic.

Best for

Teams using LiveKit, Pipecat, or custom orchestration
Cascading architectures (STT → LLM → note generation)
Medical Mode add-on with keyterm prompting for formulary
Complex EHR integrations (Epic, Cerner, custom)
BAA-eligible, SOC 2 Type 2 — bring your own compliance infra

$0.45/hr — transcription only, unlimited streams

View integration docs

No concurrency caps · Autoscaling included

Your ambient scribe pipeline

Capture clinical audio

Voice Agent API: single WebSocket. Or smartphone, tablet, or room mic → U3.5 Pro Realtime for BYO stack. Far-field from 20+ feet.

Transcribe with Medical Mode

87% fewer medical entity errors. Speaker diarization labels provider and patient speech automatically at ~150ms P50.

Generate structured notes

LLM Gateway organizes the diarized transcript into SOAP, DAP, or specialty-specific templates. 25+ models across Claude, GPT, and Gemini.

Review and sync to EHR

Provider reviews draft note, edits as needed, approves. Push to Epic, Cerner, or any EHR via API integration.

schedule

Encounter timeline

Provider

"Let's review your metformin dosage — any side effects with the 500mg?"

Patient

"Some nausea in the morning, but it's getting better."

Provider

"Good. We'll keep the current dose and recheck A1C in 3 months."

Quickstart

Build a medical scribe in minutes

Voice Agent API — recommended

# Voice Agent API: ambient scribe with Medical Mode
import asyncio, json, websockets

API_KEY = "YOUR_API_KEY"

async def run_scribe():
    async with websockets.connect(
        "wss://agents.assemblyai.com/v1/ws",
        additional_headers={"Authorization": f"Bearer {API_KEY}"},
    ) as ws:
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "system_prompt": (
                    "You are an ambient medical scribe. Listen to the "
                    "encounter and generate a SOAP note when the visit ends."
                ),
                "input": {"keyterms": ["metformin", "lisinopril", "A1C", "Dr. Patel"]},
                "output": {"voice": "ivy"},
            },
        }))
        # Stream encounter audio in, get transcript + note back
        async for msg in ws:
            handle(json.loads(msg))  # transcript.user, reply.audio, tool.call, ...

Universal-3.5 Pro Realtime + LiveKit — BYO stack

# LiveKit + AssemblyAI Medical Mode in a cascading scribe pipeline
from livekit.agents import Agent, AgentSession
from livekit.plugins import assemblyai, cartesia, openai, silero

class MedicalScribe(Agent):
    def __init__(self):
        super().__init__(
            instructions=(
                "You are an ambient scribe for Dr. Patel's clinic. "
                "Generate SOAP notes from the encounter transcript."
            ),
        )

async def entrypoint(ctx):
    session = AgentSession(
        stt=assemblyai.STT(
            model="u3-rt-pro",
            domain="medical-v1",                       # Enable Medical Mode
            keyterms_prompt=["metformin", "lisinopril", "A1C", "Dr. Patel"],
            min_turn_silence=800,                      # Clinicians pause to think
            max_turn_silence=2000,                     # Don't fragment chart-review pauses
        ),
        llm=openai.LLM(model="gpt-4o"),
        tts=cartesia.TTS(),
        vad=silero.VAD.load(),
    )
    await session.start(room=ctx.room, agent=MedicalScribe())

Try in Playground View full docs

Medical Mode accuracy

87% fewer medical entity errors — correctly captures drug names, dosages, anatomical terms, and ICD codes from ambient exam room audio.

Speaker diarization

Real-time speaker diarization separates provider and patient speech automatically — essential for mapping conversation segments to SOAP note sections.

LLM Gateway

Access 25+ models through one unified API — Claude, GPT, Gemini, and more — for SOAP note generation. Customizable templates for any specialty: primary care, psych, surgery, radiology.

Teams shipping ambient scribes on AssemblyAI

90% reduction in documentation time for clinicians

JotPsych uses AssemblyAI to power ambient clinical documentation for mental health providers — capturing nuanced patient conversations accurately and securely.

30% cost savings on real-time transcription

We require a leading edge speech-to-text provider that can meet our specialized needs: fast, accurate, targeted, and multilingual.

: An ambient AI scribe is a voice-aware system that listens passively to a patient-provider conversation and automatically generates a structured clinical note (SOAP, DAP, or specialty-specific template) without disrupting the encounter. The pipeline has three stages: speech-to-text transcribes the conversation in real time with speaker diarization separating provider and patient turns; PHI redaction masks sensitive identifiers before downstream processing; and an LLM (e.g., via AssemblyAI's LLM Gateway with 25+ models across Claude, GPT, and Gemini) takes the diarized transcript and produces a draft note. The provider reviews and signs off before the note syncs to the EHR — typically Epic, Cerner, or another system via API integration.
: Start with AssemblyAI's Voice Agent API for a fully managed pipeline (~1s end-to-end) or Universal-3.5 Pro Realtime ($0.45/hr) for BYO orchestration with LiveKit or Pipecat. Capture audio from a room mic, smartphone, or telehealth stream. Enable Medical Mode (87% fewer medical entity errors) and speaker_labels=True so provider and patient turns are separated automatically. Pass medication names, dosages, ICD codes, and specialty terms as keyterms_prompt to boost recognition. Pipe the diarized transcript into your LLM — or AssemblyAI's LLM Gateway — with a SOAP-note prompt template, then push the approved note to your EHR.
: Clinical documentation needs accuracy on two distinct axes: general transcription accuracy (target <10% word error rate on conversational speech) and medical entity accuracy (correctly capturing drug names, dosages, anatomical terms, and ICD codes). AssemblyAI's Medical Mode reduces medical entity errors by 87% versus the base model — critical because a mis-transcribed medication name or dosage can introduce patient-safety risk into the EHR. Pair it with built-in speaker diarization so notes attribute statements to the right participant. For ambient settings, the model also has to hold accuracy in far-field audio — Universal-3.5 Pro Realtime captures clean transcripts from 20+ feet as providers move around the exam room.
: Yes. For telehealth, the scribe ingests the call's audio track in real time — most teams bridge from Twilio Programmable Video, LiveKit, or a Zoom/Meet bot rather than capturing the video stream directly. Stream the audio into AssemblyAI's WebSocket with Medical Mode and speaker_labels enabled. The same SOAP-note pipeline works whether the encounter is in-person or telehealth, with the same Medical Mode accuracy and diarization quality. AssemblyAI is a business associate under HIPAA and offers a Business Associate Addendum (BAA) for covered entities processing PHI — contact sales to execute one.
: It depends on whether your scribe needs to talk back. The Voice Agent API ($4.50/hr all-in) is the fastest path to an interactive scribe — one WebSocket handles STT, LLM, and TTS, so the agent can read back orders, confirm medications, or prompt the provider for missing fields during the encounter. Universal-3.5 Pro Realtime ($0.45/hr) is the right pick for passive ambient scribes where you bring your own LLM (or use AssemblyAI's LLM Gateway) for SOAP-note generation and your own EHR integration logic. Both share the same Universal-3.5 Pro speech foundation, so transcription quality is identical — the difference is whether you want a managed conversational pipeline or full architectural control.
: The build cost on AssemblyAI is transparent: $0.45/hr for Universal-3.5 Pro Realtime (transcription + diarization + PHI redaction, BYO LLM and EHR integration) or $4.50/hr for the Voice Agent API (STT + LLM + TTS managed end-to-end). The LLM Gateway gives you access to 25+ models — Claude, GPT, Gemini, and more — through one unified API for SOAP-note generation. Buying a vertically integrated scribe product typically costs $200–600 per provider per month with vendor lock-in on the model and EHR integration. Building on AssemblyAI gives you control over the LLM, the note template, and the EHR push — which matters as you scale across specialties or customize for your workflow.

Build your ambient AI scribe today

Free tier, no credit card. From raw clinical audio to structured SOAP notes in an afternoon.

Get started free