What is the best voice agent API for healthcare apps that need a BAA?

AssemblyAI's Voice Agent API works for healthcare voice agents that process protected health information (PHI). AssemblyAI enables covered entities and their business associates subject to HIPAA to use our services to process PHI, and we offer a standard Business Associate Addendum (BAA) on request via the sales team. One WebSocket replaces separate STT, LLM, and TTS providers at $4.50/hr per session, with in-stream PHI redaction. Universal-3.5 Pro Realtime with Medical Mode reduces medical entity errors by 87% versus the base model — pharma names, dosages, and ICD references come through clean on phone and far-field audio. Teams already running LiveKit, Pipecat, or Vapi can use Universal-3.5 Pro Realtime STT standalone at $0.45/hr.

How do I build a voice agent for telehealth triage?

To build a voice agent for telehealth triage, start with AssemblyAI's Voice Agent API for a fully managed pipeline (~1s end-to-end) or Universal-3.5 Pro Realtime (~150ms P50) with LiveKit or Pipecat. Configure a system prompt for your triage protocol, enable Medical Mode for clinical vocabulary, and add medication names, symptoms, and ICD references as keyterms. Set min_turn_silence to 100ms and max_turn_silence to 3000ms — patients pause longer when recalling symptoms or medications. Use tool calls into your EHR to keep structured fields (chief complaint, allergies, severity) in sync with the conversation, and escalate to a clinician via Twilio SIP when triage rules trigger handoff.

How do I build an AI voice agent for insurance claims intake?

To build an AI voice agent for insurance claims intake, use AssemblyAI's Voice Agent API ($4.50/hr all-in) or Universal-3.5 Pro Realtime ($0.45/hr) with LiveKit, Pipecat, or Vapi. Enable in-stream PHI redaction to mask policy numbers, dates of birth, and addresses before transcripts reach your claims platform. Universal-3.5 Pro Realtime delivers 28% better consecutive-number recognition versus other providers — critical for capturing policy numbers, claim IDs, and provider NPIs over phone audio. Connect via Twilio Media Streams or SIP for telephony, and use tool calls to look up policy details, file FNOL records, and route complex claims to a live adjuster.

How accurately does AssemblyAI capture medical terminology and medications?

AssemblyAI's Medical Mode reduces medical entity errors by 87% versus the base model — pharmaceutical names, anatomical terms, dosages, and ICD references are recognized on far-field and phone audio. You can also pass a Keyterms prompt to bias decoding toward your specialty's vocabulary (e.g., behavioral health, cardiology, urgent care) without retraining. Speaker diarization separates patient, provider, and staff turns even when speakers move in and out of frame — critical for telehealth visits, ambient documentation, and multi-party intake calls. Universal-3.5 Pro Realtime achieves ~150ms P50 latency, so patients don't notice the model is listening.

Can AssemblyAI sign a BAA for healthcare voice agents?

Yes. AssemblyAI is considered a business associate under HIPAA, and we offer a standard Business Associate Addendum (BAA) that is required for covered entities and their business associates to use our services to process protected health information (PHI). Contact our sales team to execute a BAA. PHI redaction runs in-stream and is configurable per session — enable redact_pii: true to automatically mask PHI (names, DOBs, MRNs, addresses, account IDs) in the transcript, use redact_pii_policies to scope which categories are masked, and redact_pii_audio: true to mute PHI in the audio. AssemblyAI maintains SOC 2 Type 2 certification, AES-128/256 encryption at rest, and TLS 1.2+ in transit. Both the Voice Agent API ($4.50/hr) and Universal-3.5 Pro Realtime STT ($0.45/hr) support BAA-backed deployments.

How do I build a healthcare voice agent that transfers to a clinician?

To build a healthcare voice agent that transfers to a clinician, use AssemblyAI's Voice Agent API to run STT, LLM, and TTS over a single WebSocket — your system prompt defines transfer triggers (escalation keywords, high-acuity symptoms, failed tool calls) and the agent hands off via Twilio SIP REFER or whatever your telephony layer supports. The full conversation transcript and detected intent are preserved, so the receiving clinician sees the patient's symptoms, medications, and triage context without re-asking. For BYO stacks, the same handoff pattern works with Universal-3.5 Pro Realtime inside LiveKit or Pipecat — the orchestrator handles the SIP transfer while AssemblyAI keeps transcribing through the warm handoff.

Solutions

Voice agents for healthcare

Automate scheduling, insurance verification, and patient intake with AI voice agents powered by the fastest, most accurate medical speech-to-text. Build end-to-end with our Voice Agent API, or drop Universal-3.5 Pro Realtime with Medical Mode into your existing stack.

Get started free Talk to sales

The problem

Phone trees and paper intake are losing patients

Patients abandon long IVR menus and fill paper intake forms that should take seconds. Generic speech-to-text mangles medication names, dosages, and ICD references — and PHI handling is an afterthought instead of a primitive. Modern healthcare voice agents — built on medical-grade streaming STT with in-stream PHI redaction, a managed LLM, and natural TTS — handle patient intake, telehealth triage, appointment scheduling, and prescription refills before a human staff member ever picks up. They replace legacy IVR, free clinical staff from low-acuity calls, and capture structured data into the EHR in the same conversation.

Why accuracy matters in healthcare voice

Medical entities 87%

Fewer medical entity errors with Medical Mode for pharma, anatomy, and ICD references.

Latency ~150ms

P50 median streaming latency — natural turn-taking with patients.

Compliance BAA

Business Associate Addendum available for covered entities processing PHI. SOC 2 Type 2 certified.

Scale 40TB+

Audio processed daily in production across healthcare and contact-center workloads.

Two ways to build

Pick the API that fits your healthcare stack

Ship a working intake or triage agent in an afternoon, or drop medical-grade STT into the orchestrator you already run.

Recommended

Voice Agent API

Our proprietary voice stack via one WebSocket. Connect, stream patient audio in, get audio back — we handle the rest.

Best for

Best-in-class healthcare voice agents — the preferred way to build with AssemblyAI
Patient intake, telehealth triage, prescription refills, appointment scheduling, post-discharge follow-up
Teams shipping fast — working agent in an afternoon, no infra to manage
Business Associate Addendum (BAA) available, in-stream PHI redaction

$4.50/hr — speech, LLM, and voice all included

Get started for free

Free tier available · No credit card required

Bring Your Own Stack

Universal-3.5 Pro Realtime STT API

The STT layer for your cascading voice agent architecture. Works natively with LiveKit, Pipecat, and Twilio.

Best for

Teams already using LiveKit, Pipecat, or Vapi as their orchestration layer
Cascading architectures (STT → LLM → TTS) with EHR tool calls
High-scale deployments where margin and full control matter
Workflows with RAG over clinical notes or proprietary LLMs
BAA-eligible, SOC 2 Type 2 — bring your own compliance infrastructure

$0.45/hr — transcription only, unlimited concurrent streams

View integration docs

No concurrency caps · Autoscaling included

Your healthcare voice pipeline

Ingest patient audio

Voice Agent API: single WebSocket. Or Twilio Media Streams / SIP → U3.5 Pro Realtime for BYO stack.

Clinical-grade transcription

Universal-3.5 Pro Realtime with Medical Mode + keyterm prompting for medications, anatomy, and ICD codes.

PHI redaction in-stream

Names, DOBs, MRNs, addresses, and account IDs masked before transcripts leave the pipeline.

LLM reasoning + EHR tool calls

Triage rules, scheduling, refills, intake — managed (Voice Agent API) or BYO with LiveKit / Pipecat.

Voice response

TTS streamed back to patient. Full round-trip under 1 second with natural turn detection.

Quickstart

Get a working healthcare voice agent in minutes

Voice Agent API — recommended

# Voice Agent API — single WebSocket, full healthcare pipeline
import asyncio, json, websockets

API_KEY = "YOUR_API_KEY"

async def run_agent():
    async with websockets.connect(
        "wss://agents.assemblyai.com/v1/ws",
        additional_headers={"Authorization": f"Bearer {API_KEY}"},
    ) as ws:
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "system_prompt": (
                    "You are a triage agent for Acme Health. "
                    "Collect symptoms, medications, and insurance."
                ),
                "greeting": "Hi, this is Acme Health — what brings you in today?",
                "output": {"voice": "ivy"},
            },
        }))
        async for msg in ws:
            handle(json.loads(msg))  # session.ready, transcript.user, reply.audio, tool.call, ...

Universal-3.5 Pro Realtime + LiveKit — BYO stack

# LiveKit + AssemblyAI Universal-3.5 Pro Realtime — healthcare voice agent
from livekit.agents import Agent, AgentSession, TurnHandlingOptions
from livekit.plugins import assemblyai, cartesia, openai, silero

class IntakeAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions=(
                "You are a patient intake agent for Acme Health. "
                "Collect symptoms, current medications, allergies, and insurance."
            ),
        )

async def entrypoint(ctx):
    session = AgentSession(
        stt=assemblyai.STT(
            model="u3-rt-pro",
            min_turn_silence=100,
            max_turn_silence=1000,   # patients pause longer; bump higher for entity dictation
            vad_threshold=0.3,
            keyterms_prompt=["amoxicillin", "metformin", "ICD-10", "PCP referral"],
        ),
        llm=openai.LLM(model="gpt-4o"),
        tts=cartesia.TTS(),
        vad=silero.VAD.load(activation_threshold=0.3),  # match AssemblyAI's vad_threshold
        turn_handling=TurnHandlingOptions(
            turn_detection="stt",
            endpointing={"min_delay": 0},  # AssemblyAI handles endpointing; avoid additive delay
        ),
    )
    await session.start(room=ctx.room, agent=IntakeAgent())

Try in Playground View full docs

Medical-grade accuracy

Universal-3.5 Pro Realtime with Medical Mode reduces medical entity errors by 87% — pharmaceutical names, anatomical terms, dosages, and ICD references come through clean on phone and far-field audio.

PHI redaction + BAA

PHI redaction runs in-stream and is configurable per session. BAA available on request, with SOC 2 Type 2, AES-256 encryption, and optional EU data residency.

Speaker diarization for multi-party visits

Accurately separate patient, provider, and staff turns even when speakers move in and out of frame — critical for telehealth visits, rounds, and ambient documentation.

Companies building healthcare voice agents with AssemblyAI

90% reduction in documentation time for clinicians

JotPsych uses AssemblyAI to power ambient clinical documentation for mental health providers — capturing nuanced patient conversations accurately and securely.

30% cost savings on real-time transcription

We require a leading edge speech-to-text provider that can meet our specialized needs: fast, accurate, targeted, and multilingual.

: AssemblyAI's Voice Agent API works for healthcare voice agents that process protected health information (PHI). AssemblyAI enables covered entities and their business associates subject to HIPAA to use our services to process PHI, and we offer a standard Business Associate Addendum (BAA) on request via the sales team. One WebSocket replaces separate STT, LLM, and TTS providers at $4.50/hr per session, with in-stream PHI redaction. Universal-3.5 Pro Realtime with Medical Mode reduces medical entity errors by 87% versus the base model — pharma names, dosages, and ICD references come through clean on phone and far-field audio. Teams already running LiveKit, Pipecat, or Vapi can use Universal-3.5 Pro Realtime STT standalone at $0.45/hr.
: To build a voice agent for telehealth triage, start with AssemblyAI's Voice Agent API for a fully managed pipeline (~1s end-to-end) or Universal-3.5 Pro Realtime (~150ms P50) with LiveKit or Pipecat. Configure a system prompt for your triage protocol, enable Medical Mode for clinical vocabulary, and add medication names, symptoms, and ICD references as keyterms. Set min_turn_silence to 100ms and max_turn_silence to 3000ms — patients pause longer when recalling symptoms or medications. Use tool calls into your EHR to keep structured fields (chief complaint, allergies, severity) in sync with the conversation, and escalate to a clinician via Twilio SIP when triage rules trigger handoff.
: To build an AI voice agent for insurance claims intake, use AssemblyAI's Voice Agent API ($4.50/hr all-in) or Universal-3.5 Pro Realtime ($0.45/hr) with LiveKit, Pipecat, or Vapi. Enable in-stream PHI redaction to mask policy numbers, dates of birth, and addresses before transcripts reach your claims platform. Universal-3.5 Pro Realtime delivers 28% better consecutive-number recognition versus other providers — critical for capturing policy numbers, claim IDs, and provider NPIs over phone audio. Connect via Twilio Media Streams or SIP for telephony, and use tool calls to look up policy details, file FNOL records, and route complex claims to a live adjuster.
: AssemblyAI's Medical Mode reduces medical entity errors by 87% versus the base model — pharmaceutical names, anatomical terms, dosages, and ICD references are recognized on far-field and phone audio. You can also pass a Keyterms prompt to bias decoding toward your specialty's vocabulary (e.g., behavioral health, cardiology, urgent care) without retraining. Speaker diarization separates patient, provider, and staff turns even when speakers move in and out of frame — critical for telehealth visits, ambient documentation, and multi-party intake calls. Universal-3.5 Pro Realtime achieves ~150ms P50 latency, so patients don't notice the model is listening.
: Yes. AssemblyAI is considered a business associate under HIPAA, and we offer a standard Business Associate Addendum (BAA) that is required for covered entities and their business associates to use our services to process protected health information (PHI). Contact our sales team to execute a BAA. PHI redaction runs in-stream and is configurable per session — enable redact_pii: true to automatically mask PHI (names, DOBs, MRNs, addresses, account IDs) in the transcript, use redact_pii_policies to scope which categories are masked, and redact_pii_audio: true to mute PHI in the audio. AssemblyAI maintains SOC 2 Type 2 certification, AES-128/256 encryption at rest, and TLS 1.2+ in transit. Both the Voice Agent API ($4.50/hr) and Universal-3.5 Pro Realtime STT ($0.45/hr) support BAA-backed deployments.
: To build a healthcare voice agent that transfers to a clinician, use AssemblyAI's Voice Agent API to run STT, LLM, and TTS over a single WebSocket — your system prompt defines transfer triggers (escalation keywords, high-acuity symptoms, failed tool calls) and the agent hands off via Twilio SIP REFER or whatever your telephony layer supports. The full conversation transcript and detected intent are preserved, so the receiving clinician sees the patient's symptoms, medications, and triage context without re-asking. For BYO stacks, the same handoff pattern works with Universal-3.5 Pro Realtime inside LiveKit or Pipecat — the orchestrator handles the SIP transfer while AssemblyAI keeps transcribing through the warm handoff.

Build your healthcare voice agent today

Free tier, no credit card. Sign a BAA on request for healthcare workloads processing PHI.

Get started free