May 12, 2026

How to build an AI scribe for therapy sessions

A Python tutorial for building a therapy-grade AI scribe that captures sessions with Medical Mode, distinguishes therapist from client, and generates DAP notes by voice—no typing required.

Kelsey Foster

Growth

AI voice agents

ambient AI scribe

Healthcare

Reviewed by

Table of contents

[Visible on live site]

Therapy sessions are some of the hardest audio to scribe. Two people talking, often quietly, sometimes overlapping. Long therapeutic pauses that aren't end-of-turn. Clinical terminology mixed with informal speech. Medication names that sound like other medication names. PHI that absolutely cannot leak. And a clinician who needs to be present with their client—not typing.

Most general-purpose transcription products fail at therapy because they were built for meetings or podcasts. Long pauses get treated as session ends. Mumbled mid-sentence corrections lose the original word. Drug names get mangled. And the post-session note-writing burden—DAP, SOAP, or BIRP format depending on the practice—still falls on the therapist.

This tutorial walks through building a therapy-grade AI scribe that does three things well:

Ambient session capture with Universal-3 Pro Streaming and Medical Mode, tuned for therapy-specific speech patterns
Speaker-attributed transcripts that distinguish therapist from client using speaker diarization
Conversational note generation with AssemblyAI's Voice Agent API—the clinician can dictate additions, ask questions about the session, and generate the formatted note all by voice

We'll cover HIPAA considerations throughout. AssemblyAI offers a Business Associate Agreement (BAA) for customers processing PHI, and is SOC 2 Type 2, ISO 27001:2022, and PCI DSS v4.0 certified. The architecture below is designed with PHI handling in mind—but a BAA is a contractual prerequisite for using this with real patient data.

The architecture

Two phases, two AssemblyAI products, one continuous pipeline:

Phase 1—During the session (ambient): - Streaming speech-to-text captures the conversation with domain="medical-v1" (Medical Mode) and speaker_labels=true - Tuned turn-detection settings prevent therapeutic pauses from triggering false end-of-turns - Transcripts are buffered locally—never written to disk in plaintext outside the clinician's environment

Phase 2—After the session (conversational): - The clinician interacts with a Voice Agent API session that has the transcript loaded as context - The agent can: read back specific moments, generate DAP/SOAP/BIRP notes, identify themes, suggest follow-ups, accept dictated additions - Notes are produced as structured output and pushed to the EHR

The clinician never writes a note. They speak to the scribe.

What you'll build

A Python prototype that:

Captures a live therapy session via microphone with Medical Mode and speaker diarization
Persists a structured transcript with therapist/client roles
Spins up a post-session Voice Agent that can answer questions about the session and generate a DAP note via tool call

Stack: - AssemblyAI Universal-3 Pro Streaming with Medical Mode - AssemblyAI Voice Agent API (post-session note generation) - AssemblyAI LLM Gateway (structured note generation) - Python 3.9

pip install assemblyai websockets pyaudio python-dotenv requests


# .env
ASSEMBLYAI_API_KEY=your_key_here

Before using with real PHI: request a BAA from AssemblyAI, run the pipeline in your own controlled environment, and ensure the transcript persistence layer meets your organization's PHI requirements.

Step 1: Capture the session with Medical Mode

The default streaming configuration is tuned for fast-paced voice agent dialogues. Therapy is the opposite—long therapeutic pauses are normal and meaningful. The Medical Mode docs recommend conservative turn detection settings for clinical audio: raise min_turn_silence to 800 ms and max_turn_silence to 3600 ms.

import os
from dotenv import load_dotenv
from assemblyai.streaming.v3 import (
    StreamingClient,
    StreamingClientOptions,
    StreamingParameters,
    StreamingEvents,
    BeginEvent,
    TurnEvent,
    TerminationEvent,
    StreamingError,
)

load_dotenv()
API_KEY = os.getenv("ASSEMBLYAI_API_KEY")

# In-memory transcript buffer for the live session.
# In production: write to encrypted storage with PHI safeguards.
session_transcript = []

def on_begin(client: StreamingClient, event: BeginEvent):
    print(f"Session started: {event.id}")

def on_turn(client: StreamingClient, event: TurnEvent):
    if not event.end_of_turn:
        return
    speaker = event.speaker_label or "unknown"
    role = "Therapist" if speaker == "A" else "Client"
    entry = {"role": role, "text": event.transcript}
    session_transcript.append(entry)
    print(f"[{role}] {event.transcript}")

def on_error(client: StreamingClient, error: StreamingError):
    print(f"STT error: {error}")

def on_terminated(client: StreamingClient, event: TerminationEvent):
    print(f"Session ended. Total turns: {len(session_transcript)}")

def start_session_capture():
    client = StreamingClient(
        StreamingClientOptions(
            api_key=API_KEY,
            api_host="streaming.assemblyai.com",
        )
    )
    client.on(StreamingEvents.Begin, on_begin)
    client.on(StreamingEvents.Turn, on_turn)
    client.on(StreamingEvents.Error, on_error)
    client.on(StreamingEvents.Termination, on_terminated)

    client.connect(
        StreamingParameters(
            speech_model="u3-rt-pro",
            sample_rate=16000,
            domain="medical-v1",
            speaker_labels=True,
            min_turn_silence=800,
            max_turn_silence=3600,
            keyterms_prompt=[
                "cognitive behavioral therapy", "CBT", "EMDR",
                "GAD-7", "PHQ-9", "DSM-5",
                "sertraline", "fluoxetine", "buspirone",
                "panic attack", "intrusive thought",
            ],
        )
    )
    return client

A few details worth pointing out:

Medical Mode (domain="medical-v1") is purpose-built for clinical accuracy. It corrects medication names like lisprohumalog to Lispro (Humalog)—the standard medical convention—and improves accuracy on procedures, conditions, and dosages. In therapy it covers SSRIs, common assessments, and DSM-5 terminology.
speaker_labels=true tags each turn with a speaker ID. With two-person therapy sessions, that maps cleanly to therapist (Speaker A) and client (Speaker B). For couples or family therapy, see the speaker identification docs for additional speakers.
keyterms_prompt boosts recognition for terms specific to your modality. For trauma practices, add EMDR-specific vocabulary; for psychiatric practices, add the medication formulary you actually prescribe.

What you don't see here but matters in production: the transcript buffer should never persist outside an environment covered by your BAA. In-memory during the session, encrypted-at-rest after—and only if your storage layer meets your organization's PHI requirements.

Purpose-built for clinical audio

AssemblyAI's Medical Mode delivers clinical-grade accuracy on medication names, assessment instruments, and DSM-5 terminology—designed for therapy and behavioral health workflows.

Step 2: Add PII redaction for the persisted transcript

For research, audit, or training purposes you may want to keep a redacted version of the session. AssemblyAI's PII redaction handles this for streaming transcripts. For pre-recorded post-session re-processing (if you want to go back and re-transcribe with higher-accuracy pre-recorded models), the redaction options are richer:

import requests

def redact_post_session(audio_url: str):
    """
    Re-process the recorded session audio with PII redaction
    and entity detection. Use this for the persisted record.
    """
    response = requests.post(
        "https://api.assemblyai.com/v2/transcript",
        headers={"authorization": API_KEY},
        json={
            "audio_url": audio_url,
            "speech_models": ["universal-3-pro", "universal-2"],
            "domain": "medical-v1",
            "speaker_labels": True,
            "speakers_expected": 2,
            "redact_pii": True,
            "redact_pii_policies": [
                "person_name",
                "date_of_birth",
                "phone_number",
                "email_address",
                "us_social_security_number",
                "credit_card_number",
            ],
            "redact_pii_sub": "hash",
            "redact_pii_audio": True,    # Returns redacted audio URL with PII silenced
            "entity_detection": True,
        },
    )
    return response.json()

The redact_pii_audio: True flag returns a redacted version of the audio file with PII silenced—essential if you ever need to share clips for supervision or QA.

Step 3: Generate the DAP note with LLM Gateway

DAP (Data, Assessment, Plan) is the dominant therapy note format in mental health practices, alongside SOAP and BIRP. LLM Gateway with structured outputs gives you predictable JSON output that maps directly into your EHR. The chat completions endpoint handles the request.

import requests
import json

DAP_SCHEMA = {
    "type": "object",
    "properties": {
        "data": {
            "type": "string",
            "description": "Objective observations from the session — what the client said, body language, presentation. No interpretation.",
        },
        "assessment": {
            "type": "string",
            "description": "Clinical assessment — themes, progress toward goals, risk factors, response to interventions.",
        },
        "plan": {
            "type": "string",
            "description": "Clinical plan — interventions used, homework assigned, next session focus.",
        },
        "risk_flags": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Any safety concerns, suicidal/homicidal ideation, or escalations noted.",
        },
        "themes": {
            "type": "array",
            "items": {"type": "string"},
            "description": "2-4 short clinical themes (e.g. 'workplace anxiety', 'sleep disruption').",
        },
    },
    "required": ["data", "assessment", "plan", "risk_flags", "themes"],
}

def generate_dap_note(transcript: list) -> dict:
    transcript_text = "\n".join(
        f"{entry['role']}: {entry['text']}" for entry in transcript
    )

    response = requests.post(
        "https://llm-gateway.assemblyai.com/v1/chat/completions",
        headers={"authorization": API_KEY},
        json={
            "model": "claude-sonnet-4-6",
            "messages": [
                {
                    "role": "system",
                    "content": (
                        "You are an experienced clinical documentation assistant. "
                        "You generate DAP-format therapy notes from session transcripts. "
                        "Be precise, professional, and clinically appropriate. "
                        "Never invent information not present in the transcript. "
                        "If the transcript indicates risk (SI/HI, abuse, etc.), "
                        "flag it explicitly in risk_flags."
                    ),
                },
                {"role": "user", "content": f"Generate a DAP note for this session:\n\n{transcript_text}"},
            ],
            "response_format": {
                "type": "json_schema",
                "json_schema": {"name": "dap_note", "schema": DAP_SCHEMA, "strict": True},
            },
            "fallbacks": [
                {"model": "gpt-5.1"},
                {"model": "gemini-2.5-pro"},
            ],
            "fallback_config": {"depth": 2},

        },
    )

    return json.loads(response.json()["choices"][0]["message"]["content"])

Three production-grade choices baked into this:

Structured output via JSON schema. Guarantees the response shape—your EHR mapping code never has to parse freeform clinical text.
fallbacks chain. If the primary Claude model is overloaded, the Gateway transparently retries with GPT-5.1, then Gemini 2.5 Pro. A clinician finishing their day shouldn't have to wait or retry because of an upstream provider hiccup.
Explicit "never invent" instruction. The single most important sentence in the system prompt. Without it, LLMs sometimes generate plausible-sounding but fabricated clinical content. With it, they leave fields shorter and more conservative.

The note is generated. The clinician reviews. Nothing goes into the patient record without explicit approval—that's a non-negotiable design pattern for any AI medical scribe.

Try the Voice Agent API live

See real-time speech-to-text, LLM reasoning, and voice output working together—experience the same pipeline that powers clinical scribe workflows.

Try playground

Step 4: Conversational scribe interaction with the Voice Agent API

Now the part that makes this a true scribe rather than a transcription tool: after the session, the clinician opens a Voice Agent connection and just talks to the scribe.

The agent is configured with the session transcript loaded as system context, plus tools for the actions a therapist actually wants to take post-session: generate the note, search the transcript, add a dictated addendum, push to the EHR.

SCRIBE_TOOLS = [
    {
        "type": "function",
        "name": "generate_dap_note",
        "description": "Generate a DAP-format therapy note from the current session transcript.",
        "parameters": {"type": "object", "properties": {}, "required": []},
    },
    {
        "type": "function",
        "name": "search_session",
        "description": "Search the session transcript for what the client said about a topic.",
        "parameters": {
            "type": "object",
            "properties": {"topic": {"type": "string"}},
            "required": ["topic"],
        },
    },
    {
        "type": "function",
        "name": "add_addendum",
        "description": "Add a dictated addendum to the session note (e.g., a clinical observation the therapist remembered after the fact).",
        "parameters": {
            "type": "object",
            "properties": {"text": {"type": "string"}},
            "required": ["text"],
        },
    },
    {
        "type": "function",
        "name": "save_to_ehr",
        "description": "Save the finalized note to the EHR. Only call after the clinician confirms.",
        "parameters": {
            "type": "object",
            "properties": {"confirmation": {"type": "string"}},
            "required": ["confirmation"],
        },
    },
]

The system prompt for the scribe agent:

def scribe_system_prompt(transcript: list) -> str:
    transcript_text = "\n".join(f"{e['role']}: {e['text']}" for e in transcript)
    return f"""
You are a clinical documentation assistant for a licensed therapist. The therapist
just finished a session. Below is the session transcript.

Your role:
- Help the therapist generate, refine, and finalize the session note.
- Answer questions about what was said in the session.
- Accept dictated addenda.
- Never invent content that wasn't in the session.
- Push to EHR ONLY when the therapist explicitly confirms.

Tone: professional, brief, never chatty. The therapist is between clients.

Session transcript:
---
{transcript_text}
---
"""

When the clinician finishes the session, they open the scribe Voice Agent connection. A natural interaction looks like:

Therapist: "Generate the DAP note for this session."
Scribe:    "One moment... I have a draft. Want me to read it, or
            should I push it to the EHR?"
Therapist: "Read me the assessment section."
Scribe:    "Assessment: client reported continued anxiety triggered by
            workplace conflict, with a notable improvement in sleep over
            the past two weeks following sleep hygiene homework. Engaged
            well with cognitive restructuring exercise around catastrophic
            thinking. No safety concerns noted."
Therapist: "Add an addendum: client mentioned a possible elevation in
            symptoms after the meeting on Friday — flag for next session."
Scribe:    "Added. Anything else?"
Therapist: "Push it to the EHR."
Scribe:    "Confirming — should I save this note to the EHR?"
Therapist: "Yes, save it."
Scribe:    "Saved."

The Voice Agent API connection is the same one we covered in the support and shopping tutorials—wss://agents.assemblyai.com/v1/ws, system prompt, tools, voice. The difference is that the system prompt carries the session transcript as context, and the tools are scribe-specific.

What this changes for therapists

Three things, in order of how often clinicians mention them:

Eye contact comes back. When the scribe is ambient and the note is generated by voice afterward, the therapist isn't typing during the session. Clients notice. Multiple ambient AI scribe deployments in adjacent specialties report patient engagement gains tied to therapists not staring at a screen.
"Pajama time" goes away. The hours of post-session note-writing—done at home after the kids are in bed—collapse into a five-minute review of an AI-drafted note. The Permanente Medical Group has reported thousands of physician hours saved at scale with ambient AI scribes; the same principle applies to mental health caseloads.
The note quality is consistent. Variability in note quality across a long day is real. An AI scribe writes the 11am note with the same care as the 5pm note.

What to harden before clinical use

BAA in place. Do not run with real PHI without a Business Associate Agreement signed with AssemblyAI.
Encrypted transcript storage. Session transcripts are PHI. Storage layer needs to match.
Clinician sign-off on every note. AI-drafted, human-approved. No exceptions.
Audit trail. Every note edit, addendum, and save-to-EHR action should be logged to a tamper-evident store.

Retention policy. Document explicitly how long transcripts are kept, who can access them, and when they're deleted.

Where to take it from there

The same pattern extends to other formats—SOAP notes for primary care therapy referrals, BIRP for behavioral health, group therapy notes with multi-speaker diarization. Swap the schema, swap the keyterms, keep the architecture.

For practices that want a single deployment: - Solo practitioners can run the whole pipeline locally and push to a single EHR - Group practices can host the scribe centrally with per-clinician auth and route to multiple EHRs via the save_to_ehr tool - Telehealth platforms can wire the streaming STT into the existing call audio path with no change to the client-facing UI—an approach that works well for medical speech-to-text in virtual care settings

The architecture is the same. The integrations differ.

The shape of clinical documentation is changing. The therapists who get the time back are the ones whose practices ship a scribe that actually works in their workflow—and the workflow is voice in, voice out, EHR-ready note in the middle.

Build your clinical AI scribe

Get started with $50 in free credits and build a therapy scribe with Medical Mode, speaker diarization, and the Voice Agent API.

Frequently asked questions

How do I build an AI scribe for therapy sessions?

Build it as a two-phase pipeline: ambient session capture with AssemblyAI's Universal-3 Pro Streaming model in Medical Mode (domain="medical-v1") with speaker diarization, then post-session note generation through LLM Gateway with a structured JSON schema for DAP, SOAP, or BIRP format. The clinician interacts with the scribe conversationally through the Voice Agent API after the session—dictating addenda, asking questions about the transcript, and pushing the finalized note to the EHR by voice.

Does AssemblyAI support HIPAA workloads for therapy and clinical documentation?

AssemblyAI offers a Business Associate Agreement (BAA) for customers processing Protected Health Information and is SOC 2 Type 2, ISO 27001:2022, and PCI DSS v4.0 certified. A BAA is a contractual prerequisite for using AssemblyAI with real PHI, including therapy session recordings, transcripts, and clinical notes. Contact AssemblyAI sales to start a BAA conversation before deploying into a regulated environment.

What is Medical Mode and why does it matter for therapy transcription?

Medical Mode is an AssemblyAI add-on enabled with domain="medical-v1" that improves transcription accuracy for medical terminology—medication names, procedures, conditions, dosages, and clinical assessments. For therapy specifically, it correctly transcribes SSRIs (sertraline, fluoxetine), assessment instruments (GAD-7, PHQ-9), and DSM-5 terminology that general-purpose transcription models frequently mishear. It works with all of AssemblyAI's pre-recorded and streaming speech-to-text models.

How do I handle the long pauses common in therapy sessions without triggering false end-of-turns?

Raise the streaming turn-detection thresholds—set min_turn_silence to 800 ms and max_turn_silence to 3600 ms (the conservative settings recommended in the Medical Mode docs). The default streaming configuration is tuned for fast-paced voice agent dialogues, which would incorrectly fragment therapeutic pauses into separate turns. With the conservative settings, clients can pause for several seconds to think without the system breaking the turn.

Can an AI scribe distinguish between therapist and client in a session recording?

Yes—enable speaker diarization with speaker_labels=True and the streaming API will tag each turn with a speaker ID. For two-person therapy sessions, that maps cleanly to therapist (Speaker A) and client (Speaker B). For couples or family therapy with three or more speakers, AssemblyAI's Speaker Identification feature can identify speakers by their actual name or role.

What format should AI-generated therapy notes use—DAP, SOAP, or BIRP?

DAP (Data, Assessment, Plan) is the dominant format in mental health practices, but SOAP and BIRP are also widely used depending on the modality and payer requirements. With LLM Gateway's structured output support, you define a JSON schema for the format you want and the model returns exactly that shape—making it trivial to switch between DAP, SOAP, and BIRP by swapping schemas. Whichever format you choose, the AI generates a draft and the clinician reviews and approves before anything goes into the patient record.

Can I redact PII from a therapy session transcript?

Yes—enable PII redaction with redact_pii=True and specify the PII policies you want to detect (person names, dates of birth, phone numbers, email addresses, social security numbers, credit card numbers). For pre-recorded post-session re-processing, AssemblyAI also supports redact_pii_audio=True, which returns a redacted version of the audio file with PII silenced—useful if you ever need to share clips for clinical supervision or quality assurance.

How does an AI scribe interact with the clinician after the session?

The clinician opens a Voice Agent API connection and talks to the scribe directly—saying things like "generate the DAP note," "read me the assessment section," "add an addendum: client mentioned elevation in symptoms after Friday's meeting," or "push it to the EHR." The agent has the session transcript loaded as system prompt context and tools registered for note generation, search, addenda, and EHR push, so the entire post-session documentation workflow happens by voice without typing.