June 23, 2026

Building behavioral health documentation that clinicians trust

Clinicians decide whether to trust a behavioral health scribe on the words that carry clinical weight, not your overall accuracy score. Here are the three things that actually earn it.

Kelsey Foster

Growth

Medical

Reviewed by

Table of contents

[Visible on live site]

Trust is a strange thing to engineer into a transcript.

A clinician doesn't sit there grading your word error rate. They glance at the note, see that "sertraline" came through as "sertraline" and not "sir Tralee," and decide—usually in the first session—whether your product is something they can rely on or something they have to babysit. That decision is mostly subconscious, and it's almost always made on the words that carry clinical weight.

So if you're a product manager or founder building a behavioral health scribe, the real question isn't "how accurate is the transcription?" It's "is it accurate on the things a clinician would notice being wrong?" Those are very different bars.

We've written a step-by-step tutorial on building an AI scribe for therapy sessions—the actual code, the upload-and-transcribe loop, the configuration. This post is the companion to that. Less about how to wire it up, more about what earns the clinician's trust once it's wired up. Three things, in my experience, do most of the work.

Accuracy on the words that actually matter

Behavioral health vocabulary is unforgiving in a specific way. Psychiatric medication names are dense, similar-sounding, and dose-dependent. Sertraline, lamotrigine, quetiapine, bupropion—these aren't words a general speech-to-text model has heard a million times, and they sit next to numbers ("up to 200 milligrams," "split the 300 into two doses") where a single transposed entity changes the clinical meaning of the note.

Here's the trap a lot of teams fall into. They benchmark on overall word error rate, see a number that looks great, and ship. But overall WER averages across "the," "and," "um," and "lamotrigine." A model can post a lovely aggregate number while quietly fumbling the one word per minute that a prescriber will actually catch.

That's why we measure Medical Mode on Missed Entity Rate instead—how often the model drops or mangles a clinically meaningful entity like a drug, a dose, or a diagnosis. Medical Mode delivers a 3.2% MER, the lowest across the providers we benchmarked, and roughly 20% fewer missed medical entities than Universal-3 Pro running on its own. You can see the full breakdown on our benchmarks page.

Turning it on is one parameter.

import assemblyai as aai

transcriber = aai.Transcriber()
config = aai.TranscriptionConfig(
speech_models=["universal-3-pro"],
domain="medical-v1",
)
transcript = transcriber.transcribe("clinical-audio.wav", config)

‍
That domain="medical-v1" flag is the whole activation. It works on Universal-3 Pro for pre-recorded sessions and on Universal-3.5 Pro Realtime for live ones, so your documentation pipeline behaves the same whether you're processing a recorded intake overnight or transcribing a session as it happens.

And when a practice uses regional brand names or a niche formulary that even a medical model wouldn't expect, keyterms prompting lets you hand the model that vocabulary up front. Think of it as telling the model what to listen for before it ever hears the audio.‍

Ready to test accuracy on your own clinical audio? Explore Voice AI for healthcare or grab an API key.

Knowing who said what

Now here's where behavioral health diverges from almost every other medical transcription use case.

A radiology dictation is one voice. A therapy session is at least two—and group, couples, and family sessions can be four, five, six people talking over each other, finishing each other's sentences, going quiet, coming back. If your transcript renders all of that as one undifferentiated wall of text, the clinical value collapses. A note that says "patient reported increased anxiety" is useless when there were three patients in the room and you can't tell which one said it.

This is exactly what speaker diarization solves—segmenting the audio by who's speaking so each utterance is attributed to the right person. In behavioral health it's not a nice-to-have feature sitting next to accuracy. It's part of the accuracy. And on Universal-3.5 Pro Realtime, live diarization gets a second pass: the model labels speakers during the session, then re-clusters every voice when the stream ends and sends a single revision correcting any labels it now knows were wrong—async-grade speaker accuracy within about half a second of the session ending, up to 10 speakers.‍

JotPsych, which builds documentation tooling for behavioral health, put it plainly. Jackson Bierfeldt, their Cofounder and CTO, told us: "In the medical context, accuracy is highly important…[and] there can be multiple people present. Separating them is key to accuracy. The biggest impact AssemblyAI has had has been in enabling our technical team to focus on workflow-specific features rather than a general speech-to-text pipeline."

Read that last sentence again, because it's the part product leaders should care most about. JotPsych didn't just want accurate diarization—they wanted to stop building and maintaining a speech pipeline at all, so their engineers could spend their time on the things that actually differentiate a behavioral health product. The transcription layer should be something you configure, not something you staff a team around.‍

NovoPsych, another team building in behavioral health, is solving the same shape of problem—turning sensitive, multi-speaker clinical conversation into structured documentation that a clinician will sign their name to. When the words and the speakers are both right, the clinician's review time drops, and that's where the trust compounds.

Handling sensitive sessions with care

Let's talk about privacy, because behavioral health is about the most sensitive category of data there is, and a clinician's trust evaporates fast if they suspect a session is being handled carelessly

I'll be direct about what this is and isn't. This isn't the headline of your product—no clinician chose your scribe because of an infrastructure diagram. But it's the floor underneath everything else, and if the floor isn't there, none of the accuracy matters.

AssemblyAI enables covered entities and their business associates subject to HIPAA to use the AssemblyAI services to process protected health information (PHI). AssemblyAI is considered a business associate under HIPAA, and we offer a standard Business Associate Addendum (BAA) that is required under HIPAA to ensure that AssemblyAI appropriately safeguards PHI. The infrastructure is BAA-eligible, and the BAA is included.

On the engineering side, you've got PII redaction built in. With redact_pii, you can strip identifying information out of transcripts—names, contact details, the kinds of identifiers that turn an ordinary transcript into something you have to lock down. For behavioral health, where a single transcript might name a patient, their family members, and their employer in the first two minutes, redaction is a practical tool, not a checkbox.

The point is that data handling should be designed in from the first session, not retrofitted after your first enterprise customer asks the question in a security review.
Building for sensitive clinical settings? Start with a free API key and review the docs.

The trust isn't in the transcript

Here's the thing I'd leave you with, and it's the part that's easy to miss when you're heads-down on accuracy metrics.

Clinician trust isn't earned in the moment the transcript is correct. It's earned in the moment the clinician stops checking. The first few sessions, they read every line against their memory of what happened. Then one day they skim, sign, and move on—because the medication names have been right, the speakers have been attributed correctly, and nothing has surprised them. That shift from auditing to trusting is the entire ballgame, and it only happens if the foundation is boring and reliable across hundreds of sessions, not impressive in a demo.

Build for the hundredth session, not the first.‍

See what Medical Mode does on your audio—explore Voice AI for healthcare or get your API key.

Frequently asked questions

How accurate is Medical Mode on psychiatric medication names?

Medical Mode posts a 3.2% Missed Entity Rate—the lowest across the providers we benchmarked—and around 20% fewer missed medical entities than Universal-3 Pro on its own. Because MER measures clinically meaningful entities specifically, including drug names and dosages, it's a better proxy for behavioral health accuracy than overall word error rate. The full numbers live on our benchmarks page.

Can it separate speakers in group or family therapy sessions?

Yes. Speaker diarization segments the audio by speaker and attributes each utterance to the right person, which is what makes a multi-party session usable as documentation rather than a single block of text. On Universal-3.5 Pro Realtime, live labels are refined by an end-of-stream revision for async-grade speaker accuracy, up to 10 speakers. It's the capability JotPsych called out as key to accuracy in the medical context.

Does it work for live sessions, or only recorded ones?

Both. Medical Mode runs on Universal-3 Pro for pre-recorded audio and on Universal-3.5 Pro Realtime for live transcription, so you can build the same accuracy into a real-time scribe and an after-hours batch pipeline with the same one-parameter activation.

How do you handle PHI and HIPAA?

AssemblyAI is considered a business associate under HIPAA and offers a standard Business Associate Addendum, which is required under HIPAA, on BAA-eligible infrastructure. You can also use redact_pii to strip identifying information from transcripts as part of your data handling.

What does Medical Mode cost?

Medical Mode adds $0.15/hr on top of the base model. Universal-3 Pro with Medical Mode comes to $0.36/hr, and Universal-3.5 Pro Realtime with Medical Mode is $0.60/hr. Full details are on the pricing page.

‍