Voice AI Meetup recap: How Commure and Ona Health are building for healthcare
This event brought together two companies tackling healthcare documentation from different angles—Commure (ambient AI scribing for hospital systems) and Ona Health (a combined CRM/EHR/RCM platform for small to midsize clinics)—alongside a live demo of AssemblyAI's Medical Mode, our purpose-built accuracy layer for clinical terminology, drug names, and the fast, overlapping speech patterns common in healthcare settings.



Thirty minutes. That's how long a single patient visit's documentation can take a provider—time spent typing, not caring. It's a problem that Cem Torun, VP of Engineering at Commure, knows well. And it's exactly the kind of problem that brought a room full of healthcare builders together in San Francisco for our Voice AI Meetup: Medical Mode.
The event brought together two companies tackling healthcare documentation from different angles—Commure (ambient AI scribing for hospital systems) and Ona Health (a combined CRM/EHR/RCM platform for small to midsize clinics)—alongside a live demo of AssemblyAI's Medical Mode, our purpose-built accuracy layer for clinical terminology, drug names, and the fast, overlapping speech patterns common in healthcare settings.
Ryan Seams, VP of Customer Solutions at AssemblyAI, moderated the panel discussion and Q&A that followed.
Here's what stood out.
Ambient scribing that actually handles the messy reality of care
Commure's ambient AI scribing platform records and transcribes patient visits automatically, removing the documentation burden from providers entirely. But the challenge isn't just transcription—it's the context around it.
Cem described scenarios where multiple speakers are in the room—provider, patient, family members, nurses moving in and out—and where patients switch between languages mid-sentence. Code-switching between English and Spanish (or other language pairs) is common, and getting that wrong means getting the clinical record wrong.
Their platform handles multi-speaker diarization and code-switching while maintaining clinical accuracy. And they recently launched a dictation product specifically for emergency department use cases, where speed and precision collide in ways that stress-test any Voice AI system.
The key insight from Cem: in healthcare, accuracy isn't a nice-to-have. It's the baseline. If the transcript misses a drug name or dosage, everything downstream—the clinical note, the billing code, the treatment plan—breaks.
Turning telehealth conversations into structured data
Tomasz Bachosz, CEO and co-founder at Ona Health, described a different set of challenges. Ona Health offers a combined CRM, EHR, and revenue cycle management platform for small to midsize clinics, and they're using Voice AI to turn telehealth conversations into actionable, structured data.
The focus isn't just on generating a transcript. It's on extracting specific details: prescription names and dosages, patient registration data, insurance member IDs. These aren't nice-to-know details, but the data that drives patient profiles, insurance claims, and billing workflows.
Tomasz explained how Ona Health uses a post-call processing pipeline that pairs transcription with LLM-based context improvement. The transcription captures the raw conversation. Then an LLM layer refines and structures the output—correcting context, resolving ambiguities, and mapping extracted data to the right fields in the patient record.
It's a pattern we're seeing more healthcare builders adopt: speech-to-text as the foundation, with LLMs layered on top to bridge the gap between what was said and what the system needs to know.
The technical challenges that keep healthcare builders up at night
The panel discussion surfaced several technical challenges that resonated across the room:
Speaker diarization in multi-person clinical environments. Distinguishing who said what is essential for accurate documentation and legal compliance. When a doctor says "increase the dosage to 20mg" versus when a patient says "I was told to take 20mg," the attribution changes the clinical meaning entirely.
Clinical hallucinations. Both Commure and Ona Health discussed implementing hallucination detection layers to catch clinically impossible outputs. If a generated note suggests a medication interaction that doesn't exist, or attributes a symptom to the wrong speaker, it needs to be flagged before it hits the EHR. This is one area where healthcare demands a higher bar than almost any other industry.
The accuracy-versus-latency tradeoff. The panel was clear: in medical applications, accuracy wins. A 200-millisecond latency improvement means nothing if it introduces errors in drug names or dosage amounts. Both teams prioritize getting the transcript right over getting it fast—though real-time performance is still the goal.
What teams are actually measuring in production
One of the most practical parts of the discussion was around metrics. When you're building clinical documentation tools, what does "good" actually look like?
Two metrics came up repeatedly. First, time spent editing notes. The goal of an ambient scribe isn't to generate a perfect note every time—it's to dramatically reduce the time a physician spends reviewing and correcting documentation after a visit. Both panelists track this as a primary success metric.
Second, F1 scores comparing generated notes versus submitted notes. This measures the overlap between what the system produces and what the provider ultimately signs off on. A high F1 score means the system is capturing the right information in the right structure. A low score means providers are still doing heavy lifting in the editing phase.
An interesting tangent: both teams noted the importance of provider behavior coaching. The best Voice AI system in the world can't compensate for a provider who mumbles into their chest or walks to the far corner of the room with their back to the microphone. Training providers on basic audio hygiene—speaking clearly, staying within reasonable range, minimizing crosstalk—makes a measurable difference in transcription quality.
Medical Mode: purpose-built for the terminology that matters
During the event, the AssemblyAI team demoed Medical Mode—a capability layered on top of Universal-3 Pro Streaming that's specifically optimized for medical environments.
The problem Medical Mode solves is straightforward: general-purpose speech-to-text models struggle with medical terminology. Drug names like "metformin" get transcribed as "met for men." Dosage instructions get garbled. Anatomical terms and medical acronyms are misrecognized.
Medical Mode reduces missed medical entities by over 20% compared to Universal-3 Pro alone, with a 3.2% missed entity rate—the lowest across all providers we've benchmarked against, including Deepgram, Speechmatics Enhanced Medical, AWS Transcribe Medical, and Google Medical Conversation.
It's built for the realities of clinical environments: far-field audio capture from 20+ feet away, background noise from equipment, overlapping speakers, and the fast speech patterns that are common during patient encounters. And it works across both real-time streaming and pre-recorded audio workflows.
Looking ahead: voice as the interface for healthcare workflows
The conversation wrapped with a forward-looking discussion on where Voice AI in healthcare is heading.
Voice biomarkers came up as a particularly interesting frontier. The idea: using voice characteristics not just to identify speakers, but to surface diagnostic signals. Changes in speech patterns, cadence, or vocal quality could flag cognitive decline, respiratory issues, or mental health changes—turning every conversation into a potential diagnostic touchpoint.
Real-time clinical tools were another theme. Both panelists expressed interest in moving beyond post-visit documentation toward tools that assist during the encounter—surfacing relevant patient history, suggesting follow-up questions, or flagging potential drug interactions as they're discussed. The emerging voice agent paradigm could make these real-time clinical assistants practical at scale.
And the biggest vision: voice-driven workflows where a provider can dictate not just clinical notes, but trigger entire downstream processes. Imagine saying "file the prior authorization for the MRI we discussed" and having the system handle the rest—pulling the relevant clinical justification from the visit transcript, populating the insurance form, and routing it for submission.
We're not there yet. But the building blocks—accurate medical transcription, speaker diarization, structured data extraction, LLM-powered processing—are in place. The teams building on them are the ones shaping what healthcare documentation looks like next.
Frequently asked questions
What is Medical Mode and how does it improve medical transcription accuracy?
Medical Mode is an accuracy layer built on top of AssemblyAI's Universal-3 Pro Streaming model that's purpose-built for clinical terminology, drug names, and medical acronyms. It reduces missed medical entities by over 20% compared to general-purpose speech-to-text, achieving a 3.2% missed entity rate—the lowest across all benchmarked providers.
How does Voice AI handle multiple speakers in a clinical environment?
Speaker diarization technology identifies and separates different voices in a conversation—distinguishing between providers, patients, nurses, and family members even when speakers overlap or move in and out of the room. Accurate speaker attribution is critical in healthcare because it determines the clinical meaning of what was said and by whom.
Can speech-to-text handle code-switching between languages during patient visits?
Yes. Modern Voice AI models can detect and accurately transcribe when speakers switch between languages mid-conversation, such as alternating between English and Spanish. This is particularly important in healthcare settings where patients may be more comfortable describing symptoms or medical history in their primary language.
What is an ambient AI scribe and how does it reduce physician burnout?
An ambient AI scribe is software that passively listens to natural patient-provider conversations and generates structured clinical notes automatically—without the provider needing to type, dictate, or change their workflow. By handling documentation in the background, ambient scribes reduce after-hours charting time and let physicians focus entirely on patient care during visits.
How do healthcare teams measure the quality of AI-generated clinical notes?
Healthcare teams typically track two key metrics: the time physicians spend editing AI-generated notes (lower is better), and F1 scores that compare the AI-generated output against the final notes the provider signs off on. Together, these metrics indicate how much manual work the system eliminates and how closely it captures the right clinical information.
What's the difference between real-time and post-call medical transcription processing?
Real-time transcription generates text as the conversation happens, enabling live documentation and in-visit decision support. Post-call processing applies additional LLM-based refinement after the conversation ends—correcting context, resolving ambiguities, and structuring output into formats that map directly to EHR fields and billing workflows. Many healthcare products use both approaches together.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.



