Insights & Use Cases
May 1, 2026

AssemblyAI April 2026 recap

It’s been a packed couple of months. From our most accurate streaming model yet to a full voice agent API, here’s a rundown of everything we launched—and why it matters for what you’re building.

Martin Schweiger
Senior Technical Product Marketing Manager
No items found.
Reviewed by
No items found.
Table of contents

It’s been a packed couple of months. From our most accurate streaming model yet to a full voice agent API, here’s a rundown of everything we launched and why it matters for what you’re building.

Universal-3 Pro Streaming

This is the big one. Universal-3 Pro Streaming is our most accurate real-time transcription model, purpose-built for voice agents and live audio workflows.

The numbers speak for themselves: 8.14% word error rate across English—the lowest of any streaming provider we’ve benchmarked against, including Deepgram Nova-3 (11.06%), OpenAI GPT-4o Transcribe (9.90%), and Microsoft Azure (9.11%).

But raw accuracy is only part of the story. The real differentiator is entity recognition. Emails, phone numbers, medical terms, URLs—the structured data that voice agents actually act on—are where most models fall apart. Universal-3 Pro Streaming delivers the lowest missed entity rates across every category we tested.

Key capabilities include:

  • Real-time prompting—guide transcription behavior with natural language instructions
  • Dynamic key term prompting—boost up to 1,000 domain-specific terms, updated turn-by-turn mid-conversation
  • Real-time speaker diarization—identify and separate speakers mid-conversation
  • Sub-200ms end-to-end latency with immutable transcripts
  • Six languages at launch: English, Spanish, French, German, Portuguese, and Italian

Native integrations with LiveKit, Pipecat, Twilio, and Daily mean you can go from sign-up to a production voice agent in under 15 minutes. Pricing starts at $0.45/hr.

Medical Mode

If you’re building anything in healthcare—ambient AI scribes, clinical documentation, telehealth tools—Medical Mode is a game changer.

It’s an add-on that reduces missed medical entities by over 20% compared to Universal-3 Pro alone. Drug names, dosages, diagnoses, anatomical terms—the terminology that directly affects patient outcomes gets transcribed correctly the first time.

Medical Mode works across both real-time streaming and pre-recorded (async) workflows, and it’s available on Universal-3 Pro Streaming. It beats every dedicated medical transcription competitor we benchmarked, including Deepgram, Speechmatics Enhanced Medical, AWS Transcribe Medical, and Google Medical Conversation.

Pricing is $0.15/hr as an add-on. AssemblyAI offers a Business Associate Addendum (BAA) for covered entities handling PHI, and data training is opted out by default.

Voice Agent API

Building a voice agent used to mean stitching together three separate providers—STT, LLM, and TTS—across three SDKs, three invoices, and three debugging surfaces. The Voice Agent API replaces all of that with a single WebSocket.

Stream audio in, get audio back. We handle speech understanding, LLM reasoning, voice generation, turn detection, and interruption handling. You write the system prompt and focus on your product.

What makes it different:

  • Speech-aware turn detection—knows when you’re pausing to think vs. done talking. No more getting cut off mid-sentence.
  • Tool calling—register any function with JSON Schema. The agent calls it when appropriate—look up an account, check an order, trigger a workflow.
  • Live configuration updates—change the system prompt, voice, tools, and VAD settings mid-conversation. No reconnection needed.
  • Session resumption—reconnect within 30 seconds if the WebSocket drops. Context preserved.
  • ~1 second end-to-end latency at $4.50/hr flat—covering STT, LLM, and TTS. That’s roughly 4x cheaper than OpenAI’s Realtime API.

It’s a standard JSON API—no SDK required. Most developers get a working agent running the same afternoon.

Docs, MCP Server, and Claude Code Skill

We shipped a set of developer experience improvements that make it faster to build with AssemblyAI from AI-native workflows:

  • Refreshed documentation—restructured for clarity, with better code examples and integration guides across all products.
  • MCP Server—a Model Context Protocol server that lets AI tools like Claude Code and other MCP-compatible agents interact directly with the AssemblyAI API. Transcribe audio, search transcripts, and run LLM Gateway queries without leaving your coding environment.
  • Claude Code Skill—copy our docs into Claude Code and build working integrations from a natural language description. The Voice Agent API was specifically designed to work well with this workflow.

The goal: if you can describe what you want to build, you should be able to ship it in an afternoon.

New models on the LLM Gateway

LLM Gateway is an OpenAI-compatible API that lets you apply 20+ LLMs to your transcripts through a single endpoint. This month we added four new models:

  • Claude Opus 4.7 (Anthropic)
  • Kimi 2.5 (Moonshot AI)
  • Qwen 3 (Alibaba)
  • GPT 5.5 (OpenAI)

Swap models by changing the model field in your API call—no other code changes required. One API key, unified billing, and no provider prefixes in the model ID.

The Gateway already supported Claude, GPT, and Gemini families. With Kimi and Qwen now in the mix, you have even more flexibility to pick the right model for your use case—whether that’s cost optimization, multilingual support, or reasoning quality.

Universal-2 improvements: Hebrew and Swedish

Universal-2 now supports Hebrew and Swedish, bringing the total language count higher for teams building multilingual products on the async transcription API.

These aren’t beta-quality additions—they’re production-ready, with the same accuracy and feature support (speaker diarization, entity detection, sentiment analysis) available for other Universal-2 languages.

PII Redaction updates

We made improvements to our PII Redaction capabilities for both text and audio. PII Text Redaction identifies and removes personally identifiable information—phone numbers, social security numbers, names, addresses—from the transcription text before it’s returned to you. PII Audio Redaction does the same directly in the audio file itself.

The updates improve detection coverage and accuracy, especially for edge cases around formatted entities like phone numbers with varied separators and multi-part names. If you’re building in healthcare, finance, or any regulated industry, these improvements mean fewer manual review passes.

Start building

That’s the full list. Whether you’re adding real-time transcription to a voice agent, building clinical documentation tools, or just want to run your call recordings through an LLM—there’s something here for you.

Watch the full recap video to see each feature in action, or sign up for a free API key and start building today. You get $50 in free credits—no credit card required.

Title goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Button Text
No items found.