The best way to build Voice AI apps

Today’s top Voice AI companies rely on AssemblyAI’s speech-to-text and speech understanding models to launch groundbreaking products fast and scale with ease.

Streaming Speech-to-Text
Speech-to-Text
Voice Agent

Try stating information like names, dates, and address, along with technical data like codes, commands, formulas, and special formatting to see how our model performs...

Universal-3 Pro Streaming
Clinical evaluation history:
00:00
01:59
"prompt": "Produce a transcript for a clinical history evaluation. It's important to capture medication and dosage accurately. Every disfluency is meaningful data. Include: fillers (um, uh, er, erm, ah, hmm, mhm, like, you know, I mean), repetitions (I I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"
Without prompting

"I just want to move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I do. I take Ramipril. Okay. And I take Metformin, and there's another one that begins with G for the diabetes.  Glicoside."

With context aware prompting

"I just wanna move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I, I do. I take, um, I take Ramipril. Okay, mhm. And I take Metformin, and there's another one that begins with G for the diabetes. So glycosi — glycosi— glycoside."

Non-speech audio event:
00:00
01:59
"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: Tag sounds: [beep]"
Without audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options."

With audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options. [beep]"

Speech with disfluencies:
00:00
01:59
"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: fillers (um, uh, er, ah, hmm, mhm, like, you know, I mean), repetitions (I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"
Without disfluency prompting

Do you and Quentin still socialize when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we're friends. What do you do with him?

With disfluency prompting

Do you and Quentin still socialize, uh, when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we, we, we're friends. What do you do with him?

Proper noun spelling:
00:00
01:59
"keyterms_prompt": ["Kelly Byrne-Donoghue"]
Without keyterms prompting

"Hi, this is Kelly Byrne Donahue"

Without keyterms prompting

"Hi, this is Kelly Byrne-Donahue"

Caputuring speaker roles:
00:00
01:59
"prompt": "Produce a transcript with every disfluency data. Additionally, label speakers with their respective roles. 1. Place [Speaker:role] at the start of each speaker turn. Example format: [Speaker:NURSE] Hello there. How can I help you today? [Speaker:PATIENT] I'm feeling unwell. I have a headache."}
With traditional speaker labels

Speaker A: 5Mg. And do you take it regularly?

Speaker B: Oh yeah, yeah.

Speaker  A: Good.

Speaker B: Every evening.

Speaker A: And no side effects with it?

With speaker labels prompting

Speaker [Nurse]: 5Mg. And do you take it regularly?

Speaker [Patient]: Oh yeah, yeah.

Speaker  [Nurse]: Good.

Speaker [Patient]: Every evening.

Speaker [Nurse]: And no side effects with it?

Spanish and english audio:
00:00
01:59
"language_detection": True
"prompt": Preserve natural code-switching between English and Spanish. Retain spokenlanguage as-is (correct "I was hablando con mi manager").
Without codeswitching

Would definitely think I spoke Spanish if you heard me speak Spanish. But I still make mistakes. Soy wines. Paltro Soy. La fundadora de goop. Thank you. Thank you for doing that.

With codeswitching

You would definitely think I spoke Spanish if you heard me speak Spanish, but I still make mistakes. Soy Gwyneth Paltrow, soy la fundadora de Goop. Thank you. Thank you for doing that.

Everything you need to build voice apps that outpace the competition

The accuracy and capabilities required to build products that stand out, and the flexibility to scale to millions of users without blinking an eye.

industry-leading accuracy

Avoid garbage in, garbage out

Your product experience is only as good as the inputs it’s built on. AssemblyAI’s models lead the industry in accuracy and reliability.

  • Industry’s lowest Word Error Rate (WER)
  • Up to 30% less hallucinations than other providers
  • Preferred by 73% of end users in unbiased evaluations
Explore our latest model
CAPABILITIES

Go beyond transcription

Access a full suite of speech understanding capabilities to uncover insights, identify speakers, and build powerful product experiences.

  • Correctly identify speakers with advanced diarization capabilities
  • Automatically format text and alphanumerics for clearer outputs
  • Accurately capture multilingual speech with automatic language detection
Check out our products
Build-ready

Easy to start, even easier to scale

We built AssemblyAI to be the easiest platform on the market for developers to build, ship, and scale on.

  • Serving 600M+ inference calls and over 840M API calls per month
  • Over 40 terabytes of audio processed daily
  • Pay only for what you use and scale to millions of hours without contracts or throttles
Go to developer docs

We’re not playing around—but you can

Put our AI models to the test in our no-code playground.

The most loved AI apps are built on AssemblyAI

Learn why today’s most innovative companies choose us.

3x increase

in closed enterprise deals after launching Conversation Intelligence with AssemblyAI

15% higher

customer win rates after implementing AssemblyAI

2X

free-to-paid conversion rate after implementing AssemblyAI

Play video
Play video
23% improvement

in call transcription accuracy and 2X increase in customer conversion rate

90% reduction

in customer complaints and support tickets

Unlock the value of voice data

Build what’s next on the platform powering thousands of the industry’s leading of Voice AI apps.