The best way to build Voice AI apps

Today’s top Voice AI companies rely on AssemblyAI’s speech-to-text and speech understanding models to launch groundbreaking products fast and scale with ease.

Streaming Speech-to-Text

Speech-to-Text

Voice Agent

Try stating information like names, dates, and address, along with technical data like codes, commands, formulas, and special formatting to see how our model performs...

Universal-3 Pro Streaming

Source

Clinical evaluation history:

00:00

01:59

"prompt": "Produce a transcript for a clinical history evaluation. It's important to capture medication and dosage accurately. Every disfluency is meaningful data. Include: fillers (um, uh, er, erm, ah, hmm, mhm, like, you know, I mean), repetitions (I I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"

Without prompting

"I just want to move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I do. I take Ramipril. Okay. And I take Metformin, and there's another one that begins with G for the diabetes. Glicoside."

With context aware prompting

"I just wanna move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I, I do. I take, um, I take Ramipril. Okay, mhm. And I take Metformin, and there's another one that begins with G for the diabetes. So glycosi — glycosi— glycoside."

Source

Non-speech audio event:

00:00

01:59

"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: Tag sounds: [beep]"

Without audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options."

With audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options. [beep]"

Source

Speech with disfluencies:

00:00

01:59

"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: fillers (um, uh, er, ah, hmm, mhm, like, you know, I mean), repetitions (I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"

Without disfluency prompting

Do you and Quentin still socialize when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we're friends. What do you do with him?

With disfluency prompting

Do you and Quentin still socialize, uh, when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we, we, we're friends. What do you do with him?

Source

Proper noun spelling:

00:00

01:59

"keyterms_prompt": ["Kelly Byrne-Donoghue"]

Without keyterms prompting

"Hi, this is Kelly Byrne Donahue"

Without keyterms prompting

"Hi, this is Kelly Byrne-Donahue"

Source

Caputuring speaker roles:

00:00

01:59

"prompt": "Produce a transcript with every disfluency data. Additionally, label speakers with their respective roles. 1. Place [Speaker:role] at the start of each speaker turn. Example format: [Speaker:NURSE] Hello there. How can I help you today? [Speaker:PATIENT] I'm feeling unwell. I have a headache."}

With traditional speaker labels

Speaker A: 5Mg. And do you take it regularly?
‍
Speaker B: Oh yeah, yeah.
‍
Speaker A: Good.
‍
Speaker B: Every evening.
‍
Speaker A: And no side effects with it?

With speaker labels prompting

Speaker [Nurse]: 5Mg. And do you take it regularly?
‍
Speaker [Patient]: Oh yeah, yeah.
‍
Speaker [Nurse]: Good.
‍
Speaker [Patient]: Every evening.
‍
Speaker [Nurse]: And no side effects with it?

Source

Spanish and english audio:

00:00

01:59

"language_detection": True
"prompt": Preserve natural code-switching between English and Spanish. Retain spokenlanguage as-is (correct "I was hablando con mi manager").

Without codeswitching

Would definitely think I spoke Spanish if you heard me speak Spanish. But I still make mistakes. Soy wines. Paltro Soy. La fundadora de goop. Thank you. Thank you for doing that.

With codeswitching

You would definitely think I spoke Spanish if you heard me speak Spanish, but I still make mistakes. Soy Gwyneth Paltrow, soy la fundadora de Goop. Thank you. Thank you for doing that.

The industry’s best products need the industry’s best models

We build the most accurate, fully featured models on the market, so you can ship with confidence knowing that you’re building on the best.

Everything you need to build voice apps that outpace the competition

The accuracy and capabilities required to build products that stand out, and the flexibility to scale to millions of users without blinking an eye.

industry-leading accuracy

Avoid garbage in, garbage out

Your product experience is only as good as the inputs it’s built on. AssemblyAI’s models lead the industry in accuracy and reliability.

Industry’s lowest Word Error Rate (WER)
Up to 30% less hallucinations than other providers
Preferred by 73% of end users in unbiased evaluations

Explore our latest model

CAPABILITIES

Go beyond transcription

Access a full suite of speech understanding capabilities to uncover insights, identify speakers, and build powerful product experiences.

Correctly identify speakers with advanced diarization capabilities
Automatically format text and alphanumerics for clearer outputs
Accurately capture multilingual speech with automatic language detection

Check out our products

Build-ready

Easy to start, even easier to scale

We built AssemblyAI to be the easiest platform on the market for developers to build, ship, and scale on.

Serving 600M+ inference calls and over 840M API calls per month
Over 40 terabytes of audio processed daily
Pay only for what you use and scale to millions of hours without contracts or throttles

Go to developer docs