See you at booth 315

The best way to build Voice AI apps

Today’s top Voice AI companies rely on AssemblyAI’s speech-to-text and speech understanding models to launch groundbreaking products fast and scale with ease.

Get $100 in Free Credits

Streaming Speech-to-Text

Speech-to-Text

Voice Agent

Try stating information like names, dates, and address, along with technical data like codes, commands, formulas, and special formatting to see how our model performs...

Universal-3 Pro Streaming

Source

Clinical evaluation history:

00:00

01:59

"prompt": "Produce a transcript for a clinical history evaluation. It's important to capture medication and dosage accurately. Every disfluency is meaningful data. Include: fillers (um, uh, er, erm, ah, hmm, mhm, like, you know, I mean), repetitions (I I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"

Without prompting

"I just want to move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I do. I take Ramipril. Okay. And I take Metformin, and there's another one that begins with G for the diabetes. Glicoside."

With context aware prompting

"I just wanna move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I, I do. I take, um, I take Ramipril. Okay, mhm. And I take Metformin, and there's another one that begins with G for the diabetes. So glycosi — glycosi— glycoside."

Source

Non-speech audio event:

00:00

01:59

"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: Tag sounds: [beep]"

Without audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options."

With audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options. [beep]"

Source

Speech with disfluencies:

00:00

01:59

"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: fillers (um, uh, er, ah, hmm, mhm, like, you know, I mean), repetitions (I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"

Without disfluency prompting

Do you and Quentin still socialize when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we're friends. What do you do with him?

With disfluency prompting

Do you and Quentin still socialize, uh, when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we, we, we're friends. What do you do with him?

Source

Proper noun spelling:

00:00

01:59

"keyterms_prompt": ["Kelly Byrne-Donoghue"]

Without keyterms prompting

"Hi, this is Kelly Byrne Donahue"

Without keyterms prompting

"Hi, this is Kelly Byrne-Donahue"

Source

Caputuring speaker roles:

00:00

01:59

"prompt": "Produce a transcript with every disfluency data. Additionally, label speakers with their respective roles. 1. Place [Speaker:role] at the start of each speaker turn. Example format: [Speaker:NURSE] Hello there. How can I help you today? [Speaker:PATIENT] I'm feeling unwell. I have a headache."}

With traditional speaker labels

Speaker A: 5Mg. And do you take it regularly?
‍
Speaker B: Oh yeah, yeah.
‍
Speaker A: Good.
‍
Speaker B: Every evening.
‍
Speaker A: And no side effects with it?

With speaker labels prompting

Speaker [Nurse]: 5Mg. And do you take it regularly?
‍
Speaker [Patient]: Oh yeah, yeah.
‍
Speaker [Nurse]: Good.
‍
Speaker [Patient]: Every evening.
‍
Speaker [Nurse]: And no side effects with it?

Source

Spanish and english audio:

00:00

01:59

"language_detection": True
"prompt": Preserve natural code-switching between English and Spanish. Retain spokenlanguage as-is (correct "I was hablando con mi manager").

Without codeswitching

Would definitely think I spoke Spanish if you heard me speak Spanish. But I still make mistakes. Soy wines. Paltro Soy. La fundadora de goop. Thank you. Thank you for doing that.

With codeswitching

You would definitely think I spoke Spanish if you heard me speak Spanish, but I still make mistakes. Soy Gwyneth Paltrow, soy la fundadora de Goop. Thank you. Thank you for doing that.