Speech Understanding API

Turn raw audio into structured, actionable intelligence. Purpose-built models extract meaning from speech in a single API call.

Models

Nine models, One API.

Enable any combination of features with a single request. Mix, match, and scale as your application grows.

Speaker identification

Label speakers by name using audio context. Supports custom profiles and multi-speaker conversations.

Voice Agents AI Notetaker Call Analytics Agent Assist AI Scribe Conversation Intelligence

Sentiment analysis

Detect positive, neutral, or negative sentiment at the sentence level for granular conversation analytics.

Voice Agents AI Notetaker Call Analytics Agent Assist Conversation Intelligence

Key phrases

Automatically extract the most significant concepts and phrases from every transcript.

Media Monitoring AI Notetaker Call Analytics Agent Assist Conversation Intelligence

Summarization

Automatically extract the most significant concepts and phrases from every transcript.

AI Scribe AI Notetaker Call Analytics Conversation Intelligence

Custom formatting

Normalize dates, phone numbers, and email addresses to machine-readable formats automatically.

Agent Assist AI Notetaker Call Analytics Voice Agents Conversation Intelligence

Translation

Transcribe in any of 99+ languages in the same API request.

Voice Agents AI Notetaker Agent Assist AI Scribe Conversation Intelligence

Auto chapters

Break long recordings into timestamped sections with auto-generated summaries for each chapter.

Content Creation AI Notetaker Conversation Intelligence

Entity detection

Identify names, organizations, locations, dates, and other entities across every transcript.

Voice Agents AI Notetaker Call Analytics Agent Assist AI Scribe Conversation Intelligence

Topic detection

Classify audio content against the IAB standard taxonomy — ideal for content moderation, routing, and analytics.

Media Monitoring AI Notetaker Call Analytics Agent Assist AI Scribe Conversation Intelligence

The second pillar of your Voice AI pipeline

Results 2×

Conversion improvement for teams using structured conversation intelligence vs. raw transcripts alone.

Efficiency 90%

Reduction in manual QA review time after automating complaint and sentiment detection workflows.

Speed <60s

From audio file to fully structured intelligence — speaker labels, sentiment, entities, summary — in one request.

Intelligence 9+

Models for transcription, summarization, sentiment, entities, and more — all in one API.

Speaker labels

Know who said what

label
Label voices by name or role
lab_profile
Attribute summaries to real people
person
Flag compliance issues by speaker
hourglass_empty
Cut hours of manual call review

Sentiment + topics + entities

Surface meaning automatically

detector
Detect sentiment at the sentence level
chip_extraction
Extract named entities automatically
quick_phrases
Pull key phrases that matter most
content_copy
Classify content by IAB topic

Summaries + chapters + formatting

Structure every conversation

data_table
Turn audio into structured data
screen_record
Break recordings into timestamped chapters
layers
Condense calls into one-paragraph briefs
format_size
Format dates, numbers, and contacts

Translation + languages

Go global without switching vendors

translate
Translate 99+ languages in one API
audio_video_receiver
Run sentiment and entities on multilingual audio
skip_next
Skip separate translation vendors
analytics
Run analysis on multilingual audio in one pass

Playground

We’re not playing around, but you can

Put our Voice AI models to the test in our no-code playground.

Try it out

AI Speech-to-Text transcription in 99 languages

From Spanish to Korean, deliver accurate Voice AI in the languages your users speak.

🇪🇸 Spanish 🇵🇹 Portuguese 🇫🇷 French 🇩🇪 German 🇮🇳 Hindi 🇷🇺 Russian 🇳🇱 Dutch 🇯🇵 Japanese 🇮🇹 Italian 🇵🇱 Polish 🇺🇦 Ukrainian 🇮🇩 Indonesian 🇹🇷 Turkish 🇨🇳 Chinese 🇰🇷 Korean

Common questions

: AssemblyAI's Speech Understanding API is a single endpoint that turns audio into structured intelligence — speaker labels, sentiment, entities, topics, summaries, key phrases, translation, chapters, and custom formatting — in one request.
: Nine purpose-built models: Speaker Identification, Sentiment Analysis, Key Phrases, Summarization, Custom Formatting, Translation, Auto Chapters, Entity Detection, and Topic Detection (IAB taxonomy).
: Yes. Enable any combination of Speech Understanding features in a single API call — no extra plumbing. Mix, match, and scale as your application grows.
: Each model is priced per hour of audio processed. You only pay for the features you enable. See the pricing page for per-model rates and volume discounts.
: Speech Understanding features are designed for pre-recorded audio. For real-time transcription, use AssemblyAI's Realtime Speech-to-Text API. You can pipe streaming transcripts into Speech Understanding post-call.
: AssemblyAI's Speech Understanding models are built on top of our industry-leading Universal speech-to-text foundation and benchmarked regularly against real-world audio.