customers
All customer stories
Top Voice AI companies are building with Assembly.
resources
Latest Release
Voice Agent API
Voice agents that get it right, respond instantly, and ship the same day with our new Voice Agent API
resources
Build AI conversation partners that listen, understand, and respond in the target language — with real-time pronunciation feedback, word-level confidence scoring, and adaptive difficulty across 99 languages.
Conversation practice
B1 IntermediateAI tutor (Spanish)
"Cuéntame sobre tu fin de semana. ¿Qué hiciste?"
You
"El fin de semana pasado, yo fui al mercado y compré frutas."
Pronunciation feedback
Try "frutas" again — stress the first syllable: FRU-tas
AI tutor
"¡Muy bien! ¿Qué tipo de frutas compraste?"
Flashcard drills and grammar quizzes scale; live conversation practice doesn't. Hiring native-speaker tutors is expensive, scheduling is brutal, and consumer speech recognition can't separate a true mistake from an accented attempt. Learners plateau at B1 because they have nowhere to actually speak. AssemblyAI gives every app a real-time conversation partner — native-accent voice, per-word pronunciation scoring, and 99-language coverage.
Total languages supported with automatic language detection.
Core languages with native code-switching on Universal-3 Pro Streaming.
Median streaming latency — fast enough for real-time conversation practice.
Per-word confidence scores (0-1) for instant pronunciation feedback.
Two ways to build
Ship a working AI conversation tutor in an afternoon, or drop pronunciation-grade streaming STT into the learner app you already run.
Our proprietary voice stack via one WebSocket. Run an AI conversation tutor with a native-accent voice that adapts to the learner's level — zero infra to manage.
Best for
Free tier available · No credit card required
The pronunciation-grade transcription layer for your learner app. Per-word confidence scores, code-switching across 6 core languages, and automatic routing for 99 total.
Best for
No concurrency caps · Autoscaling included
Capture learner audio in the target language
Stream learner audio from a browser, mobile app, or VoIP into Universal-3 Pro Streaming. Native code-switching across 6 core languages, automatic model routing for 99 total.
Score pronunciation word-by-word
Each finalized turn returns a words array with per-word confidence scores (0-1) — surface low-confidence words to the learner and prompt a re-attempt before moving on.
Generate adaptive feedback and corrections
Pipe finalized turns into the LLM Gateway (25+ models) for grammar corrections, vocab suggestions, and adaptive difficulty — all in the learner's target language.
Respond with a native-accent voice
Voice Agent API ships with native-accent voices for Spanish (lucia, mateo, diego), French (pierre), German (lukas, lena), Italian (giulia, luca), Mandarin (mei, ethan), Japanese (ren, hana), Korean (mina, joon), Hindi (arjun), and Russian (dmitri).
Language learning pipeline
Capture learner audio in target language
Score pronunciation word-by-word
Generate adaptive feedback + corrections
Respond with native-accent voice
Voice Agent API — Spanish conversation tutor with native-accent voice
# Voice Agent API: Spanish conversation tutor with native-accent voice
import asyncio, json, websockets
API_KEY = "YOUR_API_KEY"
async def run_tutor():
async with websockets.connect(
"wss://agents.assemblyai.com/v1/ws",
additional_headers={"Authorization": f"Bearer {API_KEY}"},
) as ws:
await ws.send(json.dumps({
"type": "session.update",
"session": {
"system_prompt": (
"You are a friendly B1-intermediate Spanish tutor. "
"Speak Spanish, keep replies under 2 sentences, and ask "
"follow-up questions. When the learner mispronounces a "
"word, call pronunciation_feedback with the word and a "
"short tip — do not correct out loud."
),
"greeting": "¡Hola! Cuéntame sobre tu fin de semana. ¿Qué hiciste?",
"input": {"keyterms": ["mercado", "frutas", "fin de semana", "compré"]},
"output": {"voice": "lucia"}, # Spanish native-accent voice
"tools": [{
"type": "function",
"name": "pronunciation_feedback",
"description": "Send a pronunciation tip to the learner UI.",
"parameters": {
"type": "object",
"properties": {
"word": {"type": "string"},
"tip": {"type": "string"},
},
"required": ["word", "tip"],
},
}],
},
}))
async for msg in ws:
handle(json.loads(msg)) # transcript.user, reply.audio, tool.call, ...
Universal-3 Pro Streaming — word-level pronunciation scoring
# Universal-3 Pro Streaming: word-level pronunciation scoring
import asyncio, json, websockets
from urllib.parse import urlencode
API_KEY = "YOUR_API_KEY"
params = urlencode({
"sample_rate": 16000,
"speech_model": "u3-rt-pro",
"language_detection": "true", # tag each turn with detected language
"keyterms_prompt": json.dumps([
"mercado", "frutas", "fin de semana",
"ayer", "compré", "ir al",
]),
"format_turns": "true",
})
CONFIDENCE_THRESHOLD = 0.70 # tune per learner level
async def score_pronunciation(audio_iter, send_to_learner_ui):
url = f"wss://streaming.assemblyai.com/v3/ws?{params}"
async with websockets.connect(
url, additional_headers={"Authorization": API_KEY},
) as ws:
async def send_audio():
async for chunk in audio_iter:
await ws.send(chunk)
asyncio.create_task(send_audio())
async for raw in ws:
evt = json.loads(raw)
if evt.get("type") == "Turn" and evt.get("end_of_turn"):
words = evt.get("words", [])
low_conf = [
w for w in words
if w.get("confidence", 1.0) < CONFIDENCE_THRESHOLD
]
send_to_learner_ui({
"transcript": evt["transcript"],
"needs_practice": [{"word": w["text"], "score": w["confidence"]}
for w in low_conf],
})
Universal-3 Pro Streaming returns a confidence score (0-1) for every word in every turn. Flag words below your threshold (e.g. <0.7) and prompt the learner to re-attempt — exactly the loop pronunciation apps are built around.
Universal-3 Pro Streaming natively handles code-switching across 6 core languages (English, Spanish, French, German, Italian, Portuguese). Automatic model routing extends coverage to 99 total languages — one API call covers every market.
Voice Agent API gives you native-accent voices (lucia for Spanish, pierre for French, giulia for Italian, arjun for Hindi, and more) that code-switch naturally between their primary language and English — plus 18 American/British voices that also speak all 11 supported output languages with their English accent carrying over.
Ollang provides AI-enabled captioning, subtitling, and dubbing in 100+ languages for streaming platforms, broadcasters, and e-learning — cutting human-in-the-loop effort by 76% and lifting platform accuracy by 40% on AssemblyAI.
Ollang
Calabrio replaced its legacy on-premise transcription solution with AssemblyAI's API, gaining extensive language support to power its enterprise workforce and conversation intelligence platform across global markets.
Calabrio