Solutions

Voice agents for language learning and conversation practice

Build AI conversation partners that listen, understand, and respond in the target language — with real-time pronunciation feedback, word-level confidence scoring, and adaptive difficulty across 99 languages.

Conversation practice

B1 Intermediate

AI tutor (Spanish)

"Cuéntame sobre tu fin de semana. ¿Qué hiciste?"

You

"El fin de semana pasado, yo fui al mercado y compré frutas."

Pronunciation feedback

fui 0.97 mercado 0.94 compré 0.72 frutas 0.58

Try "frutas" again — stress the first syllable: FRU-tas

AI tutor

"¡Muy bien! ¿Qué tipo de frutas compraste?"

Runway
Dovetail
Granola
Supernormal
Ashby
Jiminny
Calabrio
JotPsych
EdgeTier
Genio
WhatConverts
Earmark
Grain
Loop
CallRail
Happy Scribe
Veed.io
Delphi
Runway
Dovetail
Granola
Supernormal
Ashby
Jiminny
Calabrio
JotPsych
EdgeTier
Genio
WhatConverts
Earmark
Grain
Loop
CallRail
Happy Scribe
Veed.io
Delphi
Runway
Dovetail
Granola
Supernormal
Ashby
Jiminny
Calabrio
JotPsych
EdgeTier
Genio
WhatConverts
Earmark
Grain
Loop
CallRail
Happy Scribe
Veed.io
Delphi
Runway
Dovetail
Granola
Supernormal
Ashby
Jiminny
Calabrio
JotPsych
EdgeTier
Genio
WhatConverts
Earmark
Grain
Loop
CallRail
Happy Scribe
Veed.io
Delphi
The problem

Language apps stall at the conversation barrier

Flashcard drills and grammar quizzes scale; live conversation practice doesn't. Hiring native-speaker tutors is expensive, scheduling is brutal, and consumer speech recognition can't separate a true mistake from an accented attempt. Learners plateau at B1 because they have nowhere to actually speak. AssemblyAI gives every app a real-time conversation partner — native-accent voice, per-word pronunciation scoring, and 99-language coverage.

Built for conversational fluency at every level

Languages 99

Total languages supported with automatic language detection.

Code-switching 6

Core languages with native code-switching on Universal-3 Pro Streaming.

Latency ~150ms

Median streaming latency — fast enough for real-time conversation practice.

Confidence Word-level

Per-word confidence scores (0-1) for instant pronunciation feedback.

Two ways to build

Pick the API that fits your language stack

Ship a working AI conversation tutor in an afternoon, or drop pronunciation-grade streaming STT into the learner app you already run.

Recommended

Voice Agent API

Our proprietary voice stack via one WebSocket. Run an AI conversation tutor with a native-accent voice that adapts to the learner's level — zero infra to manage.

Best for

  • AI conversation tutors with native-accent voices
  • Adaptive difficulty driven by the LLM Gateway (25+ models)
  • Tool calls for pronunciation feedback, lesson tracking, grading
  • Claude Code compatible — paste the docs and build anything
$4.50/hr — speech, LLM, and voice all included
Get started for free

Free tier available · No credit card required

Bring Your Own Stack

Universal-3 Pro Streaming STT API

The pronunciation-grade transcription layer for your learner app. Per-word confidence scores, code-switching across 6 core languages, and automatic routing for 99 total.

Best for

  • Per-word confidence scores (0-1) for pronunciation feedback
  • ~150ms P50 latency for instant feedback loops
  • Native code-switching across 6 core languages
  • 99-language coverage via automatic model routing
  • Keyterm prompting for lesson vocabulary
$0.45/hr — transcription only, unlimited streams
View integration docs

No concurrency caps · Autoscaling included

One pipeline turns every conversation into a lesson

Capture learner audio in the target language

Stream learner audio from a browser, mobile app, or VoIP into Universal-3 Pro Streaming. Native code-switching across 6 core languages, automatic model routing for 99 total.

Score pronunciation word-by-word

Each finalized turn returns a words array with per-word confidence scores (0-1) — surface low-confidence words to the learner and prompt a re-attempt before moving on.

Generate adaptive feedback and corrections

Pipe finalized turns into the LLM Gateway (25+ models) for grammar corrections, vocab suggestions, and adaptive difficulty — all in the learner's target language.

Respond with a native-accent voice

Voice Agent API ships with native-accent voices for Spanish (lucia, mateo, diego), French (pierre), German (lukas, lena), Italian (giulia, luca), Mandarin (mei, ethan), Japanese (ren, hana), Korean (mina, joon), Hindi (arjun), and Russian (dmitri).

school

Language learning pipeline

Capture learner audio in target language

Score pronunciation word-by-word

Generate adaptive feedback + corrections

Respond with native-accent voice

Quickstart

Build a language-learning voice agent in minutes

Voice Agent API — Spanish conversation tutor with native-accent voice

# Voice Agent API: Spanish conversation tutor with native-accent voice
import asyncio, json, websockets

API_KEY = "YOUR_API_KEY"

async def run_tutor():
    async with websockets.connect(
        "wss://agents.assemblyai.com/v1/ws",
        additional_headers={"Authorization": f"Bearer {API_KEY}"},
    ) as ws:
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "system_prompt": (
                    "You are a friendly B1-intermediate Spanish tutor. "
                    "Speak Spanish, keep replies under 2 sentences, and ask "
                    "follow-up questions. When the learner mispronounces a "
                    "word, call pronunciation_feedback with the word and a "
                    "short tip — do not correct out loud."
                ),
                "greeting": "¡Hola! Cuéntame sobre tu fin de semana. ¿Qué hiciste?",
                "input": {"keyterms": ["mercado", "frutas", "fin de semana", "compré"]},
                "output": {"voice": "lucia"},  # Spanish native-accent voice
                "tools": [{
                    "type": "function",
                    "name": "pronunciation_feedback",
                    "description": "Send a pronunciation tip to the learner UI.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "word": {"type": "string"},
                            "tip": {"type": "string"},
                        },
                        "required": ["word", "tip"],
                    },
                }],
            },
        }))
        async for msg in ws:
            handle(json.loads(msg))  # transcript.user, reply.audio, tool.call, ...

Universal-3 Pro Streaming — word-level pronunciation scoring

# Universal-3 Pro Streaming: word-level pronunciation scoring
import asyncio, json, websockets
from urllib.parse import urlencode

API_KEY = "YOUR_API_KEY"

params = urlencode({
    "sample_rate": 16000,
    "speech_model": "u3-rt-pro",
    "language_detection": "true",              # tag each turn with detected language
    "keyterms_prompt": json.dumps([
        "mercado", "frutas", "fin de semana",
        "ayer", "compré", "ir al",
    ]),
    "format_turns": "true",
})

CONFIDENCE_THRESHOLD = 0.70                    # tune per learner level

async def score_pronunciation(audio_iter, send_to_learner_ui):
    url = f"wss://streaming.assemblyai.com/v3/ws?{params}"
    async with websockets.connect(
        url, additional_headers={"Authorization": API_KEY},
    ) as ws:
        async def send_audio():
            async for chunk in audio_iter:
                await ws.send(chunk)
        asyncio.create_task(send_audio())
        async for raw in ws:
            evt = json.loads(raw)
            if evt.get("type") == "Turn" and evt.get("end_of_turn"):
                words = evt.get("words", [])
                low_conf = [
                    w for w in words
                    if w.get("confidence", 1.0) < CONFIDENCE_THRESHOLD
                ]
                send_to_learner_ui({
                    "transcript": evt["transcript"],
                    "needs_practice": [{"word": w["text"], "score": w["confidence"]}
                                       for w in low_conf],
                })

Word-level pronunciation confidence

Universal-3 Pro Streaming returns a confidence score (0-1) for every word in every turn. Flag words below your threshold (e.g. <0.7) and prompt the learner to re-attempt — exactly the loop pronunciation apps are built around.

Real code-switching across 99 languages

Universal-3 Pro Streaming natively handles code-switching across 6 core languages (English, Spanish, French, German, Italian, Portuguese). Automatic model routing extends coverage to 99 total languages — one API call covers every market.

Native-accent voices for every lesson

Voice Agent API gives you native-accent voices (lucia for Spanish, pierre for French, giulia for Italian, arjun for Hindi, and more) that code-switch naturally between their primary language and English — plus 18 American/British voices that also speak all 11 supported output languages with their English accent carrying over.

Frequently asked questions