What is the best speech-to-text API for language learning apps?

AssemblyAI's Universal-3.5 Pro Realtime is the leading speech-to-text API for language learning — ~150ms P50 median latency, native code-switching across 18 languages, and 99-language coverage via automatic model routing. The killer feature for pronunciation apps: each word in every Turn event returns a per-word confidence score (0-1) so you can flag mispronounced words and prompt learners to re-attempt. Universal-3.5 Pro Realtime runs at $0.45/hr with unlimited concurrency. For AI conversation tutors that need to talk back, the Voice Agent API ($4.50/hr) bundles STT, LLM, and native-accent TTS through a single WebSocket.

How do I build a pronunciation-feedback feature with speech-to-text?

Stream learner audio into Universal-3.5 Pro Realtime with format_turns=true. On each end_of_turn event, the response includes a `words` array — each word object has `start`, `end`, `text`, and `confidence` (a 0-1 score). Flag any word with confidence below your threshold (commonly 0.6-0.7) and surface it back to the learner with a re-attempt prompt. Add language_detection=true if learners code-switch between languages, and use keyterms_prompt to boost recognition of lesson-specific vocabulary (verb conjugations, themed nouns, idioms).

Does AssemblyAI support real-time conversation in Spanish, French, German, or other languages?

Yes. Universal-3.5 Pro Realtime natively handles real-time conversation in 18 languages with code-switching support — perfect for learners who mix the target language with English mid-sentence. For broader coverage (Korean, Polish, Russian, Ukrainian, and 80+ more), use whisper-rt for full 99-language support with automatic language detection. Pair Universal-3.5 Pro Realtime with the Voice Agent API for fully managed AI conversation tutors with native-accent voices in each language.

Can I build an AI tutor that adapts to the learner's level?

Yes. The Voice Agent API ($4.50/hr) bundles STT, LLM, and TTS into one WebSocket — set the system prompt to define the learner's CEFR level (A1-C2), the lesson focus (verb tenses, restaurant vocab, business conversation), and the tutor persona. Pipe each turn through the LLM Gateway (25+ models including Claude, GPT, and Gemini) for grammar corrections, vocab suggestions, and adaptive difficulty — the tutor can scale complexity up or down based on the learner's responses in real time.

What native-accent voices does AssemblyAI offer for language tutoring?

The Voice Agent API ships with language-specific voices that have a native accent in their primary language: lucia and mateo (Spanish), diego (Latin American / Colombian Spanish), pierre (French), lukas and lena (German), giulia and luca (Italian), mei and ethan (Mandarin), ren and hana (Japanese), mina and joon (Korean), arjun (Hindi/Hinglish), and dmitri (Russian). Every voice also speaks all the other supported output languages and code-switches naturally into English. For an American or British English accent that carries into other languages, choose any of the 17+ voices in the main catalog (ivy, james, sophie, oliver, and more).

How accurate is speech-to-text on learners with strong accents?

Universal-3.5 Pro Realtime is trained on a broad mix of real-world audio across multiple accents and dialects — non-native speakers, heavy regional accents, and code-switching all work. For learners at lower CEFR levels (A1-B1), where pronunciation may be less consistent, the per-word confidence scores are especially useful: rather than treating low-confidence words as transcription failures, surface them as pronunciation coaching opportunities. Combine with keyterm prompting to boost recognition of expected lesson vocabulary so the model isn't penalized for accented but correct attempts.

Solutions

Voice agents for language learning and conversation practice

Build AI conversation partners that listen, understand, and respond in the target language — with real-time pronunciation feedback, word-level confidence scoring, and adaptive difficulty across 99 languages.

Get started free Talk to sales

Conversation practice

B1 Intermediate

AI tutor (Spanish)

"Cuéntame sobre tu fin de semana. ¿Qué hiciste?"

You

"El fin de semana pasado, yo fui al mercado y compré frutas."

Pronunciation feedback

fui 0.97 mercado 0.94 compré 0.72 frutas 0.58

Try "frutas" again — stress the first syllable: FRU-tas

AI tutor

"¡Muy bien! ¿Qué tipo de frutas compraste?"

The problem

Language apps stall at the conversation barrier

Flashcard drills and grammar quizzes scale; live conversation practice doesn't. Hiring native-speaker tutors is expensive, scheduling is brutal, and consumer speech recognition can't separate a true mistake from an accented attempt. Learners plateau at B1 because they have nowhere to actually speak. AssemblyAI gives every app a real-time conversation partner — native-accent voice, per-word pronunciation scoring, and 99-language coverage.

Built for conversational fluency at every level

Languages 99

Total languages supported with automatic language detection.

Code-switching 18

Languages with native code-switching on Universal-3.5 Pro Realtime.

Latency ~150ms

Median streaming latency — fast enough for real-time conversation practice.

Confidence Word-level

Per-word confidence scores (0-1) for instant pronunciation feedback.

Two ways to build

Pick the API that fits your language stack

Ship a working AI conversation tutor in an afternoon, or drop pronunciation-grade streaming STT into the learner app you already run.

Recommended

Voice Agent API

Our proprietary voice stack via one WebSocket. Run an AI conversation tutor with a native-accent voice that adapts to the learner's level — zero infra to manage.

Best for

AI conversation tutors with native-accent voices
Adaptive difficulty driven by the LLM Gateway (25+ models)
Tool calls for pronunciation feedback, lesson tracking, grading
Claude Code compatible — paste the docs and build anything

$4.50/hr — speech, LLM, and voice all included

Get started for free

Free tier available · No credit card required

Bring Your Own Stack

Universal-3.5 Pro Realtime STT API

The pronunciation-grade transcription layer for your learner app. Per-word confidence scores, code-switching across 18 languages, and automatic routing for 99 total.

Best for

Per-word confidence scores (0-1) for pronunciation feedback
~150ms P50 latency for instant feedback loops
Native code-switching across 18 languages
99-language coverage via automatic model routing
Keyterm prompting for lesson vocabulary

$0.45/hr — transcription only, unlimited streams

View integration docs

No concurrency caps · Autoscaling included

One pipeline turns every conversation into a lesson

Capture learner audio in the target language

Stream learner audio from a browser, mobile app, or VoIP into Universal-3.5 Pro Realtime. Native code-switching across 18 languages, automatic model routing for 99 total.

Score pronunciation word-by-word

Each finalized turn returns a words array with per-word confidence scores (0-1) — surface low-confidence words to the learner and prompt a re-attempt before moving on.

Generate adaptive feedback and corrections

Pipe finalized turns into the LLM Gateway (25+ models) for grammar corrections, vocab suggestions, and adaptive difficulty — all in the learner's target language.

Respond with a native-accent voice

Voice Agent API ships with native-accent voices for Spanish (lucia, mateo, diego), French (pierre), German (lukas, lena), Italian (giulia, luca), Mandarin (mei, ethan), Japanese (ren, hana), Korean (mina, joon), Hindi (arjun), and Russian (dmitri).

school

Language learning pipeline

Capture learner audio in target language

↓

Score pronunciation word-by-word

↓

Generate adaptive feedback + corrections

↓

Respond with native-accent voice

Quickstart

Build a language-learning voice agent in minutes

Voice Agent API — Spanish conversation tutor with native-accent voice

# Voice Agent API: Spanish conversation tutor with native-accent voice
import asyncio, json, websockets

API_KEY = "YOUR_API_KEY"

async def run_tutor():
    async with websockets.connect(
        "wss://agents.assemblyai.com/v1/ws",
        additional_headers={"Authorization": f"Bearer {API_KEY}"},
    ) as ws:
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "system_prompt": (
                    "You are a friendly B1-intermediate Spanish tutor. "
                    "Speak Spanish, keep replies under 2 sentences, and ask "
                    "follow-up questions. When the learner mispronounces a "
                    "word, call pronunciation_feedback with the word and a "
                    "short tip — do not correct out loud."
                ),
                "greeting": "¡Hola! Cuéntame sobre tu fin de semana. ¿Qué hiciste?",
                "input": {"keyterms": ["mercado", "frutas", "fin de semana", "compré"]},
                "output": {"voice": "lucia"},  # Spanish native-accent voice
                "tools": [{
                    "type": "function",
                    "name": "pronunciation_feedback",
                    "description": "Send a pronunciation tip to the learner UI.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "word": {"type": "string"},
                            "tip": {"type": "string"},
                        },
                        "required": ["word", "tip"],
                    },
                }],
            },
        }))
        async for msg in ws:
            handle(json.loads(msg))  # transcript.user, reply.audio, tool.call, ...

Universal-3.5 Pro Realtime — word-level pronunciation scoring

# Universal-3.5 Pro Realtime: word-level pronunciation scoring
import asyncio, json, websockets
from urllib.parse import urlencode

API_KEY = "YOUR_API_KEY"

params = urlencode({
    "sample_rate": 16000,
    "speech_model": "u3-rt-pro",
    "language_detection": "true",              # tag each turn with detected language
    "keyterms_prompt": json.dumps([
        "mercado", "frutas", "fin de semana",
        "ayer", "compré", "ir al",
    ]),
    "format_turns": "true",
})

CONFIDENCE_THRESHOLD = 0.70                    # tune per learner level

async def score_pronunciation(audio_iter, send_to_learner_ui):
    url = f"wss://streaming.assemblyai.com/v3/ws?{params}"
    async with websockets.connect(
        url, additional_headers={"Authorization": API_KEY},
    ) as ws:
        async def send_audio():
            async for chunk in audio_iter:
                await ws.send(chunk)
        asyncio.create_task(send_audio())
        async for raw in ws:
            evt = json.loads(raw)
            if evt.get("type") == "Turn" and evt.get("end_of_turn"):
                words = evt.get("words", [])
                low_conf = [
                    w for w in words
                    if w.get("confidence", 1.0) < CONFIDENCE_THRESHOLD
                ]
                send_to_learner_ui({
                    "transcript": evt["transcript"],
                    "needs_practice": [{"word": w["text"], "score": w["confidence"]}
                                       for w in low_conf],
                })

Try in Playground View full docs

Word-level pronunciation confidence

Universal-3.5 Pro Realtime returns a confidence score (0-1) for every word in every turn. Flag words below your threshold (e.g. <0.7) and prompt the learner to re-attempt — exactly the loop pronunciation apps are built around.

Real code-switching across 99 languages

Universal-3.5 Pro Realtime natively handles code-switching across 18 languages. Automatic model routing extends coverage to 99 total languages — one API call covers every market.

Native-accent voices for every lesson

Voice Agent API gives you native-accent voices (lucia for Spanish, pierre for French, giulia for Italian, arjun for Hindi, and more) that code-switch naturally between their primary language and English — plus 18 American/British voices that also speak all 11 supported output languages with their English accent carrying over.

Voice AI builders at scale on AssemblyAI

76% reduction in manual processing

Ollang provides AI-enabled captioning, subtitling, and dubbing in 100+ languages for streaming platforms, broadcasters, and e-learning — cutting human-in-the-loop effort by 76% and lifting platform accuracy by 40% on AssemblyAI.

Ollang

80% increase in customer satisfaction

Calabrio replaced its legacy on-premise transcription solution with AssemblyAI's API, gaining extensive language support to power its enterprise workforce and conversation intelligence platform across global markets.

Calabrio

: AssemblyAI's Universal-3.5 Pro Realtime is the leading speech-to-text API for language learning — ~150ms P50 median latency, native code-switching across 18 languages, and 99-language coverage via automatic model routing. The killer feature for pronunciation apps: each word in every Turn event returns a per-word confidence score (0-1) so you can flag mispronounced words and prompt learners to re-attempt. Universal-3.5 Pro Realtime runs at $0.45/hr with unlimited concurrency. For AI conversation tutors that need to talk back, the Voice Agent API ($4.50/hr) bundles STT, LLM, and native-accent TTS through a single WebSocket.
: Stream learner audio into Universal-3.5 Pro Realtime with format_turns=true. On each end_of_turn event, the response includes a `words` array — each word object has `start`, `end`, `text`, and `confidence` (a 0-1 score). Flag any word with confidence below your threshold (commonly 0.6-0.7) and surface it back to the learner with a re-attempt prompt. Add language_detection=true if learners code-switch between languages, and use keyterms_prompt to boost recognition of lesson-specific vocabulary (verb conjugations, themed nouns, idioms).
: Yes. Universal-3.5 Pro Realtime natively handles real-time conversation in 18 languages with code-switching support — perfect for learners who mix the target language with English mid-sentence. For broader coverage (Korean, Polish, Russian, Ukrainian, and 80+ more), use whisper-rt for full 99-language support with automatic language detection. Pair Universal-3.5 Pro Realtime with the Voice Agent API for fully managed AI conversation tutors with native-accent voices in each language.
: Yes. The Voice Agent API ($4.50/hr) bundles STT, LLM, and TTS into one WebSocket — set the system prompt to define the learner's CEFR level (A1-C2), the lesson focus (verb tenses, restaurant vocab, business conversation), and the tutor persona. Pipe each turn through the LLM Gateway (25+ models including Claude, GPT, and Gemini) for grammar corrections, vocab suggestions, and adaptive difficulty — the tutor can scale complexity up or down based on the learner's responses in real time.
: The Voice Agent API ships with language-specific voices that have a native accent in their primary language: lucia and mateo (Spanish), diego (Latin American / Colombian Spanish), pierre (French), lukas and lena (German), giulia and luca (Italian), mei and ethan (Mandarin), ren and hana (Japanese), mina and joon (Korean), arjun (Hindi/Hinglish), and dmitri (Russian). Every voice also speaks all the other supported output languages and code-switches naturally into English. For an American or British English accent that carries into other languages, choose any of the 17+ voices in the main catalog (ivy, james, sophie, oliver, and more).
: Universal-3.5 Pro Realtime is trained on a broad mix of real-world audio across multiple accents and dialects — non-native speakers, heavy regional accents, and code-switching all work. For learners at lower CEFR levels (A1-B1), where pronunciation may be less consistent, the per-word confidence scores are especially useful: rather than treating low-confidence words as transcription failures, surface them as pronunciation coaching opportunities. Combine with keyterm prompting to boost recognition of expected lesson vocabulary so the model isn't penalized for accented but correct attempts.

Build a language learning voice agent today

Free tier, no credit card. From conversation tutor to per-word pronunciation feedback in an afternoon.

Get started free