Insights & Use Cases
June 23, 2026

Keyterm prompting for real-time accuracy: boosting names, jargon, and product terms

Real-time speech-to-text misses the rare, high-value words your app depends on. Keyterm prompting is the lever that fixes them — here's how it works and how to use it right.

Kelsey Foster
Growth
Reviewed by
No items found.
Table of contents

Your voice agent nails 98% of the conversation and then fumbles the one word that actually mattered — the caller's last name, your product SKU, the medication. "Metoprolol" comes back as "metoprolal." "Byrne-Donoghue" becomes "Byrne Donahue." The transcript looks great in the demo and falls apart on the exact tokens your downstream logic depends on.

If you're evaluating real-time speech-to-text, this is the accuracy problem that decides whether the model is usable in production. And it's the one generic benchmarks hide, because they average it away. Here's the lever that fixes it — keyterm prompting — how it actually works, and how to use it without making things worse.

Why the words that matter are the hardest to get right

It's not random which words a model misses. Domain-specific terms fail at roughly 3–5x the rate of general speech in off-the-shelf models, and the reason is training data. Common English words show up millions of times; your company name, your SKUs, a clinician's surname, an alphanumeric policy ID — those barely appear, if at all. The model has no strong prior for them, so the moment audio gets noisy or a speaker has an accent, it falls back to a more common word that sounds similar.

That's the trap: the rare, high-value terms are exactly the ones most exposed to error, and they're the ones a transcript can least afford to get wrong. A misheard filler word costs nothing. A misheard account number breaks the call.

What keyterm prompting does

Keyterm prompting is the most direct accuracy lever AssemblyAI gives you for this. You pass a list of the words and phrases that matter — names, brands, jargon, product terms — through the keyterms_prompt parameter, and the model biases toward recognizing them. It's an array of strings, and it works on both pre-recorded and streaming transcription. It's part of the broader promptable interface AssemblyAI introduced with Universal-3 Pro and carried into the Universal-3.5 Pro Realtime flagship.

keyterms_prompt=["Kelly Byrne-Donoghue", "metoprolol", "Universal-3.5 Pro"]

That's the whole interface. The interesting part is what happens behind it.

Boost your hardest terms in real time

Test keyterm prompting on your own names, jargon, and product terms with Universal-3.5 Pro Realtime. Free account, clear docs, no credit card.

Sign up free

How it actually works: two boosting stages

This is the part that comes up constantly and rarely gets explained. Streaming keyterm prompting isn't one mechanism — it's two, working in sequence.

Word-level boosting happens live, during inference. The model is biased toward your keyterms as words are emitted, so recognition improves in real time as the audio streams in. This stage is on by default.

Turn-level boosting happens after each turn completes. A second pass re-examines the full turn against your keyterms list using metaphone-based matching — phonetic matching, not exact spelling. Metaphone encodes how a word sounds, so when the model hears "Byrne Donahue" and your keyterm is "Byrne-Donoghue," the phonetic codes line up and the term gets corrected to the spelling you specified. That's why keyterms fix names that are pronounced one way and spelled another. On Universal-3.5 Pro Realtime (universal-3-5-pro), turn-level boosting is always active; on Universal-Streaming English and Multilingual it kicks in when format_turns=true.

The two stages stack: word-level catches the term as it's spoken, turn-level cleans up anything that slipped through using sound-alike matching. Together they target the precise failure mode — a term the model heard almost right.

Using it in a streaming session

On streaming, keyterms_prompt is set when you open the WebSocket and can be changed mid-stream. Here's the shape with the Python SDK:

client.connect(
    StreamingParameters(
        sample_rate=16000,
        speech_model="universal-3-5-pro",
        keyterms_prompt=["Keanu Reeves", "AssemblyAI", "Universal-3.5 Pro"],
    )
)

A few limits worth knowing before you load it up, because they're easy to trip over:

  • Streaming caps at 100 keyterms per session, and each term must be 50 characters or less. Go over and the request errors; over-length terms are ignored. (Pre-recorded is far more generous — up to 1,000 words or phrases, max six words each.)
  • On Universal-3.5 Pro Realtime, keyterms are included at no extra cost, alongside the model's conversation context. On Universal-Streaming English and Multilingual, keyterm boosting is an add-on.
  • You can combine keyterms_prompt with a prompt in the same request — keyterms get appended to your prompt automatically. (For more on shaping output with prompts, see our prompt engineering guide.)

The real unlock for voice agents is dynamic keyterms. You don't have to commit to one list for the whole call. Send an UpdateConfiguration message and swap the terms as the conversation moves:

{ "type": "UpdateConfiguration", "keyterms_prompt": ["cardiology", 
"echocardiogram", "Dr. Patel", "metoprolol"] }

So when your agent reaches the identity-verification step, you load names and date-of-birth terms; when it moves to a medical intake, you swap in clinical vocabulary. You're priming the model for exactly what it's about to hear, stage by stage — which is one of the most effective ways to lift mid-call accuracy. (Our real-time transcription guide shows a full working setup, and the Universal-3.5 Pro Realtime release post covers dynamic prompting in context.)

Keyterms aren't your only context lever anymore

Keyterm prompting tells the model which words to expect. Universal-3.5 Pro Realtime adds two more ways to give it context, and they stack with keyterms:

  • Conversation context (Context Carryover) is on by default. The model keeps a short, rolling memory of the call and uses it as context for the next turn, so it's no longer deciding each utterance cold. Across a 20,000-file voice agent benchmark, passing context cut word error rate by 10.2%, concentrated exactly where agents hurt: short utterances, names, and entities.
  • agent_context lets a voice agent pass in the question it just asked. Prime the model with "What's your email address?" and a mumbled reply resolves to user@assemblyai.com instead of "user at assembly a i dot com." Spelled-out account IDs, addresses, and one-word confirmations — the short utterances that wreck most realtime models — finally have the context to come out right.

Use keyterms for the specific high-value vocabulary your application can't lose, and use context to sharpen everything around it.

See keyterm boosting on your own audio

Drop in a recording with tricky names or jargon and watch keyterm prompting correct them live — no code required.

Try playground

How to use keyterms without making accuracy worse

More keyterms is not better. Overloading the list — or stuffing it with common words — causes overcorrection and hallucinations, where the model forces a keyterm onto audio that didn't contain it. The guidance that holds up in production:

  • Start with none. Run your real audio first, find the terms the model consistently misses, and add only those. Don't pre-load a dictionary.
  • Match spelling and capitalization exactly to the output you want. Keyterms double as a spelling instruction — "Byrne-Donoghue," not "byrne donoghue."
  • Skip common words. "Information," "account," "today" — the model already handles these, and boosting them just invites false matches.
  • Keep the list tight. A focused set of genuinely hard, genuinely important terms beats a long list every time.

Think of it as a scalpel for the specific words your application can't afford to lose, not a blanket over the whole transcript.

Where accents fit in

Keyterm prompting is also one of your best levers for accented speech — a question that comes up constantly, especially for heavier accents like Irish, Scottish, or strong regional English. Accents lower the model's confidence on individual words, and low confidence is exactly when a name or technical term slips toward a more common sound-alike. Because turn-level boosting matches phonetically, keyterms are well suited to catch those: the accented pronunciation still maps to the metaphone code of the term you specified.

That said, keyterms aren't the whole answer for accents. Pair them with the right model — Universal-3.5 Pro Realtime shows consistent improvements on accented speech, and its voice_focus option isolates the primary speaker so a noisy room or a second voice doesn't drag accuracy down further — and, critically, test with audio that matches your actual users. A model's accuracy on a clean American-English benchmark tells you almost nothing about how it handles an Irish caller on a noisy phone line. Boost the names and terms that matter, run your real accents through it, and measure what you actually get. Our guide on how to evaluate speech recognition models covers building that kind of test set, including Missed Entity Rate — the metric that actually captures whether your high-value terms survive.

Keyterms also aren't your only prompting lever. When the vocabulary you need to boost is unknown or varies call to call, open-field speech-to-text prompting gives you behavioral control — formatting, verbatim vs. clean, domain context — that a term list can't. On streaming you can run a prompt and keyterms_prompt together; on pre-recorded they're mutually exclusive, so you pick one per request.

Build accurate transcription into your product

Get real-time speech-to-text with keyterm prompting, strong accented-speech accuracy, and dynamic mid-call control. Start free and ship in an afternoon.

Sign up free

The bottom line

Headline accuracy numbers don't decide whether a transcription model works for you — accuracy on the handful of words your application depends on does. Keyterm prompting is the lever that targets exactly those words, through real-time biasing plus a phonetic cleanup pass that fixes sound-alike errors on names and jargon. Use it surgically: start empty, add the terms the model actually misses, spell them the way you want them, and update them as the conversation moves. Layer in conversation context and agent_context on top, and the demo that fell apart on a customer's name becomes the system that gets it right on the first try — which, for anything past the prototype stage, is the only accuracy that counts.

Frequently asked questions

What is keyterm prompting in speech-to-text? Keyterm prompting is a feature that improves transcription accuracy for specific words and phrases you supply through the keyterms_prompt parameter — names, brands, jargon, product terms, and other domain vocabulary. The model biases toward recognizing those terms, so the high-value words your application depends on are far less likely to be mis-transcribed.

How do I improve recognition accuracy for names, jargon, and product terms? Pass them as keyterms. With AssemblyAI you provide an array of terms via keyterms_prompt (up to 100 per streaming session, up to 1,000 for pre-recorded audio), spelled exactly as you want them to appear. The model applies real-time word-level boosting plus a metaphone-based turn-level pass that corrects phonetically similar mistakes — so "Byrne Donahue" becomes "Byrne-Donoghue." On Universal-3.5 Pro Realtime you can stack conversation context and agent_context on top.

How can I transcribe audio with heavy accents, like an Irish accent? Use a model with strong accented-speech accuracy (Universal-3.5 Pro Realtime), turn on voice_focus to isolate the primary speaker, then add keyterm prompting for the names and terms accents most often distort — phonetic boosting maps the accented pronunciation back to the spelling you specified. Most importantly, test with audio from your actual user demographics rather than clean benchmark clips, since accent accuracy is highly specific to the speakers you serve.

Does keyterm prompting work in real time? Yes. On streaming, keyterms are applied live through word-level boosting as audio arrives, then refined by a turn-level boosting pass after each turn. You can also update the keyterms list mid-stream with an UpdateConfiguration message — useful for voice agents that move through stages (verification, intake, payment) with different vocabulary at each.

Can adding too many keyterms hurt accuracy? Yes. Overloading the list or including common words can cause overcorrection and hallucinations, where the model forces a keyterm onto audio that didn't contain it. Start with no keyterms, add only the terms the model consistently misses, keep the list focused, and avoid common words the model already handles well.

How many keyterms can I use, and do they cost extra? Streaming allows up to 100 keyterms per session (50 characters max each); pre-recorded allows up to 1,000 words or phrases (six words max per phrase). Keyterm prompting is included at no extra cost on Universal-3.5 Pro Realtime, alongside conversation context; on Universal-Streaming English and Multilingual it's an add-on.

Title goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Button Text
Speech-to-Text