Documentation Index
Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Introduction
The high level objective of a pre-recorded STT model evaluation is to answer the question: Which Speech-to-text model is the best for my product? This guide provides a step-by-step framework for evaluating and benchmarking pre-recorded Speech-to-text models, with specific guidance for evaluating Universal-3 Pro and its prompting capabilities.Need help with evaluations or prompt optimization? Contact our Sales
team — we can help you design an
evaluation, optimize prompts for your audio, and benchmark against your ground
truth data.
Evaluation metrics
Traditional metrics
Word Error Rate (WER)
This formula takes the number of Substitutions (S), Deletions (D), and Insertions (I), and divides their sum by the Total Number of Words in the ground truth transcript (N).While WER calculation may seem simple, it requires a methodical granular
approach and reliable reference data. Word Error Rate can tell you how
“different” the automatic transcription was compared to the human
transcription, and generally, this is a reliable metric to determine how
“good” a transcription is. For more info on WER as a metric, read Dylan Fox’s
blog post here.
Concatenated minimum-Permutation Word Error Rate (cpWER)
cpWER is similar to WER, but it also measures the number of errors a speech recognition model makes where words with incorrectly-ascribed speakers are considered to be incorrect. The primary difference from standard WER is how S is calculated: counts both word substitutions and correctly transcribed words that are assigned to the wrong speaker. A correct word with an incorrect speaker label counts as a substitution error, thereby penalizing both transcription and speaker diarization mistakes.Formatted WER (F-WER)
F-WER is similar to WER but F-WER does not apply text normalization, so all formatting differences are accounted for, in addition to word differences when computing the WER. Therefore, F-WER is always higher than or equal to WER.Sentence Error Rate (SER)
The Sentence Error Rate (SER) is the ratio of the number of sentences with one or more errors to the total number of sentences.Diarization Error Rate (DER)
This formula takes the duration of non-speech incorrectly classified as speech (false alarm), the duration of speech incorrectly classified as non-speech (missed detection), the duration of speaker confusion (confusion), and divides the sum over the total speech duration.Missed Entity Rate (MER)
Fundamentally, MER is a negative recall rate computed for specified target entities. It is defined as the number of correctly transcribed entities relative to their total occurrence count. It accounts for multiple occurrences of the same entity and their positions within the hypothesis transcription. Our Research team proposes this as the best metric to measure the effectiveness of word boost. For a simpler approach when evaluating high-stakes entities (such as credit card numbers, names, or dosages), consider a binary pass/fail metric per file: score each file as 1 if the model captured the target entity correctly, or 0 if it did not. Run this across 100 or more files for statistical reliability. This is especially useful when you need a quick signal on entity accuracy at scale without computing full MER breakdowns.Metrics for Universal-3 Pro
Universal-3 Pro is significantly more capable than prior models, and traditional WER alone may not fully capture its performance. The following metrics provide a more complete picture.Semantic WER
Traditional WER treats every difference between the model output and a reference transcript as an error—even when the difference is semantically equivalent. Semantic WER corrects this by normalizing equivalent words and phrases before calculating WER, so that differences likedr. vs doctor or 1300 vs thirteen hundred aren’t counted as errors.
Rule-based normalization
At its simplest, Semantic WER is a preprocessing step. Before running standard WER, apply find-and-replace rules to both the reference and hypothesis transcripts:- Number formats:
1300→thirteen hundred,$5→five dollars - Abbreviations and titles:
dr.→doctor,mr.→mister,govt→government - Contractions:
gonna→going to,can't→cannot - Variant spellings:
grey→gray,cancelled→canceled - Filler words: Remove
um,uh,you knowfrom both sides (or keep both—just be consistent)
LLM-based scoring
For cases where simple rules can’t capture the nuance—was an omission meaningful? Is a proper noun misspelling close enough?—an LLM can perform word-level alignment and classify each difference by severity:- No penalty: Semantically equivalent forms (number formats, contractions, variant spellings)
- Minor penalty: Single-character misspellings, minor grammatical markers
- Major penalty: Incorrect substitutions, meaning-altering errors, significant omissions or additions of content words
LASER score (LLM-based ASR Evaluation Rubric)
LASER is a published LLM-based evaluation metric (Parulekar & Jyothi, EMNLP 2025) that uses an LLM prompt with detailed examples to classify ASR errors and compute a score: The LLM aligns each word in the ASR output against the reference transcription and assigns a penalty per word pair:- No penalty (0): Acceptable variations including numerical format differences, abbreviations, compound word splits, transliterations, alternate spellings, proper noun variants, and colloquial terms
- Minor penalty (0.5): Small spelling errors (single character) or minor grammatical errors (gender, tense, number markers) that preserve sentence meaning
- Major penalty (1.0): Incorrect word substitutions, significant omissions or additions, and reordering that changes meaning
Why new metrics matter
Traditional WER treats every difference between the model output and human transcription as an error. Universal-3 Pro’s contextual awareness means it will often transcribe words that human transcribers missed entirely. In traditional WER, these show up as insertions (penalized errors), even though the model is correct. This makes WER an unreliable metric when used alone — your evaluation is only as good as your ground truth labels. This is why Artificial Analysis, an independent AI benchmarking organization, had to create proprietary evaluation datasets with manually corrected ground truths when building their Speech-to-Text leaderboard. Existing public datasets contain systematic human transcription errors that penalize models which are actually more accurate.The evaluation process
This section provides a step-by-step guide on how to run an evaluation. The evaluation process should closely match your production environment, including the files you intend to transcribe, the model you intend to use, and the settings applied to those models.Step 1: Prepare your evaluation dataset
Ensure that the files you use to benchmark are representative of the files you plan to use in production. For example, if you plan to transcribe meetings, gather a set of meeting recordings. If you plan to transcribe phone calls, focus on finding phone calls that match your customer base’s language and region. We recommend using at least 25 files that are representative of your use case. Length is less important than diversity of audio conditions — a good evaluation set covers the range of speakers, accents, audio quality, and vocabulary your model will encounter in production. Then, gather human-labeled data to act as your source of ground truth. Ground truth is accurately transcribed audio data that will serve as the “correct answer” for our benchmark. Human-labeled data can be purchased from an external vendor or created manually.Open-source audio corpora (for example, datasets on Hugging Face) can serve as a starting point for building ground truth, but they require review and correction before use in production evaluations. These datasets contain the same systematic human transcription errors described below — missing filler words, incorrect proper nouns, and simplified speech patterns — and should be audited against your actual audio before benchmarking.
Ground truth quality
The quality of your ground truth data directly affects the reliability of your evaluation. With Universal-3 Pro, this is more important than ever because the model frequently outperforms human transcribers. Common issues with ground truth data:- Missing filler words: Human transcribers often omit
um,uh,like, and other disfluencies - Incorrect proper nouns: Rare names, technical terms, and domain vocabulary are often misspelled
- Simplified speech patterns: Human transcribers tend to “clean up” speech, missing repetitions, false starts, and self-corrections
- Code-switching errors: Multilingual segments are frequently translated to English rather than transcribed as spoken
See this article to learn more about why your word error rate (WER) benchmark might be lying to you.
Dataset diversity
A prompt that performs well overall may underperform on specific audio types. Include a diverse mix of audio in your evaluation set and track per-dataset breakdowns:| Audio type | Characteristics | Typical WER range |
|---|---|---|
| Earnings calls | Clean English, formal vocabulary | Low |
| Meeting recordings | Multi-speaker, informal | Moderate |
| Code-switching audio | Mixed languages (e.g., English/Spanish) | Higher (normalization affects scoring) |
| Medical consultations | Clinical vocabulary, accented speech | Moderate |
| Phone calls | Compression artifacts, background noise | Moderate to high |
Step 2: Establish a baseline
Before optimizing prompts, measure your baseline performance by transcribing your evaluation set with Universal-3 Pro and no custom prompt. The built-in default is already applied whenprompt is omitted, and it outperforms most custom prompts. Record both WER and Semantic WER so you can track improvements as you layer instructions on top.
Step 3: Transcribe and evaluate with prompts
Transcribe your files using AssemblyAI’s API with Universal-3 Pro and your candidate prompts. When crafting evaluation prompts, use the prompting guide as a reference. Key principles:- Use authoritative language: The model responds better to
Mandatory:,Required:, andAlways:than soft language liketry toorplease - Be specific about speech patterns: Enumerate what you want preserved (disfluencies, filler words, hesitations, repetitions, stutters, false starts, colloquialisms)
- Give instructions, not just context:
This is a doctor-patient visit. Prioritize accurately transcribing medications and diseases wherever possible.is far more effective thanThis is a doctor-patient visit. - Start with fewer instructions, add one at a time: Every added instruction risks conflicting with another. Add a single instruction, evaluate it against your dataset, and only then add the next.
Step 4: Text normalization
Before calculating WER metrics, both reference (ground truth) and hypothesis (model generated) texts need to be normalized to ensure a fair comparison. This accounts for differences in:- Punctuation and capitalization
- Number formatting (e.g., “twenty-one” vs. “21”)
- Contractions and abbreviations
- Other stylistic variations that don’t affect meaning
If you are prompting Universal-3 Pro to include
[unclear] or [masked] tags
for uncertain audio, ensure your normalizer strips these tags before computing
WER. Otherwise, they will be counted as insertions.Step 5: Compare and calculate
Calculate the error rates using the formulas above or consider using a library like jiwer. For Semantic WER, apply text normalization replacements before calculating WER. For LASER scoring, use an LLM-based evaluator (see Open-source tools below). When reviewing results:- Check per-dataset breakdowns, not just aggregate WER
- Audit insertions manually by listening to the audio
- Compare both traditional WER and Semantic WER to get a full picture
- Track which prompt components improve which audio types
Qualitative analysis
Quantitative metrics don’t capture everything. Qualitative analysis helps you identify differences between STT providers that metrics might miss — for example, how certain key terms are transcribed can make or break a transcript, even if the rest of the transcript has a lower overall error rate. Qualitative analysis is also useful for tie-breaking when benchmarking metrics don’t clearly favor one model over another. Since you’re comparing models against each other, ground truth files aren’t required. Side-by-side comparison: Have users compare and pick their preferred transcript between two formatted outputs from different STT providers. Tools like Diffchecker or any side-by-side interface work well for this. LLM as judge: An LLM can automatically identify differences between two transcriptions and pick a winner. However, be cautious: an LLM judge can be misled by outputs that look correct but contain subtle errors (such as translated code-switching segments that read well in English but don’t reflect what was actually spoken). Always pair LLM-based judgments with spot-checking against the actual audio. A/B testing in production: Serve transcripts from different providers to users and collect feedback. You can ask users to score transcripts directly, or track indirect signals like the number of support ticket complaints about transcription quality.Domain-specific evaluation considerations
WER is not always the right primary metric. Some domains prioritize output qualities that traditional accuracy metrics do not capture:- Medical scribes: Customers often evaluate based on user preference rate and readability — whether clinicians prefer the transcript output for generating clinical notes. Formatting quality, medical terminology accuracy, and structured output can matter more than raw WER. See the Medical Scribe guides for domain-specific evaluation guidance.
- Legal transcription: Verbatim accuracy including disfluencies and speaker attribution may be more important than clean, readable output.
- Media and entertainment: Proper noun accuracy for names, places, and brands can outweigh overall WER.
Iterating on prompts
Finding the optimal prompt for your use case is an iterative process. There are two main approaches:Manual iteration
- Start with the default system prompt or one of the reference prompts below
- Transcribe a representative sample of your audio
- Review the output against your ground truth, focusing on the types of errors that matter most for your use case
- Adjust the prompt to address specific error patterns
- Re-evaluate and compare
Automated optimization
For large-scale prompt optimization, consider using one of the open-source tools described below. These tools systematically test prompt component combinations and score them against your evaluation data, converging on the best prompt for your specific audio.Reference prompts for evaluation
Use these prompts directly from the prompting guide as your evaluation prompts.Evaluation prompt
Start with the built-in default (Best all around). Omit theprompt parameter to use it — you don’t need to set it explicitly:
Comparison prompt (for identifying model uncertainty)
This is the Handling unclear audio with[unclear] prompt. Run it alongside the evaluation prompt on the same audio and diff the outputs to find where the model is guessing:
- Evaluating how the model handles unclear or noisy audio
- Finding segments where the model’s guesses may be incorrect
- Prioritizing which audio segments to manually review
- Understanding whether WER differences are coming from genuine errors or uncertain segments
What works and what doesn’t
The authoritative list lives in the prompting guide — see What works / what to avoid. The same rules apply when building evaluation prompts: lead withTranscribe…, use authoritative language (Required:, Mandatory:, Always:), describe the pattern to watch for, and add instructions one at a time.
Open-source tools
aai-cli
aai-cli is a command-line tool for evaluating and optimizing transcription prompts. It supports:- Prompt evaluation: Score a prompt against datasets from Hugging Face or your own audio files using WER and LASER metrics
- Prompt optimization: Automatically iterate on prompts using DSPy GEPA with LASER feedback, where an LLM reflects on transcription errors and proposes improved prompts
- Dataset discovery: Search and load audio datasets from Hugging Face for benchmarking
prompt-seeker
prompt-seeker uses Bayesian optimization (Optuna TPE) to systematically find the best transcription prompt by testing component combinations across diverse audio datasets and scoring with Semantic WER. It supports:- Component-based optimization: Modular prompt pieces (language, disfluency, punctuation, etc.) are tested in combinations
- Meta-optimization: An LLM designs new component spaces between optimization rounds based on accumulated findings
- Per-dataset analysis: Breakdown of what works for each audio type in your evaluation set