Evaluating Pre-recorded STT models

Introduction

The high level objective of a pre-recorded STT model evaluation is to answer the question: Which Speech-to-text model is the best for my product?

This guide provides a step-by-step framework for evaluating and benchmarking pre-recorded Speech-to-text models, with specific guidance for evaluating Universal-3-Pro and its prompting capabilities.

Need help with evaluations or prompt optimization? Contact our Sales team — we can help you design an evaluation, optimize prompts for your audio, and benchmark against your ground truth data.

Evaluation metrics

Traditional metrics

Word Error Rate (WER)

WER=S+D+INWER = \frac{S + D + I}{N}

This formula takes the number of Substitutions (S), Deletions (D), and Insertions (I), and divides their sum by the Total Number of Words in the ground truth transcript (N).

While WER calculation may seem simple, it requires a methodical granular approach and reliable reference data. Word Error Rate can tell you how “different” the automatic transcription was compared to the human transcription, and generally, this is a reliable metric to determine how “good” a transcription is. For more info on WER as a metric, read Dylan Fox’s blog post here.

Concatenated minimum-Permutation Word Error Rate (cpWER)

cpWER=Sspk+D+IN\text{cpWER} = \frac{S_{\text{spk}} + D + I}{N}

cpWER is similar to WER, but it also measures the number of errors a speech recognition model makes where words with incorrectly-ascribed speakers are considered to be incorrect. The primary difference from standard WER is how S is calculated: SspkS_{\text{spk}} counts both word substitutions and correctly transcribed words that are assigned to the wrong speaker. A correct word with an incorrect speaker label counts as a substitution error, thereby penalizing both transcription and speaker diarization mistakes.

Formatted WER (F-WER)

F-WER is similar to WER but F-WER does not apply text normalization, so all formatting differences are accounted for, in addition to word differences when computing the WER. Therefore, F-WER is always higher than or equal to WER.

Sentence Error Rate (SER)

SER=NerrNsent\text{SER} = \frac{N_{\text{err}}}{N_{\text{sent}}}

The Sentence Error Rate (SER) is the ratio of the number of sentences with one or more errors to the total number of sentences.

Diarization Error Rate (DER)

DER=falsealarm+misseddetection+confusiontotalDER = \frac{false alarm + missed detection + confusion}{total}

This formula takes the duration of non-speech incorrectly classified as speech (false alarm), the duration of speech incorrectly classified as non-speech (missed detection), the duration of speaker confusion (confusion), and divides the sum over the total speech duration.

Missed Entity Rate (MER)

MER=1NrecNtotal\text{MER} = 1 - \frac{N_{\text{rec}}}{N_{\text{total}}}

Fundamentally, MER is a negative recall rate computed for specified target entities. It is defined as the number of correctly transcribed entities relative to their total occurrence count. It accounts for multiple occurrences of the same entity and their positions within the hypothesis transcription. Our Research team proposes this as the best metric to measure the effectiveness of word boost.

Metrics for Universal-3-Pro

Universal-3-Pro is significantly more capable than prior models, and traditional WER alone may not fully capture its performance. The following metrics provide a more complete picture.

Semantic WER

Semantic WER uses an LLM (such as Claude) to perform word-level alignment and scoring, applying nuanced rules that traditional WER ignores. Instead of treating every difference as an error, Semantic WER classifies differences into categories:

  • No penalty: Variant spellings of the same name, number format differences (1300 vs thirteen hundred), contractions (going to vs gonna), singular/plural of the same word, filler words added or removed
  • Minor penalty (0.5): Single-character spelling errors, minor grammatical markers, small proper noun misspellings
  • Major penalty (1.0): Incorrect word substitutions, meaning-altering errors, significant omissions or additions of content words

This approach is particularly valuable for Universal-3-Pro because the model often transcribes audio more accurately than human transcribers, producing differences that are correct but would be penalized by traditional WER. For an implementation of Semantic WER using Bayesian optimization, see prompt-seeker.

LASER score (LLM-based ASR Evaluation Rubric)

LASER is a published LLM-based evaluation metric (Parulekar & Jyothi, EMNLP 2025) that uses an LLM prompt with detailed examples to classify ASR errors and compute a score:

LASER=1Total PenaltyReference Word Count\text{LASER} = 1 - \frac{\text{Total Penalty}}{\text{Reference Word Count}}

The LLM aligns each word in the ASR output against the reference transcription and assigns a penalty per word pair:

  • No penalty (0): Acceptable variations including numerical format differences, abbreviations, compound word splits, transliterations, alternate spellings, proper noun variants, and colloquial terms
  • Minor penalty (0.5): Small spelling errors (single character) or minor grammatical errors (gender, tense, number markers) that preserve sentence meaning
  • Major penalty (1.0): Incorrect word substitutions, significant omissions or additions, and reordering that changes meaning

Unlike Semantic WER which outputs a single number, LASER provides structured per-error feedback alongside the score. This makes it useful for prompt optimization workflows where you need to understand why a prompt performed poorly, not just how much error there was. For an implementation of LASER scoring, see aai-cli.

Why new metrics matter

Traditional WER treats every difference between the model output and human transcription as an error. Universal-3-Pro’s contextual awareness means it will often transcribe words that human transcribers missed entirely. In traditional WER, these show up as insertions (penalized errors), even though the model is correct. This makes WER an unreliable metric when used alone — your evaluation is only as good as your ground truth labels.

WER is only as good as your ground truth labels. Human transcriptions contain systematic errors — missed filler words, incorrect proper nouns, simplified speech patterns, and translated code-switching. When Universal-3-Pro transcribes audio more accurately than the human label, those improvements show up as WER errors.

Before reporting WER, manually audit at least 20 insertions to determine what percentage are true errors versus ground truth omissions. In our testing, the majority of insertions were cases where Universal-3-Pro correctly transcribed audio that the human transcriber missed.

This is why Artificial Analysis, an independent AI benchmarking organization, had to create proprietary evaluation datasets with manually corrected ground truths when building their Speech-to-Text leaderboard. Existing public datasets contain systematic human transcription errors that penalize models which are actually more accurate.

The evaluation process

This section provides a step-by-step guide on how to run an evaluation. The evaluation process should closely match your production environment, including the files you intend to transcribe, the model you intend to use, and the settings applied to those models.

Step 1: Prepare your evaluation dataset

Ensure that the files you use to benchmark are representative of the files you plan to use in production. For example, if you plan to transcribe meetings, gather a set of meeting recordings. If you plan to transcribe phone calls, focus on finding phone calls that match your customer base’s language and region.

We recommend using at least 25 files that are representative of your use case. Length is less important than diversity of audio conditions — a good evaluation set covers the range of speakers, accents, audio quality, and vocabulary your model will encounter in production.

Then, gather human-labeled data to act as your source of ground truth. Ground truth is accurately transcribed audio data that will serve as the “correct answer” for our benchmark. Human-labeled data can be purchased from an external vendor or created manually.

Ground truth quality

The quality of your ground truth data directly affects the reliability of your evaluation. With Universal-3-Pro, this is more important than ever because the model frequently outperforms human transcribers.

Common issues with ground truth data:

  • Missing filler words: Human transcribers often omit um, uh, like, and other disfluencies
  • Incorrect proper nouns: Rare names, technical terms, and domain vocabulary are often misspelled
  • Simplified speech patterns: Human transcribers tend to “clean up” speech, missing repetitions, false starts, and self-corrections
  • Code-switching errors: Multilingual segments are frequently translated to English rather than transcribed as spoken

Before running evaluations, audit a sample of your ground truth files by listening to the audio and comparing. If your ground truth contains systematic errors, your WER numbers will be misleading.

Dataset diversity

A prompt that performs well overall may underperform on specific audio types. Include a diverse mix of audio in your evaluation set and track per-dataset breakdowns:

Audio typeCharacteristicsTypical WER range
Earnings callsClean English, formal vocabularyLow
Meeting recordingsMulti-speaker, informalModerate
Code-switching audioMixed languages (e.g., English/Spanish)Higher (normalization affects scoring)
Medical consultationsClinical vocabulary, accented speechModerate
Phone callsCompression artifacts, background noiseModerate to high

Step 2: Establish a baseline

Before optimizing prompts, measure your baseline performance:

  1. No prompt: Transcribe your evaluation set with Universal-3-Pro and no custom prompt. This gives you the model’s out-of-the-box performance.
  2. Default system prompt: Transcribe with the current system prompt (see the prompting guide) to understand the default behavior.

Record both WER and Semantic WER for each baseline so you can track improvements.

Step 3: Transcribe and evaluate with prompts

Transcribe your files using AssemblyAI’s API with Universal-3-Pro and your candidate prompts.

When crafting evaluation prompts, use the prompting guide as a reference. Key principles:

  • Use authoritative language: The model responds better to Mandatory:, Required:, and Always: than soft language like try to or please
  • Be specific about speech patterns: Enumerate what you want preserved (disfluencies, filler words, hesitations, repetitions, stutters, false starts, colloquialisms)
  • Give instructions, not just context: This is a doctor-patient visit. Prioritize accurately transcribing medications and diseases wherever possible. is far more effective than This is a doctor-patient visit.
  • Use 3-6 instructions: Fewer than 3 leaves important categories uncovered; more than 7 causes diminishing returns

Step 4: Text normalization

Before calculating WER metrics, both reference (ground truth) and hypothesis (model generated) texts need to be normalized to ensure a fair comparison.

This accounts for differences in:

  • Punctuation and capitalization
  • Number formatting (e.g., “twenty-one” vs. “21”)
  • Contractions and abbreviations
  • Other stylistic variations that don’t affect meaning

Normalization can be done with a library like Whisper Normalizer.

If you are prompting Universal-3-Pro to include [unclear] or [masked] tags for uncertain audio, ensure your normalizer strips these tags before computing WER. Otherwise, they will be counted as insertions.

Step 5: Compare and calculate

Calculate the error rates using the formulas above or consider using a library like jiwer. For Semantic WER or LASER scoring, use an LLM-based evaluator (see Open-source tools below).

When reviewing results:

  • Check per-dataset breakdowns, not just aggregate WER
  • Audit insertions manually by listening to the audio
  • Compare both traditional WER and Semantic WER to get a full picture
  • Track which prompt components improve which audio types

Qualitative analysis

Quantitative metrics don’t capture everything. Qualitative analysis helps you identify differences between STT providers that metrics might miss — for example, how certain key terms are transcribed can make or break a transcript, even if the rest of the transcript has a lower overall error rate.

Qualitative analysis is also useful for tie-breaking when benchmarking metrics don’t clearly favor one model over another. Since you’re comparing models against each other, ground truth files aren’t required.

Side-by-side comparison: Have users compare and pick their preferred transcript between two formatted outputs from different STT providers. Tools like Diffchecker or any side-by-side interface work well for this.

LLM as judge: An LLM can automatically identify differences between two transcriptions and pick a winner. However, be cautious: an LLM judge can be misled by outputs that look correct but contain subtle errors (such as translated code-switching segments that read well in English but don’t reflect what was actually spoken). Always pair LLM-based judgments with spot-checking against the actual audio.

A/B testing in production: Serve transcripts from different providers to users and collect feedback. You can ask users to score transcripts directly, or track indirect signals like the number of support ticket complaints about transcription quality.

Iterating on prompts

Finding the optimal prompt for your use case is an iterative process. There are two main approaches:

Manual iteration

  1. Start with the default system prompt or one of the reference prompts below
  2. Transcribe a representative sample of your audio
  3. Review the output against your ground truth, focusing on the types of errors that matter most for your use case
  4. Adjust the prompt to address specific error patterns
  5. Re-evaluate and compare

Automated optimization

For large-scale prompt optimization, consider using one of the open-source tools described below. These tools systematically test prompt component combinations and score them against your evaluation data, converging on the best prompt for your specific audio.

Reference prompts for evaluation

The following prompts are aligned with the prompting guide and represent tested configurations for evaluation workflows.

Evaluation prompt

This is the current system prompt and provides the best balance between verbatim accuracy, multilingual support, and handling challenging audio. Use this as your primary evaluation prompt:

Required: Preserve the original language(s) and script as spoken, including code-switching and mixed-language phrases.
Mandatory: Preserve linguistic speech patterns including disfluencies, filler words, hesitations, repetitions, stutters, false starts, and colloquialisms in the spoken language.
Always: Transcribe speech with your best guess based on context in all possible scenarios where speech is present in the audio.

Comparison prompt (for identifying model uncertainty)

This prompt replaces the “best guess” instruction with an [unclear] tag for uncertain audio. Run both prompts on the same audio and diff the outputs to find where the model is guessing:

Required: Preserve the original language(s) and script as spoken, including code-switching and mixed-language phrases.
Mandatory: Preserve linguistic speech patterns including disfluencies, filler words, hesitations, repetitions, stutters, false starts, and colloquialisms in the spoken language.
Always: Transcribe speech with your best guess when speech is heard, mark [unclear] when audio segments are unknown.

By comparing the two outputs, you can identify exactly which segments the model is least confident about. This is useful for:

  • Evaluating how the model handles unclear or noisy audio
  • Finding segments where the model’s guesses may be incorrect
  • Prioritizing which audio segments to manually review
  • Understanding whether WER differences are coming from genuine errors or uncertain segments

What works and what doesn’t

The following guidance is aligned with the prompting guide’s best practices.

High-value prompt components

ComponentDescriptionWhen to use
Authoritative languageUse Required:, Mandatory:, Always: prefixesAll prompts
Disfluency preservationEnumerate specific patterns (filler words, hesitations, repetitions, stutters, false starts, colloquialisms)Meeting, interview, and conversational audio
Language preservationPreserve original languages as spoken, including code-switchingMultilingual audio datasets
Proper noun precisionStandard spelling for names, brands, medical terms, and entitiesTechnical, medical, or domain-specific content
Completeness / best guess”Transcribe speech with your best guess based on context”Audio with variable quality
Audio context”Transcribe everything spoken despite background noise or audio quality issues”Noisy or compressed audio

What to avoid in evaluation prompts

Anti-patternImpact
Domain labels without instructions (e.g., Context: Medical)Helps on domain-specific data but hurts on mixed datasets
Review/correction instructions (e.g., Review and correct, Edit for accuracy)Consistently hurts accuracy
Over-instruction (more than 7 rules)Causes diminishing returns and confusion
Soft language (e.g., try to, please, if possible)Model responds poorly compared to authoritative instructions
Negative language (e.g., Don’t, Avoid, Never, Not)Model does not process negative instructions and gets confused

Listing specific word examples in your prompt causes hallucinations. The model becomes over-eager to insert those exact words into the transcript, even when they weren’t spoken. For example, Pharmaceutical accuracy required (omeprazole over omeprizole, metformin over metforman) will cause the model to hallucinate those drug names. Instead, describe the pattern of entities to prioritize: Pharmaceutical accuracy required across all medications and drug names. See the prompting guide for more details.

Open-source tools

aai-cli

aai-cli is a command-line tool for evaluating and optimizing transcription prompts. It supports:

  • Prompt evaluation: Score a prompt against datasets from Hugging Face or your own audio files using WER and LASER metrics
  • Prompt optimization: Automatically iterate on prompts using DSPy GEPA with LASER feedback, where an LLM reflects on transcription errors and proposes improved prompts
  • Dataset discovery: Search and load audio datasets from Hugging Face for benchmarking
$# Evaluate a prompt
$aai eval --prompt "Transcribe verbatim." --max-samples 50
$
$# Optimize a prompt
$aai optimize --starting-prompt "Transcribe verbatim." --iterations 5 --samples 50

prompt-seeker

prompt-seeker uses Bayesian optimization (Optuna TPE) to systematically find the best transcription prompt by testing component combinations across diverse audio datasets and scoring with Semantic WER. It supports:

  • Component-based optimization: Modular prompt pieces (language, disfluency, punctuation, etc.) are tested in combinations
  • Meta-optimization: An LLM designs new component spaces between optimization rounds based on accumulated findings
  • Per-dataset analysis: Breakdown of what works for each audio type in your evaluation set
$# Run optimization (50 trials across your data)
$uv run python -m prompt_seeker.cli optimize \
> --datasets "my_calls:50" \
> --trials 50 -c 20
$
$# Run the meta-optimizer (Claude designs rounds autonomously)
$uv run python -m prompt_seeker.cli meta-optimize \
> --datasets "my_calls:50" \
> --rounds 3 --trials 50 -c 10

Both tools require ground truth transcriptions for scoring. If you don’t have ground truth yet, transcribe a sample of your audio manually and use that as your starting point.

Conclusion

Evaluating Universal-3-Pro requires going beyond traditional WER. The model’s contextual awareness and prompting capabilities mean that evaluation is as much about finding the right prompt as it is about measuring accuracy. Use Semantic WER or LASER alongside traditional WER, audit your ground truth data carefully, and iterate on prompts systematically to find the best configuration for your audio.