Evaluating Streaming STT models for Voice Agents | AssemblyAI

Introduction

The high level objective of a streaming STT model evaluation is to answer the question: Which streaming Speech-to-text model is the best for my voice agent?

This guide will provide a step-by-step framework for evaluating and benchmarking streaming Speech-to-text models to help you select the best fit for your voice agent.

Need help evaluating our Speech-to-text products? Contact our Sales team to request for an evaluation.

Common evaluation metrics

Time to First Token (TTFT) / Time to First Byte (TTFB)

TTFT = t_{first\_token} - t_{stream\_start}

This measures the time from when the audio stream begins (including model startup/initialization) to when the very first token is returned by the model. TTFT is critical for perceived initial responsiveness in real-time applications.

TTFT includes all initialization overhead - connection setup, model loading, and initial audio buffering. For voice agents and real-time transcription, sub-500ms TTFT is typically considered good, while sub-200ms is excellent.

Time to Complete Transcript (TTCT) / Transcription Delay

TTCT = t_{final\_token} - t_{utterance\_end}

This measures the latency between when a user finishes speaking (end of speech detected) and when the complete transcription for that utterance is received from the STT model. This metric is crucial for understanding overall streaming model latency performance.

TTCT is measured on a per utterance basis, and a single user turn may contain multiple utterances. This metric is crucial for minimizing overall voice agent latency since it represents how soon you’ll be able to send STT outputs downstream to the LLM.

End-of-Turn Finalization Latency / Endpointing Latency

Latency = t_{turn\_boundary} - t_{speech\_end}

This measures the time from when the user actually finishes speaking (end of speech detected) to when the system recognizes and signals the end of their conversational turn. This includes both speech detection latency and any additional processing to determine turn completion.

End-of-Turn Finalization Latency should measured from the actual environment the user will use when the turn ends. AssemblyAI provides native endpointing in our models, but you can also use our outputs with other providers and their End-of-Turn models like LiveKit, Pipecat, or Vapi.

End-of-Turn Detection Accuracy

EOT Accuracy = \frac{TP_{eos}}{TP_{eos} + FP_{eos} + FN_{eos}}

This measures how accurately the model detects when a user has finished speaking, considering true positives (correct detections), false positives (premature cutoffs), and false negatives (missed endpoints).

Word Error Rate (WER)

WER = \frac{S + D + I}{N}

This formula takes the number of Substitutions (S), Deletions (D), and Insertions (I), and divides their sum by the Total Number of Words in the ground truth transcript (N).

For streaming models, it’s important to measure both partial WER (accuracy of interim results) and final WER (accuracy after all corrections). The delta between these indicates the model’s self-correction capability. While WER calculation may seem simple, it requires a methodical granular approach and reliable reference data.

The evaluation process

This section will be the core of the documentation, providing a step-by-step guide on how to run an evaluation.

For that reason, the evaluation process should closely match your production environment - including the streaming conditions you intend to transcribe, the model you intend to use, and the settings applied to those models.

Step 1: Set up your benchmarking environment

When benchmarking a voice agent, you can decide to benchmark the audio directly against the provider’s API and/or set up a live testing environment.

Benchmarks with real files against the API are best for measuring overall accuracy and latency metrics, like WER and TTFB. These will give you a good idea of the model’s performance. See our section on Pre-recorded evaluation benchmarks for how to do this part.

Voice agents are complex and other metrics like TTCT and End-of-Turn Finalization Latency often depend on additional factors like the end user’s audio device and environment. We highly recommend you run live side by side evals with your voice agent hooked up to multiple streaming STT providers to experience what your user will feel for themselves.

Step 2: Run your test scenarios

If you have out of the box test scenarios, you can run these through the API and capture the metrics above.

If you don’t have scenarios, you can make these up based on expected customer behaviors and measure the side by side results across different providers. For example, if you are building a drive-through ordering system, create simulated test scenarios to represent different user orders, pacing, tonality, background noise, accents, etc.

If you are unsure how to proceed here, it might be worth checking out BlueJay, Coval, Hamming, who all help with evaluating and measuring performance of voice agents.

Step 3: Compare the results

It is highly unlikely you will find a single streaming STT model that wins in all of the metrics outlined above. Ultimately your goal should be to compare for your use case which of these metrics helps your agent drive the best end user outcome.

For this case, you might consider:

Are you replacing humans with your voice agent? It is likely TTCT and Endpointing Latency matter most since these metrics best simulate human behavior.
Are you working with domain specific words like medical? While WER is important, it’s most important the LLM in your voice agent understands the user. This requires simulating full test scenarios outside of just metrics.
Are you showing transcript text to your end users (like subtitles)? Perhaps WER is most important to end user perception of quality and accuracy.

Vibes vs metrics

While metrics provide a useful quantitative evaluation of a streaming Speech-to-text model, voice agents are complex and are not made up solely of STT models. For this, we recommend doing a “vibe-eval”.

Why do a vibe-eval?

Vibe-evals are useful to determine the qualitative difference between STT providers that affect the end outcome of your voice agent. For example, how errors in transcript may or may not trip up a voice agent as the LLM may fix some of the issues.

Vibe-evals are also good for tie-breaking instances where the benchmarking metrics don’t lean in favour of one model over the other.

Another benefit of doing a vibe-evals is that truth files don’t have to be sourced for them since Speech-to-text models are being compared against each other in a real voice agent.

How to do a vibe-eval?

To do a vibe-eval, set up your voice agent with different STT providers. This can be done quickly with integrations like LiveKit where you can swap out providers.

Another option is to do A/B testing with your current voice agent in production and ask users to give the agent a score. We’ve also seen users in the past compare the number of support ticket complaints based on the voice agent served to the user.

Vibe-evals are a great way to see how our models perform in a production setting while also letting your users determine their preferred streaming Speech-to-text provider.

Conclusion

We hope that this short guide was helpful in shaping your evaluation methodology.

Have more questions about evaluating our Speech-to-text models? Contact our sales team and we can help.