Build & Learn
July 14, 2025

How to build the lowest latency voice agent in Vapi: Achieving ~465ms end-to-end Latency

In this comprehensive guide, we'll show you how to build a voice agent in Vapi that achieves an impressive ~465ms end-to-end latency—fast enough to feel truly conversational.

Daniel Ince
Applied API Engineer
Daniel Ince
Applied API Engineer
Reviewed by
No items found.
No items found.
No items found.
No items found.

Want to test this voice agent? Click the "TALK WITH AI" button in the bottom right!

Voice AI applications are revolutionizing how we interact with technology, but latency remains the biggest barrier to creating truly conversational experiences. When users have to wait seconds for a response, the magic of natural conversation is lost. 

In this comprehensive guide, we'll show you how to build a voice agent in Vapi that achieves an impressive ~465ms end-to-end latency—fast enough to feel truly conversational.

Understanding the latency challenge

Before diving into the configuration, it's crucial to understand that voice agent latency comes from multiple components in the pipeline:

  • Speech-to-Text (STT): Converting audio to text
  • Large Language Model (LLM): Processing and generating responses
  • Text-to-Speech (TTS): Converting text back to audio
  • Turn Detection: Determining when the user has finished speaking
  • Network Overhead: Data transmission delays

The key to ultra-low latency is optimizing each component and minimizing unnecessary delays.

The optimal configuration stack

Our target configuration achieves the following breakdown:

  • STT: 90ms (AssemblyAI Universal-Streaming)
  • LLM: 200ms (Groq Llama 4 Maverick 17B)
  • TTS: 75ms (Eleven Labs Flash v2.5)
  • Pipeline Total: 365ms
  • Network Overhead: 100ms (Web) / 600ms+ (Telephony)
  • Final Latency: ~465ms (Web) / ~965ms+ (Telephony)

Step 1: Configure speech-to-text with AssemblyAI

AssemblyAI's Universal-Streaming API is currently one of the fastest STT options available, delivering transcripts in just 90ms.

Key Configuration Settings:

Critical optimization: Disable formatting

This is perhaps the most important STT optimization that many developers overlook. By setting Format Turns : false, you eliminate unnecessary processing time that adds latency. Modern LLMs are perfectly capable of understanding unformatted transcripts, and this single change can save precious milliseconds in your pipeline.

Why this matters: Formatting processes like punctuation insertion, capitalization, and number formatting require additional computation. When every millisecond counts, these "nice-to-have" features become latency bottlenecks.

Step 2: Choose the right LLM - Groq's Llama 4 Maverick 17B

The LLM is typically the highest latency component in your voice pipeline, making model selection critical. Groq's Llama 4 Maverick 17B 128e Instruct offers the perfect balance of speed and capability.

Configuration:

Why Groq + Llama 4 Maverick?

  • Optimized Model: Llama 4 Maverick offers a best-in-class performance-to-cost ratio 
  • Consistent Performance: 200ms processing time with minimal variance
  • Open Source: Cost-effective compared to proprietary alternatives

Pro Tip: Keep your maxTokens relatively low (150-200) for voice applications. Users expect concise responses in conversation, and shorter responses generate faster.

Step 3: Implement lightning-fast TTS with Eleven Labs Flash v2.5

Eleven Labs Flash v2.5 is engineered specifically for low-latency applications, achieving an impressive 75ms time-to-first-byte.

Configuration:

Key Settings Explained:

  • Optimize Streaming Latency: Set to 4 for maximum speed priority
  • Voice Selection: Choose simpler voices for faster processing
  • No Style Exaggeration: Higher values may increase latency slightly

Step 4: Optimize turn detection settings

This is where many developers unknowingly sabotage their latency optimization. Vapi's default turn detection settings include wait times that can add 1.5+ seconds to your response time—completely negating all your other optimizations.

Critical configuration in advanced settings:

Before:

After: 

Why this matters as much as model choice:

The default settings often include:

  • Wait Seconds: 0.4s (unnecessary delay)
  • On PunctuationSeconds: 0.1s (unnecessary delay)
  • On No Punctuation Seconds: 1.5s (waiting when no punctuation detected)
  • On Number Seconds: 0.5s (unnecessary delay)

Since our STT has formatting disabled, the system would default to the 1.5s "no punctuation" delay—adding 1500ms to a pipeline that we've optimized to 365ms (4x!). This single setting can make or break your latency goals.

Network considerations and deployment

Web vs. telephony latency:

  • Web (WebRTC): ~100ms network overhead
  • Telephony (Twilio/Vonage): 600ms+ network overhead

Deployment tips:

  1. Choose regions wisely: Deploy close to your users
  2. Consider CDN: For global applications, use edge locations
  3. Monitor performance: Set up latency monitoring and alerts
  4. Test thoroughly: Network conditions vary significantly

Testing and monitoring your configuration

Key metrics to track:

  • End-to-end latency: Time from user stops speaking to agent starts responding
  • Component breakdown: Individual STT, LLM, TTS timings
  • Network overhead: Measure actual vs. expected network delays
  • User experience: Conduct user testing for perceived responsiveness

Common pitfalls and troubleshooting

1. Forgetting turn detection settings

Problem: Great model configuration, but 1.5s delays remain 

Solution: Always check and optimize startSpeakingPlan settings

2. Over-engineering prompts

Problem: Long system prompts increase LLM processing time 

Solution: Keep prompts concise and specific

3. Ignoring network conditions

Problem: Perfect configuration, but poor real-world performance 

Solution: Test in various network conditions and locations

4. Choosing quality over speed

Problem: Using high-quality but slower models 

Solution: For voice, prioritize speed; users value responsiveness over perfection

Conclusion

Building a voice agent with ~465ms end-to-end latency is achievable with the right configuration and attention to detail. The key insights are:

  1. Every component matters: Optimize STT, LLM, and TTS individually
  2. Turn detection is critical: Default settings can destroy your latency goals
  3. Disable unnecessary features: Formatting and other "nice-to-haves" add latency
  4. Test in realistic conditions: Network overhead varies significantly by deployment

By following this configuration and understanding the principles behind each optimization, you'll create voice agents that feel truly conversational. Remember, in voice AI, perceived speed often matters more than absolute accuracy—users will forgive minor imperfections but won't tolerate slow responses.

The future of voice AI lies in these ultra-responsive interactions. With this guide, you're now equipped to build voice agents that meet users' expectations for natural, real-time conversation.

See also:

Have you implemented this configuration? We'd love to hear about your results and any additional optimizations you've discovered. The voice AI community thrives on shared knowledge and continuous improvement.

Interested in more articles like this?

Sign up for our free newsletter to get technical tutorials and articles delivered to your inbox

Sign up now
Title goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Button Text
Streaming Speech-to-Text