How to build the lowest latency voice agent in Vapi: Achieving ~465ms end-to-end Latency
In this comprehensive guide, we'll show you how to build a voice agent in Vapi that achieves an impressive ~465ms end-to-end latency—fast enough to feel truly conversational.



Want to test this voice agent? Click the "TALK WITH AI" button in the bottom right!
Voice AI applications are revolutionizing how we interact with technology, but latency remains the biggest barrier to creating truly conversational experiences. When users have to wait seconds for a response, the magic of natural conversation is lost.
In this comprehensive guide, we'll show you how to build a voice agent in Vapi that achieves an impressive ~465ms end-to-end latency—fast enough to feel truly conversational.

Understanding the latency challenge
Before diving into the configuration, it's crucial to understand that voice agent latency comes from multiple components in the pipeline:
- Speech-to-Text (STT): Converting audio to text
- Large Language Model (LLM): Processing and generating responses
- Text-to-Speech (TTS): Converting text back to audio
- Turn Detection: Determining when the user has finished speaking
- Network Overhead: Data transmission delays
The key to ultra-low latency is optimizing each component and minimizing unnecessary delays.
The optimal configuration stack
Our target configuration achieves the following breakdown:
- STT: 90ms (AssemblyAI Universal-Streaming)
- LLM: 200ms (Groq Llama 4 Maverick 17B)
- TTS: 75ms (Eleven Labs Flash v2.5)
- Pipeline Total: 365ms
- Network Overhead: 100ms (Web) / 600ms+ (Telephony)
- Final Latency: ~465ms (Web) / ~965ms+ (Telephony)
Step 1: Configure speech-to-text with AssemblyAI
AssemblyAI's Universal-Streaming API is currently one of the fastest STT options available, delivering transcripts in just 90ms.
Key Configuration Settings:

Critical optimization: Disable formatting
This is perhaps the most important STT optimization that many developers overlook. By setting Format Turns : false, you eliminate unnecessary processing time that adds latency. Modern LLMs are perfectly capable of understanding unformatted transcripts, and this single change can save precious milliseconds in your pipeline.
Why this matters: Formatting processes like punctuation insertion, capitalization, and number formatting require additional computation. When every millisecond counts, these "nice-to-have" features become latency bottlenecks.
Step 2: Choose the right LLM - Groq's Llama 4 Maverick 17B
The LLM is typically the highest latency component in your voice pipeline, making model selection critical. Groq's Llama 4 Maverick 17B 128e Instruct offers the perfect balance of speed and capability.
Configuration:

Why Groq + Llama 4 Maverick?
- Optimized Model: Llama 4 Maverick offers a best-in-class performance-to-cost ratio
- Consistent Performance: 200ms processing time with minimal variance
- Open Source: Cost-effective compared to proprietary alternatives
Pro Tip: Keep your maxTokens relatively low (150-200) for voice applications. Users expect concise responses in conversation, and shorter responses generate faster.
Step 3: Implement lightning-fast TTS with Eleven Labs Flash v2.5
Eleven Labs Flash v2.5 is engineered specifically for low-latency applications, achieving an impressive 75ms time-to-first-byte.
Configuration:


Key Settings Explained:
- Optimize Streaming Latency: Set to 4 for maximum speed priority
- Voice Selection: Choose simpler voices for faster processing
- No Style Exaggeration: Higher values may increase latency slightly
Step 4: Optimize turn detection settings
This is where many developers unknowingly sabotage their latency optimization. Vapi's default turn detection settings include wait times that can add 1.5+ seconds to your response time—completely negating all your other optimizations.
Critical configuration in advanced settings:
Before:

After:

Why this matters as much as model choice:
The default settings often include:
- Wait Seconds: 0.4s (unnecessary delay)
- On PunctuationSeconds: 0.1s (unnecessary delay)
- On No Punctuation Seconds: 1.5s (waiting when no punctuation detected)
- On Number Seconds: 0.5s (unnecessary delay)
Since our STT has formatting disabled, the system would default to the 1.5s "no punctuation" delay—adding 1500ms to a pipeline that we've optimized to 365ms (4x!). This single setting can make or break your latency goals.
Network considerations and deployment
Web vs. telephony latency:
- Web (WebRTC): ~100ms network overhead
- Telephony (Twilio/Vonage): 600ms+ network overhead
Deployment tips:
- Choose regions wisely: Deploy close to your users
- Consider CDN: For global applications, use edge locations
- Monitor performance: Set up latency monitoring and alerts
- Test thoroughly: Network conditions vary significantly
Testing and monitoring your configuration
Key metrics to track:
- End-to-end latency: Time from user stops speaking to agent starts responding
- Component breakdown: Individual STT, LLM, TTS timings
- Network overhead: Measure actual vs. expected network delays
- User experience: Conduct user testing for perceived responsiveness
Common pitfalls and troubleshooting
1. Forgetting turn detection settings
Problem: Great model configuration, but 1.5s delays remain
Solution: Always check and optimize startSpeakingPlan settings
2. Over-engineering prompts
Problem: Long system prompts increase LLM processing time
Solution: Keep prompts concise and specific
3. Ignoring network conditions
Problem: Perfect configuration, but poor real-world performance
Solution: Test in various network conditions and locations
4. Choosing quality over speed
Problem: Using high-quality but slower models
Solution: For voice, prioritize speed; users value responsiveness over perfection
Conclusion
Building a voice agent with ~465ms end-to-end latency is achievable with the right configuration and attention to detail. The key insights are:
- Every component matters: Optimize STT, LLM, and TTS individually
- Turn detection is critical: Default settings can destroy your latency goals
- Disable unnecessary features: Formatting and other "nice-to-haves" add latency
- Test in realistic conditions: Network overhead varies significantly by deployment
By following this configuration and understanding the principles behind each optimization, you'll create voice agents that feel truly conversational. Remember, in voice AI, perceived speed often matters more than absolute accuracy—users will forgive minor imperfections but won't tolerate slow responses.
The future of voice AI lies in these ultra-responsive interactions. With this guide, you're now equipped to build voice agents that meet users' expectations for natural, real-time conversation.
See also:
- Biggest challenges in building AI voice agents (and how AssemblyAI & Vapi are solving them)
- Build an AI Voice Agent with DeepSeek R1, AssemblyAI, and ElevenLabs
- 6 best orchestration tools to build AI voice agents in 2025
Have you implemented this configuration? We'd love to hear about your results and any additional optimizations you've discovered. The voice AI community thrives on shared knowledge and continuous improvement.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.