May 19, 2026

Building a voice agent: the full production timeline for both approaches

Building a voice agent has never been technically impossible. What made it hard was the coordination overhead — STT evaluation, LLM selection, TTS audition, integration debugging. We mapped both paths so you can choose with accurate information.

Devon Malloy

Staff Growth Manager

Reviewed by

Table of contents

[Visible on live site]

According to our Voice Agent Report, 87.5% of developers surveyed are actively building voice agents right now. That's not a future-tense number. These are people with tickets open, repos in progress, and deadlines looming.

Here's the question we don't ask often enough: how long did it take them to get to "working"?

Most builders can't answer that precisely—and the reason is interesting. The time doesn't feel like it was spent on the voice agent. It feels like time spent on everything around it. Three hours reading STT provider docs that don't quite address your latency requirements. A day and a half evaluating TTS voices before realizing you'd been testing the wrong pricing tier. An afternoon tracing a mysterious 400ms delay to the hop between your transcription output and your LLM input.

This is the invisible work. It doesn't show up in a changelog. It doesn't feel like building. But it is the build—or at least, it's the moat between "I want to build a voice agent" and "I have a working voice agent."

This post maps both paths from start to working production system. Not to declare a winner—both paths get you there. But to make the invisible work visible so you can choose with accurate information.

Path 1: The full DIY route

If you're building your own stack, here's what actually has to happen. Not a glossy architecture diagram—a checklist.

1. Evaluate speech-to-text providers

At minimum, you'll benchmark three: Deepgram Nova, Whisper (self-hosted or OpenAI API), and AssemblyAI's streaming speech-to-text. Each has different latency profiles, word error rates across accents, and pricing models. The benchmarking itself takes time, but so does understanding what you're actually measuring. Real-time voice agents need streaming STT—not batch. That distinction shapes which models are even in scope, and not every provider's documentation makes this obvious upfront.

2. Understand streaming-specific tradeoffs

Partial transcripts arrive mid-sentence. Your system needs to decide: when does a partial become final enough to pass to the LLM? Too eager and you're sending noise. Too conservative and you've added latency. This is turn detection at the transcription layer, and it's separate from turn detection at the conversation layer. You'll need a position on both.

3. Select and test your LLM

GPT-4o, Claude, Gemini—each has different streaming characteristics, context window behavior, and latency under load. For voice, time-to-first-token matters more than it does for chat. You'll want to measure that under realistic conditions, which means you need a voice pipeline to test it in. This creates a dependency loop that slows evaluation.

4. Select and audition TTS

Eleven Labs, Cartesia, Rime, and others all offer meaningfully different voice quality, latency, and cost structures. "Audition" is the right word: you're making a qualitative judgment about which voice sounds right for your use case, on top of a quantitative judgment about latency and price. The qualitative part is genuinely hard to shortcut. And it involves playing audio files.

5. Evaluate orchestration frameworks—or decide to go raw

LiveKit Agents, Pipecat, and Vapi each handle different parts of the problem: session management, turn detection, tool use, audio transport. Understanding what each one does (and doesn't) handle takes real reading time. Some teams skip frameworks entirely and wire the WebSocket connections themselves. That's a legitimate choice, but it means owning the behavior each framework would have given you for free.

6. Write integration glue code

STT output doesn't arrive in the format your LLM prompt expects. LLM streaming output doesn't arrive in chunks your TTS system wants. Every boundary between vendors needs code—often more code than expected, because edge cases emerge that the documentation didn't model (partial words at chunk boundaries, silence handling, reconnection logic).

7. Implement turn detection

When does the user stop speaking? When should the agent start responding? This is harder than it sounds. Voice activity detection (VAD) has false positives—it will think the user stopped talking when they paused to think. Your turn detection logic needs to account for this, or your agent will interrupt mid-thought. If you're using an orchestration framework, this may be partially handled. If you're going raw, it's fully yours to build.

8. Handle interruptions

What happens when the user speaks while the agent is in the middle of a response? You need to detect the interruption, stop the TTS playback, cancel or drain the in-flight LLM generation, and restart the conversation turn. Each of those operations involves a different system, and they have to coordinate correctly.

This is where integration complexity compounds. One developer building on LiveKit ran into exactly this—they had to write a custom AgentActivity class patch to fix interruption handling that was cascading across three gates in their pipeline. The fix worked, but it took a full day to diagnose and another half-day to implement. That's not a criticism of LiveKit. It's an example of what integration complexity looks like in practice: real code, real debugging, time that doesn't show up anywhere in the timeline.

Skip the multi-vendor STT bake-off
Test streaming speech-to-text on your own audio in the playground. Hear latency and accuracy before you write a line of integration code.

Try the playground

9. Set up billing accounts across providers

Three vendors means three accounts, three billing models, three rate limits to understand, and three invoices to reconcile. Operational overhead, but real overhead.

10. Debug cross-vendor latency

Your end-to-end latency is the sum of: STT streaming latency + network hop to LLM + LLM time-to-first-token + network hop to TTS + TTS synthesis latency + audio playback. You can optimize each, but only after you've measured it—which requires a working pipeline. Early in the build, you may not know whether a slowness lives in the STT, the LLM, the TTS, or the connection between them.

11. Add observability

How do you log a full conversation when it spans three vendors' systems? Conversation-level analytics—what was said, when, latency at each step, which tool calls fired—don't emerge automatically from three separate logging systems. You'll have to wire them together, or accept that you're flying blind in production.

That's the checklist. None of these items are hard in isolation. What makes them hard is that they're sequential in some places, parallel in others, and interdependent in ways you don't fully understand until you're already into the build. The total calendar time for most teams building this way: four to eight weeks from start to production-ready system. That varies a lot depending on team experience and how many of the above decisions go cleanly.

Path 2: The single-WebSocket route

Here's the same destination on the Voice Agent API path.

Step	DIY stack	Voice Agent API
Provider selection	Benchmark STT, LLM, and TTS separately	One API, one signup
Turn detection	Build or configure inside a framework	Handled inside the API
Interruption handling	Coordinate STT, LLM, and TTS state	Handled inside the API
Integration glue	Custom code at every vendor boundary	Single WebSocket; audio in, audio out
Typical time to working	Four to eight weeks	Same afternoon for most developers

1. Read the docs

The core concepts section of the Voice Agent docs covers WebSocket connection, audio format, turn detection, and tool use. Most developers get through it in about ten minutes.

2. Get an API key

Standard signup flow.

3. Connect a WebSocket and stream audio

The API takes raw audio in, sends audio back. There's no separate STT integration, no separate LLM integration, no separate TTS integration. Those live inside the API. Your client streams audio bytes; the API streams audio bytes back.

4. Add tools if your use case needs them

If your agent needs to look up a calendar, check a CRM, or call a backend—you define tools in your session config. The agent decides when to invoke them; you handle the function calls. This part is genuinely similar to tool use in any LLM API.

5. Deploy

Your infrastructure is a WebSocket client and whatever backend your tools connect to. No multi-vendor deployment to coordinate.

Early usage of the Voice Agent API showed a pattern worth noting: new signups—developers who had never used AssemblyAI before—came back after their first session. That's a signal about first-run success rate. If the time-to-working were measured in days rather than hours, you wouldn't see that return behavior. You'd see people who tried it, got stuck somewhere, and moved on.

That matches what the launch documentation says: most developers get a working agent running the same afternoon they start.

Ship a voice agent this afternoon
Sign up free and connect a single WebSocket. STT, LLM, and TTS are handled inside the API, so you can focus on what your agent should actually do.

The honest tradeoff

The single-WebSocket path is faster. That's real, and it's worth saying plainly. But faster isn't the only variable.

What does the full DIY path give you that the single-API path doesn't? Genuine answers:

Full component control. You can use any LLM, any TTS voice, any speech-to-text model. You can swap components as the landscape changes. If a new TTS provider releases voices that sound dramatically better for your use case, you can integrate them without waiting for anyone else to do it.

Vendor-specific compliance infrastructure. Some teams have already built on Azure OpenAI or AWS Bedrock for data residency reasons. Some have SOC 2 relationships with specific providers they can't break, or operate as business associates under HIPAA with existing BAAs in place. The full DIY path lets you stay within existing compliance perimeters.

Deep customization at every layer. Some teams we work with describe their stacks as "highly custom"—specific orchestration choices made deliberately because their use case requires behavior that off-the-shelf options don't provide. For teams with genuine requirements at that level of specificity, owning every layer is worth the time. The control isn't incidental; it's the point.

If you're building a product where the voice infrastructure itself is a differentiator, the DIY path makes sense. The tradeoff is real time, measured in weeks, spent on coordination that doesn't directly serve your end users.

The pattern we're seeing in production

There's a category of team that the full DIY route wasn't really designed for—not because they're less technical, but because their expertise is vertical-specific rather than infrastructure-specific.

The teams shipping voice agents in week one of the Voice Agent API: small engineering teams, two to five people, deep domain expertise in their vertical, no particular interest in becoming voice AI infrastructure specialists. The use cases they're building: lead qualification, appointment booking, clinical intake, interview simulation, sales roleplay coaching. One developer on our solutions team put together a live translation demo in about an hour—functional enough to show to stakeholders. We're seeing a similar pattern in meeting intelligence and notetaker builds, where the team's edge is in the workflow, not the audio stack.

These teams don't have a strong opinion on which orchestration framework handles turn detection best. They've never had to. What they have is a strong opinion on what their agent should do—what it should know, how it should handle edge cases in their domain, what integrations it needs. The Voice Agent API was designed for that profile: teams where voice is the interface and the product is something else.

Closing

Both paths lead to the same place: a voice agent running in production. The question is what your team wants to spend time on.

If voice infrastructure is the product—if the ability to tune every component is a genuine competitive advantage—own every layer. The DIY path is real engineering work for a real reason.

If voice is the interface and the product is something else—if what you're actually building is a better way to qualify leads, book appointments, or conduct clinical intake—then infrastructure being invisible is a feature. The weeks you're not spending on STT evaluation and interruption handling and cross-vendor observability are weeks you're spending on the thing that makes your product different.

The invisible work is real either way. The question is which invisible work is worth doing.

Talk to a voice AI expert
Have a use case with specific compliance, latency, or customization requirements? We'll help you map the right path—DIY, Voice Agent API, or a hybrid.

Contact sales

FAQs

How long does it take to build a voice agent?

Building a voice agent on a full DIY stack typically takes four to eight weeks of calendar time, depending on team experience and how cleanly STT, LLM, TTS, and orchestration decisions go. On a single-API path like AssemblyAI's Voice Agent API, most developers get a working agent running the same afternoon they start, since STT, LLM, and TTS are handled behind one WebSocket.

What's the difference between a DIY voice agent stack and a Voice Agent API?

A DIY stack composes separate STT, LLM, and TTS providers, glued together with custom code for turn detection, interruption handling, and observability. A Voice Agent API exposes a single endpoint—you stream audio in and receive audio back—and handles the pipeline internally. The DIY route trades time for component-level control; the API route trades control for speed to production.

Do I need streaming speech-to-text for voice agents?

Yes. Real-time voice agents need streaming STT so partial transcripts can be passed to the LLM mid-utterance and the agent can respond within human-conversation latency budgets. Batch transcription waits for the full audio file, which adds seconds of delay and breaks the conversational feel. See our real-time speech recognition deep-dive for how the streaming model works.

How do you handle interruptions in a voice agent?

Interruption handling requires detecting that the user has started speaking, stopping TTS playback, canceling or draining the in-flight LLM generation, and restarting the conversation turn—all coordinated across whichever systems own each piece. On a DIY stack, you write this logic yourself across three vendors. On a single Voice Agent API, the interruption state is managed inside the API.

What's the role of turn detection in a voice agent pipeline?

Turn detection decides when a user has finished speaking and the agent should respond. It runs on signals from voice activity detection, transcript stability, and conversation context. Get it wrong on the eager side and the agent interrupts a thinking pause; get it wrong on the conservative side and the agent feels sluggish. Speaker diarization and turn detection often run together in multi-party scenarios.

How much does a voice agent cost to run?

Cost depends on usage volume and the path you take. A DIY stack means paying STT, LLM, and TTS providers separately, each with their own rate limits and minimums. The AssemblyAI Voice Agent API uses transparent per-minute pricing with no upfront commits or concurrency caps—see pricing for current rates and the docs for implementation details.

Building a voice agent: the full production timeline for both approaches

Path 1: The full DIY route

Path 2: The single-WebSocket route

The honest tradeoff

The pattern we're seeing in production

Closing

FAQs

Gemini 3 Pro vs GPT-5 vs Claude 4.5: Which model wins for audio workflows?

What is conversation context in voice AI — and why it improves accuracy

Multi-language voice agents: Building agents that speak to anyone

5 Benefits of Voice AI for Video Editing Platforms

Building a voice agent: the full production timeline for both approaches

Path 1: The full DIY route

Path 2: The single-WebSocket route

The honest tradeoff

The pattern we're seeing in production

Closing

FAQs

Related posts

Gemini 3 Pro vs GPT-5 vs Claude 4.5: Which model wins for audio workflows?

What is conversation context in voice AI — and why it improves accuracy

Multi-language voice agents: Building agents that speak to anyone

5 Benefits of Voice AI for Video Editing Platforms