Insights & Use Cases
April 29, 2026

Build a voice agent with a chained STT-LLM-TTS architecture

Voice agent architecture explained: learn how STT, LLM, TTS, and orchestration work together in a low-latency streaming pipeline for natural voice interactions.

Kelsey Foster
Growth
Reviewed by
No items found.
Table of contents

Voice agents are transforming how businesses interact with customers, automate workflows, and build conversational interfaces. At their core, most voice agents follow a chained architecture: speech-to-text (STT) converts audio to text, a large language model (LLM) generates a response, and text-to-speech (TTS) converts that response back to audio. This STT‑LLM‑TTS pipeline is the most common pattern for building real-time voice agents today.

In this guide, we'll walk through every component of a voice agent architecture, show you how to wire them together with working Python code, discuss performance requirements, and help you decide whether to build or buy your voice agent pipeline. Whether you're building a customer support bot, a virtual receptionist, or an AI-powered phone agent, this article gives you the foundation to get started.

What are the core components of voice agent architecture?

A voice agent pipeline has four core components. Each one plays a distinct role in converting spoken input into a spoken response.

Component

Role

Example providers

Speech-to-text (STT)

Converts spoken audio into text in real time

AssemblyAI, Google Cloud STT, Whisper

Large language model (LLM)

Processes the transcribed text and generates an intelligent response

Anthropic Claude, OpenAI GPT, Google Gemini

Text-to-speech (TTS)

Converts the LLM's text response into natural-sounding audio

ElevenLabs, Amazon Polly, Google Cloud TTS

Orchestration layer

Manages data flow, turn detection, interruption handling, and error recovery

Custom code, LiveKit, AssemblyAI Voice Agent API

Speech-to-text

Streaming speech-to-text is the entry point of any voice agent. It takes raw audio from a microphone or phone call and produces a text transcript in real time. For voice agents, you need a streaming STT provider that supports WebSocket connections and delivers transcripts with minimal latency.

AssemblyAI's streaming STT uses the Universal-3 Pro Streaming speech model, which delivers state-of-the-art accuracy for real-time transcription. Here's how to connect using the v3 streaming API. Note that the SDK invokes each handler with two arguments — the StreamingClient instance and the event — so every callback's first parameter is client:

from assemblyai.streaming.v3 import (
    BeginEvent,
    StreamingClient,
    StreamingClientOptions,
    StreamingError,
    StreamingEvents,
    StreamingParameters,
    TurnEvent,
    TerminationEvent,
)

def on_begin(client: StreamingClient, event: BeginEvent):
    """Called when the streaming session begins."""
    print(f"Session started — ID: {event.id}")

def on_turn(client: StreamingClient, event: TurnEvent):
    """Called on each turn update. Check end_of_turn for final transcripts."""
    if event.end_of_turn:
        print(f"Final transcript: {event.transcript}")
    else:
        print(f"Partial: {event.transcript}", end="\r")

def on_error(client: StreamingClient, error: StreamingError):
    print(f"Error: {error}")

def on_terminated(client: StreamingClient, event: TerminationEvent):
    print("Session terminated")

client = StreamingClient(
    StreamingClientOptions(
        api_key="YOUR_ASSEMBLYAI_KEY",
        api_host="streaming.assemblyai.com",
    )
)

client.on(StreamingEvents.Begin, on_begin)
client.on(StreamingEvents.Turn, on_turn)
client.on(StreamingEvents.Termination, on_terminated)
client.on(StreamingEvents.Error, on_error)

client.connect(
    StreamingParameters(
        speech_model="u3-rt-pro",
        sample_rate=16000,
    )
)

The v3 streaming API connects over wss://streaming.assemblyai.com/v3/ws and uses a turn-based event model. The TurnEvent provides both partial and final transcripts—check event.end_of_turn to determine when a complete utterance has been captured. This is critical for voice agents because you only want to send complete utterances to the LLM.

Get started with streaming STT

Sign up for a free AssemblyAI API key to start building with real-time streaming speech-to-text powered by Universal-3 Pro..

Get free API key

Large language models

The LLM is the "brain" of your voice agent. Once you have a transcript from the STT component, you send it to an LLM to generate an appropriate response. For voice agents, you want a model that balances speed with quality—a fast model that produces concise, conversational responses works better than a large model that generates lengthy, verbose output.

Here's an example using Anthropic's Claude as the LLM in a voice agent pipeline:

import anthropic

client = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_KEY")
conversation_history = []

SYSTEM_PROMPT = """You are a helpful voice assistant. Keep your responses
concise and conversational—aim for 1-3 sentences. Avoid bullet points,
markdown formatting, or lengthy explanations. Respond as if you're
having a natural phone conversation."""

def get_llm_response(user_text: str) -> str:
    """Send transcribed text to Claude and get a response."""
    conversation_history.append({
        "role": "user",
        "content": user_text
    })

    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=150,
        system=SYSTEM_PROMPT,
        messages=conversation_history
    )

    assistant_text = response.content[0].text
    conversation_history.append({
        "role": "assistant",
        "content": assistant_text
    })

    return assistant_text

A few best practices for using LLMs in voice agents:

  • Keep the system prompt voice-optimized. Instruct the model to be concise, avoid formatting, and respond conversationally.
  • Limit max tokens. For voice, shorter responses feel more natural. Aim for 50–150 tokens.
  • Use streaming responses. Stream the LLM output token by token so you can start TTS before the full response is generated.
  • Maintain conversation history. Pass the full conversation context so the LLM can handle multi-turn dialogues.

Text-to-speech

The final stage of the pipeline converts the LLM's text response into audio that the user hears. Modern TTS systems produce remarkably natural speech, and many offer streaming output so audio playback can begin before the full response is synthesized.

Here's a comparison of popular TTS providers for voice agents:

Provider

Streaming support

Voice cloning

Latency

ElevenLabs

Yes

Yes

~300 ms

Amazon Polly

Yes

No

~200 ms

Google Cloud TTS

Yes

Limited

~250 ms

OpenAI TTS

Yes

No

~300 ms

Here's a basic example of streaming TTS with ElevenLabs:

import requests

def text_to_speech_stream(text: str, voice_id: str = "21m00Tcm4TlvDq8ikWAM"):
    """Convert text to speech using ElevenLabs streaming API."""
    url = f"https://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream"

    headers = {
        "xi-api-key": "YOUR_ELEVENLABS_KEY",
        "Content-Type": "application/json"
    }

    payload = {
        "text": text,
        "model_id": "eleven_turbo_v2",
        "voice_settings": {
            "stability": 0.5,
            "similarity_boost": 0.75
        }
    }

    response = requests.post(url, json=payload, headers=headers, stream=True)

    for chunk in response.iter_content(chunk_size=1024):
        if chunk:
            yield chunk  # Stream audio chunks for playback

Orchestration

The orchestration layer is the glue that holds everything together. It handles:

  • Turn detection: Knowing when the user has finished speaking so you can trigger the LLM
  • Barge-in / interruption handling: Allowing the user to interrupt the agent mid-response
  • Error recovery: Handling dropped connections, timeouts, and API failures gracefully
  • Context management: Maintaining conversation state across turns
  • Audio routing: Managing audio input/output streams, mixing, and playback

Orchestration is often the hardest part to get right. Turn detection alone—deciding when a pause means "I'm done talking" versus "I'm thinking"—requires careful tuning. This is one of the key reasons many teams choose a managed solution rather than building from scratch.

Architecture patterns for voice agents

There are two primary patterns for wiring STT, LLM, and TTS together in a voice agent: the cascading pipeline and the streaming architecture. The right choice depends on your latency requirements and implementation complexity tolerance.

Pattern

How it works

Latency

Complexity

Cascading pipeline

Each stage completes fully before the next begins

Higher (sum of all stages)

Low

Streaming architecture

Stages overlap—LLM streams tokens to TTS as they arrive

Lower (stages run in parallel)

High

Cascading pipeline

In a cascading pipeline, each stage completes before the next one starts. The user speaks, the STT produces a full transcript, the LLM generates a complete response, and then TTS converts the entire response to audio.

This is the simplest pattern to implement, but it has a significant latency penalty. If STT takes 500 ms, the LLM takes 1 second, and TTS takes 400 ms, the user waits 1.9 seconds before hearing any audio. For casual applications, this may be acceptable—but for real-time voice agents, it feels sluggish.

The snippet below illustrates the shape of a cascading pipeline. speech_to_text, get_llm_response (defined earlier), and text_to_speech are placeholders — substitute the streaming-STT, Claude, and ElevenLabs implementations from the previous sections to make it runnable end-to-end:

import time

# Placeholder helpers — replace with real implementations:
# - speech_to_text: blocking call to your STT provider
# - get_llm_response: defined in the LLM section above
# - text_to_speech: blocking call to your TTS provider

def speech_to_text(audio_data: bytes) -> str: ...
def text_to_speech(text: str) -> bytes: ...

def cascading_pipeline(audio_data: bytes) -> bytes:
    """Simple cascading pipeline — each stage runs sequentially."""

    # Stage 1: STT — wait for full transcript
    start = time.time()
    transcript = speech_to_text(audio_data)
    stt_time = time.time() - start
    print(f"STT completed in {stt_time:.2f}s: {transcript}")

    # Stage 2: LLM — wait for full response
    start = time.time()
    response_text = get_llm_response(transcript)
    llm_time = time.time() - start
    print(f"LLM completed in {llm_time:.2f}s: {response_text}")

    # Stage 3: TTS — wait for full audio
    start = time.time()
    audio_output = text_to_speech(response_text)
    tts_time = time.time() - start
    print(f"TTS completed in {tts_time:.2f}s")

    total = stt_time + llm_time + tts_time
    print(f"Total pipeline latency: {total:.2f}s")

    return audio_output

Streaming architecture

The streaming architecture overlaps stages to minimize time-to-first-audio. As soon as the STT delivers a final transcript, the LLM begins streaming tokens. As tokens arrive, they're buffered into sentence fragments and sent to TTS immediately. The user starts hearing audio while the LLM is still generating the rest of the response.

Here's a complete streaming pipeline using AssemblyAI's v3 streaming API for STT, Claude for the LLM, and ElevenLabs for TTS. Note that play_audio is left as a stub — wire it to your platform's audio output (PyAudio, sounddevice, a phone-bridge SDK, etc.):

import asyncio
import anthropic
import requests
from assemblyai.streaming.v3 import (
    BeginEvent, StreamingClient, StreamingClientOptions,
    StreamingError, StreamingEvents, StreamingParameters,
    TurnEvent, TerminationEvent,
)

# --- Configuration ---
ASSEMBLYAI_API_KEY = "YOUR_ASSEMBLYAI_KEY"
ANTHROPIC_API_KEY  = "YOUR_ANTHROPIC_KEY"
ELEVENLABS_API_KEY = "YOUR_ELEVENLABS_KEY"
ELEVENLABS_VOICE_ID = "21m00Tcm4TlvDq8ikWAM"

# --- LLM ---
anthropic_client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
conversation_history = []

SYSTEM_PROMPT = """You are a helpful voice assistant. Keep your responses
concise and conversational—aim for 1-3 sentences. Avoid bullet points,
markdown formatting, or lengthy explanations."""

def stream_llm_response(user_text: str):
    """Stream tokens from Claude for low-latency TTS handoff."""
    conversation_history.append({"role": "user", "content": user_text})

    with anthropic_client.messages.stream(
        model="claude-haiku-4-5-20251001",
        max_tokens=150,
        system=SYSTEM_PROMPT,
        messages=conversation_history
    ) as stream:
        full_response = ""
        sentence_buffer = ""

        for text in stream.text_stream:
            full_response += text
            sentence_buffer += text

            # Flush buffer at sentence boundaries for TTS
            if any(sentence_buffer.rstrip().endswith(p)
                   for p in [".", "!", "?"]):
                yield sentence_buffer
                sentence_buffer = ""

        # Flush remaining text
        if sentence_buffer.strip():
            yield sentence_buffer

    conversation_history.append({
        "role": "assistant",
        "content": full_response
    })

# --- TTS ---
def stream_tts(text_chunk: str):
    """Send a text chunk to ElevenLabs and stream back audio."""
    url = f"https://api.elevenlabs.io/v1/text-to-speech/{ELEVENLABS_VOICE_ID}/stream"
    headers = {
        "xi-api-key": ELEVENLABS_API_KEY,
        "Content-Type": "application/json"
    }
    payload = {
        "text": text_chunk,
        "model_id": "eleven_turbo_v2",
        "voice_settings": {"stability": 0.5, "similarity_boost": 0.75}
    }

    response = requests.post(url, json=payload, headers=headers, stream=True)
    for chunk in response.iter_content(chunk_size=1024):
        if chunk:
            play_audio(chunk)

def play_audio(chunk: bytes):
    """Placeholder — replace with actual audio output."""
    pass

# --- Orchestration ---
def handle_final_transcript(transcript: str):
    """Called when STT delivers a final transcript. Streams LLM -> TTS."""
    print(f"User said: {transcript}")
    for sentence in stream_llm_response(transcript):
        print(f"Agent says: {sentence}")
        stream_tts(sentence)

# --- STT with v3 streaming API ---
def on_begin(client: StreamingClient, event: BeginEvent):
    print(f"Streaming session started — ID: {event.id}")

def on_turn(client: StreamingClient, event: TurnEvent):
    if event.end_of_turn:
        handle_final_transcript(event.transcript)
    else:
        print(f"Partial: {event.transcript}", end="\r")

def on_error(client: StreamingClient, error: StreamingError):
    print(f"STT error: {error}")

def on_terminated(client: StreamingClient, event: TerminationEvent):
    print("STT session terminated")

def main():
    client = StreamingClient(
        StreamingClientOptions(
            api_key=ASSEMBLYAI_API_KEY,
            api_host="streaming.assemblyai.com",
        )
    )
    client.on(StreamingEvents.Begin, on_begin)
    client.on(StreamingEvents.Turn, on_turn)
    client.on(StreamingEvents.Termination, on_terminated)
    client.on(StreamingEvents.Error, on_error)

    client.connect(
        StreamingParameters(
            speech_model="u3-rt-pro",
            sample_rate=16000,
        )
    )

if __name__ == "__main__":
    main()

In this streaming architecture, the key optimization is at the sentence boundary. Instead of waiting for the full LLM response, we flush each sentence to TTS as soon as it's complete. This means the user hears the first sentence of the response while the LLM is still generating the second sentence—dramatically reducing perceived latency.

Try the Voice Agent API playground

Skip the complexity of wiring three services together. AssemblyAI's Voice Agent API handles STT, LLM, and TTS in a single WebSocket connection.

Try playground

Performance requirements for real-time voice agents

Latency is the single most important performance metric for voice agents. Users expect conversational AI to feel like a natural conversation, and research shows that response delays beyond 500–700 ms start to feel unnatural. Here's a breakdown of typical latency budgets for each component:

Component

Target latency

What impacts it

STT (time to final transcript)

200–500 ms

Model size, network latency, audio chunk size

LLM (time to first token)

150–400 ms

Model size, prompt length, provider infrastructure

TTS (time to first audio)

200–400 ms

Voice model, text length, streaming support

Network overhead

50–150 ms

Geographic distance, connection type, edge routing

Latency budget

In a streaming architecture, the total time from "user stops speaking" to "user hears first audio" is approximately: STT finalization + LLM time-to-first-token + TTS time-to-first-audio + network overhead. With well-optimized providers, this can be as low as 600–900 ms—fast enough to feel conversational.

To stay within this budget, consider the following optimizations:

  • Use streaming everywhere. Stream STT transcripts, LLM tokens, and TTS audio. Never wait for a full response when you can start processing partial output.
  • Choose fast models. Smaller, faster LLMs like Claude Haiku outperform larger models for voice agents because the speed improvement matters more than marginal quality gains.
  • Minimize network hops. Keep your services in the same region when possible. AssemblyAI offers edge routing that automatically directs requests to the nearest compute region, reducing network latency for geographically distributed users.
  • Pre-warm connections. Establish WebSocket connections before the user starts speaking to eliminate connection setup latency.
  • Buffer intelligently. Send text to TTS in sentence-sized chunks rather than word-by-word (too many requests) or all-at-once (too much delay).

For entity-heavy use cases like a customer intake agent (e.g., "I need to reach Sarah Kowalczyk at Acme Corp"), STT accuracy is especially critical. A single transcription error in a name or company can break downstream processes. This is where investing in a high-accuracy STT provider like AssemblyAI pays dividends.

Build vs. buy

One of the biggest decisions you'll face is whether to build your own STT-LLM-TTS pipeline or use a managed Voice Agent API. Here's how the two approaches compare:

Factor

Build your own pipeline

AssemblyAI Voice Agent API

Setup time

Days to weeks—integrate 3+ providers, handle auth, manage connections

Minutes—single WebSocket connection handles everything

Cost

Variable—pay each provider separately, costs scale unpredictably

$4.50/hr flat rate covering STT + LLM + TTS

Orchestration

You build turn detection, interruption handling, error recovery

Built-in turn detection, barge-in, and error recovery

Tool calling

Custom implementation with your LLM provider

Built-in function calling with tool definitions

Reliability

You manage failover across multiple providers

Session resumption—reconnect within 30 seconds without losing context

Flexibility

Full control over every component and model choice

Configurable via JSON—choose voice, system prompt, tools

When to build your own: If you need full control over every component, want to use specific providers for each stage, or have unique latency requirements that demand custom optimization, building your own pipeline makes sense. The streaming architecture example earlier in this article gives you a solid starting point.

When to use the Voice Agent API: If you want to ship fast, avoid managing three separate provider integrations, and prefer predictable pricing, the Voice Agent API is the better choice. It uses a standard WebSocket + JSON API—no SDK required. Connect to wss://agents.assemblyai.com/v1/ws, send a session.update message with your configuration, and start streaming audio.

Here's how to connect to the Voice Agent API and configure a session:

import asyncio
import websockets
import json
import os

async def voice_agent():
    uri = "wss://agents.assemblyai.com/v1/ws"
    headers = {"Authorization": f"Bearer {os.environ['ASSEMBLYAI_API_KEY']}"}

    async with websockets.connect(uri, additional_headers=headers) as ws:
        # Configure session — STT, LLM, and TTS handled automatically
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "system_prompt": "You are a helpful customer support agent.",
                "greeting": "Hi! How can I help you today?",
                "output": {"voice": "ivy"},
                "tools": [{
                    "type": "function",
                    "name": "get_account_info",
                    "description": "Look up account status",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "account_id": {"type": "string"}
                        },
                        "required": ["account_id"]
                    }
                }]
            }
        }))

        async for message in ws:
            event = json.loads(message)
            if event["type"] == "session.ready":
                print(f"Ready — session {event['session_id']}")
                # Start streaming audio with input.audio events
                break

asyncio.run(voice_agent())

With the Voice Agent API, you define your system prompt, greeting, voice, and tools in a single JSON message. The API handles STT, LLM inference, TTS, turn detection, interruption handling, and tool execution—all through one WebSocket connection. Session resumption lets you reconnect within 30 seconds without losing conversation context, which is critical for production deployments over unreliable networks.

Final words

Building a voice agent with a chained STT-LLM-TTS architecture is a well-understood pattern, but getting it to production quality requires careful attention to latency, orchestration, and reliability. The streaming architecture—where STT, LLM, and TTS stages overlap—is the key to achieving conversational response times.

For teams who want the full pipeline handled, AssemblyAI's Voice Agent API provides a single WebSocket connection at $4.50/hr that replaces three separate providers, with built-in orchestration, turn detection, tool calling, and interruption handling included. No SDK required—connect to wss://agents.assemblyai.com/v1/ws, send JSON, and ship.

If you prefer full control, the code examples in this guide give you a working foundation to build your own pipeline with AssemblyAI's v3 streaming STT, Claude, and ElevenLabs. Either way, the core principles are the same: stream everything, minimize latency at every stage, and invest in a robust orchestration layer.

Start building voice agents today

Sign up for a free AssemblyAI account and build your first voice agent—whether you use the chained STT-LLM-TTS pipeline or the Voice Agent API.

Get free API key

Frequently asked questions

What is a chained STT-LLM-TTS architecture?

A chained STT-LLM-TTS architecture is a voice agent pipeline where speech-to-text converts audio to text, a large language model generates a response, and text-to-speech converts that response back to audio. These three stages run in sequence (or overlapping in a streaming architecture) to enable real-time voice conversations. It is the most common pattern for building voice agents today because it lets you choose best-in-class providers for each stage independently.

How do I reduce latency in a voice agent?

The most effective way to reduce latency is to use a streaming architecture where STT, LLM, and TTS stages overlap. Stream LLM tokens to TTS in sentence-sized chunks so the user hears audio while the model is still generating. Additionally, choose fast models (e.g., Claude Haiku for LLM), pre-warm WebSocket connections, and use providers with edge routing to minimize network latency. A well-optimized streaming pipeline can achieve 600–900 ms time-to-first-audio.

What is AssemblyAI's Voice Agent API?

The Voice Agent API is a managed service that handles the entire STT-LLM-TTS pipeline through a single WebSocket connection at wss://agents.assemblyai.com/v1/ws. It costs $4.50/hr flat and includes built-in turn detection, barge-in handling, tool calling, and session resumption. No SDK is required—you connect with standard WebSockets and send JSON messages to configure your agent.

Should I build my own voice agent pipeline or use a managed API?

Build your own if you need full control over every component, want to use specific providers for each stage, or have unique requirements that demand custom optimization. Use a managed API like AssemblyAI's Voice Agent API if you want to ship faster, avoid integrating three separate services, and prefer predictable pricing at $4.50/hr. Most teams start with the managed API for speed and only move to a custom pipeline when they hit specific limitations.

What speech-to-text model should I use for voice agents?

For real-time voice agents, you need a streaming STT model with low latency and high accuracy. AssemblyAI's Universal-3 Pro Streaming model (u3-rt-pro) is purpose-built for real-time applications and delivers state-of-the-art accuracy via the v3 streaming API at wss://streaming.assemblyai.com/v3/ws. Accuracy is especially important for entity-heavy use cases like names, addresses, and account numbers where a single transcription error can break downstream processes.

How does turn detection work in voice agents?

Turn detection determines when a user has finished speaking so the agent can begin responding. It typically combines voice activity detection (VAD) with silence thresholds—if the user stops speaking for a configured duration (usually 500–800 ms), the system treats it as a completed turn. Advanced implementations also use semantic cues from partial transcripts. AssemblyAI's v3 streaming API provides turn-level events via TurnEvent with an end_of_turn flag, simplifying this for developers.

Title goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Button Text
AI voice agents