June 23, 2026

Build a voice agent without Pipecat or LiveKit

Do you really need Pipecat or LiveKit to build a voice agent? Often not. Here's the framework-free architecture — one WebSocket for STT, LLM, and TTS, telephony included.

Kelsey Foster

Growth

Voice Agent API

AI voice agents

Reviewed by

Table of contents

[Visible on live site]

If you've started building a voice agent in the last year, you've hit this question fast: do I need Pipecat or LiveKit?

The internet says yes. Every tutorial reaches for an orchestration framework before it writes a line of agent logic. And for good reason — those frameworks are genuinely excellent, and AssemblyAI ships drop-in plugins for both. If you're already running one, keep running it.

But "do I need a framework" is the wrong first question. The real question is: what moves the audio, and what wires the pipeline together? Answer those two and the framework question answers itself. A lot of the time, the honest answer is: you don't need one.
This post is an architecture overview of the framework-free path. We'll cover the four things people actually mean when they ask about going without Pipecat or LiveKit:

What a voice pipeline framework actually does for you (and what it costs).
The transport question everyone conflates — SIP vs. WebSocket.
Connecting a phone number with Twilio.
Whether the framework-free architecture holds up at enterprise scale.

Let's start with what you'd be removing.

What a framework actually does for you

A voice agent has to do a few things in a tight loop, every few hundred milliseconds: turn speech into text, decide what to say, turn that back into speech, and move audio in and out — while handling interruptions when the caller talks over the bot.

Historically, each of those was a different vendor. Speech-to-text from one provider, an LLM from another, text-to-speech from a third. Something has to sit in the middle and conduct: route partial transcripts to the LLM, stream the LLM's tokens to the TTS engine, manage turn-taking, and handle barge-in. That conductor is what Pipecat and LiveKit Agents give you, plus a transport layer to get audio to and from the user.

That's real work, and the frameworks do it well. The thing is, it's only necessary work if your pipeline is actually a multi-vendor pipeline.

Here's the cost side, because it's easy to miss when you're following a quickstart. A framework is another dependency in your stack — one more thing to deploy, version, monitor, and reason about when something goes wrong at 2am. You're still wiring three model vendors together, so you still own three sets of API keys, three billing relationships, three failure modes, and three latency budgets that stack. The framework hides that complexity. It doesn't remove it.

So the question becomes: what if the pipeline weren't multi-vendor?

The architecture without a framework

Here's the shift. AssemblyAI's Voice Agent API collapses speech-to-text, the LLM, and text-to-speech into a single WebSocket connection. You stream audio in, you get audio out. Turn detection, interruption handling, and tool calling happen inside that one connection.

When the whole pipeline lives behind one API, there's nothing left to orchestrate. Your "conductor" becomes a single open socket.
Compare the two topologies.

With a framework (cascading, multi-vendor):

    ┌──────────── your orchestration framework ────────────┐  

Caller ─▶│  transport ─▶ STT vendor ─▶ LLM vendor ─▶ TTS vendor │─▶ Caller
└──────────────────────────────────────────────────────┘
(you deploy, scale, and monitor this whole box)

Without a framework (single API):

Caller ─▶ thin transport relay ─▶ Voice Agent API ─▶ Caller
(STT + LLM + TTS, one socket)

‍

The Voice Agent API runs on Universal-3.5 Pro Realtime for the speech layer — the model that leads Pipecat's agent-conversation benchmark on both word error rate (6.99%) and entity accuracy, and nails short-utterance handling ("yes," "no," "mmhmm"). It takes the agent's question as context, keeps a rolling memory of the conversation, isolates the speaker from the room, and runs across 18 languages with native code-switching. It's framed as invisible infrastructure on purpose: one connection, one bill, one set of logs. It runs around one second end-to-end, and because there's no SDK to adopt, it works with coding agents like Claude Code out of the box.

What you build instead of an orchestration layer is a thin relay: take audio from wherever your user is, forward it to the WebSocket, and play back what comes off it. That's it. Let's wire it up — first in code, then over the phone.

Connect from your own server

Server-side and native clients connect directly with your API key in the Authorization header. Create an agent once with a single REST call, then connect to it by agent_id:

curl -X POST https://agents.assemblyai.com/v1/agents \
  -H "Authorization: $ASSEMBLYAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Support Assistant",
    "system_prompt": "You are a friendly support agent. Keep replies short and 
natural.",
    "greeting": "Hey there, what can I help with?",
    "voice": { "voice_id": "ivy" }
  }'

That returns an id. Now open the WebSocket and bind to it. The agent's stored prompt, voice, and tools load automatically, so you don't resend them:

import asyncio, json, base64, websockets

URL = "wss://agents.assemblyai.com/v1/ws"
AGENT_ID = "7ad24396-b822-4dca-871a-be9cc4781cf9"  # from the create call above

async def main():
    headers = {"Authorization": "YOUR_API_KEY"}
    async with websockets.connect(URL, extra_headers=headers) as ws:
        # Bind to the stored agent; no inline config needed.
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {"agent_id": AGENT_ID},
        }))

        async for raw in ws:
            event = json.loads(raw)
            t = event.get("type")
            if t == "session.ready":
                print("ready:", event.get("session_id"))
                # start streaming input.audio frames here
            elif t == "transcript.agent":
                print("agent:", event.get("text"))
            elif t == "reply.audio":
                pcm = base64.b64decode(event["data"])  # play this
            elif t in ("error", "session.error"):
                print("error:", event.get("message"))

asyncio.run(main())

Notice there's no pipeline graph, no service registry, no turn-detection plugin to configure. The events — session.ready, transcript.agent, reply.audio, input.speech.started for barge-in — are the whole protocol, and it's identical across every transport. For the browser, you'd mint a short-lived token server-side and open wss://agents.assemblyai.com/v1/ws?token=<token> so your key never ships to the client. Same session.update, same events. The full browser walkthrough is in the Voice Agent API quickstart.

See real-time Voice AI in action

Drop in call audio and watch Universal-3.5 Pro Realtime handle names, numbers, and turn-taking in real time — no code required.

Try playground

The transport question: SIP vs. WebSocket

Here's where most of the confusion lives, so let's settle it.
Audio has to physically move between the user and your agent. There are three common ways to move it:

WebRTC — the browser-and-app real-time standard. It handles NAT traversal, jitter buffering, and echo cancellation. This is LiveKit's and Daily's home turf, and it's the right tool for multi-party rooms, video, and rich client SDKs.
SIP — the signaling protocol of the telephone network. If you're terminating raw PSTN calls or running your own telephony infrastructure, you're in SIP territory.
WebSocket — a plain, bidirectional socket. No media server, no SFU, no signaling dance. You send audio frames, you receive audio frames.

AssemblyAI takes audio over a WebSocket. That's the entire transport story for the STT layer and the Voice Agent API alike.

"But my users are on the phone — don't I need SIP?" This is the part people get wrong. You don't speak SIP yourself, and you don't stand up a WebRTC media server either. Your telephony provider does that for you. Twilio terminates the PSTN call, handles the SIP side, and hands you the call's audio over — you guessed it — a WebSocket. So the agent is WebSocket on the user side and WebSocket on the AssemblyAI side. It's sockets all the way through.

That's why you can skip the framework's transport layer for telephony: the carrier already converted the hard part into a WebSocket before it reaches you. Your job is just to bridge two sockets.

Connecting a phone number with Twilio

Let's make the phone case concrete, because it's the one that sounds like it should need the most machinery and actually needs the least.
The end-to-end path is four hops:

Caller ↔ Twilio Media Streams ↔ Your server ↔ Voice Agent API

Twilio handles the phone network. Your server is a thin bridge. The agent handles speech-to-speech. And here's the detail that removes a whole class of pain: Twilio's native G.711 μ-law audio is byte-compatible with the Voice Agent API's audio/pcmu encoding, so your bridge forwards audio as-is, with zero transcoding or resampling.
When a call comes in, Twilio hits a webhook on your server. You return TwiML that opens a Media Streams WebSocket:

app.post("/twiml", (req, res) => {
  const callId = newCallId();
  const hostname = process.env.HOSTNAME.replace(/^https?:\/\//, "");
  const streamUrl = `wss://${hostname}/media-stream/${callId}`;

  res.type("text/xml").status(200).send(
    `<Response>
  <Connect>
    <Stream url="${streamUrl}" />
  </Connect>
</Response>`,
  );
});

When Twilio opens that Media Streams socket, your server opens a parallel connection to the Voice Agent API and configures the session inline — telling it to speak μ-law in both directions to match Twilio:

const aaiWs = new WebSocket("wss://agents.assemblyai.com/v1/ws", {
  headers: { Authorization: process.env.ASSEMBLYAI_API_KEY },
});

aaiWs.send(JSON.stringify({
  type: "session.update",
  session: {
    system_prompt: "You are a helpful voice assistant.",
    greeting: "Hi, thanks for calling. How can I help?",
    input: { type: "audio", format: { encoding: "audio/pcmu" } },
    output: { type: "audio", voice: "ivy", format: { encoding: "audio/pcmu" } },
    tools: [/* your tool definitions */],
  },
}));

Once session.ready fires, the whole bridge is two forwarding rules plus one for interruptions:

// Caller → Agent: each Twilio media event becomes an input.audio event.
tw.on("media", (msg) => {
  if (msg.media.track !== "inbound") return;
  aaiWs.send(JSON.stringify({ type: "input.audio", audio: msg.media.payload }));
});

aaiWs.on("message", (data) => {
  const event = JSON.parse(data.toString());

  // Agent → Caller: each reply.audio event becomes a Twilio media action.
  if (event.type === "reply.audio" && event.data) {
    tw.send({ event: "media", streamSid: tw.streamSid, media: { payload: 
event.data } });
  }

  // Barge-in: caller talks over the agent → clear Twilio's buffer so it stops 
instantly.
  if (event.type === "input.speech.started") {
    tw.send({ event: "clear", streamSid: tw.streamSid });
  }
});

That's a production phone agent. No SIP stack, no media server, no orchestration framework — a webhook and a socket bridge. AssemblyAI maintains a complete Twilio example repo with inbound and outbound calling and tool handling wired up, and the full walkthrough lives in Connect to Twilio. The same pattern applies to other carriers — Twilio, Vonage, and the rest all hand you a media WebSocket.

What if you only want to swap the STT layer?

Not everyone is building net-new. Maybe you already have an LLM and a TTS voice you love, and you only want better transcription — the rip-and-replace-one-piece path. You can do that without a framework too.
Connect straight to the streaming speech-to-text WebSocket and keep the rest of your stack exactly as it is:

import json
from urllib.parse import urlencode
import websocket  # pip install websocket-client

API_KEY = "YOUR_API_KEY"
# Streaming uses speech_model (singular). universal-3-5-pro is the flagship realtime model.
params = {"speech_model": "universal-3-5-pro", "sample_rate": 16000}
ENDPOINT = f"wss://streaming.assemblyai.com/v3/ws?{urlencode(params)}"

def on_message(ws, message):
    data = json.loads(message)
    if data.get("type") == "Turn":
        if data.get("end_of_turn"):
            print("Final:", data.get("transcript"))   # → send to your own LLM
        else:
            print("Partial:", data.get("transcript"), end="\r")
    elif data.get("type") == "SpeechStarted":
        print("Barge-in: user started speaking")       # → interrupt your TTS

# Authenticate with the raw key in the header — no "Bearer" prefix.
ws = websocket.WebSocketApp(ENDPOINT, header={"Authorization": API_KEY},
                            on_message=on_message)
ws.run_forever()
# Stream 16 kHz PCM16 audio in ~50 ms chunks as binary WebSocket frames.

You hold the loop — partial transcripts to your LLM, tokens to your TTS, SpeechStarted to trigger barge-in. It's more wiring than the all-in-one Voice Agent API, but it's still a direct socket, and you're still not deploying a framework. If you want a longer build, our real-time transcription in Python walkthrough goes step by step.

A couple of details that trip people up: streaming uses speech_model (singular) — that's the opposite of the pre-recorded API, which takes a speech_models array for fallback routing. And universal-3-5-pro uses punctuation-based turn detection, so end_of_turn_confidence_threshold is a no-op on it. If your session is monolingual, pass language_code to commit the model to one language; leave it off to keep native code-switching.

Does this hold up at enterprise scale?

A fair pushback: "single API is fine for a demo, but does it survive production?" This is where collapsing the pipeline actually helps instead of hurts.‍

Fewer moving parts, fewer failure modes. A three-vendor cascade has three things that can rate-limit you, three that can have an incident, and three latency budgets that compound. One connection has one of each. When you're the one carrying the pager, that math matters more than any benchmark.‍

Concurrency that scales with you. Pay-as-you-go accounts start at 100 new streams per minute with no hard cap on concurrent sessions, and capacity scales up automatically — roughly a 10% bump every 60 seconds once you cross 70% utilization. AssemblyAI runs millions of hours of audio and 600M+ inference calls a month, with unlimited concurrency on the platform.‍

Pricing you can forecast. The Voice Agent API is a flat $4.50/hr that bundles STT, the LLM, and TTS. There's no separate per-vendor metering to model out, which makes capacity planning a spreadsheet instead of a guessing game. (Prefer to bring your own LLM and TTS? The streaming STT layer is billed separately on its own pricing — $0.45/hr for Universal-3.5 Pro Realtime, with context and keyterm prompting included.)‍

The compliance and residency boxes. SOC 2 Type 2, EU data residency via streaming.eu.assemblyai.com at the same price as the US, and a self-hosted option if the audio can't leave your VPC. For teams handling protected health information (PHI), AssemblyAI is a business associate under HIPAA and offers a Business Associate Addendum (BAA) you can sign in minutes, without a sales call.

None of this is a reason to avoid a framework — plenty of large deployments run Pipecat or LiveKit happily. It's a reason to know that the framework-free path isn't a toy. For an honest look at where any voice agent architecture starts to strain, see where voice agent stacks start showing their limits.

When you should use Pipecat or LiveKit

Skipping the framework is the right call for a large share of voice agents — especially phone agents, support bots, and anything where the job is a clean speech-to-speech loop. But not all of them. Reach for Pipecat or LiveKit when:

You need real WebRTC features — multi-party rooms, video alongside voice, screen share, or polished mobile and web client SDKs. This is LiveKit's core strength, and a WebSocket bridge won't replicate it.
You want best-of-breed per stage — a specific TTS voice, a particular LLM, or a custom turn-detection model, mixed and matched per use case. Frameworks are built for exactly this kind of swapping.
You already run one in production. If Pipecat or LiveKit is in your stack, don't rip it out. AssemblyAI ships first-class plugins for both, so you can run Universal-3.5 Pro Realtime as your STT inside the framework you already trust and get the same accuracy either way.
You need fine-grained control over the pipeline graph — custom processors, intricate branching, frame-level manipulation.

The decision is less "framework vs. no framework" and more "how much of the pipeline do I actually need to control."

Choosing your path

You want to…	Use	Why
Ship a phone or web voice agent fast	Voice Agent API (direct WebSocket)	One socket for STT + LLM + TTS; flat $4.50/hr; no framework to deploy
Keep your own LLM/TTS, upgrade only STT	Streaming STT (direct WebSocket)	Swap one piece; you keep the loop
Build multi-party rooms or add video	LiveKit Agents + AssemblyAI plugin	WebRTC features a socket bridge can't match
Mix-and-match vendors with fine control	Pipecat + AssemblyAI plugin	Framework-grade pipeline control
Add a phone number to any of the above	Your telephony provider (e.g. Twilio)	Carrier terminates SIP/PSTN, hands you a WebSocket

The takeaway

The voice agent ecosystem defaulted to orchestration frameworks for a sound reason: for years, building an agent meant gluing three vendors together, and that glue is real engineering. Frameworks made the glue manageable.

But the default quietly assumes the pipeline has to be multi-vendor. The moment one API owns speech-to-text, the LLM, and text-to-speech behind a single WebSocket — and your carrier hands you telephony audio over a WebSocket too — the orchestration layer doesn't get easier. It disappears. What's left is a socket on each side and a few lines to forward audio between them.

So before you add Pipecat or LiveKit to your next voice agent, ask the two questions that actually matter: what's moving my audio, and is my pipeline really multi-vendor? If the answers are "a WebSocket" and "no," you've already got your architecture.

Build your voice agent today

Get a free API key and $50 in credits to ship your first voice agent — one WebSocket for speech-to-text, the LLM, and text-to-speech, at a flat $4.50/hr, with no framework to deploy.

Frequently asked questions

Do you need Pipecat or LiveKit to build a voice agent?

No. Pipecat and LiveKit orchestrate a multi-vendor pipeline — they wire together separate speech-to-text, LLM, and text-to-speech services and manage turn-taking and transport between them. If a single API handles all three, like AssemblyAI's Voice Agent API does over one WebSocket, there's nothing left to orchestrate. Reach for a framework when you need WebRTC features like multi-party rooms or video, or fine-grained control over each pipeline stage.

Do voice agents use SIP or WebSocket for audio transport?

Voice agents can move audio over WebSocket, WebRTC, or SIP, and the three solve different problems. AssemblyAI takes audio over a plain WebSocket — no media server or SFU required. For phone calls you don't implement SIP yourself: your telephony provider (such as Twilio) terminates the PSTN and SIP side and hands you the call's audio over a WebSocket, so the whole agent ends up being WebSocket on both ends.

What's the difference between the AssemblyAI Voice Agent API and a framework like Pipecat or LiveKit?

An orchestration framework connects separate STT, LLM, and TTS vendors and coordinates turn-taking, interruptions, and transport across them. The Voice Agent API replaces those three providers with one WebSocket connection — one bill, one set of logs, and one latency budget instead of three. Frameworks give you per-stage swappability and deep pipeline control; the single API gives you fewer moving parts. You don't have to choose blindly, either: AssemblyAI ships plugins for both Pipecat and LiveKit, so you can run Universal-3.5 Pro Realtime inside whichever framework you already use.

Can you build a Twilio voice agent without a framework?

Yes. Twilio terminates the phone call and streams G.711 μ-law audio to your server over a Media Streams WebSocket, and your server bridges that audio to the Voice Agent API, which accepts μ-law (audio/pcmu) natively with zero transcoding. The whole bridge is a webhook that returns TwiML plus a few lines that forward audio in each direction — no SIP stack, no media server, and no orchestration framework. AssemblyAI maintains a complete Twilio example repo with inbound calling, outbound calling, and tool handling.

Does a voice agent without an orchestration framework scale to production?

Yes. AssemblyAI's streaming platform starts at 100 new streams per minute on pay-as-you-go with no hard cap on concurrent sessions, and capacity scales up automatically under load. Collapsing speech-to-text, the LLM, and text-to-speech into one connection actually reduces operational risk at scale — there are fewer vendors to rate-limit you, fewer services that can have an incident, and fewer latency budgets that stack. Pricing is a flat $4.50/hr, with EU data residency and a self-hosted deployment option for teams with strict data requirements.

How do I build a voice agent without Pipecat or LiveKit?

Create an agent with a single REST call to define its prompt, voice, and tools, then open a WebSocket to wss://agents.assemblyai.com/v1/ws and bind it by agent_id in your first session.update message. From there you stream PCM audio in, play the agent's audio back out, and handle barge-in when the input.speech.started event fires. The same connection and event protocol work from a browser (using a short-lived token), from your own server, or over the phone through Twilio.