Insights & Use Cases
April 2, 2026

Raw WebSocket voice agent with AssemblyAI Universal-3 Pro Streaming

The simplest possible voice agent — no frameworks, no abstraction layers. Just raw WebSockets, a microphone, and the AssemblyAI Universal-3 Pro Streaming model

Reviewed by
No items found.
Table of contents

The simplest possible voice agent — no frameworks, no abstraction layers. Just raw WebSockets, a microphone, and the AssemblyAI Universal-3 Pro Streaming model (u3-rt-pro).

This tutorial shows exactly what LiveKit Agents, Pipecat, and Vapi are doing underneath. If you want full control over every byte, or you're embedding a voice agent into a custom application, start here.

The pipeline

Microphone
    │ float32 audio (sounddevice)
    ▼ convert → int16 PCM
AssemblyAI WebSocket (wss://streaming.assemblyai.com/v3/ws)
    │ ?speech_model=u3-rt-pro&encoding=pcm_s16le&sample_rate=16000
    │ Turn message (end_of_turn=true) — neural turn detection
OpenAI GPT-4o → text response
ElevenLabs TTS → PCM audio → Speakers (sounddevice)


Build your first voice agent from scratch

Sign up for a free AssemblyAI account and connect to the Universal-3 Pro Streaming WebSocket in minutes. No framework required.

Start building

Prerequisites

  • Python 3.11+
  • A microphone and speakers
  • AssemblyAI API key
  • OpenAI API key
  • ElevenLabs API key

On macOS, install PortAudio for sounddevice:

brew install portaudio

Quick start

git clone https://github.com/kelseyefoster/voice-agent-websocket-universal-3-pro
cd voice-agent-websocket-universal-3-pro

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

cp .env.example .env
# Edit .env with your API keys

python agent.py


Speak — partial transcripts appear in real time. When Universal-3 Pro detects an end-of-turn, the agent responds.

How it works

1. WebSocket connection

AAI_WS_URL = (
    "wss://streaming.assemblyai.com/v3/ws"
    "?speech_model=u3-rt-pro"
    "&encoding=pcm_s16le"
    "&sample_rate=16000"
    "&end_of_turn_confidence_threshold=0.4"
    f"&token={ASSEMBLYAI_API_KEY}"
)

2. Message types

AssemblyAI v3 sends three event types:

// Session started
{ "type": "Begin", "id": "session_abc123" }

// Rolling transcript
{ "type": "Turn", "transcript": "how do I", "end_of_turn": false }

// End-of-turn detected — respond now
{ "type": "Turn", "transcript": "how do I get started?", "end_of_turn": true }

// Session closed
{ "type": "Termination" }


3. Sending audio

# Raw PCM bytes — no wrapper, no base64
await ws.send(pcm_bytes)

# Terminate the session cleanly
await ws.send(json.dumps({"type": "Terminate"}))


Tuning turn detection

Setting

Effect

Lower end_of_turn_confidence_threshold (e.g. 0.3)

Faster response, more false triggers

Higher end_of_turn_confidence_threshold (e.g. 0.6)

More patient, better for noisy environments

Lower min_turn_silence (e.g. 200ms)

Snappier for fast-paced conversation

Higher max_turn_silence (e.g. 2000ms)

Better for deliberate speech

Swapping components

# Anthropic Claude instead of GPT-4o
from anthropic import AsyncAnthropic
client = AsyncAnthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
response = await client.messages.create(model="claude-opus-4-6", max_tokens=150, ...)

# Cartesia instead of ElevenLabs (lower TTS latency)
import cartesia


Resources

Title goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Button Text
AI voice agents
Universal-3 Pro Streaming