Raw WebSocket voice agent with AssemblyAI Universal-3 Pro Streaming
The simplest possible voice agent — no frameworks, no abstraction layers. Just raw WebSockets, a microphone, and the AssemblyAI Universal-3 Pro Streaming model



The simplest possible voice agent — no frameworks, no abstraction layers. Just raw WebSockets, a microphone, and the AssemblyAI Universal-3 Pro Streaming model (u3-rt-pro).
This tutorial shows exactly what LiveKit Agents, Pipecat, and Vapi are doing underneath. If you want full control over every byte, or you're embedding a voice agent into a custom application, start here.
The pipeline
Microphone
│ float32 audio (sounddevice)
▼ convert → int16 PCM
AssemblyAI WebSocket (wss://streaming.assemblyai.com/v3/ws)
│ ?speech_model=u3-rt-pro&encoding=pcm_s16le&sample_rate=16000
│ Turn message (end_of_turn=true) — neural turn detection
▼
OpenAI GPT-4o → text response
▼
ElevenLabs TTS → PCM audio → Speakers (sounddevice)
Prerequisites
- Python 3.11+
- A microphone and speakers
- AssemblyAI API key
- OpenAI API key
- ElevenLabs API key
On macOS, install PortAudio for sounddevice:
brew install portaudioQuick start
git clone https://github.com/kelseyefoster/voice-agent-websocket-universal-3-pro
cd voice-agent-websocket-universal-3-pro
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Edit .env with your API keys
python agent.pySpeak — partial transcripts appear in real time. When Universal-3 Pro detects an end-of-turn, the agent responds.
How it works
1. WebSocket connection
AAI_WS_URL = (
"wss://streaming.assemblyai.com/v3/ws"
"?speech_model=u3-rt-pro"
"&encoding=pcm_s16le"
"&sample_rate=16000"
"&end_of_turn_confidence_threshold=0.4"
f"&token={ASSEMBLYAI_API_KEY}"
)2. Message types
AssemblyAI v3 sends three event types:
// Session started
{ "type": "Begin", "id": "session_abc123" }
// Rolling transcript
{ "type": "Turn", "transcript": "how do I", "end_of_turn": false }
// End-of-turn detected — respond now
{ "type": "Turn", "transcript": "how do I get started?", "end_of_turn": true }
// Session closed
{ "type": "Termination" }
3. Sending audio
# Raw PCM bytes — no wrapper, no base64
await ws.send(pcm_bytes)
# Terminate the session cleanly
await ws.send(json.dumps({"type": "Terminate"}))
Tuning turn detection
Swapping components
# Anthropic Claude instead of GPT-4o
from anthropic import AsyncAnthropic
client = AsyncAnthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
response = await client.messages.create(model="claude-opus-4-6", max_tokens=150, ...)
# Cartesia instead of ElevenLabs (lower TTS latency)
import cartesia
Resources
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

