April 2, 2026

Raw WebSocket voice agent with AssemblyAI Universal-3 Pro Streaming

The simplest possible voice agent — no frameworks, no abstraction layers. Just raw WebSockets, a microphone, and the AssemblyAI Universal-3 Pro Streaming model

Kelsey Foster

Growth

AI voice agents

Universal-3 Pro Streaming

Reviewed by

Table of contents

[Visible on live site]

The simplest possible voice agent — no frameworks, no abstraction layers. Just raw WebSockets, a microphone, and the AssemblyAI Universal-3 Pro Streaming model (u3-rt-pro).

This tutorial shows exactly what LiveKit Agents, Pipecat, and Vapi are doing underneath. If you want full control over every byte, or you're embedding a voice agent into a custom application, start here.

The pipeline

Microphone
    │ float32 audio (sounddevice)
    ▼ convert → int16 PCM
AssemblyAI WebSocket (wss://streaming.assemblyai.com/v3/ws)
    │ ?speech_model=u3-rt-pro&encoding=pcm_s16le&sample_rate=16000
    │ Turn message (end_of_turn=true) — neural turn detection
    ▼
OpenAI GPT-4o → text response
    ▼
ElevenLabs TTS → PCM audio → Speakers (sounddevice)

‍

Build your first voice agent from scratch

Sign up for a free AssemblyAI account and connect to the Universal-3 Pro Streaming WebSocket in minutes. No framework required.

Start building

Prerequisites

Python 3.11+
A microphone and speakers
AssemblyAI API key
OpenAI API key
ElevenLabs API key

On macOS, install PortAudio for sounddevice:

brew install portaudio

Quick start

git clone https://github.com/kelsey-aai/voice-agent-websocket-universal-3-pro
cd voice-agent-websocket-universal-3-pro

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

cp .env.example .env
# Edit .env with your API keys

python agent.py

‍Speak — partial transcripts appear in real time. When Universal-3 Pro detects an end-of-turn, the agent responds.

How it works

1. WebSocket connection

AAI_WS_URL = (
    "wss://streaming.assemblyai.com/v3/ws"
    "?speech_model=u3-rt-pro"
    "&encoding=pcm_s16le"
    "&sample_rate=16000"
    "&end_of_turn_confidence_threshold=0.4"
    f"&token={ASSEMBLYAI_API_KEY}"
)

2. Message types

AssemblyAI v3 sends three event types:

// Session started
{ "type": "Begin", "id": "session_abc123" }

// Rolling transcript
{ "type": "Turn", "transcript": "how do I", "end_of_turn": false }

// End-of-turn detected — respond now
{ "type": "Turn", "transcript": "how do I get started?", "end_of_turn": true }

// Session closed
{ "type": "Termination" }

‍

3. Sending audio

# Raw PCM bytes — no wrapper, no base64
await ws.send(pcm_bytes)

# Terminate the session cleanly
await ws.send(json.dumps({"type": "Terminate"}))

‍

Tuning turn detection

Setting	Effect
Lower `end_of_turn_confidence_threshold` (e.g. 0.3)	Faster response, more false triggers
Higher `end_of_turn_confidence_threshold` (e.g. 0.6)	More patient, better for noisy environments
Lower `min_turn_silence` (e.g. 200ms)	Snappier for fast-paced conversation
Higher `max_turn_silence` (e.g. 2000ms)	Better for deliberate speech

Swapping components

# Anthropic Claude instead of GPT-4o
from anthropic import AsyncAnthropic
client = AsyncAnthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
response = await client.messages.create(model="claude-opus-4-6", max_tokens=150, ...)

# Cartesia instead of ElevenLabs (lower TTS latency)
import cartesia

‍

Raw WebSocket voice agent with AssemblyAI Universal-3 Pro Streaming

The pipeline

Prerequisites

Quick start

How it works

1. WebSocket connection

2. Message types

3. Sending audio

Tuning turn detection

Swapping components

Resources

Top APIs and models for real-time speech recognition and transcription in 2026

Prompting Claude to build voice agents

Build a voice agent without Pipecat or LiveKit

Universal-3.5 Pro Realtime vs. Voice Agent API: Which one should you actually build on?

Real-time transcription that code-switches for multilingual speakers

The conversational AI evolution: How agentic systems are rewriting contact center operations

Getting Started With Torchaudio

Offline speech recognition with Whisper: Browser + Node.js implementations

Raw WebSocket voice agent with AssemblyAI Universal-3 Pro Streaming

The pipeline

Prerequisites

Quick start

How it works

1. WebSocket connection

2. Message types

3. Sending audio

Tuning turn detection

Swapping components

Resources

Related posts

Top APIs and models for real-time speech recognition and transcription in 2026

Prompting Claude to build voice agents

Build a voice agent without Pipecat or LiveKit

Universal-3.5 Pro Realtime vs. Voice Agent API: Which one should you actually build on?

Real-time transcription that code-switches for multilingual speakers

The conversational AI evolution: How agentic systems are rewriting contact center operations

Getting Started With Torchaudio

Offline speech recognition with Whisper: Browser + Node.js implementations