Migrating from OpenAI Realtime API to AssemblyAI Voice Agent API
OpenAI Realtime API migration guide: learn how to move to AssemblyAI Voice Agent API with less session code, simpler audio streaming, and tool support.



This guide walks you through migrating a voice agent from the OpenAI Realtime API to AssemblyAI's Voice Agent API. You'll replace manual session management, audio buffer handling, and ephemeral token generation with a cleaner WebSocket interface that manages the full voice pipeline through a single connection.
The migration covers four areas: authentication setup, session configuration, audio streaming, and tool migration. Each section shows the OpenAI implementation alongside the AssemblyAI equivalent so you can see exactly what changes and what stays the same. You'll need Python 3.8+, an AssemblyAI API key, and working familiarity with WebSockets and async Python. Your existing business logic and function definitions transfer directly.
What is the OpenAI Realtime API?
The OpenAI Realtime API is a WebSocket-based interface that lets you build using OpenAI's audio models—specifically gpt-realtime and gpt-realtime-mini. Instead of chaining together separate speech-to-text, LLM, and text-to-speech services, it processes audio directly as a single multimodal input and output.
That word "multimodal" is important. It means the same underlying model handles audio, text, and images—voice is one capability among many, not the primary focus.
This distinction matters when you hit production. A model designed to do everything doesn't always do any one thing as well as a model built specifically for it. Speech accuracy is where you'll feel that most.
How the OpenAI Realtime API works: sessions and events
The Realtime API is event-driven. You and the server exchange JSON messages over a persistent WebSocket connection, and a "session" holds the state of your entire conversation.
Think of a session like a phone call—it stays open, remembers what was said, and closes when you hang up. Every action you take sends a specific event type over that connection.
Here are the core events you'll work with:
Two more concepts you'll need to understand before writing any code:
- Voice activity detection (VAD): This is how the model knows when you've stopped talking. You can use
server_vad(silence-based) orsemantic_vad(content-based, meaning it waits for a natural pause in meaning rather than just silence). - Function calling: The model can invoke tools you've registered—like looking up a customer record—before generating its spoken response.
Here's the minimal Python code to open a connection and configure a session:
import asyncio
import websockets
import json
import os
from dotenv import load_dotenv
load_dotenv()
async def connect():
uri = "wss://api.openai.com/v1/realtime?model=gpt-realtime"
headers = {"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"}
async with websockets.connect(uri, extra_headers=headers) as ws:
# Configure the session
await ws.send(json.dumps({
"type": "session.update",
"session": {
"type": "realtime",
"instructions": "You are a helpful assistant.",
"audio": {
"input": {
"turn_detection": {"type": "server_vad"}
},
"output": {"voice": "alloy"}
}
}
}))
# Listen for the server to confirm
async for message in ws:
event = json.loads(message)
print(f"Received: {event['type']}")
if event["type"] == "session.created":
print("Session ready.")
break
asyncio.run(connect())
This is the foundation every OpenAI Realtime API integration starts with. Everything else—sending audio, handling responses, calling tools—builds on this session.
OpenAI Realtime API production challenges
OpenAI Realtime API works well for prototypes. But when you move to production, three problems compound quickly.
Token-based pricing is hard to predict. Audio tokens are billed per million tokens for input and output separately. The count changes based on speech pace, pauses, and conversation length—so your monthly bill is a moving target, not a fixed cost.
Session management requires significant boilerplate. In production, you need to handle concurrent event streams, manage reconnections after network interruptions, enforce the 30-minute session timeout, and write retry logic. All of that exists before you write a single line of product logic.
Concurrency limits require quota negotiations. Session limits are tied to account tiers. As your user base grows, you'll need to request higher limits from OpenAI and architect around those ceilings from the start.
AssemblyAI's Voice Agent API was built specifically for these production constraints—one WebSocket, all-inclusive flat-rate pricing at $4.50/hr for the complete voice agent service, and auto-scaling concurrency address each of these friction points directly.
What is AssemblyAI Voice Agent API?
AssemblyAI's Voice Agent API is a single WebSocket API that handles the full voice pipeline—speech understanding, LLM reasoning, and voice generation—in one connection. You don't manage separate STT, LLM, and TTS providers. You connect once, and the infrastructure handles the rest.
The key architectural difference from OpenAI Realtime API: where OpenAI's model is multimodal (voice as one of many capabilities), AssemblyAI's pipeline is purpose-built for speech. Universal-3 Pro Streaming—AssemblyAI's dedicated Voice AI model, ranked #1 on the Hugging Face Open ASR Leaderboard—sits at the foundation. Getting the input right is the whole point, because when speech-to-text misreads a customer's name or account number, every downstream step responds to the wrong thing.
Here's how the two APIs compare side by side:
The "invisible infrastructure" framing here is intentional. You're not building on a platform with its own opinions about your product—you control the conversation design, tool integrations, and agent behavior fully. The infrastructure just works.
Migrate from OpenAI Realtime API to AssemblyAI Voice Agent API
Most migrations take one to two days. You're not rewriting your product logic—you're replacing infrastructure code with simpler infrastructure code. The four steps are: authentication setup, session configuration, audio streaming, and tool migration.
Authentication and environment setup
Before you connect, install the required packages:
pip install websockets pyaudio python-dotenvAdd your AssemblyAI API key to your .env file:
ASSEMBLYAI_API_KEY=your_assemblyai_key_hereHere's where the first simplification happens. OpenAI requires you to generate an ephemeral token as a separate API call before you can open a WebSocket connection:
from openai import OpenAI
import os
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Extra step before every connection
response = client.realtime.sessions.create(
model="gpt-realtime",
voice="alloy"
)
ephemeral_token = response.client_secret.value
# Token expires — you need to handle that too
With AssemblyAI, you authenticate directly by passing your API key as a Bearer token in the WebSocket connection header. No token generation, no expiration handling:
import asyncio
import websockets
import os
from dotenv import load_dotenv
load_dotenv()
async def connect():
uri = "wss://agents.assemblyai.com/v1/ws"
headers = {"Authorization": f"Bearer {os.environ['ASSEMBLYAI_API_KEY']}"}
async with websockets.connect(uri, additional_headers=headers) as ws:
print("Connected — ready to configure session.")
asyncio.run(connect())Removing the ephemeral token step eliminates an entire failure mode. Token expiration mid-session was a real edge case to handle with OpenAI; it doesn't exist here.
Note for browser-based applications: If you're building a client-side voice agent that runs in the browser, you'll need to use AssemblyAI's temporary token flow to avoid exposing your API key. Generate a temporary token server-side using the token endpoint and pass it as a query parameter: wss://agents.assemblyai.com/v1/ws?token=YOUR_TEMP_TOKEN.
Session configuration
With OpenAI, session configuration requires a deeply nested JSON structure. You manually specify audio format, sample rate, VAD parameters, and voice settings—all before anything works:
import json
import asyncio
import websockets
import os
async def configure_openai_session():
uri = "wss://api.openai.com/v1/realtime?model=gpt-realtime"
headers = {"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"}
async with websockets.connect(uri, extra_headers=headers) as ws:
session_config = {
"type": "session.update",
"session": {
"instructions": "You are a helpful customer support agent.",
"audio": {
"input": {
"format": {"type": "audio/pcm", "rate": 24000},
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 200
}
},
"output": {
"voice": "alloy",
"format": {"type": "audio/pcm", "rate": 24000}
}
}
}
}
await ws.send(json.dumps(session_config))
async for message in ws:
event = json.loads(message)
if event["type"] == "session.updated":
print("Session configured.")
break
asyncio.run(configure_openai_session())With AssemblyAI, you send a session.update message after connecting. Audio format defaults to PCM at 24kHz, and turn detection is built in with sensible defaults—you only override what you actually need to change:
import json
import asyncio
import websockets
import os
from dotenv import load_dotenv
load_dotenv()
async def configure_assemblyai_session():
uri = "wss://agents.assemblyai.com/v1/ws"
headers = {"Authorization": f"Bearer {os.environ['ASSEMBLYAI_API_KEY']}"}
async with websockets.connect(uri, additional_headers=headers) as ws:
# Configure the session — defaults handle audio format and turn detection
await ws.send(json.dumps({
"type": "session.update",
"session": {
"system_prompt": "You are a helpful customer support agent.",
"greeting": "Hi! How can I help you today?",
"output": {"voice": "ivy"}
}
}))
# Wait for session.ready before streaming audio
async for message in ws:
event = json.loads(message)
if event["type"] == "session.ready":
print(f"Session ready — id: {event['session_id']}")
break
asyncio.run(configure_assemblyai_session())
The difference is more than syntax. With OpenAI, misconfiguring a single audio parameter breaks everything silently. AssemblyAI's defaults are tuned for real-world conversations, so you only override what you actually need to change. Need to adjust turn detection sensitivity? Add turn_detection to the config:
# Optional: customize turn detection
await ws.send(json.dumps({
"type": "session.update",
"session": {
"input": {
"turn_detection": {
"vad_threshold": 0.5,
"min_silence": 600,
"max_silence": 1500,
"interrupt_response": True
},
"keyterms": ["AssemblyAI", "Universal-3 Pro"]
}
}
}))You can also update session.update mid-conversation—change the system prompt, add tools, or adjust turn detection without reconnecting.
Audio streaming and event handling
This is where the complexity gap is most visible. OpenAI requires you to manually encode audio as base64, manage the input buffer, commit it at the right time, and decode the audio response back to bytes:
import base64
import pyaudio
import asyncio
import websockets
import json
import os
async def openai_audio_loop(ws):
p = pyaudio.PyAudio()
# Input stream
input_stream = p.open(
format=pyaudio.paInt16,
channels=1,
rate=24000,
input=True,
frames_per_buffer=4800
)
# Output stream
output_stream = p.open(
format=pyaudio.paInt16,
channels=1,
rate=24000,
output=True
)
async def send_audio():
while True:
chunk = input_stream.read(4800, exception_on_overflow=False)
# Must encode as base64 before sending
await ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": base64.b64encode(chunk).decode("utf-8")
}))
await asyncio.sleep(0.1)
async def receive_audio():
async for message in ws:
event = json.loads(message)
if event["type"] == "response.output_audio.delta":
# Must decode from base64 before playing
audio_bytes = base64.b64decode(event["delta"])
output_stream.write(audio_bytes)
elif event["type"] == "response.done":
print("Response complete.")
# Run both concurrently
await asyncio.gather(send_audio(), receive_audio())With AssemblyAI, you still work with WebSocket events and base64 audio—but the event types are cleaner and there's no buffer management step. You send input.audio events and receive reply.audio events, plus you get transcript events for both sides of the conversation:
import base64
import pyaudio
import asyncio
import websockets
import json
import os
from dotenv import load_dotenv
load_dotenv()
async def assemblyai_audio_loop():
uri = "wss://agents.assemblyai.com/v1/ws"
headers = {"Authorization": f"Bearer {os.environ['ASSEMBLYAI_API_KEY']}"}
p = pyaudio.PyAudio()
input_stream = p.open(
format=pyaudio.paInt16, channels=1, rate=24000,
input=True, frames_per_buffer=4800
)
output_stream = p.open(
format=pyaudio.paInt16, channels=1, rate=24000,
output=True
)
async with websockets.connect(uri, additional_headers=headers) as ws:
# Configure session
await ws.send(json.dumps({
"type": "session.update",
"session": {
"system_prompt": "You are a helpful customer support agent.",
"output": {"voice": "ivy"}
}
}))
ready = False
async def send_audio():
nonlocal ready
while True:
if ready:
chunk = input_stream.read(4800, exception_on_overflow=False)
await ws.send(json.dumps({
"type": "input.audio",
"audio": base64.b64encode(chunk).decode("utf-8")
}))
await asyncio.sleep(0.1)
async def receive_events():
nonlocal ready
async for message in ws:
event = json.loads(message)
if event["type"] == "session.ready":
ready = True
print("Session ready.")
elif event["type"] == "reply.audio":
audio_bytes = base64.b64decode(event["data"])
output_stream.write(audio_bytes)
elif event["type"] == "transcript.user":
print(f"User: {event['text']}")
elif event["type"] == "transcript.agent":
print(f"Agent: {event['text']}")
elif event["type"] == "reply.done":
if event.get("status") == "interrupted":
print("Agent interrupted by user.")
await asyncio.gather(send_audio(), receive_events())
asyncio.run(assemblyai_audio_loop())The event model is simpler: no buffer commit step, no separate response trigger. You stream audio in with input.audio, and the server handles turn detection, generates a response, and streams audio back with reply.audio. You also get transcript.user and transcript.agent events—full text transcripts of both sides of the conversation, which OpenAI doesn't provide natively.
Barge-in is built in too. When a user interrupts, you'll receive reply.done with status: "interrupted"—flush your audio playback buffer and the agent picks up the new turn automatically.
Function calling and tool migration
Both APIs support tool calling, and the JSON Schema definitions are similar enough that your existing tool definitions translate directly. The difference is in how you handle the results.
With OpenAI, you track a call_id for each function call, execute the function, then manually route the result back to the model via a conversation.item.create event:
import json
import asyncio
import websockets
import os
def get_account_info(account_id: str) -> dict:
# Your business logic
return {"status": "active", "balance": 1000, "id": account_id}
# Register tools in session config
tools = [{
"type": "function",
"name": "get_account_info",
"description": "Look up account status by ID",
"parameters": {
"type": "object",
"properties": {
"account_id": {"type": "string", "description": "The customer account ID"}
},
"required": ["account_id"]
}
}]
async def openai_with_tools():
uri = "wss://api.openai.com/v1/realtime?model=gpt-realtime"
headers = {"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"}
async with websockets.connect(uri, extra_headers=headers) as ws:
# Include tools in session config
await ws.send(json.dumps({
"type": "session.update",
"session": {
"instructions": "You are a support agent.",
"tools": tools
}
}))
async for message in ws:
event = json.loads(message)
if event["type"] == "response.done":
output = event["response"]["output"][0]
if output.get("type") == "function_call":
# Parse arguments
args = json.loads(output["arguments"])
# Execute your function
result = get_account_info(args["account_id"])
# Manually route the result back to the model
await ws.send(json.dumps({
"type": "conversation.item.create",
"item": {
"type": "function_call_output",
"call_id": output["call_id"], # Must track this
"output": json.dumps(result)
}
}))
# Trigger the next response
await ws.send(json.dumps({"type": "response.create"}))
asyncio.run(openai_with_tools())With AssemblyAI, the tool definition format is nearly identical—your JSON Schema definitions transfer directly. The key difference: you accumulate tool results during tool.call events, then send them all back after reply.done. The agent speaks a natural transition phrase while waiting for your results:
import json
import asyncio
import websockets
import os
from dotenv import load_dotenv
load_dotenv()
def get_account_info(account_id: str) -> dict:
# Your business logic — same function, no changes needed
return {"status": "active", "balance": 1000, "id": account_id}
# Tool definitions — same JSON Schema format as OpenAI
tools = [{
"type": "function",
"name": "get_account_info",
"description": "Look up account status by ID",
"parameters": {
"type": "object",
"properties": {
"account_id": {"type": "string", "description": "The customer account ID"}
},
"required": ["account_id"]
}
}]
async def assemblyai_with_tools():
uri = "wss://agents.assemblyai.com/v1/ws"
headers = {"Authorization": f"Bearer {os.environ['ASSEMBLYAI_API_KEY']}"}
async with websockets.connect(uri, additional_headers=headers) as ws:
# Include tools in session config
await ws.send(json.dumps({
"type": "session.update",
"session": {
"system_prompt": "You are a support agent. Use get_account_info to look up accounts.",
"output": {"voice": "ivy"},
"tools": tools
}
}))
pending_tools = []
async for message in ws:
event = json.loads(message)
if event["type"] == "session.ready":
print("Session ready with tools.")
elif event["type"] == "tool.call":
# Execute your function
name = event["name"]
args = event.get("arguments", {})
if name == "get_account_info":
result = get_account_info(args["account_id"])
else:
result = {"error": "Unknown tool"}
# Accumulate — don't send yet
pending_tools.append({
"call_id": event["call_id"],
"result": result,
})
elif event["type"] == "reply.done":
if event.get("status") == "interrupted":
# User barged in — discard pending results
pending_tools.clear()
elif pending_tools:
# Now send all tool results
for tool in pending_tools:
await ws.send(json.dumps({
"type": "tool.result",
"call_id": tool["call_id"],
"result": json.dumps(tool["result"]),
}))
pending_tools.clear()
elif event["type"] == "transcript.user":
print(f"User: {event['text']}")
elif event["type"] == "transcript.agent":
print(f"Agent: {event['text']}")
asyncio.run(assemblyai_with_tools())Your actual business logic—get_account_info in this case—doesn't change at all. The tool definition JSON Schema is the same format. The difference is in the plumbing: AssemblyAI uses a cleaner accumulate-and-send pattern with tool.call / tool.result events, and the agent automatically speaks a transition phrase ("Let me check that for you") while waiting for results—no dead air.
What changes and what stays the same
Migrating from OpenAI Realtime API to AssemblyAI's Voice Agent API is, in practice, mostly a subtraction exercise. Here's a quick summary:
What you remove: Ephemeral token generation, manual audio buffer management and commit steps, base64 encoding/decoding boilerplate (though you still work with base64, the event model is simpler), the 30+ event types to handle, manual response triggering after tool calls.
What stays the same: Your business logic and function implementations, JSON Schema tool definitions (they transfer directly), the WebSocket + JSON communication model, base64 PCM audio format.
What you gain: Streaming transcripts of both sides of the conversation (transcript.user and transcript.agent), purpose-built speech accuracy from Universal-3 Pro Streaming, built-in turn detection with acoustic + contextual signals, natural barge-in handling, predictable $4.50/hr pricing, session resumption (reconnect within 30 seconds and pick up where you left off), and key terms prompting to boost accuracy on domain-specific vocabulary.
If you're building production voice agents and want speech accuracy, predictable pricing, and auto-scaling concurrency without managing three separate providers, AssemblyAI's Voice Agent API is worth evaluating. It's built on Universal-3 Pro Streaming—the same dedicated Voice AI model that powers AssemblyAI's real-time transcription—and is designed to be invisible infrastructure that stays out of the way while you build the product your users actually interact with.
Frequently asked questions
How does AssemblyAI Voice Agent API pricing compare to OpenAI Realtime API?
AssemblyAI charges a flat rate of $4.50/hour that covers speech understanding, LLM reasoning, and voice generation regardless of conversation density. OpenAI charges per audio token for input and output separately, meaning costs vary based on speech pace, pause frequency, and response length. The flat-rate model makes cost forecasting straightforward—one line of math to model what a 5-minute call costs.
How long does migrating from OpenAI Realtime API to AssemblyAI Voice Agent API take?
Most migrations take one to two days. The changes are primarily structural—replacing OpenAI's event handling patterns with AssemblyAI's cleaner event model. Your business logic and JSON Schema tool definitions transfer directly without modification.
Can I reuse my existing OpenAI function definitions when migrating?
Yes. Both APIs use JSON Schema for tool definitions, so your existing schemas work as-is. The difference is in the execution flow: OpenAI uses conversation.item.create to return results, while AssemblyAI uses a tool.call / tool.result pattern where you accumulate results and send them after reply.done.
Does AssemblyAI Voice Agent API require an SDK?
No. The Voice Agent API is a standard WebSocket + JSON API. You connect to wss://agents.assemblyai.com/v1/ws, authenticate with a Bearer token, and exchange JSON messages. No SDK to install, no framework to learn. You can read the entire API reference in 10 minutes, and it works natively with coding agents like Claude Code.
What languages does AssemblyAI Voice Agent API support?
The Voice Agent API supports English, Spanish, French, German, Italian, and Portuguese through Universal-3 Pro, with regional dialect recognition across all six languages. Multilingual voices are available for additional languages including Hindi, Mandarin, Russian, Korean, and Japanese.
What happens if the WebSocket connection drops during a conversation?
AssemblyAI preserves sessions for 30 seconds after disconnection. You can reconnect using the session.resume event with the session_id from the original session.ready event, and the conversation picks up where it left off with full context preserved. This eliminates the need for custom reconnection logic.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.