April 29, 2026

Migrating from OpenAI Realtime API to AssemblyAI Voice Agent API

OpenAI Realtime API migration guide: learn how to move to AssemblyAI Voice Agent API with less session code, simpler audio streaming, and tool support.

Kelsey Foster

Growth

AI voice agents

Voice Agent API

Reviewed by

Table of contents

[Visible on live site]

This guide walks you through migrating a voice agent from the OpenAI Realtime API to AssemblyAI's Voice Agent API. You'll replace manual session management, audio buffer handling, and ephemeral token generation with a cleaner WebSocket interface that manages the full voice pipeline through a single connection.

The migration covers four areas: authentication setup, session configuration, audio streaming, and tool migration. Each section shows the OpenAI implementation alongside the AssemblyAI equivalent so you can see exactly what changes and what stays the same. You'll need Python 3.8+, an AssemblyAI API key, and working familiarity with WebSockets and async Python. Your existing business logic and function definitions transfer directly.

What is the OpenAI Realtime API?

The OpenAI Realtime API is a WebSocket-based interface that lets you build using OpenAI's audio models—specifically gpt-realtime and gpt-realtime-mini. Instead of chaining together separate speech-to-text, LLM, and text-to-speech services, it processes audio directly as a single multimodal input and output.

That word "multimodal" is important. It means the same underlying model handles audio, text, and images—voice is one capability among many, not the primary focus.

This distinction matters when you hit production. A model designed to do everything doesn't always do any one thing as well as a model built specifically for it. Speech accuracy is where you'll feel that most.

How the OpenAI Realtime API works: sessions and events

The Realtime API is event-driven. You and the server exchange JSON messages over a persistent WebSocket connection, and a "session" holds the state of your entire conversation.

Think of a session like a phone call—it stays open, remembers what was said, and closes when you hang up. Every action you take sends a specific event type over that connection.

Here are the core events you'll work with:

Event name	Direction	What it does
`session.update`	client → server	Configures session settings like instructions and voice
`session.created`	server → client	Confirms the session is ready with its default config
`input_audio_buffer.append`	client → server	Sends a chunk of audio to the model's input buffer
`input_audio_buffer.commit`	client → server	Finalizes the buffer so the model can process it
`response.create`	client → server	Tells the model to generate a response
`response.output_audio.delta`	server → client	Streams audio response chunks back to you
`response.done`	server → client	Signals the response is complete

Two more concepts you'll need to understand before writing any code:

Voice activity detection (VAD): This is how the model knows when you've stopped talking. You can use server_vad (silence-based) or semantic_vad (content-based, meaning it waits for a natural pause in meaning rather than just silence).
Function calling: The model can invoke tools you've registered—like looking up a customer record—before generating its spoken response.

Here's the minimal Python code to open a connection and configure a session:

import asyncio
import websockets
import json
import os
from dotenv import load_dotenv

load_dotenv()

async def connect():
    uri = "wss://api.openai.com/v1/realtime?model=gpt-realtime"
    headers = {"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"}

    async with websockets.connect(uri, extra_headers=headers) as ws:
        # Configure the session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "type": "realtime",
                "instructions": "You are a helpful assistant.",
                "audio": {
                    "input": {
                        "turn_detection": {"type": "server_vad"}
                    },
                    "output": {"voice": "alloy"}
                }
            }
        }))

        # Listen for the server to confirm
        async for message in ws:
            event = json.loads(message)
            print(f"Received: {event['type']}")
            if event["type"] == "session.created":
                print("Session ready.")
                break

asyncio.run(connect())

‍This is the foundation every OpenAI Realtime API integration starts with. Everything else—sending audio, handling responses, calling tools—builds on this session.

OpenAI Realtime API production challenges

OpenAI Realtime API works well for prototypes. But when you move to production, three problems compound quickly.

Token-based pricing is hard to predict. Audio tokens are billed per million tokens for input and output separately. The count changes based on speech pace, pauses, and conversation length—so your monthly bill is a moving target, not a fixed cost.

Session management requires significant boilerplate. In production, you need to handle concurrent event streams, manage reconnections after network interruptions, enforce the 30-minute session timeout, and write retry logic. All of that exists before you write a single line of product logic.

Concurrency limits require quota negotiations. Session limits are tied to account tiers. As your user base grows, you'll need to request higher limits from OpenAI and architect around those ceilings from the start.

AssemblyAI's Voice Agent API was built specifically for these production constraints—one WebSocket, all-inclusive flat-rate pricing at $4.50/hr for the complete voice agent service, and auto-scaling concurrency address each of these friction points directly.

Start with your AssemblyAI API key

Authenticate once with a single key—no ephemeral tokens or session boilerplate. Create your account to begin migrating your agent.

What is AssemblyAI Voice Agent API?

AssemblyAI's Voice Agent API is a single WebSocket API that handles the full voice pipeline—speech understanding, LLM reasoning, and voice generation—in one connection. You don't manage separate STT, LLM, and TTS providers. You connect once, and the infrastructure handles the rest.

The key architectural difference from OpenAI Realtime API: where OpenAI's model is multimodal (voice as one of many capabilities), AssemblyAI's pipeline is purpose-built for speech. Universal-3 Pro Streaming—AssemblyAI's dedicated Voice AI model, ranked #1 on the Hugging Face Open ASR Leaderboard—sits at the foundation. Getting the input right is the whole point, because when speech-to-text misreads a customer's name or account number, every downstream step responds to the wrong thing.

Here's how the two APIs compare side by side:

Feature	OpenAI Realtime API	AssemblyAI Voice Agent API
Pricing model	Token-based (varies)	Flat-rate ($4.50/hr)
Concurrency	Tier-based quotas	Auto-scaling
Speech accuracy foundation	Multimodal model	Universal-3 Pro (dedicated, #1 Hugging Face)
Connection type	WebSocket, manual event handling	WebSocket, standard JSON API
Session management	Manual (30+ event types)	Streamlined (~10 event types)
Turn detection	server_vad or semantic_vad	Acoustic + contextual (built-in, configurable)
Supported languages	Multiple	English, Spanish, French, German, Italian, Portuguese + multilingual voices

The "invisible infrastructure" framing here is intentional. You're not building on a platform with its own opinions about your product—you control the conversation design, tool integrations, and agent behavior fully. The infrastructure just works.

Migrate from OpenAI Realtime API to AssemblyAI Voice Agent API

Most migrations take one to two days. You're not rewriting your product logic—you're replacing infrastructure code with simpler infrastructure code. The four steps are: authentication setup, session configuration, audio streaming, and tool migration.

Authentication and environment setup

Before you connect, install the required packages:

pip install websockets pyaudio python-dotenv

Add your AssemblyAI API key to your .env file:

ASSEMBLYAI_API_KEY=your_assemblyai_key_here

Here's where the first simplification happens. OpenAI requires you to generate an ephemeral token as a separate API call before you can open a WebSocket connection:

from openai import OpenAI
import os
from dotenv import load_dotenv

load_dotenv()

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Extra step before every connection
response = client.realtime.sessions.create(
    model="gpt-realtime",
    voice="alloy"
)
ephemeral_token = response.client_secret.value
# Token expires — you need to handle that too

With AssemblyAI, you authenticate directly by passing your API key as a Bearer token in the WebSocket connection header. No token generation, no expiration handling:‍

import asyncio
import websockets
import os
from dotenv import load_dotenv

load_dotenv()

async def connect():
    uri = "wss://agents.assemblyai.com/v1/ws"
    headers = {"Authorization": f"Bearer {os.environ['ASSEMBLYAI_API_KEY']}"}

    async with websockets.connect(uri, additional_headers=headers) as ws:
        print("Connected — ready to configure session.")

asyncio.run(connect())

Removing the ephemeral token step eliminates an entire failure mode. Token expiration mid-session was a real edge case to handle with OpenAI; it doesn't exist here.

Note for browser-based applications: If you're building a client-side voice agent that runs in the browser, you'll need to use AssemblyAI's temporary token flow to avoid exposing your API key. Generate a temporary token server-side using the token endpoint and pass it as a query parameter: wss://agents.assemblyai.com/v1/ws?token=YOUR_TEMP_TOKEN.

Session configuration

With OpenAI, session configuration requires a deeply nested JSON structure. You manually specify audio format, sample rate, VAD parameters, and voice settings—all before anything works:

import json
import asyncio
import websockets
import os

async def configure_openai_session():
    uri = "wss://api.openai.com/v1/realtime?model=gpt-realtime"
    headers = {"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"}

    async with websockets.connect(uri, extra_headers=headers) as ws:
        session_config = {
            "type": "session.update",
            "session": {
                "instructions": "You are a helpful customer support agent.",
                "audio": {
                    "input": {
                        "format": {"type": "audio/pcm", "rate": 24000},
                        "turn_detection": {
                            "type": "server_vad",
                            "threshold": 0.5,
                            "prefix_padding_ms": 300,
                            "silence_duration_ms": 200
                        }
                    },
                    "output": {
                        "voice": "alloy",
                        "format": {"type": "audio/pcm", "rate": 24000}
                    }
                }
            }
        }
        await ws.send(json.dumps(session_config))

        async for message in ws:
            event = json.loads(message)
            if event["type"] == "session.updated":
                print("Session configured.")
                break

asyncio.run(configure_openai_session())

With AssemblyAI, you send a session.update message after connecting. Audio format defaults to PCM at 24kHz, and turn detection is built in with sensible defaults—you only override what you actually need to change:‍

import json
import asyncio
import websockets
import os
from dotenv import load_dotenv

load_dotenv()

async def configure_assemblyai_session():
    uri = "wss://agents.assemblyai.com/v1/ws"
    headers = {"Authorization": f"Bearer {os.environ['ASSEMBLYAI_API_KEY']}"}

    async with websockets.connect(uri, additional_headers=headers) as ws:
        # Configure the session — defaults handle audio format and turn detection
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "system_prompt": "You are a helpful customer support agent.",
                "greeting": "Hi! How can I help you today?",
                "output": {"voice": "ivy"}
            }
        }))

        # Wait for session.ready before streaming audio
        async for message in ws:
            event = json.loads(message)
            if event["type"] == "session.ready":
                print(f"Session ready — id: {event['session_id']}")
                break

asyncio.run(configure_assemblyai_session())

The difference is more than syntax. With OpenAI, misconfiguring a single audio parameter breaks everything silently. AssemblyAI's defaults are tuned for real-world conversations, so you only override what you actually need to change. Need to adjust turn detection sensitivity? Add turn_detection to the config:

# Optional: customize turn detection
await ws.send(json.dumps({
    "type": "session.update",
    "session": {
        "input": {
            "turn_detection": {
                "vad_threshold": 0.5,
                "min_silence": 600,
                "max_silence": 1500,
                "interrupt_response": True
            },
            "keyterms": ["AssemblyAI", "Universal-3 Pro"]
        }
    }
}))

You can also update session.update mid-conversation—change the system prompt, add tools, or adjust turn detection without reconnecting.

See Voice AI in action

Test streaming transcription and voice agent interactions in the browser before wiring your event handlers. No code required.

Try playground

Audio streaming and event handling

This is where the complexity gap is most visible. OpenAI requires you to manually encode audio as base64, manage the input buffer, commit it at the right time, and decode the audio response back to bytes:‍

import base64
import pyaudio
import asyncio
import websockets
import json
import os

async def openai_audio_loop(ws):
    p = pyaudio.PyAudio()

    # Input stream
    input_stream = p.open(
        format=pyaudio.paInt16,
        channels=1,
        rate=24000,
        input=True,
        frames_per_buffer=4800
    )

    # Output stream
    output_stream = p.open(
        format=pyaudio.paInt16,
        channels=1,
        rate=24000,
        output=True
    )

    async def send_audio():
        while True:
            chunk = input_stream.read(4800, exception_on_overflow=False)
            # Must encode as base64 before sending
            await ws.send(json.dumps({
                "type": "input_audio_buffer.append",
                "audio": base64.b64encode(chunk).decode("utf-8")
            }))
            await asyncio.sleep(0.1)

    async def receive_audio():
        async for message in ws:
            event = json.loads(message)
            if event["type"] == "response.output_audio.delta":
                # Must decode from base64 before playing
                audio_bytes = base64.b64decode(event["delta"])
                output_stream.write(audio_bytes)
            elif event["type"] == "response.done":
                print("Response complete.")

    # Run both concurrently
    await asyncio.gather(send_audio(), receive_audio())

With AssemblyAI, you still work with WebSocket events and base64 audio—but the event types are cleaner and there's no buffer management step. You send input.audio events and receive reply.audio events, plus you get transcript events for both sides of the conversation:‍

import base64
import pyaudio
import asyncio
import websockets
import json
import os
from dotenv import load_dotenv

load_dotenv()

async def assemblyai_audio_loop():
    uri = "wss://agents.assemblyai.com/v1/ws"
    headers = {"Authorization": f"Bearer {os.environ['ASSEMBLYAI_API_KEY']}"}

    p = pyaudio.PyAudio()
    input_stream = p.open(
        format=pyaudio.paInt16, channels=1, rate=24000,
        input=True, frames_per_buffer=4800
    )
    output_stream = p.open(
        format=pyaudio.paInt16, channels=1, rate=24000,
        output=True
    )

    async with websockets.connect(uri, additional_headers=headers) as ws:
        # Configure session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "system_prompt": "You are a helpful customer support agent.",
                "output": {"voice": "ivy"}
            }
        }))

        ready = False

        async def send_audio():
            nonlocal ready
            while True:
                if ready:
                    chunk = input_stream.read(4800, exception_on_overflow=False)
                    await ws.send(json.dumps({
                        "type": "input.audio",
                        "audio": base64.b64encode(chunk).decode("utf-8")
                    }))
                await asyncio.sleep(0.1)

        async def receive_events():
            nonlocal ready
            async for message in ws:
                event = json.loads(message)

                if event["type"] == "session.ready":
                    ready = True
                    print("Session ready.")
                elif event["type"] == "reply.audio":
                    audio_bytes = base64.b64decode(event["data"])
                    output_stream.write(audio_bytes)
                elif event["type"] == "transcript.user":
                    print(f"User: {event['text']}")
                elif event["type"] == "transcript.agent":
                    print(f"Agent: {event['text']}")
                elif event["type"] == "reply.done":
                    if event.get("status") == "interrupted":
                        print("Agent interrupted by user.")

        await asyncio.gather(send_audio(), receive_events())

asyncio.run(assemblyai_audio_loop())

The event model is simpler: no buffer commit step, no separate response trigger. You stream audio in with input.audio, and the server handles turn detection, generates a response, and streams audio back with reply.audio. You also get transcript.user and transcript.agent events—full text transcripts of both sides of the conversation, which OpenAI doesn't provide natively.

Barge-in is built in too. When a user interrupts, you'll receive reply.done with status: "interrupted"—flush your audio playback buffer and the agent picks up the new turn automatically.

Function calling and tool migration

Both APIs support tool calling, and the JSON Schema definitions are similar enough that your existing tool definitions translate directly. The difference is in how you handle the results.

With OpenAI, you track a call_id for each function call, execute the function, then manually route the result back to the model via a conversation.item.create event:‍

import json
import asyncio
import websockets
import os

def get_account_info(account_id: str) -> dict:
    # Your business logic
    return {"status": "active", "balance": 1000, "id": account_id}

# Register tools in session config
tools = [{
    "type": "function",
    "name": "get_account_info",
    "description": "Look up account status by ID",
    "parameters": {
        "type": "object",
        "properties": {
            "account_id": {"type": "string", "description": "The customer account ID"}
        },
        "required": ["account_id"]
    }
}]

async def openai_with_tools():
    uri = "wss://api.openai.com/v1/realtime?model=gpt-realtime"
    headers = {"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"}

    async with websockets.connect(uri, extra_headers=headers) as ws:
        # Include tools in session config
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "instructions": "You are a support agent.",
                "tools": tools
            }
        }))

        async for message in ws:
            event = json.loads(message)

            if event["type"] == "response.done":
                output = event["response"]["output"][0]

                if output.get("type") == "function_call":
                    # Parse arguments
                    args = json.loads(output["arguments"])

                    # Execute your function
                    result = get_account_info(args["account_id"])

                    # Manually route the result back to the model
                    await ws.send(json.dumps({
                        "type": "conversation.item.create",
                        "item": {
                            "type": "function_call_output",
                            "call_id": output["call_id"],  # Must track this
                            "output": json.dumps(result)
                        }
                    }))

                    # Trigger the next response
                    await ws.send(json.dumps({"type": "response.create"}))

asyncio.run(openai_with_tools())

With AssemblyAI, the tool definition format is nearly identical—your JSON Schema definitions transfer directly. The key difference: you accumulate tool results during tool.call events, then send them all back after reply.done. The agent speaks a natural transition phrase while waiting for your results:‍

import json
import asyncio
import websockets
import os
from dotenv import load_dotenv

load_dotenv()

def get_account_info(account_id: str) -> dict:
    # Your business logic — same function, no changes needed
    return {"status": "active", "balance": 1000, "id": account_id}

# Tool definitions — same JSON Schema format as OpenAI
tools = [{
    "type": "function",
    "name": "get_account_info",
    "description": "Look up account status by ID",
    "parameters": {
        "type": "object",
        "properties": {
            "account_id": {"type": "string", "description": "The customer account ID"}
        },
        "required": ["account_id"]
    }
}]

async def assemblyai_with_tools():
    uri = "wss://agents.assemblyai.com/v1/ws"
    headers = {"Authorization": f"Bearer {os.environ['ASSEMBLYAI_API_KEY']}"}

    async with websockets.connect(uri, additional_headers=headers) as ws:
        # Include tools in session config
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "system_prompt": "You are a support agent. Use get_account_info to look up accounts.",
                "output": {"voice": "ivy"},
                "tools": tools
            }
        }))

        pending_tools = []

        async for message in ws:
            event = json.loads(message)

            if event["type"] == "session.ready":
                print("Session ready with tools.")

            elif event["type"] == "tool.call":
                # Execute your function
                name = event["name"]
                args = event.get("arguments", {})

                if name == "get_account_info":
                    result = get_account_info(args["account_id"])
                else:
                    result = {"error": "Unknown tool"}

                # Accumulate — don't send yet
                pending_tools.append({
                    "call_id": event["call_id"],
                    "result": result,
                })

            elif event["type"] == "reply.done":
                if event.get("status") == "interrupted":
                    # User barged in — discard pending results
                    pending_tools.clear()
                elif pending_tools:
                    # Now send all tool results
                    for tool in pending_tools:
                        await ws.send(json.dumps({
                            "type": "tool.result",
                            "call_id": tool["call_id"],
                            "result": json.dumps(tool["result"]),
                        }))
                    pending_tools.clear()

            elif event["type"] == "transcript.user":
                print(f"User: {event['text']}")
            elif event["type"] == "transcript.agent":
                print(f"Agent: {event['text']}")

asyncio.run(assemblyai_with_tools())

Your actual business logic—get_account_info in this case—doesn't change at all. The tool definition JSON Schema is the same format. The difference is in the plumbing: AssemblyAI uses a cleaner accumulate-and-send pattern with tool.call / tool.result events, and the agent automatically speaks a transition phrase ("Let me check that for you") while waiting for results—no dead air.

What changes and what stays the same

Migrating from OpenAI Realtime API to AssemblyAI's Voice Agent API is, in practice, mostly a subtraction exercise. Here's a quick summary:

What you remove: Ephemeral token generation, manual audio buffer management and commit steps, base64 encoding/decoding boilerplate (though you still work with base64, the event model is simpler), the 30+ event types to handle, manual response triggering after tool calls.

What stays the same: Your business logic and function implementations, JSON Schema tool definitions (they transfer directly), the WebSocket + JSON communication model, base64 PCM audio format.

What you gain: Streaming transcripts of both sides of the conversation (transcript.user and transcript.agent), purpose-built speech accuracy from Universal-3 Pro Streaming, built-in turn detection with acoustic + contextual signals, natural barge-in handling, predictable $4.50/hr pricing, session resumption (reconnect within 30 seconds and pick up where you left off), and key terms prompting to boost accuracy on domain-specific vocabulary.

If you're building production voice agents and want speech accuracy, predictable pricing, and auto-scaling concurrency without managing three separate providers, AssemblyAI's Voice Agent API is worth evaluating. It's built on Universal-3 Pro Streaming—the same dedicated Voice AI model that powers AssemblyAI's real-time transcription—and is designed to be invisible infrastructure that stays out of the way while you build the product your users actually interact with.

Build production voice agents today

Get all-inclusive flat-rate pricing at $4.50/hr, auto-scaling concurrency, and industry-leading speech accuracy built on Universal-3 Pro. Read the full API reference in 10 minutes.

Frequently asked questions

How does AssemblyAI Voice Agent API pricing compare to OpenAI Realtime API?

AssemblyAI charges a flat rate of $4.50/hour that covers speech understanding, LLM reasoning, and voice generation regardless of conversation density. OpenAI charges per audio token for input and output separately, meaning costs vary based on speech pace, pause frequency, and response length. The flat-rate model makes cost forecasting straightforward—one line of math to model what a 5-minute call costs.

How long does migrating from OpenAI Realtime API to AssemblyAI Voice Agent API take?

Most migrations take one to two days. The changes are primarily structural—replacing OpenAI's event handling patterns with AssemblyAI's cleaner event model. Your business logic and JSON Schema tool definitions transfer directly without modification.

Can I reuse my existing OpenAI function definitions when migrating?

Yes. Both APIs use JSON Schema for tool definitions, so your existing schemas work as-is. The difference is in the execution flow: OpenAI uses conversation.item.create to return results, while AssemblyAI uses a tool.call / tool.result pattern where you accumulate results and send them after reply.done.

Does AssemblyAI Voice Agent API require an SDK?

No. The Voice Agent API is a standard WebSocket + JSON API. You connect to wss://agents.assemblyai.com/v1/ws, authenticate with a Bearer token, and exchange JSON messages. No SDK to install, no framework to learn. You can read the entire API reference in 10 minutes, and it works natively with coding agents like Claude Code.

What languages does AssemblyAI Voice Agent API support?

The Voice Agent API supports English, Spanish, French, German, Italian, and Portuguese through Universal-3 Pro, with regional dialect recognition across all six languages. Multilingual voices are available for additional languages including Hindi, Mandarin, Russian, Korean, and Japanese.

What happens if the WebSocket connection drops during a conversation?

AssemblyAI preserves sessions for 30 seconds after disconnection. You can reconnect using the session.resume event with the session_id from the original session.ready event, and the conversation picks up where it left off with full context preserved. This eliminates the need for custom reconnection logic.

Migrating from OpenAI Realtime API to AssemblyAI Voice Agent API

What is the OpenAI Realtime API?

How the OpenAI Realtime API works: sessions and events

OpenAI Realtime API production challenges

What is AssemblyAI Voice Agent API?

Migrate from OpenAI Realtime API to AssemblyAI Voice Agent API

Authentication and environment setup

Session configuration

Audio streaming and event handling

Function calling and tool migration

What changes and what stays the same

Frequently asked questions

How does AssemblyAI Voice Agent API pricing compare to OpenAI Realtime API?

How long does migrating from OpenAI Realtime API to AssemblyAI Voice Agent API take?

Can I reuse my existing OpenAI function definitions when migrating?

Does AssemblyAI Voice Agent API require an SDK?

What languages does AssemblyAI Voice Agent API support?

What happens if the WebSocket connection drops during a conversation?

Create an ambient AI scribe that works during telehealth video calls

Voice agents in noisy environments (Drive-Thrus, Contact Centers, Field)

When to use Voice Agent API vs. Universal-3 Pro Streaming

Voice Agent Orchestrators Compared: Vapi vs Pipecat vs LiveKit with AssemblyAI

BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

Filter profanity from audio files using Python

Node.js voice agent with AssemblyAI Universal-3 Pro Streaming

Auto Chapters in Action - Build a Web App that Automatically Summarizes Podcasts

Migrating from OpenAI Realtime API to AssemblyAI Voice Agent API

What is the OpenAI Realtime API?

How the OpenAI Realtime API works: sessions and events

OpenAI Realtime API production challenges

What is AssemblyAI Voice Agent API?

Migrate from OpenAI Realtime API to AssemblyAI Voice Agent API

Authentication and environment setup

Session configuration

Audio streaming and event handling

Function calling and tool migration

What changes and what stays the same

Frequently asked questions

How does AssemblyAI Voice Agent API pricing compare to OpenAI Realtime API?

How long does migrating from OpenAI Realtime API to AssemblyAI Voice Agent API take?

Can I reuse my existing OpenAI function definitions when migrating?

Does AssemblyAI Voice Agent API require an SDK?

What languages does AssemblyAI Voice Agent API support?

What happens if the WebSocket connection drops during a conversation?

Related posts

Create an ambient AI scribe that works during telehealth video calls

Voice agents in noisy environments (Drive-Thrus, Contact Centers, Field)

When to use Voice Agent API vs. Universal-3 Pro Streaming

Voice Agent Orchestrators Compared: Vapi vs Pipecat vs LiveKit with AssemblyAI

BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

Filter profanity from audio files using Python

Node.js voice agent with AssemblyAI Universal-3 Pro Streaming

Auto Chapters in Action - Build a Web App that Automatically Summarizes Podcasts