What is the best speech-to-text API for real-time agent assist?

AssemblyAI's Universal-3.5 Pro Realtime is the leading speech-to-text API for real-time agent assist — ~150ms P50 median latency, native speaker diarization for agent/customer separation, 28% better consecutive number recognition for capturing account IDs and order numbers, and dynamic keyterm prompting for compliance triggers. Universal-3.5 Pro Realtime at $0.45/hr drops directly into existing agent desktops via WebSocket. For teams that want a fully managed coaching pipeline (STT + LLM + TTS in one), the Voice Agent API at $4.50/hr handles the full loop through a single endpoint.

How do I build a real-time agent assist tool that suggests responses during a call?

Stream live caller audio into Universal-3.5 Pro Realtime via WebSocket. For each end-of-turn event, pipe the finalized transcript into the LLM Gateway (25+ models across Claude, GPT, and Gemini) with a system prompt for response suggestions, next-best-action, or objection handling. Render the suggestion in the agent desktop. Speaker diarization separates agent and customer turns automatically so the LLM only generates suggestions on customer turns. Use dynamic keyterms (UpdateConfiguration) to boost product names and compliance phrases as the conversation evolves.

Can speech-to-text detect escalation or compliance keywords in real time?

Yes. Pass escalation triggers ("cancel", "supervisor", "lawyer") and compliance phrases as keyterms_prompt — Universal-3.5 Pro Realtime boosts recognition of those terms in real time. When the model emits a finalized turn containing a tracked keyterm, your agent assist UI can surface a coaching prompt, compliance script, or supervisor alert. Keyterms are capped at 100 per session with each term up to 50 characters; update the list mid-call with UpdateConfiguration as the conversation evolves.

How do I separate agent and customer audio in a single call?

Universal-3.5 Pro Realtime includes native speaker diarization — turn on speaker_labels and each finalized turn includes a speaker label so you can route agent turns and customer turns to different downstream logic. For contact centers using Twilio Media Streams or a SIP trunk, run separate WebSockets for the inbound (customer) and outbound (agent) legs — that gives you channel separation in addition to speaker labels, which is the cleanest pattern for agent assist UIs.

How do I prevent PII like credit card numbers from showing on the agent's screen?

Enable inline PII redaction on Universal-3.5 Pro Realtime — credit card numbers, SSNs, dates of birth, phone numbers, and other PII categories are masked in the transcript before it reaches your agent desktop. Combine with keyterm prompting to boost the words around the PII (e.g., "card number", "social") so the redactor catches every instance even on noisy audio. AssemblyAI is SOC 2 Type 2 certified and offers a Business Associate Addendum (BAA) for customers processing PHI.

What integrations work with existing contact center platforms for agent assist?

AssemblyAI's Universal-3.5 Pro Realtime integrates with standard WebSocket clients and works directly with Twilio Media Streams, Twilio SIP, LiveKit Agents, Pipecat, and Vapi — the same orchestrators contact center platforms already run. For LiveKit-based agent assist desktops, drop in livekit-plugins-assemblyai (under 10 lines of code). The Voice Agent API also exposes a single WebSocket compatible with any modern voice infrastructure.

Solutions

Voice agents for real-time agent assist

Give your human agents live transcription, real-time coaching prompts, and automatic escalation detection — all powered by the most accurate Realtime Speech-to-Text API in the market.

Get started free Talk to sales

Agent assist panel

Live

Customer

"I've been waiting two weeks for my refund and nobody can tell me where it is."

Suggested response

Acknowledge frustration. Pull up order #, check refund status in billing system.

Agent

"I completely understand. Let me pull up your account right now..."

Escalation detected

Keywords: "cancel", "supervisor" — compliance script recommended.

The problem

Agents fly blind on every call until it's too late

Contact center agents handle dozens of calls per shift with no real-time visibility. Supervisors spot-check a fraction; compliance violations and escalation opportunities are caught after the fact, if at all. Traditional QA is post-hoc — by the time you flag a churn signal or a missed cross-sell, the customer is gone. AssemblyAI gives every agent a live coach: real-time transcript, AI-generated suggestions, and automatic keyword detection on 100% of calls.

Built for real-time contact center operations

Latency ~150ms

Median streaming latency on Universal-3.5 Pro Realtime for real-time agent display.

Entity accuracy 28%

Better consecutive number recognition for order numbers, account IDs, and SKUs.

Coverage 100%

Of calls analyzed in real time — no more random spot-checks.

Keyterms 100

Domain-specific keyterms boosted dynamically per session.

Two ways to build

Pick the API that fits your agent assist stack

Ship a working coaching pipeline in an afternoon, or drop best-in-class streaming STT into the agent desktop you already run.

Recommended

Voice Agent API

Our proprietary voice stack via one WebSocket. Run an AI tier-1 voice agent that triages and warm-transfers complex calls to your human team with full context — zero infra to manage.

Best for

Tier-1 triage with warm escalation to human agents
Tool calls for CRM lookup, ticket creation, and handoff
Built-in keyterm prompting for compliance triggers
Claude Code compatible — paste the docs and build anything

$4.50/hr — speech, LLM, and voice all included

Get started for free

Free tier available · No credit card required

Bring Your Own Stack

Universal-3.5 Pro Realtime STT API

The live transcript layer for your agent desktop. Works natively with LiveKit, Pipecat, Vapi, and Twilio — speaker labels, dynamic keyterms, and PII redaction baked in.

Best for

Teams running their own LLM and agent desktop
~150ms P50 median latency for instant suggestions
Native speaker labels for agent/customer separation
Dynamic keyterms — update mid-call as topics shift
Inline PII redaction before data hits the agent screen

$0.45/hr — transcription only, unlimited streams

View integration docs

No concurrency caps · Autoscaling included

One pipeline turns every call into a coaching opportunity

Live transcription with speaker labels

Universal-3.5 Pro Realtime transcribes agent and customer turns in real time with built-in diarization, so coaching prompts are always anchored to the right speaker.

AI-generated suggestions during the call

Pipe finalized turns into the LLM Gateway (25+ models) for real-time response suggestions, objection handling, and next-best-action prompts.

Escalation and compliance keyword detection

Boost compliance triggers and escalation phrases through keyterm prompting. Update keyterms mid-call as the conversation evolves.

PII-safe agent surface

Stream PII redaction inline so the agent assist UI never displays raw card numbers, SSNs, or other sensitive data.

support_agent

Agent assist pipeline

Capture live call audio

↓

Transcribe + diarize in real time

↓

Generate coaching suggestions

↓

Flag escalations + compliance

Quickstart

Build a real-time agent assist tool in minutes

Voice Agent API — tier-1 caller agent with human escalation

# Voice Agent API: tier-1 caller agent that escalates to a human
import asyncio, json, websockets

API_KEY = "YOUR_API_KEY"

async def run_agent():
    async with websockets.connect(
        "wss://agents.assemblyai.com/v1/ws",
        additional_headers={"Authorization": f"Bearer {API_KEY}"},
    ) as ws:
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "system_prompt": (
                    "You are a tier-1 support agent. Triage the caller's issue. "
                    "If they ask for a supervisor, say 'cancel', or the issue is "
                    "complex, call escalate_to_human with a one-sentence summary."
                ),
                "greeting": "Hi, thanks for calling Acme — how can I help today?",
                "input": {"keyterms": ["cancel", "supervisor", "refund", "lawyer"]},
                "output": {"voice": "ivy"},
                "tools": [{
                    "type": "function",
                    "name": "escalate_to_human",
                    "description": "Warm-transfer to a human agent with full context.",
                    "parameters": {
                        "type": "object",
                        "properties": {"summary": {"type": "string"}},
                        "required": ["summary"],
                    },
                }],
            },
        }))
        async for msg in ws:
            handle(json.loads(msg))  # transcript.user, reply.audio, tool.call, ...

Universal-3.5 Pro Realtime — live transcript for agent desktop

# Universal-3.5 Pro Realtime: live transcript for an agent assist desktop
import asyncio, json, websockets
from urllib.parse import urlencode

API_KEY = "YOUR_API_KEY"

params = urlencode({
    "sample_rate": 16000,
    "speech_model": "u3-rt-pro",
    "keyterms_prompt": json.dumps([
        "cancel", "supervisor", "refund",
        "order #", "account number",
    ]),
    "format_turns": "true",
    "speaker_labels": "true",                  # diarize agent vs. caller
    "redact_pii": "true",                      # mask PII before it hits the UI
    "redact_pii_policies": json.dumps([
        "credit_card_number", "us_social_security_number",
        "date_of_birth", "phone_number", "email_address",
    ]),
    "redact_pii_sub": "entity_name",           # e.g. [CREDIT_CARD_NUMBER]
})

async def stream_agent_assist(audio_iter):
    url = f"wss://streaming.assemblyai.com/v3/ws?{params}"
    async with websockets.connect(
        url, additional_headers={"Authorization": API_KEY},
    ) as ws:
        async def send_audio():
            async for chunk in audio_iter:
                await ws.send(chunk)
        asyncio.create_task(send_audio())
        async for raw in ws:
            evt = json.loads(raw)
            if evt.get("type") == "Turn" and evt.get("end_of_turn"):
                # finalized, PII-redacted turn — pipe to LLM Gateway for coaching
                push_to_coach(evt["transcript"], evt.get("speaker_label"))

Try in Playground View full docs

Sub-300ms streaming for instant suggestions

Universal-3.5 Pro Realtime delivers ~150ms P50 median latency, so coaching prompts appear before the agent has to think — not after the moment has passed.

Dynamic keyterms for evolving calls

Push a new keyterm list with UpdateConfiguration mid-call when the conversation shifts from menu items to payment terms — context biasing adapts in real time.

Inline PII redaction on every turn

Redact card numbers, SSNs, dates of birth, and other sensitive data before they ever hit the agent's screen. SOC 2 Type 2 certified; BAA available for regulated workloads.

Teams shipping agent assist on AssemblyAI

100% coverage of agent messages

Calls are and will remain a pertinent part of the customer service journey. Customer service isn't moving entirely to chatbots or chat interactions.

Dr. Shane Lynn, CEO — EdgeTier

Frequently asked questions

: AssemblyAI's Universal-3.5 Pro Realtime is the leading speech-to-text API for real-time agent assist — ~150ms P50 median latency, native speaker diarization for agent/customer separation, 28% better consecutive number recognition for capturing account IDs and order numbers, and dynamic keyterm prompting for compliance triggers. Universal-3.5 Pro Realtime at $0.45/hr drops directly into existing agent desktops via WebSocket. For teams that want a fully managed coaching pipeline (STT + LLM + TTS in one), the Voice Agent API at $4.50/hr handles the full loop through a single endpoint.
: Stream live caller audio into Universal-3.5 Pro Realtime via WebSocket. For each end-of-turn event, pipe the finalized transcript into the LLM Gateway (25+ models across Claude, GPT, and Gemini) with a system prompt for response suggestions, next-best-action, or objection handling. Render the suggestion in the agent desktop. Speaker diarization separates agent and customer turns automatically so the LLM only generates suggestions on customer turns. Use dynamic keyterms (UpdateConfiguration) to boost product names and compliance phrases as the conversation evolves.
: Yes. Pass escalation triggers ("cancel", "supervisor", "lawyer") and compliance phrases as keyterms_prompt — Universal-3.5 Pro Realtime boosts recognition of those terms in real time. When the model emits a finalized turn containing a tracked keyterm, your agent assist UI can surface a coaching prompt, compliance script, or supervisor alert. Keyterms are capped at 100 per session with each term up to 50 characters; update the list mid-call with UpdateConfiguration as the conversation evolves.
: Universal-3.5 Pro Realtime includes native speaker diarization — turn on speaker_labels and each finalized turn includes a speaker label so you can route agent turns and customer turns to different downstream logic. For contact centers using Twilio Media Streams or a SIP trunk, run separate WebSockets for the inbound (customer) and outbound (agent) legs — that gives you channel separation in addition to speaker labels, which is the cleanest pattern for agent assist UIs.
: Enable inline PII redaction on Universal-3.5 Pro Realtime — credit card numbers, SSNs, dates of birth, phone numbers, and other PII categories are masked in the transcript before it reaches your agent desktop. Combine with keyterm prompting to boost the words around the PII (e.g., "card number", "social") so the redactor catches every instance even on noisy audio. AssemblyAI is SOC 2 Type 2 certified and offers a Business Associate Addendum (BAA) for customers processing PHI.
: AssemblyAI's Universal-3.5 Pro Realtime integrates with standard WebSocket clients and works directly with Twilio Media Streams, Twilio SIP, LiveKit Agents, Pipecat, and Vapi — the same orchestrators contact center platforms already run. For LiveKit-based agent assist desktops, drop in livekit-plugins-assemblyai (under 10 lines of code). The Voice Agent API also exposes a single WebSocket compatible with any modern voice infrastructure.

Build real-time agent assist today

Free tier, no credit card. From live transcript to AI coaching prompts in an afternoon.

Get started free