June 29, 2026

Analyzing and scoring voice agent calls with the LLM Gateway

You shipped a voice agent and it's taking calls—now comes the harder question that decides whether it stays in production: how do you know if it's any good? Measuring voice-agent quality at scale means turning every call into structured, scored data. Here's the pattern that works: transcribe the call accurately, then use an LLM to summarize and score it against the criteria you actually care about—built on AssemblyAI's LLM Gateway.

Kelsey Foster

Growth

LLM Gateway

Reviewed by

Table of contents

[Visible on live site]

You shipped a voice agent. It's taking calls. Now comes the harder question, the one that decides whether it stays in production: how do you know if it's any good? "It sounds fine on the demo" doesn't survive contact with a few thousand real conversations, where the agent occasionally misunderstands an account number, talks over a frustrated customer, or confidently resolves the wrong issue.

Measuring voice-agent quality at scale means turning every call into structured, scored data you can track. The pattern that works: transcribe the call accurately, then use an LLM to summarize and score it against the criteria you actually care about. Here's how to build that with AssemblyAI's LLM Gateway on top of an accurate transcript.

Why you can't measure what you can't read accurately

Every quality metric you derive from a call is downstream of the transcript. If the transcription is wrong, your scores are wrong — and worse, confidently wrong. An agent that correctly said "your balance is $1,040" gets marked as an error if the transcript heard "$1,014." Garbage in, garbage scored.

So step one isn't the scoring model — it's accuracy at the source. Run the call recording through Universal-3 Pro (async) with speaker diarization on, so you have a clean, speaker-attributed transcript that distinguishes what the agent said from what the customer said. That distinction is essential: most of what you want to score — did the agent resolve it, was the customer satisfied — depends on knowing who spoke each line. Universal-3 Pro's entity accuracy matters here too, because account numbers, names, and dollar figures are exactly what QA rubrics check.

Two jobs: summarize, then score

Once you have a reliable transcript, the LLM Gateway does two related jobs. It's a single API to OpenAI's GPT, Anthropic's Claude, and Google's Gemini, billed at pass-through token rates — so you can pick the model that fits each job without managing three vendor integrations.

Job 1: Summarize the call

A summary is the human-readable record — what happened, what was resolved, what's outstanding. The key is that summaries shouldn't be one-size-fits-all. A support lead wants a different shape than a sales manager. With the LLM Gateway you define the format: a terse three-bullet recap, a structured "issue / resolution / follow-up" block, or a narrative paragraph. (This is also why "does AssemblyAI support customizable summary types" is a common question — yes, and the LLM Gateway is how you control the shape.)

A prompt as simple as this gets you a structured summary from the transcript:

Summarize this customer support call as JSON with keys:
"reason_for_call", "resolution", "follow_up_needed" (boolean),
and "follow_up_notes". Transcript:
{transcript}

Transcribe and Summarize in a Few Lines

Transcribe with Universal-3 Pro and summarize through the LLM Gateway in a few lines of code. Start free with clear docs and pass-through LLM pricing.

Job 2: Score the call

Scoring is where measurement gets real. Instead of asking the LLM to describe the call, you ask it to evaluate the call against an explicit rubric and return numbers you can aggregate. Decide what "good" means for your agent first — that's the part teams skip — then encode it. Common dimensions for a voice agent:

Task success: did the agent actually accomplish what the caller needed? Resolution without escalation: handled end to end, or punted to a human? Sentiment trajectory: did the customer end calmer or more frustrated than they started? Adherence: did the agent follow required disclosures and process? Accuracy of information: did the agent state correct facts (cross-checkable against your systems)?

A scoring prompt makes the rubric explicit and forces structured output:

Score this support call 1–5 on each dimension and return JSON:
"task_success", "resolution_no_escalation", "customer_sentiment_end",
"policy_adherence". For each, add a one-sentence "evidence" field
quoting the transcript. Be strict. Transcript:
{transcript}

Run that across every call and you have a dataset: average task success this week, the percentage of calls escalated, sentiment trends by topic, adherence rates by call type. That's what "measuring success" actually looks like — not a vibe, a tracked number with quotable evidence behind each score.

Layer in Speech Understanding for the structured signals

Some signals don't need an LLM at all, and it's cheaper and more consistent to get them deterministically. AssemblyAI's Speech Understanding features run on the same transcript: sentiment analysis per utterance, entity detection for the names and numbers mentioned, topic detection to bucket calls by subject, and PII redaction so your QA dataset doesn't store raw customer data. Use these for the objective, repeatable signals and reserve the LLM for the judgment calls — task success, adherence — that genuinely need reasoning.

This hybrid is the architecture behind real conversation intelligence: deterministic features for the facts, an LLM for the evaluation, all on one accurate transcript.

See Scoring Run on a Real Recording

Watch transcription, sentiment, and LLM-based scoring run on a real recording in the playground—no code required.

Try playground

Closing the loop

The point of scoring isn't a dashboard — it's improvement. Once every call carries scores and evidence, you can find the agent's failure patterns (it always escalates billing disputes; sentiment drops whenever it asks for an account number a second time), feed those back into the agent's system prompt or tools, and watch the scores move. That's the difference between a voice agent you hope is working and one you can prove is improving.

And it's the same accurate transcript powering all of it — transcription, summarization, structured understanding, and LLM scoring on one platform, one bill. If you're building the agent itself, our guide on building with the Voice Agent API covers the real-time side; this is how you measure what it does once it's live. Start by scoring a handful of yesterday's calls — the failure patterns show up faster than you'd expect.

Transcribe, Summarize, and Score Every Call

Run transcription, summarization, and LLM scoring on one platform and one bill. Free key, Universal-3 Pro accuracy, pass-through LLM pricing, no commitments.

Frequently asked questions

How do you measure whether a voice agent is successful?

Turn every call into scored data: transcribe it accurately, then use an LLM to evaluate it against an explicit rubric — task success, resolution without escalation, customer sentiment trajectory, and policy adherence — returning structured numbers you can aggregate over time. Define what "good" means before you score, and require the model to quote transcript evidence for each score so the metrics are auditable.

What is the LLM Gateway and how does it help analyze calls?

The LLM Gateway is a single AssemblyAI API to OpenAI's GPT, Anthropic's Claude, and Google's Gemini, billed at pass-through token rates. For call analysis, you pass a transcript and a prompt to summarize or score the conversation, choosing the best model per job without managing three separate vendor integrations.

Can I generate custom summary formats for calls?

Yes. The LLM Gateway lets you define the exact summary shape in your prompt — a three-bullet recap, a structured issue/resolution/follow-up block, or a narrative paragraph — so different teams get the format they need from the same call. This customizability is the main reason teams use the Gateway over a fixed summarization feature.

Why does transcription accuracy matter for call scoring?

Every score is derived from the transcript, so transcription errors produce confidently wrong scores — an agent that stated the correct balance gets flagged as an error if the number was misheard. Use Universal-3 Pro with speaker diarization for accurate, speaker-attributed text, since most scoring depends on knowing whether the agent or the customer spoke each line.

Should I use an LLM or Speech Understanding features to score calls?

Use both. Speech Understanding features (sentiment, entity detection, topic detection, PII redaction) give you objective, repeatable signals deterministically and cheaply, while the LLM handles judgment calls like task success and policy adherence that need reasoning. This hybrid keeps costs down and scores consistent.

How much does it cost to score calls this way?

Transcription is $0.21/hr on Universal-3 Pro, and LLM Gateway usage is billed at pass-through token rates — typically a cent or two per call for a summary plus a scoring pass. Speech Understanding features are priced separately per feature. Everything runs on one platform with pay-as-you-go billing and no minimums.