May 5, 2026

What is LLM Gateway?

LLM gateway explained: compare features like routing, failover, security, and cost control so you can choose the right option for production AI apps today.

Kelsey Foster

Growth

LLM Gateway

LLMs

Reviewed by

Table of contents

[Visible on live site]

Building a Voice AI application means combining speech-to-text with language model processing—and that second step is where complexity accumulates fast. Which model handles summarization best? What happens when a provider rate-limits you mid-call? How do you A/B test GPT-5 against Claude without rewriting your integration? The growing LLM gateway market exists precisely because these problems are universal—and when your primary model goes down at 2am, you need infrastructure that handles it automatically.

AssemblyAI's LLM gateway solves all of this from inside the same platform you already use for transcription. One OpenAI-compatible API endpoint. Twenty-plus models across four providers. Zero markup on pricing. And because it runs inside the same infrastructure as your speech-to-text pipeline, there's no extra network hop eating into your latency budget.

This guide explains what AssemblyAI's LLM gateway is, how it compares to alternatives, and why it was built specifically for Voice AI—not as a generic multi-model proxy.

What is AssemblyAI's LLM gateway?

AssemblyAI's LLM gateway is an OpenAI-compatible API that routes your transcripts through leading LLMs from Anthropic, OpenAI, Google, and Meta—without managing separate API keys, accounts, or billing relationships. Change one base URL, keep your existing code, and you're running on a different model.

Unlike general-purpose LLM proxies, LLM gateway is designed around spoken data. It understands the structure of a transcript—speaker labels, word-level timestamps, overlapping turns—and preserves that context when routing to downstream models. You're not sending raw text; you're sending speech-aware context.

But here's where it gets interesting. Because LLM gateway runs inside the same infrastructure as AssemblyAI's transcription pipeline, there's no additional network hop between your speech-to-text and your LLM call. Every utterance in a voice agent flows through a single system: speech in, LLM processing, action out. That architecture difference matters when you're optimizing for sub-second response times.

Here's what changes when you add AssemblyAI's LLM gateway to your Voice AI stack:

Without LLM gateway	With LLM gateway
Separate API keys and billing per LLM provider	One AssemblyAI API key for all 20+ models across 4 providers
Manual prompt formatting per provider	Automatic prompt construction from transcript context
Brittle model-switching code	Swap providers with a single parameter change (OpenAI-compatible API)
Rebuild integrations for every new model	New models added automatically as they release
5-5.5% markup on model costs (OpenRouter, llmgateway.io)	Zero markup—pay exactly what providers charge
Client-side retry logic when models fail or timeout	Automatic cross-provider fallbacks with same response schema
Third-party proxy libraries with massive dependency trees	Minimal-dependency architecture built by hand
Extra network hop between STT and LLM	Runs inside same infrastructure as transcription pipeline

Try LLM gateway with your existing API key

Access 20+ models from OpenAI, Anthropic, Google, and Meta through one endpoint—with zero markup on provider costs.

Get started free

When do you need AssemblyAI's LLM gateway?

LLM gateway makes sense when your application involves spoken data and you want to apply language models to it without managing the complexity yourself. Here are the scenarios where it delivers the most value:

You're already using AssemblyAI for transcription. You don't need a new account, a new API key, or a new billing relationship. Your existing credentials work. Your existing dashboard shows usage across both transcription and LLM calls.
You want to compare models without rewriting code. Testing GPT-5 against Claude 4.5 Sonnet? Change one parameter. The API contract stays identical, so your integration code doesn't need to know which model is handling the request.
You need automatic fallbacks. If your primary model errors or exceeds a latency threshold, LLM gateway can automatically reroute to a backup model. Same schema, same response format. No client-side retry logic required.
Pricing markup matters to you. OpenRouter charges 5-5.5% on top of provider costs. llmgateway.io charges 5%. AssemblyAI charges zero markup—you pay exactly what the model providers charge. At scale, that difference adds up.
You're building voice agents. LLM gateway powers the LLM step in AssemblyAI's Voice Agent API pipeline. One WebSocket connection handles STT, LLM, and TTS at a flat $4.50/hr rate. If you're building conversational voice interfaces, this is the fastest path from speech to action.
You're concerned about supply chain security. In March 2026, LiteLLM—a popular open-source LLM proxy library—was hit by a supply chain attack that stole API credentials. AssemblyAI was unaffected because LLM gateway doesn't depend on third-party proxy libraries. Every provider integration is built by hand, with a minimal dependency tree.

Supported models

LLM gateway provides access to 20+ models across four providers:

Provider	Models	Use cases
Anthropic	Claude 4.5 Sonnet, Claude 4.5 Haiku, Claude 4 Sonnet, Claude 4 Opus	Long-context summarization, nuanced analysis, complex reasoning
OpenAI	GPT-5, GPT-5 mini, GPT-5 nano, GPT-4.1, ChatGPT-4o, gpt-oss-20b, gpt-oss-120b	General-purpose processing, function calling, structured outputs
Google	Gemini 3 Pro, Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 2.5 Flash-Lite	Very long context windows, fast inference, cost-effective processing
Meta	Llama 4 Scout, Llama 4 Maverick	Open-weight flexibility, fine-tuning compatibility, cost optimization

This is a curated catalog of tested, reliable models—not 300+ options where you're left wondering which ones actually work. Every model in LLM gateway has been tested against voice data and maintains consistent behavior through the unified API.

Note on model availability: Models are added as providers release them and deprecated when providers sunset them. Check the API documentation for the current list.

Key features

Unified interface with one API key

LLM gateway exposes an OpenAI-compatible chat completions endpoint. If you've written code against OpenAI's API, you already know the interface. Change the base URL to AssemblyAI's endpoint, swap in your AssemblyAI API key, and you're routing through LLM gateway.

The practical impact: you can switch between GPT-5 and Claude 4.5 Sonnet by changing one parameter. No SDK changes. No authentication rework. No schema translation.

For teams already using AssemblyAI for transcription, this means one API key for your entire Voice AI stack. One set of rate limits to manage. One dashboard for monitoring. One invoice at the end of the month.

Speech-native context preservation

This is the core differentiator. When you route a transcript through LLM gateway, you're not just sending text—you're sending structured speech data. Speaker diarization, word-level timestamps, turn boundaries, and confidence scores can all flow through to your LLM prompts.

Why does this matter? Because voice applications require different context than text applications. Knowing who said what (and when) changes how you summarize a meeting, analyze a call, or generate a response in a voice agent.

More importantly, LLM gateway runs inside the same infrastructure as AssemblyAI's transcription pipeline. There's no extra network hop between your speech-to-text output and your LLM input. For voice agents where every millisecond counts, eliminating that round-trip matters. Speech flows in, gets transcribed, hits the LLM, and returns an action—all within a single system.

Teams building voice-powered products use this architecture because latency directly impacts user experience—every millisecond between transcription and LLM response matters.

Full chat completion capabilities

LLM gateway supports the full OpenAI chat completions feature set:

Streaming responses: Get tokens as they're generated for real-time UIs
Function calling: Define tools and let models invoke them with structured arguments
JSON mode: Force structured JSON outputs for downstream processing
System prompts: Set context and behavior across conversation turns
Multi-turn conversations: Maintain conversation history for contextual responses

Whatever you're doing with OpenAI's API today, LLM gateway supports it—while giving you access to Claude, Gemini, and Llama through the same interface.

Zero-markup pricing

AssemblyAI charges exactly what model providers charge. No percentage fee on top. No hidden costs.

To put this in perspective:

OpenRouter: 5-5.5% markup on provider costs
llmgateway.io: 5% markup on provider costs
AssemblyAI LLM gateway: 0% markup

At low volume, the difference is negligible. At scale—processing thousands of transcripts per day through GPT-5 or Claude—those percentages translate to real dollars. If you're already paying for transcription infrastructure, there's no reason to pay an additional tax on your LLM calls.

Billing is unified with your existing AssemblyAI account. One invoice shows transcription usage and LLM usage together. No separate payment methods. No reconciling bills from multiple vendors.

Automatic cross-provider fallbacks

Here's a scenario: your application uses GPT-5 as the primary model. OpenAI experiences an outage, or your request hits rate limits, or latency spikes above your threshold. What happens?

With direct API calls or basic proxies, your request fails. You catch the error, implement retry logic, maybe try a different provider, handle the different response format, and hope your fallback is actually available.

With LLM gateway, you configure fallback models in advance. If the primary model fails or exceeds latency thresholds, the request automatically reroutes to your backup—Claude, Gemini, Llama, whatever you've specified. The response comes back in the same schema. Your client code doesn't know anything changed.

This isn't just convenience. For voice agents, a failed LLM call means dead air. For real-time applications, a 10-second timeout means a broken experience. Automatic fallbacks keep your application responsive when individual providers have problems.

Minimal-dependency security

In March 2026, LiteLLM—one of the most widely-used open-source LLM proxy libraries—was compromised in a supply chain attack. Malicious code in a dependency update exfiltrated API keys to external servers. Teams using LiteLLM had their OpenAI, Anthropic, and other credentials exposed.

AssemblyAI was unaffected. LLM gateway doesn't use LiteLLM or similar third-party proxy libraries. Every provider integration is built by hand, with intentionally minimal dependencies. There's no massive dependency tree where a single compromised package can steal your credentials.

This architecture decision predates the LiteLLM incident—it's a deliberate choice to control security-critical code paths. But the incident validates the approach. If you're routing production traffic and API keys through infrastructure, you want that infrastructure to have the smallest possible attack surface.

Always up-to-date model support

When Anthropic releases Claude 5 or OpenAI launches GPT-5, you don't rebuild your integration. AssemblyAI adds new models to LLM gateway as providers release them. Your code stays the same; you just change the model parameter to access new capabilities.

Similarly, when providers deprecate older models, AssemblyAI provides migration guidance and ensures fallback configurations continue working. The model landscape changes constantly—LLM gateway absorbs that churn so your application code doesn't have to.

How AssemblyAI's LLM gateway compares to alternatives

There are multiple ways to access LLMs: direct provider APIs, multi-model proxies like OpenRouter, or platform-integrated solutions like AssemblyAI's LLM gateway. Here's how they compare:

Criteria	AssemblyAI LLM gateway	OpenRouter	llmgateway.io	Direct provider APIs
Pricing markup	0%	5-5.5%	5%	0%
Model count	20+ curated models	300+ models	Varies	Provider-specific
Automatic fallbacks	Yes, cross-provider with same schema	Limited	Limited	No—manual implementation
Voice AI integration	Native—same infrastructure as STT	No	No	Separate integration
Dependency security	Minimal deps, built by hand	Third-party libraries	Third-party libraries	Provider SDKs only
Setup for existing users	Same API key, no new account	New account required	New account required	Account per provider
Speech context preservation	Yes—diarization, timestamps, turns	No	No	Manual formatting

When to choose OpenRouter: If you need access to 300+ models including niche fine-tuned variants, and you're not building voice applications, OpenRouter's breadth makes sense. The markup is the cost of that breadth.

When to choose direct APIs: If you're only using one provider and don't need fallbacks, billing consolidation, or speech-to-text integration, direct APIs are simpler. You avoid any middleware.

When to choose AssemblyAI LLM gateway: If you're building Voice AI, already using AssemblyAI, want zero markup, need automatic fallbacks, or care about supply chain security, LLM gateway is purpose-built for your use case.

See LLM gateway in action

Transcribe audio and route it through GPT-5, Claude, or Gemini—all from the same API. Try it in the Playground.

Open playground

Using LLM gateway with streaming speech-to-text

For real-time applications—live captioning, voice agents, call analytics—you're often processing audio as it streams rather than waiting for a complete recording. LLM gateway integrates directly with AssemblyAI's Streaming Speech-to-Text API.

The architecture looks like this:

Audio streams in via WebSocket to AssemblyAI's Streaming STT
Partial transcripts arrive with sub-second latency as speech is recognized
Final transcripts trigger LLM gateway calls for processing
Results return through the same connection for immediate action

Because LLM gateway runs inside the same infrastructure as the streaming transcription service, there's no extra network hop. The transcript doesn't leave AssemblyAI's system to reach an LLM—it's processed internally and the result comes back faster than if you were making a separate API call to a third-party gateway.

This matters most for voice agents. When a user finishes speaking, you want the LLM response to begin as quickly as possible. Every 100ms of added latency feels like dead air. By eliminating the network round-trip between STT and LLM, you're shaving time off every turn in the conversation.

For teams building with AssemblyAI's Voice Agent API, LLM gateway handles the LLM step automatically. You don't configure it separately—it's part of the pipeline. One WebSocket connection, flat $4.50/hr pricing for the full STT+LLM+TTS stack, and you're focused on your application logic rather than orchestrating multiple services.

Voice AI use cases for LLM gateway

LLM gateway shines when you're applying language models to spoken data. Here are the patterns we see most often:

Use case	How LLM gateway helps	Typical models
Meeting summarization	Process transcripts with speaker labels to generate summaries, action items, and follow-ups	Claude 4.5 Sonnet, GPT-5, Gemini 2.5 Pro
Call center analytics	Analyze agent/customer conversations for sentiment, compliance, and coaching opportunities	GPT-5 mini, Claude 4.5 Haiku
Clinical documentation	Convert patient-provider conversations into structured clinical notes	Claude 4.5 Sonnet, GPT-5
Customer support agents	Power voice agents that understand queries, retrieve information, and respond naturally	GPT-5, Claude 4.5 Sonnet
Sales coaching	Analyze sales calls for talk time, objection handling, and messaging adherence	GPT-5 mini, Gemini 2.5 Flash
Content repurposing	Transform podcasts, webinars, or interviews into blog posts, social content, and newsletters	Claude 4.5 Sonnet, GPT-5
Appointment scheduling agents	Voice agents that handle inbound scheduling calls, check availability, and confirm bookings	GPT-5 mini, Claude 4.5 Haiku
Outbound reminder calls	Voice agents that call patients or customers with appointment reminders and handle rescheduling	GPT-5 mini, Llama 4 Scout

Voice agent applications

Voice agents represent the fastest-growing category of LLM gateway usage. These are conversational AI applications where users speak naturally and receive spoken responses—think customer support lines that don't require "press 1 for billing," or clinical intake systems that gather patient information through natural conversation.

LLM gateway is particularly well-suited here because:

Latency is critical. Running inside the same infrastructure as STT eliminates network hops.
Fallbacks prevent dead air. If GPT-5 is slow or unavailable, automatic failover to Claude keeps the conversation flowing.
Speech context matters. Understanding who's speaking, conversational turns, and timing affects response quality.

For teams building voice agents, the Voice Agent API packages LLM gateway with Universal-3 Pro Streaming (STT) and text-to-speech into a single $4.50/hr bundle. One WebSocket, one bill, one set of logs. The LLM step is handled internally—you just focus on your agent's logic and conversation design.

Build voice agents with one API

Combine streaming STT, LLM gateway, and text-to-speech in a single WebSocket connection at $4.50/hr flat.

Explore Voice Agent API

Getting started

If you're already an AssemblyAI customer, you can start using LLM gateway today with your existing API key. No new signup, no separate credentials.

The fastest way to see LLM gateway in action is to transcribe an audio file with AssemblyAI and then pass that transcript to an LLM through the same API key. The documentation includes copy-paste examples in Python, JavaScript, and curl.

Frequently asked questions

What is the difference between an LLM gateway and an AI gateway?

The terms are often used interchangeably, but there's a distinction. An "AI gateway" is a broader category that might include routing to any AI model—image generation, embedding models, speech recognition, and LLMs. An "LLM gateway" specifically routes requests to large language models.

In practice, most products marketed as "AI gateways" are primarily LLM gateways with some additional model types supported. AssemblyAI's LLM gateway focuses specifically on language models while integrating natively with AssemblyAI's speech-to-text infrastructure—giving you both capabilities through one platform.

What is the difference between an LLM gateway and an MCP gateway?

MCP (Model Context Protocol) is an open standard from Anthropic for connecting AI models to external tools and data sources. An MCP gateway manages these tool connections and context handoffs.

An LLM gateway handles model routing and API abstraction—choosing which model processes a request, managing fallbacks, and normalizing response formats. The two can work together: an LLM gateway routes your request to the right model, while MCP-compliant integrations give that model access to tools and external context.

AssemblyAI's LLM gateway focuses on the routing and abstraction layer, with speech context preservation as a first-class feature.

Can I use LLM gateway without AssemblyAI's transcription?

Yes. LLM gateway is a standalone service that works with any text input. If you're using a different transcription provider (or working with text that isn't from speech at all), you can still route your LLM calls through AssemblyAI's gateway to benefit from unified billing, automatic fallbacks, and zero markup pricing.

That said, the full value of LLM gateway comes from integration with AssemblyAI's transcription pipeline. The speech-native context preservation and elimination of network hops between STT and LLM are specific advantages for voice applications.

How does billing work?

LLM usage is billed at exactly the rates model providers charge—no markup. Charges appear on your existing AssemblyAI invoice alongside transcription usage. You don't need separate accounts or payment methods for each model provider.

If you're using the Voice Agent API, the pricing model is different: a flat $4.50/hr covers STT, LLM, and TTS together. The LLM costs are bundled into that rate rather than charged separately per token.

Usage metrics—tokens consumed, requests by model, latency percentiles—are available in your AssemblyAI dashboard. This makes it easy to track costs and optimize model selection.

What happens when new models are released?

AssemblyAI adds new models to LLM gateway as providers release them. You'll see announcements in release notes and documentation updates. Using a new model typically requires changing just the model parameter in your API call—no SDK updates, no schema changes.

When providers deprecate models, AssemblyAI provides migration timelines and guidance. Fallback configurations continue working even as individual models are sunset, so your application stays resilient.

The goal is to absorb the churn of the rapidly evolving model landscape so your integration code can stay stable. You focus on what your application does with LLM outputs; LLM gateway handles keeping up with the providers.

What is LLM Gateway?

What is AssemblyAI's LLM gateway?

When do you need AssemblyAI's LLM gateway?

Supported models

Key features

Unified interface with one API key

Speech-native context preservation

Full chat completion capabilities

Zero-markup pricing

Automatic cross-provider fallbacks

Minimal-dependency security

Always up-to-date model support

How AssemblyAI's LLM gateway compares to alternatives

Using LLM gateway with streaming speech-to-text

Voice AI use cases for LLM gateway

Voice agent applications

Getting started

Frequently asked questions

What is the difference between an LLM gateway and an AI gateway?

What is the difference between an LLM gateway and an MCP gateway?

Can I use LLM gateway without AssemblyAI's transcription?

How does billing work?

What happens when new models are released?

Stream LLM responses in a voice pipeline: Tool calling, structured outputs, and real-time actions

AssemblyAI LLM Gateway vs. OpenRouter vs. LLM Gateway.io: Pricing, security, and reliability compared

How to add automatic LLM fallbacks to your voice pipeline

Build voice AI apps with LLM Gateway

Speaker identification and diarization with AssemblyAI

A New API Endpoint to Paginate Through Historical Transcripts

[Webinar] Conversation Intelligence with CallRail

Dev.to x AssemblyAI: Winter Speech-to-Text Challenge Winners

What is LLM Gateway?

What is AssemblyAI's LLM gateway?

When do you need AssemblyAI's LLM gateway?

Supported models

Key features

Unified interface with one API key

Speech-native context preservation

Full chat completion capabilities

Zero-markup pricing

Automatic cross-provider fallbacks

Minimal-dependency security

Always up-to-date model support

How AssemblyAI's LLM gateway compares to alternatives

Using LLM gateway with streaming speech-to-text

Voice AI use cases for LLM gateway

Voice agent applications

Getting started

Frequently asked questions

What is the difference between an LLM gateway and an AI gateway?

What is the difference between an LLM gateway and an MCP gateway?

Can I use LLM gateway without AssemblyAI's transcription?

How does billing work?

What happens when new models are released?

Related posts

Stream LLM responses in a voice pipeline: Tool calling, structured outputs, and real-time actions

AssemblyAI LLM Gateway vs. OpenRouter vs. LLM Gateway.io: Pricing, security, and reliability compared

How to add automatic LLM fallbacks to your voice pipeline

Build voice AI apps with LLM Gateway

Speaker identification and diarization with AssemblyAI

A New API Endpoint to Paginate Through Historical Transcripts

[Webinar] Conversation Intelligence with CallRail

Dev.to x AssemblyAI: Winter Speech-to-Text Challenge Winners