What is LLM Gateway?
LLM gateway explained: compare features like routing, failover, security, and cost control so you can choose the right option for production AI apps today.



Building a Voice AI application means combining speech-to-text with language model processing—and that second step is where complexity accumulates fast. Which model handles summarization best? What happens when a provider rate-limits you mid-call? How do you A/B test GPT-5 against Claude without rewriting your integration? The growing LLM gateway market exists precisely because these problems are universal—and when your primary model goes down at 2am, you need infrastructure that handles it automatically.
AssemblyAI's LLM gateway solves all of this from inside the same platform you already use for transcription. One OpenAI-compatible API endpoint. Twenty-plus models across four providers. Zero markup on pricing. And because it runs inside the same infrastructure as your speech-to-text pipeline, there's no extra network hop eating into your latency budget.
This guide explains what AssemblyAI's LLM gateway is, how it compares to alternatives, and why it was built specifically for Voice AI—not as a generic multi-model proxy.
What is AssemblyAI's LLM gateway?
AssemblyAI's LLM gateway is an OpenAI-compatible API that routes your transcripts through leading LLMs from Anthropic, OpenAI, Google, and Meta—without managing separate API keys, accounts, or billing relationships. Change one base URL, keep your existing code, and you're running on a different model.
Unlike general-purpose LLM proxies, LLM gateway is designed around spoken data. It understands the structure of a transcript—speaker labels, word-level timestamps, overlapping turns—and preserves that context when routing to downstream models. You're not sending raw text; you're sending speech-aware context.
But here's where it gets interesting. Because LLM gateway runs inside the same infrastructure as AssemblyAI's transcription pipeline, there's no additional network hop between your speech-to-text and your LLM call. Every utterance in a voice agent flows through a single system: speech in, LLM processing, action out. That architecture difference matters when you're optimizing for sub-second response times.
Here's what changes when you add AssemblyAI's LLM gateway to your Voice AI stack:
When do you need AssemblyAI's LLM gateway?
LLM gateway makes sense when your application involves spoken data and you want to apply language models to it without managing the complexity yourself. Here are the scenarios where it delivers the most value:
- You're already using AssemblyAI for transcription. You don't need a new account, a new API key, or a new billing relationship. Your existing credentials work. Your existing dashboard shows usage across both transcription and LLM calls.
- You want to compare models without rewriting code. Testing GPT-5 against Claude 4.5 Sonnet? Change one parameter. The API contract stays identical, so your integration code doesn't need to know which model is handling the request.
- You need automatic fallbacks. If your primary model errors or exceeds a latency threshold, LLM gateway can automatically reroute to a backup model. Same schema, same response format. No client-side retry logic required.
- Pricing markup matters to you. OpenRouter charges 5-5.5% on top of provider costs. llmgateway.io charges 5%. AssemblyAI charges zero markup—you pay exactly what the model providers charge. At scale, that difference adds up.
- You're building voice agents. LLM gateway powers the LLM step in AssemblyAI's Voice Agent API pipeline. One WebSocket connection handles STT, LLM, and TTS at a flat $4.50/hr rate. If you're building conversational voice interfaces, this is the fastest path from speech to action.
- You're concerned about supply chain security. In March 2026, LiteLLM—a popular open-source LLM proxy library—was hit by a supply chain attack that stole API credentials. AssemblyAI was unaffected because LLM gateway doesn't depend on third-party proxy libraries. Every provider integration is built by hand, with a minimal dependency tree.
Supported models
LLM gateway provides access to 20+ models across four providers:
This is a curated catalog of tested, reliable models—not 300+ options where you're left wondering which ones actually work. Every model in LLM gateway has been tested against voice data and maintains consistent behavior through the unified API.
Note on model availability: Models are added as providers release them and deprecated when providers sunset them. Check the API documentation for the current list.
Key features
Unified interface with one API key
LLM gateway exposes an OpenAI-compatible chat completions endpoint. If you've written code against OpenAI's API, you already know the interface. Change the base URL to AssemblyAI's endpoint, swap in your AssemblyAI API key, and you're routing through LLM gateway.
The practical impact: you can switch between GPT-5 and Claude 4.5 Sonnet by changing one parameter. No SDK changes. No authentication rework. No schema translation.
For teams already using AssemblyAI for transcription, this means one API key for your entire Voice AI stack. One set of rate limits to manage. One dashboard for monitoring. One invoice at the end of the month.
Speech-native context preservation
This is the core differentiator. When you route a transcript through LLM gateway, you're not just sending text—you're sending structured speech data. Speaker diarization, word-level timestamps, turn boundaries, and confidence scores can all flow through to your LLM prompts.
Why does this matter? Because voice applications require different context than text applications. Knowing who said what (and when) changes how you summarize a meeting, analyze a call, or generate a response in a voice agent.
More importantly, LLM gateway runs inside the same infrastructure as AssemblyAI's transcription pipeline. There's no extra network hop between your speech-to-text output and your LLM input. For voice agents where every millisecond counts, eliminating that round-trip matters. Speech flows in, gets transcribed, hits the LLM, and returns an action—all within a single system.
Teams building voice-powered products use this architecture because latency directly impacts user experience—every millisecond between transcription and LLM response matters.
Full chat completion capabilities
LLM gateway supports the full OpenAI chat completions feature set:
- Streaming responses: Get tokens as they're generated for real-time UIs
- Function calling: Define tools and let models invoke them with structured arguments
- JSON mode: Force structured JSON outputs for downstream processing
- System prompts: Set context and behavior across conversation turns
- Multi-turn conversations: Maintain conversation history for contextual responses
Whatever you're doing with OpenAI's API today, LLM gateway supports it—while giving you access to Claude, Gemini, and Llama through the same interface.
Zero-markup pricing
AssemblyAI charges exactly what model providers charge. No percentage fee on top. No hidden costs.
To put this in perspective:
- OpenRouter: 5-5.5% markup on provider costs
- llmgateway.io: 5% markup on provider costs
- AssemblyAI LLM gateway: 0% markup
At low volume, the difference is negligible. At scale—processing thousands of transcripts per day through GPT-5 or Claude—those percentages translate to real dollars. If you're already paying for transcription infrastructure, there's no reason to pay an additional tax on your LLM calls.
Billing is unified with your existing AssemblyAI account. One invoice shows transcription usage and LLM usage together. No separate payment methods. No reconciling bills from multiple vendors.
Automatic cross-provider fallbacks
Here's a scenario: your application uses GPT-5 as the primary model. OpenAI experiences an outage, or your request hits rate limits, or latency spikes above your threshold. What happens?
With direct API calls or basic proxies, your request fails. You catch the error, implement retry logic, maybe try a different provider, handle the different response format, and hope your fallback is actually available.
With LLM gateway, you configure fallback models in advance. If the primary model fails or exceeds latency thresholds, the request automatically reroutes to your backup—Claude, Gemini, Llama, whatever you've specified. The response comes back in the same schema. Your client code doesn't know anything changed.
This isn't just convenience. For voice agents, a failed LLM call means dead air. For real-time applications, a 10-second timeout means a broken experience. Automatic fallbacks keep your application responsive when individual providers have problems.
Minimal-dependency security
In March 2026, LiteLLM—one of the most widely-used open-source LLM proxy libraries—was compromised in a supply chain attack. Malicious code in a dependency update exfiltrated API keys to external servers. Teams using LiteLLM had their OpenAI, Anthropic, and other credentials exposed.
AssemblyAI was unaffected. LLM gateway doesn't use LiteLLM or similar third-party proxy libraries. Every provider integration is built by hand, with intentionally minimal dependencies. There's no massive dependency tree where a single compromised package can steal your credentials.
This architecture decision predates the LiteLLM incident—it's a deliberate choice to control security-critical code paths. But the incident validates the approach. If you're routing production traffic and API keys through infrastructure, you want that infrastructure to have the smallest possible attack surface.
Always up-to-date model support
When Anthropic releases Claude 5 or OpenAI launches GPT-5, you don't rebuild your integration. AssemblyAI adds new models to LLM gateway as providers release them. Your code stays the same; you just change the model parameter to access new capabilities.
Similarly, when providers deprecate older models, AssemblyAI provides migration guidance and ensures fallback configurations continue working. The model landscape changes constantly—LLM gateway absorbs that churn so your application code doesn't have to.
How AssemblyAI's LLM gateway compares to alternatives
There are multiple ways to access LLMs: direct provider APIs, multi-model proxies like OpenRouter, or platform-integrated solutions like AssemblyAI's LLM gateway. Here's how they compare:
When to choose OpenRouter: If you need access to 300+ models including niche fine-tuned variants, and you're not building voice applications, OpenRouter's breadth makes sense. The markup is the cost of that breadth.
When to choose direct APIs: If you're only using one provider and don't need fallbacks, billing consolidation, or speech-to-text integration, direct APIs are simpler. You avoid any middleware.
When to choose AssemblyAI LLM gateway: If you're building Voice AI, already using AssemblyAI, want zero markup, need automatic fallbacks, or care about supply chain security, LLM gateway is purpose-built for your use case.
Using LLM gateway with streaming speech-to-text
For real-time applications—live captioning, voice agents, call analytics—you're often processing audio as it streams rather than waiting for a complete recording. LLM gateway integrates directly with AssemblyAI's Streaming Speech-to-Text API.
The architecture looks like this:
- Audio streams in via WebSocket to AssemblyAI's Streaming STT
- Partial transcripts arrive with sub-second latency as speech is recognized
- Final transcripts trigger LLM gateway calls for processing
- Results return through the same connection for immediate action
Because LLM gateway runs inside the same infrastructure as the streaming transcription service, there's no extra network hop. The transcript doesn't leave AssemblyAI's system to reach an LLM—it's processed internally and the result comes back faster than if you were making a separate API call to a third-party gateway.
This matters most for voice agents. When a user finishes speaking, you want the LLM response to begin as quickly as possible. Every 100ms of added latency feels like dead air. By eliminating the network round-trip between STT and LLM, you're shaving time off every turn in the conversation.
For teams building with AssemblyAI's Voice Agent API, LLM gateway handles the LLM step automatically. You don't configure it separately—it's part of the pipeline. One WebSocket connection, flat $4.50/hr pricing for the full STT+LLM+TTS stack, and you're focused on your application logic rather than orchestrating multiple services.
Voice AI use cases for LLM gateway
LLM gateway shines when you're applying language models to spoken data. Here are the patterns we see most often:
Voice agent applications
Voice agents represent the fastest-growing category of LLM gateway usage. These are conversational AI applications where users speak naturally and receive spoken responses—think customer support lines that don't require "press 1 for billing," or clinical intake systems that gather patient information through natural conversation.
LLM gateway is particularly well-suited here because:
- Latency is critical. Running inside the same infrastructure as STT eliminates network hops.
- Fallbacks prevent dead air. If GPT-5 is slow or unavailable, automatic failover to Claude keeps the conversation flowing.
- Speech context matters. Understanding who's speaking, conversational turns, and timing affects response quality.
For teams building voice agents, the Voice Agent API packages LLM gateway with Universal-3 Pro Streaming (STT) and text-to-speech into a single $4.50/hr bundle. One WebSocket, one bill, one set of logs. The LLM step is handled internally—you just focus on your agent's logic and conversation design.
Getting started
If you're already an AssemblyAI customer, you can start using LLM gateway today with your existing API key. No new signup, no separate credentials.
The fastest way to see LLM gateway in action is to transcribe an audio file with AssemblyAI and then pass that transcript to an LLM through the same API key. The documentation includes copy-paste examples in Python, JavaScript, and curl.
Frequently asked questions
What is the difference between an LLM gateway and an AI gateway?
The terms are often used interchangeably, but there's a distinction. An "AI gateway" is a broader category that might include routing to any AI model—image generation, embedding models, speech recognition, and LLMs. An "LLM gateway" specifically routes requests to large language models.
In practice, most products marketed as "AI gateways" are primarily LLM gateways with some additional model types supported. AssemblyAI's LLM gateway focuses specifically on language models while integrating natively with AssemblyAI's speech-to-text infrastructure—giving you both capabilities through one platform.
What is the difference between an LLM gateway and an MCP gateway?
MCP (Model Context Protocol) is an open standard from Anthropic for connecting AI models to external tools and data sources. An MCP gateway manages these tool connections and context handoffs.
An LLM gateway handles model routing and API abstraction—choosing which model processes a request, managing fallbacks, and normalizing response formats. The two can work together: an LLM gateway routes your request to the right model, while MCP-compliant integrations give that model access to tools and external context.
AssemblyAI's LLM gateway focuses on the routing and abstraction layer, with speech context preservation as a first-class feature.
Can I use LLM gateway without AssemblyAI's transcription?
Yes. LLM gateway is a standalone service that works with any text input. If you're using a different transcription provider (or working with text that isn't from speech at all), you can still route your LLM calls through AssemblyAI's gateway to benefit from unified billing, automatic fallbacks, and zero markup pricing.
That said, the full value of LLM gateway comes from integration with AssemblyAI's transcription pipeline. The speech-native context preservation and elimination of network hops between STT and LLM are specific advantages for voice applications.
How does billing work?
LLM usage is billed at exactly the rates model providers charge—no markup. Charges appear on your existing AssemblyAI invoice alongside transcription usage. You don't need separate accounts or payment methods for each model provider.
If you're using the Voice Agent API, the pricing model is different: a flat $4.50/hr covers STT, LLM, and TTS together. The LLM costs are bundled into that rate rather than charged separately per token.
Usage metrics—tokens consumed, requests by model, latency percentiles—are available in your AssemblyAI dashboard. This makes it easy to track costs and optimize model selection.
What happens when new models are released?
AssemblyAI adds new models to LLM gateway as providers release them. You'll see announcements in release notes and documentation updates. Using a new model typically requires changing just the model parameter in your API call—no SDK updates, no schema changes.
When providers deprecate models, AssemblyAI provides migration timelines and guidance. Fallback configurations continue working even as individual models are sunset, so your application stays resilient.
The goal is to absorb the churn of the rapidly evolving model landscape so your integration code can stay stable. You focus on what your application does with LLM outputs; LLM gateway handles keeping up with the providers.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.


