Insights & Use Cases
May 5, 2026

What is LLM Gateway?

LLM gateway explained: compare features like routing, failover, security, and cost control so you can choose the right option for production AI apps today.

Kelsey Foster
Growth
Reviewed by
No items found.
Table of contents

Building a Voice AI application means combining speech-to-text with language model processing—and that second step is where complexity accumulates fast. Which model handles summarization best? What happens when a provider rate-limits you mid-call? How do you A/B test GPT-5 against Claude without rewriting your integration? The growing LLM gateway market exists precisely because these problems are universal—and when your primary model goes down at 2am, you need infrastructure that handles it automatically.

AssemblyAI's LLM gateway solves all of this from inside the same platform you already use for transcription. One OpenAI-compatible API endpoint. Twenty-plus models across four providers. Zero markup on pricing. And because it runs inside the same infrastructure as your speech-to-text pipeline, there's no extra network hop eating into your latency budget.

This guide explains what AssemblyAI's LLM gateway is, how it compares to alternatives, and why it was built specifically for Voice AI—not as a generic multi-model proxy.

What is AssemblyAI's LLM gateway?

AssemblyAI's LLM gateway is an OpenAI-compatible API that routes your transcripts through leading LLMs from Anthropic, OpenAI, Google, and Meta—without managing separate API keys, accounts, or billing relationships. Change one base URL, keep your existing code, and you're running on a different model.

Unlike general-purpose LLM proxies, LLM gateway is designed around spoken data. It understands the structure of a transcript—speaker labels, word-level timestamps, overlapping turns—and preserves that context when routing to downstream models. You're not sending raw text; you're sending speech-aware context.

But here's where it gets interesting. Because LLM gateway runs inside the same infrastructure as AssemblyAI's transcription pipeline, there's no additional network hop between your speech-to-text and your LLM call. Every utterance in a voice agent flows through a single system: speech in, LLM processing, action out. That architecture difference matters when you're optimizing for sub-second response times.

Here's what changes when you add AssemblyAI's LLM gateway to your Voice AI stack:

Without LLM gateway

With LLM gateway

Separate API keys and billing per LLM provider

One AssemblyAI API key for all 20+ models across 4 providers

Manual prompt formatting per provider

Automatic prompt construction from transcript context

Brittle model-switching code

Swap providers with a single parameter change (OpenAI-compatible API)

Rebuild integrations for every new model

New models added automatically as they release

5-5.5% markup on model costs (OpenRouter, llmgateway.io)

Zero markup—pay exactly what providers charge

Client-side retry logic when models fail or timeout

Automatic cross-provider fallbacks with same response schema

Third-party proxy libraries with massive dependency trees

Minimal-dependency architecture built by hand

Extra network hop between STT and LLM

Runs inside same infrastructure as transcription pipeline

Try LLM gateway with your existing API key

Access 20+ models from OpenAI, Anthropic, Google, and Meta through one endpoint—with zero markup on provider costs.

Get started free

When do you need AssemblyAI's LLM gateway?

LLM gateway makes sense when your application involves spoken data and you want to apply language models to it without managing the complexity yourself. Here are the scenarios where it delivers the most value:

  • You're already using AssemblyAI for transcription. You don't need a new account, a new API key, or a new billing relationship. Your existing credentials work. Your existing dashboard shows usage across both transcription and LLM calls.
  • You want to compare models without rewriting code. Testing GPT-5 against Claude 4.5 Sonnet? Change one parameter. The API contract stays identical, so your integration code doesn't need to know which model is handling the request.
  • You need automatic fallbacks. If your primary model errors or exceeds a latency threshold, LLM gateway can automatically reroute to a backup model. Same schema, same response format. No client-side retry logic required.
  • Pricing markup matters to you. OpenRouter charges 5-5.5% on top of provider costs. llmgateway.io charges 5%. AssemblyAI charges zero markup—you pay exactly what the model providers charge. At scale, that difference adds up.
  • You're building voice agents. LLM gateway powers the LLM step in AssemblyAI's Voice Agent API pipeline. One WebSocket connection handles STT, LLM, and TTS at a flat $4.50/hr rate. If you're building conversational voice interfaces, this is the fastest path from speech to action.
  • You're concerned about supply chain security. In March 2026, LiteLLM—a popular open-source LLM proxy library—was hit by a supply chain attack that stole API credentials. AssemblyAI was unaffected because LLM gateway doesn't depend on third-party proxy libraries. Every provider integration is built by hand, with a minimal dependency tree.

Supported models

LLM gateway provides access to 20+ models across four providers:

Provider

Models

Use cases

Anthropic

Claude 4.5 Sonnet, Claude 4.5 Haiku, Claude 4 Sonnet, Claude 4 Opus

Long-context summarization, nuanced analysis, complex reasoning

OpenAI

GPT-5, GPT-5 mini, GPT-5 nano, GPT-4.1, ChatGPT-4o, gpt-oss-20b, gpt-oss-120b

General-purpose processing, function calling, structured outputs

Google

Gemini 3 Pro, Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 2.5 Flash-Lite

Very long context windows, fast inference, cost-effective processing

Meta

Llama 4 Scout, Llama 4 Maverick

Open-weight flexibility, fine-tuning compatibility, cost optimization

This is a curated catalog of tested, reliable models—not 300+ options where you're left wondering which ones actually work. Every model in LLM gateway has been tested against voice data and maintains consistent behavior through the unified API.

Note on model availability: Models are added as providers release them and deprecated when providers sunset them. Check the API documentation for the current list.

Key features

Unified interface with one API key

LLM gateway exposes an OpenAI-compatible chat completions endpoint. If you've written code against OpenAI's API, you already know the interface. Change the base URL to AssemblyAI's endpoint, swap in your AssemblyAI API key, and you're routing through LLM gateway.

The practical impact: you can switch between GPT-5 and Claude 4.5 Sonnet by changing one parameter. No SDK changes. No authentication rework. No schema translation.

For teams already using AssemblyAI for transcription, this means one API key for your entire Voice AI stack. One set of rate limits to manage. One dashboard for monitoring. One invoice at the end of the month.

Speech-native context preservation

This is the core differentiator. When you route a transcript through LLM gateway, you're not just sending text—you're sending structured speech data. Speaker diarization, word-level timestamps, turn boundaries, and confidence scores can all flow through to your LLM prompts.

Why does this matter? Because voice applications require different context than text applications. Knowing who said what (and when) changes how you summarize a meeting, analyze a call, or generate a response in a voice agent.

More importantly, LLM gateway runs inside the same infrastructure as AssemblyAI's transcription pipeline. There's no extra network hop between your speech-to-text output and your LLM input. For voice agents where every millisecond counts, eliminating that round-trip matters. Speech flows in, gets transcribed, hits the LLM, and returns an action—all within a single system.

Teams building voice-powered products use this architecture because latency directly impacts user experience—every millisecond between transcription and LLM response matters.

Full chat completion capabilities

LLM gateway supports the full OpenAI chat completions feature set:

  • Streaming responses: Get tokens as they're generated for real-time UIs
  • Function calling: Define tools and let models invoke them with structured arguments
  • JSON mode: Force structured JSON outputs for downstream processing
  • System prompts: Set context and behavior across conversation turns
  • Multi-turn conversations: Maintain conversation history for contextual responses

Whatever you're doing with OpenAI's API today, LLM gateway supports it—while giving you access to Claude, Gemini, and Llama through the same interface.

Zero-markup pricing

AssemblyAI charges exactly what model providers charge. No percentage fee on top. No hidden costs.

To put this in perspective:

  • OpenRouter: 5-5.5% markup on provider costs
  • llmgateway.io: 5% markup on provider costs
  • AssemblyAI LLM gateway: 0% markup

At low volume, the difference is negligible. At scale—processing thousands of transcripts per day through GPT-5 or Claude—those percentages translate to real dollars. If you're already paying for transcription infrastructure, there's no reason to pay an additional tax on your LLM calls.

Billing is unified with your existing AssemblyAI account. One invoice shows transcription usage and LLM usage together. No separate payment methods. No reconciling bills from multiple vendors.

Automatic cross-provider fallbacks

Here's a scenario: your application uses GPT-5 as the primary model. OpenAI experiences an outage, or your request hits rate limits, or latency spikes above your threshold. What happens?

With direct API calls or basic proxies, your request fails. You catch the error, implement retry logic, maybe try a different provider, handle the different response format, and hope your fallback is actually available.

With LLM gateway, you configure fallback models in advance. If the primary model fails or exceeds latency thresholds, the request automatically reroutes to your backup—Claude, Gemini, Llama, whatever you've specified. The response comes back in the same schema. Your client code doesn't know anything changed.

This isn't just convenience. For voice agents, a failed LLM call means dead air. For real-time applications, a 10-second timeout means a broken experience. Automatic fallbacks keep your application responsive when individual providers have problems.

Minimal-dependency security

In March 2026, LiteLLM—one of the most widely-used open-source LLM proxy libraries—was compromised in a supply chain attack. Malicious code in a dependency update exfiltrated API keys to external servers. Teams using LiteLLM had their OpenAI, Anthropic, and other credentials exposed.

AssemblyAI was unaffected. LLM gateway doesn't use LiteLLM or similar third-party proxy libraries. Every provider integration is built by hand, with intentionally minimal dependencies. There's no massive dependency tree where a single compromised package can steal your credentials.

This architecture decision predates the LiteLLM incident—it's a deliberate choice to control security-critical code paths. But the incident validates the approach. If you're routing production traffic and API keys through infrastructure, you want that infrastructure to have the smallest possible attack surface.

Always up-to-date model support

When Anthropic releases Claude 5 or OpenAI launches GPT-5, you don't rebuild your integration. AssemblyAI adds new models to LLM gateway as providers release them. Your code stays the same; you just change the model parameter to access new capabilities.

Similarly, when providers deprecate older models, AssemblyAI provides migration guidance and ensures fallback configurations continue working. The model landscape changes constantly—LLM gateway absorbs that churn so your application code doesn't have to.

How AssemblyAI's LLM gateway compares to alternatives

There are multiple ways to access LLMs: direct provider APIs, multi-model proxies like OpenRouter, or platform-integrated solutions like AssemblyAI's LLM gateway. Here's how they compare:

Criteria

AssemblyAI LLM gateway

OpenRouter

llmgateway.io

Direct provider APIs

Pricing markup

0%

5-5.5%

5%

0%

Model count

20+ curated models

300+ models

Varies

Provider-specific

Automatic fallbacks

Yes, cross-provider with same schema

Limited

Limited

No—manual implementation

Voice AI integration

Native—same infrastructure as STT

No

No

Separate integration

Dependency security

Minimal deps, built by hand

Third-party libraries

Third-party libraries

Provider SDKs only

Setup for existing users

Same API key, no new account

New account required

New account required

Account per provider

Speech context preservation

Yes—diarization, timestamps, turns

No

No

Manual formatting

When to choose OpenRouter: If you need access to 300+ models including niche fine-tuned variants, and you're not building voice applications, OpenRouter's breadth makes sense. The markup is the cost of that breadth.

When to choose direct APIs: If you're only using one provider and don't need fallbacks, billing consolidation, or speech-to-text integration, direct APIs are simpler. You avoid any middleware.

When to choose AssemblyAI LLM gateway: If you're building Voice AI, already using AssemblyAI, want zero markup, need automatic fallbacks, or care about supply chain security, LLM gateway is purpose-built for your use case.

See LLM gateway in action

Transcribe audio and route it through GPT-5, Claude, or Gemini—all from the same API. Try it in the Playground.

Open playground

Using LLM gateway with streaming speech-to-text

For real-time applications—live captioning, voice agents, call analytics—you're often processing audio as it streams rather than waiting for a complete recording. LLM gateway integrates directly with AssemblyAI's Streaming Speech-to-Text API.

The architecture looks like this:

  1. Audio streams in via WebSocket to AssemblyAI's Streaming STT
  2. Partial transcripts arrive with sub-second latency as speech is recognized
  3. Final transcripts trigger LLM gateway calls for processing
  4. Results return through the same connection for immediate action

Because LLM gateway runs inside the same infrastructure as the streaming transcription service, there's no extra network hop. The transcript doesn't leave AssemblyAI's system to reach an LLM—it's processed internally and the result comes back faster than if you were making a separate API call to a third-party gateway.

This matters most for voice agents. When a user finishes speaking, you want the LLM response to begin as quickly as possible. Every 100ms of added latency feels like dead air. By eliminating the network round-trip between STT and LLM, you're shaving time off every turn in the conversation.

For teams building with AssemblyAI's Voice Agent API, LLM gateway handles the LLM step automatically. You don't configure it separately—it's part of the pipeline. One WebSocket connection, flat $4.50/hr pricing for the full STT+LLM+TTS stack, and you're focused on your application logic rather than orchestrating multiple services.

Voice AI use cases for LLM gateway

LLM gateway shines when you're applying language models to spoken data. Here are the patterns we see most often:

Use case

How LLM gateway helps

Typical models

Meeting summarization

Process transcripts with speaker labels to generate summaries, action items, and follow-ups

Claude 4.5 Sonnet, GPT-5, Gemini 2.5 Pro

Call center analytics

Analyze agent/customer conversations for sentiment, compliance, and coaching opportunities

GPT-5 mini, Claude 4.5 Haiku

Clinical documentation

Convert patient-provider conversations into structured clinical notes

Claude 4.5 Sonnet, GPT-5

Customer support agents

Power voice agents that understand queries, retrieve information, and respond naturally

GPT-5, Claude 4.5 Sonnet

Sales coaching

Analyze sales calls for talk time, objection handling, and messaging adherence

GPT-5 mini, Gemini 2.5 Flash

Content repurposing

Transform podcasts, webinars, or interviews into blog posts, social content, and newsletters

Claude 4.5 Sonnet, GPT-5

Appointment scheduling agents

Voice agents that handle inbound scheduling calls, check availability, and confirm bookings

GPT-5 mini, Claude 4.5 Haiku

Outbound reminder calls

Voice agents that call patients or customers with appointment reminders and handle rescheduling

GPT-5 mini, Llama 4 Scout

Voice agent applications

Voice agents represent the fastest-growing category of LLM gateway usage. These are conversational AI applications where users speak naturally and receive spoken responses—think customer support lines that don't require "press 1 for billing," or clinical intake systems that gather patient information through natural conversation.

LLM gateway is particularly well-suited here because:

  • Latency is critical. Running inside the same infrastructure as STT eliminates network hops.
  • Fallbacks prevent dead air. If GPT-5 is slow or unavailable, automatic failover to Claude keeps the conversation flowing.
  • Speech context matters. Understanding who's speaking, conversational turns, and timing affects response quality.

For teams building voice agents, the Voice Agent API packages LLM gateway with Universal-3 Pro Streaming (STT) and text-to-speech into a single $4.50/hr bundle. One WebSocket, one bill, one set of logs. The LLM step is handled internally—you just focus on your agent's logic and conversation design.

Build voice agents with one API

Combine streaming STT, LLM gateway, and text-to-speech in a single WebSocket connection at $4.50/hr flat.

Explore Voice Agent API

Getting started

If you're already an AssemblyAI customer, you can start using LLM gateway today with your existing API key. No new signup, no separate credentials.

The fastest way to see LLM gateway in action is to transcribe an audio file with AssemblyAI and then pass that transcript to an LLM through the same API key. The documentation includes copy-paste examples in Python, JavaScript, and curl.

Frequently asked questions

What is the difference between an LLM gateway and an AI gateway?

The terms are often used interchangeably, but there's a distinction. An "AI gateway" is a broader category that might include routing to any AI model—image generation, embedding models, speech recognition, and LLMs. An "LLM gateway" specifically routes requests to large language models.

In practice, most products marketed as "AI gateways" are primarily LLM gateways with some additional model types supported. AssemblyAI's LLM gateway focuses specifically on language models while integrating natively with AssemblyAI's speech-to-text infrastructure—giving you both capabilities through one platform.

What is the difference between an LLM gateway and an MCP gateway?

MCP (Model Context Protocol) is an open standard from Anthropic for connecting AI models to external tools and data sources. An MCP gateway manages these tool connections and context handoffs.

An LLM gateway handles model routing and API abstraction—choosing which model processes a request, managing fallbacks, and normalizing response formats. The two can work together: an LLM gateway routes your request to the right model, while MCP-compliant integrations give that model access to tools and external context.

AssemblyAI's LLM gateway focuses on the routing and abstraction layer, with speech context preservation as a first-class feature.

Can I use LLM gateway without AssemblyAI's transcription?

Yes. LLM gateway is a standalone service that works with any text input. If you're using a different transcription provider (or working with text that isn't from speech at all), you can still route your LLM calls through AssemblyAI's gateway to benefit from unified billing, automatic fallbacks, and zero markup pricing.

That said, the full value of LLM gateway comes from integration with AssemblyAI's transcription pipeline. The speech-native context preservation and elimination of network hops between STT and LLM are specific advantages for voice applications.

How does billing work?

LLM usage is billed at exactly the rates model providers charge—no markup. Charges appear on your existing AssemblyAI invoice alongside transcription usage. You don't need separate accounts or payment methods for each model provider.

If you're using the Voice Agent API, the pricing model is different: a flat $4.50/hr covers STT, LLM, and TTS together. The LLM costs are bundled into that rate rather than charged separately per token.

Usage metrics—tokens consumed, requests by model, latency percentiles—are available in your AssemblyAI dashboard. This makes it easy to track costs and optimize model selection.

What happens when new models are released?

AssemblyAI adds new models to LLM gateway as providers release them. You'll see announcements in release notes and documentation updates. Using a new model typically requires changing just the model parameter in your API call—no SDK updates, no schema changes.

When providers deprecate models, AssemblyAI provides migration timelines and guidance. Fallback configurations continue working even as individual models are sunset, so your application stays resilient.

The goal is to absorb the churn of the rapidly evolving model landscape so your integration code can stay stable. You focus on what your application does with LLM outputs; LLM gateway handles keeping up with the providers.

Title goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Button Text
LLM Gateway
LLMs