Prompt Caching

Overview

Prompt caching lets you avoid reprocessing the same prompt content on every request. When you send a long system prompt, tool definitions, or conversation history repeatedly, the LLM provider can cache that content and reuse it on subsequent requests — reducing both latency and cost.

The LLM Gateway supports prompt caching across all major providers, with each provider using its own caching mechanism:

ProviderCaching behaviorConfiguration required?
ClaudeExplicit opt-inYes — add cache_control to messages
OpenAIAutomaticNo — caching happens implicitly
GeminiAutomaticNo — caching happens implicitly

Cached input tokens are billed at a discounted rate compared to regular input tokens. The exact discount depends on the model and provider.

Prompt caching only activates when the cacheable portion of your prompt meets a minimum token threshold. Claude requires at least 1,024 tokens (2,048 for certain models), and OpenAI requires 1,024 tokens. Shorter prompts won’t benefit from caching.

Claude models

Claude models require you to explicitly mark which content blocks to cache using the cache_control field. Add cache_control with type set to "ephemeral" on any message you want cached.

1import os
2import requests
3
4headers = {
5 "authorization": os.environ["ASSEMBLYAI_API_KEY"]
6}
7
8system_prompt = (
9 "You are a customer support agent for Acme Corp. "
10 "You have access to our full product catalog, pricing, "
11 "and policy documentation. Always be helpful and concise."
12)
13
14response = requests.post(
15 "https://llm-gateway.assemblyai.com/v1/chat/completions",
16 headers=headers,
17 json={
18 "model": "claude-sonnet-4-6",
19 "messages": [
20 {
21 "role": "system",
22 "content": system_prompt,
23 "cache_control": {"type": "ephemeral"}
24 },
25 {
26 "role": "user",
27 "content": "What is your return policy?"
28 }
29 ],
30 "max_tokens": 1000
31 }
32)
33
34result = response.json()
35print(result["choices"][0]["message"]["content"])
36
37# Check cache usage in the response
38usage = result["usage"]
39cache_details = usage.get("prompt_tokens_details", {})
40print(f"Cached tokens: {cache_details.get('cached_tokens', 0)}")

You can also set cache_control on tool result messages to cache tool interaction history in multi-turn agentic conversations.

Cache control with TTL

The LLM Gateway extends Anthropic’s native cache_control with an optional ttl field for specifying cache duration. This is a Gateway-specific parameter — Anthropic’s native API does not support it.

1{
2 "cache_control": {
3 "type": "ephemeral",
4 "ttl": "5m"
5 }
6}

The ttl field is a Gateway extension, not part of Anthropic’s native API. If omitted, Anthropic’s default cache duration applies.

Where to place cache_control

The cache_control field can be placed on:

  • System messages — Cache long system prompts that don’t change between requests
  • User and assistant messages — Cache conversation history in multi-turn flows
  • Tool result messages — Cache tool call outputs in agentic workflows

For best results with Claude, place cache_control on the content that stays the same across requests — typically the system prompt and any static context. Content after the last cache breakpoint is not cached.

OpenAI models

OpenAI models cache prompts automatically. No configuration is needed — the gateway passes your requests through and caching happens on OpenAI’s infrastructure.

1import os
2import requests
3
4headers = {
5 "authorization": os.environ["ASSEMBLYAI_API_KEY"]
6}
7
8# OpenAI models cache automatically — no cache_control needed
9response = requests.post(
10 "https://llm-gateway.assemblyai.com/v1/chat/completions",
11 headers=headers,
12 json={
13 "model": "gpt-4.1",
14 "messages": [
15 {
16 "role": "system",
17 "content": "You are a customer support agent..."
18 },
19 {
20 "role": "user",
21 "content": "What is your return policy?"
22 }
23 ],
24 "max_tokens": 1000
25 }
26)
27
28result = response.json()
29
30# Cached tokens still appear in usage
31cache_details = result["usage"].get("prompt_tokens_details", {})
32print(f"Cached tokens: {cache_details.get('cached_tokens', 0)}")

You can optionally configure cache behavior with two additional request-level fields:

FieldTypeDescription
prompt_cache_retentionstringControls how long cached content is retained on OpenAI’s infrastructure. These values are passed through to OpenAI’s API — refer to OpenAI’s documentation for current allowed values.
prompt_cache_keystringA custom key to group related requests for caching. Requests with the same key are more likely to share cached content.

Gemini models

Gemini models also cache automatically — no configuration needed.

1import os
2import requests
3
4headers = {
5 "authorization": os.environ["ASSEMBLYAI_API_KEY"]
6}
7
8# Gemini models cache automatically
9response = requests.post(
10 "https://llm-gateway.assemblyai.com/v1/chat/completions",
11 headers=headers,
12 json={
13 "model": "gemini-2.5-flash",
14 "messages": [
15 {
16 "role": "system",
17 "content": "You are a customer support agent..."
18 },
19 {
20 "role": "user",
21 "content": "What is your return policy?"
22 }
23 ],
24 "max_tokens": 1000
25 }
26)
27
28result = response.json()
29
30# Cached tokens appear in usage
31cache_details = result["usage"].get("prompt_tokens_details", {})
32print(f"Cached tokens: {cache_details.get('cached_tokens', 0)}")

Reading cache metrics from the response

All providers return cache usage data in the usage.prompt_tokens_details field of the response:

1{
2 "usage": {
3 "input_tokens": 500,
4 "output_tokens": 150,
5 "total_tokens": 650,
6 "prompt_tokens_details": {
7 "cached_tokens": 450,
8 "cache_creation": {
9 "ephemeral_5m_input_tokens": 0,
10 "ephemeral_1h_input_tokens": 0
11 }
12 }
13 }
14}
FieldDescription
cached_tokensNumber of input tokens read from cache (cost savings).
cache_creation.ephemeral_5m_input_tokensTokens written to a 5-minute ephemeral cache (Claude only).
cache_creation.ephemeral_1h_input_tokensTokens written to a 1-hour ephemeral cache (Claude only).

When cached_tokens is greater than zero, those tokens were served from cache and billed at the discounted cached input rate rather than the standard input rate.

Best practices

  • Cache your system prompt — System prompts are the best candidates for caching since they stay the same across requests. Place cache_control on the system message for Claude, or rely on automatic caching for OpenAI and Gemini.
  • Cache tool definitions — If you use the same tools across multiple requests, the tool definitions in your prompt are automatically eligible for caching.
  • Order messages for maximum cache hits — Put static content (system prompt, tool definitions) at the beginning of the message array. Content before the cache breakpoint is more likely to match across requests.
  • Monitor cache metrics — Check prompt_tokens_details.cached_tokens in responses to verify caching is working and estimate your cost savings.

API reference

Request

Top-level request parameters

These fields are set at the top level of the request body:

KeyTypeRequired?Description
cache_controlobjectNoDefault cache control applied to the entire request (Claude models). When set at the request level, it acts as a default for all messages. Contains type and optional ttl.
prompt_cache_retentionstringNoControls cache retention duration (OpenAI models). Passed through to OpenAI’s API.
prompt_cache_keystringNoCustom cache key for grouping requests (OpenAI models).

Message-level cache_control

The cache_control field can also be set on individual messages. Message-level cache_control lets you mark specific cache breakpoints — the provider caches all content up to and including the marked message. This is the recommended approach for Claude models.

KeyTypeRequired?Description
cache_controlobjectNoCache control for this specific message. Marks a cache breakpoint at this position in the conversation.

Cache control object

The cache_control object has the same structure whether used at the request level or message level:

KeyTypeRequired?Description
typestringYesThe cache type. Use "ephemeral" for standard caching.
ttlstringNoTime-to-live for the cached content (Gateway extension — not part of Anthropic’s native API).

Response

Usage fields for cache metrics

KeyTypeDescription
usage.prompt_tokens_details.cached_tokensnumberInput tokens served from cache.
usage.prompt_tokens_details.cache_creation.ephemeral_5m_input_tokensnumberTokens written to 5-minute cache.
usage.prompt_tokens_details.cache_creation.ephemeral_1h_input_tokensnumberTokens written to 1-hour cache.