> ## Documentation Index
> Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Prompt Caching

<Warning>
  **Public Beta**

  Prompt caching is available in Public Beta.
</Warning>

## Overview

Prompt caching lets you avoid reprocessing the same prompt content on every request. When you send a long system prompt, tool definitions, or conversation history repeatedly, the LLM provider can cache that content and reuse it on subsequent requests — reducing both latency and cost.

The LLM Gateway supports prompt caching across all major providers, with each provider using its own caching mechanism:

| Provider | Caching behavior | Configuration required?               |
| -------- | ---------------- | ------------------------------------- |
| Claude   | Explicit opt-in  | Yes — add `cache_control` to messages |
| OpenAI   | Automatic        | No — caching happens implicitly       |
| Gemini   | Automatic        | No — caching happens implicitly       |
| Kimi     | Automatic        | No — caching happens implicitly       |

<Note>
  Cached input tokens are billed at a discounted rate compared to regular input tokens. The exact discount depends on the model and provider.
</Note>

<Warning>
  Prompt caching only activates when the cacheable portion of your prompt meets a minimum token threshold. Claude's minimum varies by model — see [Minimum cacheable prompt length](#minimum-cacheable-prompt-length) for the per-model limits. OpenAI requires 1,024 tokens. Shorter prompts won't benefit from caching.
</Warning>

## Claude models

Claude models require you to explicitly mark which content blocks to cache using the `cache_control` field. Add `cache_control` with `type` set to `"ephemeral"` on any message you want cached.

<Tabs>
  <Tab title="Python" language="python">
    ```python expandable theme={null}
    import os
    import requests

    headers = {
      "authorization": os.environ["ASSEMBLYAI_API_KEY"]
    }

    system_prompt = (
        "You are a customer support agent for Acme Corp. "
        "You have access to our full product catalog, pricing, "
        "and policy documentation. Always be helpful and concise."
    )

    response = requests.post(
        "https://llm-gateway.assemblyai.com/v1/chat/completions",
        headers=headers,
        json={
            "model": "claude-sonnet-4-6",
            "messages": [
                {
                    "role": "system",
                    "content": system_prompt,
                    "cache_control": {"type": "ephemeral"}
                },
                {
                    "role": "user",
                    "content": "What is your return policy?"
                }
            ],
            "max_tokens": 1000
        }
    )

    result = response.json()
    print(result["choices"][0]["message"]["content"])

    # Check cache usage in the response
    usage = result["usage"]
    cache_details = usage.get("prompt_tokens_details", {})
    print(f"Cached tokens: {cache_details.get('cached_tokens', 0)}")
    ```
  </Tab>

  <Tab title="JavaScript" language="javascript">
    ```javascript expandable theme={null}
    const response = await fetch(
      "https://llm-gateway.assemblyai.com/v1/chat/completions",
      {
        method: "POST",
        headers: {
          authorization: process.env.ASSEMBLYAI_API_KEY,
          "content-type": "application/json",
        },
        body: JSON.stringify({
          model: "claude-sonnet-4-6",
          messages: [
            {
              role: "system",
              content:
                "You are a customer support agent for Acme Corp. " +
                "You have access to our full product catalog, pricing, " +
                "and policy documentation. Always be helpful and concise.",
              cache_control: { type: "ephemeral" },
            },
            {
              role: "user",
              content: "What is your return policy?",
            },
          ],
          max_tokens: 1000,
        }),
      }
    );

    const result = await response.json();
    console.log(result.choices[0].message.content);

    // Check cache usage in the response
    const cacheDetails = result.usage?.prompt_tokens_details;
    console.log(`Cached tokens: ${cacheDetails?.cached_tokens ?? 0}`);
    ```
  </Tab>
</Tabs>

You can also set `cache_control` on tool result messages to cache tool interaction history in multi-turn agentic conversations.

### Minimum cacheable prompt length

Claude only caches prompts that meet a minimum token threshold, and the threshold depends on the [model](https://platform.claude.com/docs/en/build-with-claude/prompt-caching#cache-limitations). If the cacheable portion of your prompt falls below this threshold, the request is processed without caching and no error is returned.

| Model                                            | Minimum cacheable prompt length |
| ------------------------------------------------ | ------------------------------- |
| Claude Opus 4.7 (`claude-opus-4-7`)              | 4,096 tokens                    |
| Claude Opus 4.6 (`claude-opus-4-6`)              | 4,096 tokens                    |
| Claude Opus 4.5 (`claude-opus-4-5-20251101`)     | 4,096 tokens                    |
| Claude Haiku 4.5 (`claude-haiku-4-5-20251001`)   | 4,096 tokens                    |
| Claude Sonnet 4.6 (`claude-sonnet-4-6`)          | 2,048 tokens                    |
| Claude Sonnet 4.5 (`claude-sonnet-4-5-20250929`) | 1,024 tokens                    |

### Cache control with TTL

The LLM Gateway extends Anthropic's native `cache_control` with an optional `ttl` field for specifying cache duration. This is a Gateway-specific parameter — Anthropic's native API does not support it.

```json theme={null}
{
  "cache_control": {
    "type": "ephemeral",
    "ttl": "5m"
  }
}
```

<Note>
  The `ttl` field is a Gateway extension, not part of Anthropic's native API. If omitted, Anthropic's default cache duration applies.
</Note>

### Where to place cache\_control

The `cache_control` field can be placed on:

* **System messages** — Cache long system prompts that don't change between requests
* **User and assistant messages** — Cache conversation history in multi-turn flows
* **Tool result messages** — Cache tool call outputs in agentic workflows

<Note>
  For best results with Claude, place `cache_control` on the content that stays the same across requests — typically the system prompt and any static context. Content after the last cache breakpoint is not cached.
</Note>

## OpenAI models

OpenAI models cache prompts automatically. No configuration is needed — the gateway passes your requests through and caching happens on OpenAI's infrastructure.

<Tabs>
  <Tab title="Python" language="python">
    ```python expandable theme={null}
    import os
    import requests

    headers = {
      "authorization": os.environ["ASSEMBLYAI_API_KEY"]
    }

    # OpenAI models cache automatically — no cache_control needed
    response = requests.post(
        "https://llm-gateway.assemblyai.com/v1/chat/completions",
        headers=headers,
        json={
            "model": "gpt-4.1",
            "messages": [
                {
                    "role": "system",
                    "content": "You are a customer support agent..."
                },
                {
                    "role": "user",
                    "content": "What is your return policy?"
                }
            ],
            "max_tokens": 1000
        }
    )

    result = response.json()

    # Cached tokens still appear in usage
    cache_details = result["usage"].get("prompt_tokens_details", {})
    print(f"Cached tokens: {cache_details.get('cached_tokens', 0)}")
    ```
  </Tab>

  <Tab title="JavaScript" language="javascript">
    ```javascript expandable theme={null}
    // OpenAI models cache automatically — no cache_control needed
    const response = await fetch(
      "https://llm-gateway.assemblyai.com/v1/chat/completions",
      {
        method: "POST",
        headers: {
          authorization: process.env.ASSEMBLYAI_API_KEY,
          "content-type": "application/json",
        },
        body: JSON.stringify({
          model: "gpt-4.1",
          messages: [
            {
              role: "system",
              content: "You are a customer support agent...",
            },
            {
              role: "user",
              content: "What is your return policy?",
            },
          ],
          max_tokens: 1000,
        }),
      }
    );

    const result = await response.json();
    const cacheDetails = result.usage?.prompt_tokens_details;
    console.log(`Cached tokens: ${cacheDetails?.cached_tokens ?? 0}`);
    ```
  </Tab>
</Tabs>

You can optionally configure cache behavior with two additional request-level fields:

| Field                    | Type   | Description                                                                                                                                                                                                                                        |
| ------------------------ | ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `prompt_cache_retention` | string | Controls how long cached content is retained on OpenAI's infrastructure. These values are passed through to OpenAI's API — refer to [OpenAI's documentation](https://developers.openai.com/docs/guides/prompt-caching) for current allowed values. |
| `prompt_cache_key`       | string | A custom key to group related requests for caching. Requests with the same key are more likely to share cached content.                                                                                                                            |

## Gemini models

Gemini models also cache automatically — no configuration needed.

<Tabs>
  <Tab title="Python" language="python">
    ```python expandable theme={null}
    import os
    import requests

    headers = {
      "authorization": os.environ["ASSEMBLYAI_API_KEY"]
    }

    # Gemini models cache automatically
    response = requests.post(
        "https://llm-gateway.assemblyai.com/v1/chat/completions",
        headers=headers,
        json={
            "model": "gemini-2.5-flash",
            "messages": [
                {
                    "role": "system",
                    "content": "You are a customer support agent..."
                },
                {
                    "role": "user",
                    "content": "What is your return policy?"
                }
            ],
            "max_tokens": 1000
        }
    )

    result = response.json()

    # Cached tokens appear in usage
    cache_details = result["usage"].get("prompt_tokens_details", {})
    print(f"Cached tokens: {cache_details.get('cached_tokens', 0)}")
    ```
  </Tab>

  <Tab title="JavaScript" language="javascript">
    ```javascript expandable theme={null}
    // Gemini models cache automatically
    const response = await fetch(
      "https://llm-gateway.assemblyai.com/v1/chat/completions",
      {
        method: "POST",
        headers: {
          authorization: process.env.ASSEMBLYAI_API_KEY,
          "content-type": "application/json",
        },
        body: JSON.stringify({
          model: "gemini-2.5-flash",
          messages: [
            {
              role: "system",
              content: "You are a customer support agent...",
            },
            {
              role: "user",
              content: "What is your return policy?",
            },
          ],
          max_tokens: 1000,
        }),
      }
    );

    const result = await response.json();
    const cacheDetails = result.usage?.prompt_tokens_details;
    console.log(`Cached tokens: ${cacheDetails?.cached_tokens ?? 0}`);
    ```
  </Tab>
</Tabs>

## Kimi models

Kimi models also cache automatically — no configuration needed.

<Tabs>
  <Tab title="Python" language="python">
    ```python expandable theme={null}
    import os
    import requests

    headers = {
      "authorization": os.environ["ASSEMBLYAI_API_KEY"]
    }

    # Kimi models cache automatically
    response = requests.post(
        "https://llm-gateway.assemblyai.com/v1/chat/completions",
        headers=headers,
        json={
            "model": "kimi-k2.5",
            "messages": [
                {
                    "role": "system",
                    "content": "You are a customer support agent..."
                },
                {
                    "role": "user",
                    "content": "What is your return policy?"
                }
            ],
            "max_tokens": 1000
        }
    )

    result = response.json()

    # Cached tokens still appear in usage
    cache_details = result["usage"].get("prompt_tokens_details", {})
    print(f"Cached tokens: {cache_details.get('cached_tokens', 0)}")
    ```
  </Tab>

  <Tab title="JavaScript" language="javascript">
    ```javascript expandable theme={null}
    // Kimi models cache automatically
    const response = await fetch(
      "https://llm-gateway.assemblyai.com/v1/chat/completions",
      {
        method: "POST",
        headers: {
          authorization: process.env.ASSEMBLYAI_API_KEY,
          "content-type": "application/json",
        },
        body: JSON.stringify({
          model: "kimi-k2.5",
          messages: [
            {
              role: "system",
              content: "You are a customer support agent...",
            },
            {
              role: "user",
              content: "What is your return policy?",
            },
          ],
          max_tokens: 1000,
        }),
      }
    );

    const result = await response.json();
    const cacheDetails = result.usage?.prompt_tokens_details;
    console.log(`Cached tokens: ${cacheDetails?.cached_tokens ?? 0}`);
    ```
  </Tab>
</Tabs>

You can optionally configure cache behavior with the same request-level fields supported for OpenAI models:

| Field                    | Type   | Description                                                                                                                                                   |
| ------------------------ | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `prompt_cache_retention` | string | Controls how long cached content is retained. Refer to [OpenAI's documentation](https://developers.openai.com/docs/guides/prompt-caching) for allowed values. |
| `prompt_cache_key`       | string | A custom key to group related requests for caching. Requests with the same key are more likely to share cached content.                                       |

## Reading cache metrics from the response

All providers return cache usage data in the `usage.prompt_tokens_details` field of the response:

```json theme={null}
{
  "usage": {
    "input_tokens": 500,
    "output_tokens": 150,
    "total_tokens": 650,
    "prompt_tokens_details": {
      "cached_tokens": 450,
      "cache_creation": {
        "ephemeral_5m_input_tokens": 0,
        "ephemeral_1h_input_tokens": 0
      }
    }
  }
}
```

| Field                                      | Description                                                 |
| ------------------------------------------ | ----------------------------------------------------------- |
| `cached_tokens`                            | Number of input tokens read from cache (cost savings).      |
| `cache_creation.ephemeral_5m_input_tokens` | Tokens written to a 5-minute ephemeral cache (Claude only). |
| `cache_creation.ephemeral_1h_input_tokens` | Tokens written to a 1-hour ephemeral cache (Claude only).   |

When `cached_tokens` is greater than zero, those tokens were served from cache and billed at the discounted cached input rate rather than the standard input rate.

## Best practices

* **Cache your system prompt** — System prompts are the best candidates for caching since they stay the same across requests. Place `cache_control` on the system message for Claude, or rely on automatic caching for OpenAI and Gemini.
* **Cache tool definitions** — If you use the same tools across multiple requests, the tool definitions in your prompt are automatically eligible for caching.
* **Order messages for maximum cache hits** — Put static content (system prompt, tool definitions) at the beginning of the message array. Content before the cache breakpoint is more likely to match across requests.
* **Monitor cache metrics** — Check `prompt_tokens_details.cached_tokens` in responses to verify caching is working and estimate your cost savings.

## API reference

### Request

#### Top-level request parameters

These fields are set at the top level of the request body:

| Key                      | Type   | Required? | Description                                                                                                                                                                    |
| ------------------------ | ------ | --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `cache_control`          | object | No        | Default cache control applied to the entire request (Claude models). When set at the request level, it acts as a default for all messages. Contains `type` and optional `ttl`. |
| `prompt_cache_retention` | string | No        | Controls cache retention duration (OpenAI models). Passed through to OpenAI's API.                                                                                             |
| `prompt_cache_key`       | string | No        | Custom cache key for grouping requests (OpenAI models).                                                                                                                        |

#### Message-level cache\_control

The `cache_control` field can also be set on individual messages. Message-level `cache_control` lets you mark specific cache breakpoints — the provider caches all content up to and including the marked message. This is the recommended approach for Claude models.

| Key             | Type   | Required? | Description                                                                                             |
| --------------- | ------ | --------- | ------------------------------------------------------------------------------------------------------- |
| `cache_control` | object | No        | Cache control for this specific message. Marks a cache breakpoint at this position in the conversation. |

#### Cache control object

The `cache_control` object has the same structure whether used at the request level or message level:

| Key    | Type   | Required? | Description                                                                                   |
| ------ | ------ | --------- | --------------------------------------------------------------------------------------------- |
| `type` | string | Yes       | The cache type. Use `"ephemeral"` for standard caching.                                       |
| `ttl`  | string | No        | Time-to-live for the cached content (Gateway extension — not part of Anthropic's native API). |

### Response

#### Usage fields for cache metrics

| Key                                                                    | Type   | Description                       |
| ---------------------------------------------------------------------- | ------ | --------------------------------- |
| `usage.prompt_tokens_details.cached_tokens`                            | number | Input tokens served from cache.   |
| `usage.prompt_tokens_details.cache_creation.ephemeral_5m_input_tokens` | number | Tokens written to 5-minute cache. |
| `usage.prompt_tokens_details.cache_creation.ephemeral_1h_input_tokens` | number | Tokens written to 1-hour cache.   |
