Prompt Caching
Reduce costs and latency with Anthropic prompt caching for stable system prompts and tool definitions.
Prompt Caching
Supyagent supports Anthropic's prompt caching feature, which can significantly reduce costs and latency for repeated conversations. When enabled, Anthropic caches the stable portions of your prompt (system prompt, tool definitions) so they do not need to be re-processed on every request.
How It Works
Anthropic's prompt caching works by identifying stable prefix content in your messages. When consecutive API calls share the same prefix (system prompt, tool schemas, summary), the cached portion is processed at a fraction of the cost and with lower latency.
For typical supyagent usage, the cacheable prefix includes:
- The system prompt (often hundreds of tokens, especially with tool-creation instructions and thinking guidelines)
- Tool definitions (can be thousands of tokens with 30+ tools)
- The context summary (relatively stable between turns)
This means the per-turn cost drops significantly after the first call in a session.
Configuration
Prompt caching is enabled by default. Configure it in the model section of your agent YAML:
name: myagent
model:
provider: anthropic/claude-sonnet-4-5-20250929
temperature: 0.7
cache: true # Enable prompt caching (default: true)Configuration Field
| Field | Type | Default | Description |
|---|---|---|---|
cache | bool | true | Enable prompt caching when supported by the provider |
The Beta Header Mechanism
When cache: true is set and the model identifier contains anthropic, the LLM client automatically adds the required beta header to API requests:
anthropic-beta: prompt-caching-2024-07-31This header activates the caching behavior on Anthropic's API. No other configuration is needed.
The header is only added for Anthropic models. For other providers, the cache setting is safely ignored.
Provider Support
| Provider | Prompt Caching | Notes |
|---|---|---|
| Anthropic | Supported | Automatic with cache: true |
| OpenAI | Not applicable | Uses different caching mechanisms internally |
| Google/Gemini | Not applicable | Context caching available separately |
| OpenRouter | Depends | Passed through when routing to Anthropic |
| Ollama | Not applicable | Local models |
| Other LiteLLM providers | Not applicable | Cache header is only sent for Anthropic |
Cost Savings
Anthropic charges significantly less for cached prompt tokens compared to fresh tokens. The exact savings depend on the model, but typical reductions are:
- Cached input tokens are charged at a reduced rate compared to regular input tokens
- The system prompt and tool definitions are the most impactful portions to cache since they are identical across all turns
- Context summaries also benefit from caching since they change infrequently (only when summarization triggers)
For an agent with a large system prompt (1000+ tokens) and 30+ tools (10,000+ tokens), caching can reduce per-turn input costs substantially.
When Caching Helps Most
Prompt caching provides the most benefit when:
- Your system prompt is large (multi-paragraph instructions, tool creation guides)
- You have many tools registered (each tool schema adds hundreds of tokens)
- Conversations are multi-turn (the prefix is re-sent with every message)
- You use context summaries (the summary is part of the stable prefix)
Prompt caching provides less benefit when:
- Conversations are single-turn (execution mode with
supyagent run) - The system prompt is very short
- The agent has few or no tools
Disabling Caching
If you want to disable caching (for example, to get exact token usage numbers for benchmarking), set cache: false:
model:
provider: anthropic/claude-sonnet-4-5-20250929
cache: falseRelated
- Context Management -- How summaries interact with caching
- Configuration -- Model configuration options
- Telemetry -- Track token usage and costs