Reduce costs and latency with Anthropic prompt caching for stable system prompts and tool definitions.

Prompt Caching

Supyagent supports Anthropic's prompt caching feature, which can significantly reduce costs and latency for repeated conversations. When enabled, Anthropic caches the stable portions of your prompt (system prompt, tool definitions) so they do not need to be re-processed on every request.

How It Works

Anthropic's prompt caching works by identifying stable prefix content in your messages. When consecutive API calls share the same prefix (system prompt, tool schemas, summary), the cached portion is processed at a fraction of the cost and with lower latency.

For typical supyagent usage, the cacheable prefix includes:

The system prompt (often hundreds of tokens, especially with tool-creation instructions and thinking guidelines)
Tool definitions (can be thousands of tokens with 30+ tools)
The context summary (relatively stable between turns)

This means the per-turn cost drops significantly after the first call in a session.

Configuration

Prompt caching is enabled by default. Configure it in the model section of your agent YAML:

agents/myagent.yaml

name: myagent
model:
  provider: anthropic/claude-sonnet-4-5-20250929
  temperature: 0.7
  cache: true    # Enable prompt caching (default: true)

Configuration Field

Field	Type	Default	Description
`cache`	bool	`true`	Enable prompt caching when supported by the provider

The Beta Header Mechanism

When cache: true is set and the model identifier contains anthropic, the LLM client automatically adds the required beta header to API requests:

anthropic-beta: prompt-caching-2024-07-31

This header activates the caching behavior on Anthropic's API. No other configuration is needed.

The header is only added for Anthropic models. For other providers, the cache setting is safely ignored.

Provider Support

Provider	Prompt Caching	Notes
Anthropic	Supported	Automatic with `cache: true`
OpenAI	Not applicable	Uses different caching mechanisms internally
Google/Gemini	Not applicable	Context caching available separately
OpenRouter	Depends	Passed through when routing to Anthropic
Ollama	Not applicable	Local models
Other LiteLLM providers	Not applicable	Cache header is only sent for Anthropic

Cost Savings

Anthropic charges significantly less for cached prompt tokens compared to fresh tokens. The exact savings depend on the model, but typical reductions are:

Cached input tokens are charged at a reduced rate compared to regular input tokens
The system prompt and tool definitions are the most impactful portions to cache since they are identical across all turns
Context summaries also benefit from caching since they change infrequently (only when summarization triggers)

For an agent with a large system prompt (1000+ tokens) and 30+ tools (10,000+ tokens), caching can reduce per-turn input costs substantially.

When Caching Helps Most

Prompt caching provides the most benefit when:

Your system prompt is large (multi-paragraph instructions, tool creation guides)
You have many tools registered (each tool schema adds hundreds of tokens)
Conversations are multi-turn (the prefix is re-sent with every message)
You use context summaries (the summary is part of the stable prefix)

Prompt caching provides less benefit when:

Conversations are single-turn (execution mode with supyagent run)
The system prompt is very short
The agent has few or no tools

Disabling Caching

If you want to disable caching (for example, to get exact token usage numbers for benchmarking), set cache: false:

model:
  provider: anthropic/claude-sonnet-4-5-20250929
  cache: false

Context Management -- How summaries interact with caching
Configuration -- Model configuration options
Telemetry -- Track token usage and costs

Prompt Caching

On this page