Every message your agent handles follows the same pattern: load the system prompt, append the conversation, call the API. If your agent has a personality file, a memory document, and skill instructions, that system prompt can easily be 3,000-5,000 tokens. Multiply by hundreds of messages a day and you’re burning money on content that hasn’t changed since the last deploy.
Anthropic’s prompt caching fixes this — but only if you structure your prompts correctly. The key insight: split your system prompt into stable and dynamic blocks, and only cache the stable part.
The Problem
A multi-skill agent has a system prompt that looks something like this:
[Agent personality — who you are, how you talk] ~1,500 tokens
[Long-term memory — key facts, preferences] ~1,000 tokens
[Skill instructions — what to do with this message] ~500 tokens
[Channel context — where this message came from] ~200 tokens
The personality and memory are the same on every call. The skill instructions change depending on what the user asked for. The channel context changes per message.
If you send this as one big string, the API treats the entire thing as uncacheable — because any change anywhere in the string invalidates the whole block.
The Fix
Send the system prompt as an array of content blocks instead of a single string. Mark the stable block for caching. Leave the dynamic block unmarked.
buildSystemParam(stablePrompt, dynamicPrompt) {
const parts = [
{
type: 'text',
text: stablePrompt,
cache_control: { type: 'ephemeral' }
}
];
if (dynamicPrompt) {
parts.push({
type: 'text',
text: dynamicPrompt
});
}
return parts;
}
Then in your API call:
const response = await this.client.messages.create({
model: this.model,
max_tokens: 4096,
system: this.buildSystemParam(stablePrompt, dynamicPrompt),
messages: conversationHistory,
});
The system parameter accepts either a string or an array of content blocks. When you pass an array, each block is evaluated for caching independently. The block with cache_control: { type: 'ephemeral' } is cached for the duration of the session (currently ~5 minutes for ephemeral caching). Subsequent calls with the same stable content hit the cache.
What Goes Where
Stable block (cached):
- Agent personality file (SOUL.md, identity doc)
- Long-term memory (MEMORY.md, institutional knowledge)
- Base system instructions that don’t change
- Any reference material that’s loaded at startup
Dynamic block (not cached):
- Skill-specific instructions (changes based on which skill handles the message)
- Channel or context metadata (which Discord channel, what time, etc.)
- Current task details
- Anything that varies per request
// Assemble at message handling time
const stablePrompt = `${soulContent}\n\n${memoryContent}`;
const dynamicPrompt = skill
? `## Current Skill: ${skill.name}\n${skill.instructions}`
: null;
const systemParam = claude.buildSystemParam(stablePrompt, dynamicPrompt);
Verifying It Works
The API response includes cache usage in the usage object:
const response = await client.messages.create({ ... });
console.log({
input_tokens: response.usage.input_tokens,
cache_read: response.usage.cache_read_input_tokens,
cache_created: response.usage.cache_creation_input_tokens,
output_tokens: response.usage.output_tokens,
});
On the first call, you’ll see cache_creation_input_tokens equal to your stable block size — the cache is being populated. On subsequent calls, cache_read_input_tokens should be roughly the same number — those tokens are being served from cache at 10% cost.
// First call
{ input_tokens: 5200, cache_read: 0, cache_created: 4000, output_tokens: 350 }
// Second call (within cache TTL)
{ input_tokens: 5200, cache_read: 4000, cache_created: 0, output_tokens: 280 }
That second call just saved 90% on 4,000 tokens of input.
The Math
For an agent using Claude Sonnet handling 200 messages/day with a 4,000-token stable prompt:
Without caching:
- 200 calls × 4,000 stable tokens = 800,000 tokens/day at full input price
With caching:
- 200 calls × 4,000 cached tokens = 800,000 tokens/day at 10% input price
- ~1 cache creation per 5-minute window ≈ 288/day × 4,000 tokens at full price
- Net savings: ~75-85% on the stable portion of your system prompt
The savings compound with prompt size. If your stable block is 8,000 tokens (detailed personality + extensive memory), the numbers double.
Backwards Compatibility
If you’re retrofitting this into an existing agent, you’ll have callers that pass a single string as the system prompt. Handle both signatures:
async chat(systemPromptOrBlocks, messages) {
// Support both: string (legacy) and array (structured)
const system = typeof systemPromptOrBlocks === 'string'
? [{ type: 'text', text: systemPromptOrBlocks }]
: systemPromptOrBlocks;
return this.client.messages.create({
model: this.model,
max_tokens: 4096,
system,
messages,
});
}
Old callers keep working. New callers pass the structured blocks and get caching. No migration needed.
Key Takeaway
Your agent’s identity is the most expensive thing you’re not caching. Split the system prompt into stable and dynamic blocks. Mark the stable block with cache_control: { type: 'ephemeral' }. Send them as an array of content blocks in the system parameter. The API handles the rest — same content hits cache, different content gets recomputed. The structure of your prompt becomes the boundary between what you pay for once and what you pay for every time.