How does Anthropic prompt caching work for agents?

Anthropic's API caches system prompt content that is marked with cache_control: { type: 'ephemeral' }. When the same content block appears in a subsequent API call, the cached tokens are read at 10% of the normal input token price. For an agent that sends the same identity and memory context on every call, this saves 90% on those tokens.

What should be cached vs not cached in an agent's system prompt?

Cache the stable parts — personality files, long-term memory, system instructions that don't change between requests. Don't cache the dynamic parts — skill-specific instructions, channel context, current task details. Send both as separate content blocks in the system parameter array. Only the stable block gets the cache_control marker.

Can I use prompt caching with structured system content blocks?

Yes. Instead of passing a single string as the system parameter, pass an array of content blocks. Each block has type, text, and optionally cache_control. This is the recommended approach for agents because it separates stable content (cacheable) from dynamic content (changes per request).

Prompt Caching for Multi-Skill Agents — Split Stable vs Dynamic

Every message your agent handles follows the same pattern: load the system prompt, append the conversation, call the API. If your agent has a personality file, a memory document, and skill instructions, that system prompt can easily be 3,000-5,000 tokens. Multiply by hundreds of messages a day and you’re burning money on content that hasn’t changed since the last deploy.

Anthropic’s prompt caching fixes this — but only if you structure your prompts correctly. The key insight: split your system prompt into stable and dynamic blocks, and only cache the stable part.

The Problem

A multi-skill agent has a system prompt that looks something like this:

[Agent personality — who you are, how you talk]     ~1,500 tokens
[Long-term memory — key facts, preferences]          ~1,000 tokens
[Skill instructions — what to do with this message]    ~500 tokens
[Channel context — where this message came from]       ~200 tokens

The personality and memory are the same on every call. The skill instructions change depending on what the user asked for. The channel context changes per message.

If you send this as one big string, the API treats the entire thing as uncacheable — because any change anywhere in the string invalidates the whole block.

The Fix

Send the system prompt as an array of content blocks instead of a single string. Mark the stable block for caching. Leave the dynamic block unmarked.

buildSystemParam(stablePrompt, dynamicPrompt) {
  const parts = [
    {
      type: 'text',
      text: stablePrompt,
      cache_control: { type: 'ephemeral' }
    }
  ];

  if (dynamicPrompt) {
    parts.push({
      type: 'text',
      text: dynamicPrompt
    });
  }

  return parts;
}

Then in your API call:

const response = await this.client.messages.create({
  model: this.model,
  max_tokens: 4096,
  system: this.buildSystemParam(stablePrompt, dynamicPrompt),
  messages: conversationHistory,
});

The system parameter accepts either a string or an array of content blocks. When you pass an array, each block is evaluated for caching independently. The block with cache_control: { type: 'ephemeral' } is cached for the duration of the session (currently ~5 minutes for ephemeral caching). Subsequent calls with the same stable content hit the cache.

What Goes Where

Stable block (cached):

Agent personality file (SOUL.md, identity doc)
Long-term memory (MEMORY.md, institutional knowledge)
Base system instructions that don’t change
Any reference material that’s loaded at startup

Dynamic block (not cached):

Skill-specific instructions (changes based on which skill handles the message)
Channel or context metadata (which Discord channel, what time, etc.)
Current task details
Anything that varies per request

// Assemble at message handling time
const stablePrompt = `${soulContent}\n\n${memoryContent}`;
const dynamicPrompt = skill
  ? `## Current Skill: ${skill.name}\n${skill.instructions}`
  : null;

const systemParam = claude.buildSystemParam(stablePrompt, dynamicPrompt);

Verifying It Works

The API response includes cache usage in the usage object:

const response = await client.messages.create({ ... });

console.log({
  input_tokens: response.usage.input_tokens,
  cache_read: response.usage.cache_read_input_tokens,
  cache_created: response.usage.cache_creation_input_tokens,
  output_tokens: response.usage.output_tokens,
});

On the first call, you’ll see cache_creation_input_tokens equal to your stable block size — the cache is being populated. On subsequent calls, cache_read_input_tokens should be roughly the same number — those tokens are being served from cache at 10% cost.

// First call
{ input_tokens: 5200, cache_read: 0, cache_created: 4000, output_tokens: 350 }

// Second call (within cache TTL)
{ input_tokens: 5200, cache_read: 4000, cache_created: 0, output_tokens: 280 }

That second call just saved 90% on 4,000 tokens of input.

The Math

For an agent using Claude Sonnet handling 200 messages/day with a 4,000-token stable prompt:

Without caching:

200 calls × 4,000 stable tokens = 800,000 tokens/day at full input price

With caching:

200 calls × 4,000 cached tokens = 800,000 tokens/day at 10% input price
~1 cache creation per 5-minute window ≈ 288/day × 4,000 tokens at full price
Net savings: ~75-85% on the stable portion of your system prompt

The savings compound with prompt size. If your stable block is 8,000 tokens (detailed personality + extensive memory), the numbers double.

Backwards Compatibility

If you’re retrofitting this into an existing agent, you’ll have callers that pass a single string as the system prompt. Handle both signatures:

async chat(systemPromptOrBlocks, messages) {
  // Support both: string (legacy) and array (structured)
  const system = typeof systemPromptOrBlocks === 'string'
    ? [{ type: 'text', text: systemPromptOrBlocks }]
    : systemPromptOrBlocks;

  return this.client.messages.create({
    model: this.model,
    max_tokens: 4096,
    system,
    messages,
  });
}

Old callers keep working. New callers pass the structured blocks and get caching. No migration needed.

Key Takeaway

Your agent’s identity is the most expensive thing you’re not caching. Split the system prompt into stable and dynamic blocks. Mark the stable block with cache_control: { type: 'ephemeral' }. Send them as an array of content blocks in the system parameter. The API handles the rest — same content hits cache, different content gets recomputed. The structure of your prompt becomes the boundary between what you pay for once and what you pay for every time.