The first implementation of our agent’s memory was a simple append-only log. Every message the agent received, every response it generated — saved to a JSON file, loaded into context on every request. The idea was that the agent should remember everything, like a perfect assistant with a perfect notebook.
After two weeks and several hundred messages, the context payload was enormous. The responses got slower. Worse, they got less relevant — the model was spending attention on a conversation from last Tuesday about grocery lists while trying to answer a question about today’s task priorities.
The Problem
Infinite conversation history creates three cascading problems:
Context dilution. LLMs have finite attention. When you pack 500 historical messages into the context window alongside a new question, the model divides its attention across all of it. The signal-to-noise ratio drops with every old message that isn’t relevant to the current conversation.
Cost scaling. Every token in the context window costs money. A conversation history that grows unbounded means API costs grow unbounded. An agent that was cheap to run in week one becomes expensive by month two.
Stale context. A conversation from two weeks ago is almost never relevant today. But the model doesn’t know that — it treats recent and old messages with similar attention weight. Stale context doesn’t just waste tokens; it actively misleads.
Why This Happens
“Save everything” is the instinct when you’re not sure what will be important later. It’s the safe choice — you can always filter later. Except “later” never comes, because the system works well enough in the first few days when history is short. By the time the problems appear, the append-only pattern is baked in.
The deeper issue is treating conversation memory like a database when it should be treated like working memory. You don’t remember every conversation you’ve had this month. You remember the current conversation and a few relevant recent ones. That’s the model that works for agents too.
The Sawtooth Problem
Kenneth Jiang documented what happens when you let a framework handle this for you. His OpenClaw deployment hit 177 million tokens in 48 hours — only 1.7% was real work. The rest was context replay.
The pattern he found is a sawtooth: every message replays the entire conversation history plus all injected skills and system prompts. Context climbs steadily toward the model’s limit, triggers a “defensive compaction” (forced summarization to avoid crashing), drops, and starts climbing again. A two-word reply like “Looks good” dragged 95,000-112,000 cache read tokens along with it.
Three things compound the problem in framework-managed memory:
Workspace injection tax. Frameworks inject every enabled skill file into the system prompt on every turn — 35-40k tokens before the user even says hello. Whether or not a skill is relevant to the current message, its full text rides along.
Tool output accumulation. When the agent calls a tool, the full request and response get appended to history and replayed on every subsequent turn. A single verbose curl response or DOM snapshot compounds across every future message.
Thinking token tax. Extended thinking tokens get logged into session history too. Multi-hour sessions accumulate massive blocks of internal reasoning that push you toward the context ceiling twice as fast.
Prompt caching softens the financial blow — cache reads cost 10-25% of compute tokens — but it doesn’t fix the architecture. You’re still sending 150k tokens per turn, still hovering near the context ceiling, still losing nuance every time a compaction fires.
The fix isn’t better compaction. It’s not sending the history in the first place.
The Fix
Replace append-only history with a sliding window. Keep the last N exchanges per user. Let old messages fall off naturally.
The implementation:
class Memory {
constructor(config) {
this.maxHistory = config.maxConversationHistory || 20;
this.dir = config.conversationDir;
}
getHistory(userId) {
const file = join(this.dir, `${userId}.json`);
if (!existsSync(file)) return [];
return JSON.parse(readFileSync(file, 'utf-8'));
}
addExchange(userId, userMessage, assistantResponse) {
const history = this.getHistory(userId);
history.push(
{ role: 'user', content: userMessage, timestamp: Date.now() },
{ role: 'assistant', content: assistantResponse, timestamp: Date.now() }
);
// Sliding window — keep only the last N exchanges
const trimmed = history.slice(-(this.maxHistory * 2));
writeFileSync(
join(this.dir, `${userId}.json`),
JSON.stringify(trimmed, null, 2)
);
return trimmed;
}
}
Why 20 exchanges:
Twenty exchanges (user + assistant pairs) covers most multi-turn conversations with room to spare. A typical task management flow — “add a task,” “what’s on my list,” “mark that one done,” “what’s next” — is four exchanges. Twenty gives you five of those conversations stacked, which is more continuity than most interactions need.
Per-user isolation:
Each user gets their own JSON file: data/conversations/{userId}.json. User A’s context never leaks into User B’s responses. The file name is the user’s platform ID (Discord ID, phone number, etc.), so it’s unique by construction.
What about long-term memory?
The sliding window handles conversation context — the “what did we just talk about?” problem. Long-term memory — institutional knowledge, key decisions, recurring preferences — belongs in the agent’s profile files (MEMORY.md, TASKS.md), not in conversation history. These are loaded into the system prompt at startup, always present, and edited intentionally rather than accumulated automatically.
System prompt:
[SOUL.md — personality] ← always present
[MEMORY.md — institutional knowledge] ← always present
[conversation window — last 20 exchanges] ← sliding, per-user
The agent always knows its personality and institutional knowledge. It only “remembers” recent conversations. That’s the right split — stable context in the profile, volatile context in the window.
Compare this to the framework approach: SOUL.md + MEMORY.md + 20 exchanges is a predictable, bounded payload on every turn. No sawtooth. No compaction. No surprise where a two-word message costs 100k tokens in context replay. The total context size is flat — it never grows beyond what you configured.
Why not a database:
For a personal agent with one to five users, a JSON file per user is the simplest thing that works. No database setup. No connection pooling. No schema migrations. The file is human-readable — you can open it in a text editor and see exactly what the agent remembers. You can delete it to reset a user’s context. You can copy it to debug a conversation.
Move to a database when you have enough users that file I/O becomes a bottleneck. For most personal agents, that’s a problem you’ll never have.
Key Takeaway
Conversation memory is context for the current interaction, not an archive of every interaction. A sliding window of 20 exchanges gives your agent enough continuity to be useful without the cost, latency, and attention problems of infinite history. You get a flat, predictable context size on every turn — no sawtooth, no compaction, no six-figure token bills from a bot that mostly says “done.” Save everything if you want analytics. Feed the agent only what it needs right now.