Our agent was running every message through Claude API. Morning briefing? Claude. “Mark that task done”? Claude. “What time is it in Tokyo”? Claude. The responses were excellent. The bill was not.
More importantly, when Anthropic’s API had a hiccup — and every API has hiccups — the agent went completely silent. No fallback. No degraded mode. Just nothing.
The Problem
Single-model agents have two failure modes: they’re too expensive for simple tasks, and they’re completely down when the API is down. Both problems have the same root cause — every message takes the same path through the same model.
A task confirmation doesn’t need the same model that writes a 500-word content draft. But if both tasks go through the same API call, they cost the same and fail together.
Why This Happens
When you prototype an agent, you wire it to one LLM endpoint. It works. The quality is good. Shipping means shipping, so you ship with one model. Adding a second model feels like premature optimization — until the API bill arrives or the first outage hits.
The real cost isn’t the money. It’s the architectural assumption that the agent only has one brain. Once that assumption is baked in, every message competes for the same resource.
The Fix
Run two models. Route between them based on task complexity.
The setup:
Claude API (remote, primary): Complex reasoning, content generation, email extraction, multi-step planning, tool use. This is the expensive brain that earns its cost.
Ollama (local, fallback): Simple confirmations, quick lookups, status summaries, message classification. Free, fast, always available as long as the machine is on.
The routing logic:
Skills declare which model they need in their YAML frontmatter. The orchestrator reads the model field and routes accordingly. No model declared? Default to Ollama with Claude as the safety net.
Note: model is not part of the Agent Skills spec — it’s a custom field we added to our harness. The spec defines name, description, user-invocable, and tools, but model routing is runtime-specific. We added model because the harness needs to know which brain to use before it starts processing. If you’re building your own skill runner, this is the kind of field you extend the spec with for your own needs.
async function chat(systemPrompt, messages, tools, route = 'default') {
// Skill declared model: claude — go straight to the API
if (route === 'claude') {
return await claude.chat(systemPrompt, messages, tools);
}
// Everything else tries Ollama first (free, local)
const ollamaUp = await ollama.isAvailable();
if (ollamaUp) {
try {
return await ollama.chat(systemPrompt, messages);
} catch (err) {
logger.warn('Ollama failed, falling back to Claude', { error: err.message });
}
}
// Ollama down or errored — Claude picks it up
return await claude.chat(systemPrompt, messages, tools);
}
What goes where:
Skills that declare model: claude:
- Content drafting (articles, emails, social posts)
- Email extraction (parsing forwarded emails into structured data)
- Multi-step task planning
- Any skill that needs tool use (file operations, web searches)
- Anything where nuance matters
Everything else hits Ollama by default:
- General conversation
- Task confirmations (“Done. Marked ‘update webhook’ as complete.”)
- Time/date lookups
- Simple status queries from existing context
- Message classification (is this urgent or routine?)
Claude is always there as the fallback if Ollama chokes.
The fallback chain:
User message
→ Skill declares model? Use that model directly
→ No preference? Default route:
→ Is Ollama up? Try Ollama first (free, fast, local)
→ Success? Return response
→ Ollama down or errored? Fall back to Claude
→ Success? Return response (user doesn't know)
→ Claude also down? Return error message
The user sees a response in every case except total system failure. Most messages hit Ollama and never touch the API. When a skill needs the big brain — content drafting, email extraction, anything with tool use — its frontmatter declares model: claude and the router sends it straight there.
Tool use stays with Claude:
Local models through Ollama don’t reliably support tool use. That’s fine — structure your tool executor so it only wires into the Claude path. Ollama handles the tasks that don’t need tools.
claude.setToolExecutor(executeAnyTool);
// Ollama gets no tool executor — it handles text-only tasks
The cost math:
A typical agent handles 50-100 messages per day. Maybe 10-15 of those genuinely need Claude’s reasoning. The rest are confirmations, status checks, and simple replies. With Ollama as the default, those 85+ simple messages cost exactly zero — they never leave the machine. Only the skills that declare model: claude hit the API. That’s 80-90% of your message volume handled for free.
Key Takeaway
Your agent has two kinds of work: thinking and doing. Ollama handles the bulk — cheap, local, always on. Claude steps in when a skill says it needs the big brain. The orchestrator’s job isn’t picking the right model per message — it’s letting each skill declare what it needs and defaulting to the free option for everything else. You get lower costs, offline resilience, and Claude-quality reasoning exactly where it matters.