obi-jam pattern intermediate

Claude for Thinking, Ollama for Doing

Running every agent message through Claude is like hiring a senior architect to answer the phone. Some tasks need the big model. Most don't. The trick is knowing which wire to pull.

Our agent was running every message through Claude API. Morning briefing? Claude. “Mark that task done”? Claude. “What time is it in Tokyo”? Claude. The responses were excellent. The bill was not.

More importantly, when Anthropic’s API had a hiccup — and every API has hiccups — the agent went completely silent. No fallback. No degraded mode. Just nothing.

The Problem

Single-model agents have two failure modes: they’re too expensive for simple tasks, and they’re completely down when the API is down. Both problems have the same root cause — every message takes the same path through the same model.

A task confirmation doesn’t need the same model that writes a 500-word content draft. But if both tasks go through the same API call, they cost the same and fail together.

Why This Happens

When you prototype an agent, you wire it to one LLM endpoint. It works. The quality is good. Shipping means shipping, so you ship with one model. Adding a second model feels like premature optimization — until the API bill arrives or the first outage hits.

The real cost isn’t the money. It’s the architectural assumption that the agent only has one brain. Once that assumption is baked in, every message competes for the same resource.

The Fix

Run two models. Route between them based on task complexity.

The setup:

Claude API (remote, primary): Complex reasoning, content generation, email extraction, multi-step planning, tool use. This is the expensive brain that earns its cost.

Ollama (local, fallback): Simple confirmations, quick lookups, status summaries, message classification. Free, fast, always available as long as the machine is on.

The routing logic:

Skills declare which model they need in their YAML frontmatter. The orchestrator reads the model field and routes accordingly. No model declared? Default to Ollama with Claude as the safety net.

Note: model is not part of the Agent Skills spec — it’s a custom field we added to our harness. The spec defines name, description, user-invocable, and tools, but model routing is runtime-specific. We added model because the harness needs to know which brain to use before it starts processing. If you’re building your own skill runner, this is the kind of field you extend the spec with for your own needs.

async function chat(systemPrompt, messages, tools, route = 'default') {
  // Skill declared model: claude — go straight to the API
  if (route === 'claude') {
    return await claude.chat(systemPrompt, messages, tools);
  }

  // Everything else tries Ollama first (free, local)
  const ollamaUp = await ollama.isAvailable();
  if (ollamaUp) {
    try {
      return await ollama.chat(systemPrompt, messages);
    } catch (err) {
      logger.warn('Ollama failed, falling back to Claude', { error: err.message });
    }
  }

  // Ollama down or errored — Claude picks it up
  return await claude.chat(systemPrompt, messages, tools);
}

What goes where:

Skills that declare model: claude:

  • Content drafting (articles, emails, social posts)
  • Email extraction (parsing forwarded emails into structured data)
  • Multi-step task planning
  • Any skill that needs tool use (file operations, web searches)
  • Anything where nuance matters

Everything else hits Ollama by default:

  • General conversation
  • Task confirmations (“Done. Marked ‘update webhook’ as complete.”)
  • Time/date lookups
  • Simple status queries from existing context
  • Message classification (is this urgent or routine?)

Claude is always there as the fallback if Ollama chokes.

The fallback chain:

User message
  → Skill declares model? Use that model directly
  → No preference? Default route:
    → Is Ollama up? Try Ollama first (free, fast, local)
      → Success? Return response
      → Ollama down or errored? Fall back to Claude
        → Success? Return response (user doesn't know)
        → Claude also down? Return error message

The user sees a response in every case except total system failure. Most messages hit Ollama and never touch the API. When a skill needs the big brain — content drafting, email extraction, anything with tool use — its frontmatter declares model: claude and the router sends it straight there.

Tool use stays with Claude:

Local models through Ollama don’t reliably support tool use. That’s fine — structure your tool executor so it only wires into the Claude path. Ollama handles the tasks that don’t need tools.

claude.setToolExecutor(executeAnyTool);
// Ollama gets no tool executor — it handles text-only tasks

The cost math:

A typical agent handles 50-100 messages per day. Maybe 10-15 of those genuinely need Claude’s reasoning. The rest are confirmations, status checks, and simple replies. With Ollama as the default, those 85+ simple messages cost exactly zero — they never leave the machine. Only the skills that declare model: claude hit the API. That’s 80-90% of your message volume handled for free.

Key Takeaway

Your agent has two kinds of work: thinking and doing. Ollama handles the bulk — cheap, local, always on. Claude steps in when a skill says it needs the big brain. The orchestrator’s job isn’t picking the right model per message — it’s letting each skill declare what it needs and defaulting to the free option for everything else. You get lower costs, offline resilience, and Claude-quality reasoning exactly where it matters.

FAQ

How do I route agent tasks between Claude API and local Ollama?

Use Claude as the primary for complex reasoning — content drafting, email extraction, multi-step planning, anything that needs tool use. Use Ollama (running locally on the same machine) for simple confirmations, quick lookups, and as a fallback when Claude is unavailable. The routing logic lives in your orchestrator: default to Claude, fall back to Ollama on API errors or timeouts.

What tasks can a local Ollama model handle for an agent?

Quick confirmations ('done, task marked complete'), simple formatting, status summaries from existing data, and basic message classification. Anything where the response pattern is predictable and doesn't require nuanced reasoning or tool use. The bar is: if a template could almost handle it, Ollama can handle it.

How do I set up Ollama as a fallback for Claude API failures?

Wrap your Claude call in a try/catch. On 5xx errors or timeouts, pass the same prompt to Ollama with a simplified system prompt (local models handle shorter context better). Log the fallback so you know it happened. The user sees a response either way — they don't need to know which model generated it.

Is Ollama on Apple Silicon fast enough for real-time agent responses?

For small models like llama3.3, yes. Response times on Apple Silicon are typically 2-5 seconds for short completions — fast enough for a chat interface. You won't get Claude-quality reasoning, but for the tasks Ollama should be handling, the quality gap doesn't matter.