When does an LLM call actually earn its cost?

Three cases. (1) The question genuinely requires judgment a regex can't fake — judging tone, summarizing intent, deciding if prose is hostile. (2) The input space is too varied to enumerate — free-text user requests, novel error messages, messy data scraped from the open web. (3) The output is for a human, not for code — drafting copy, explaining a result. Everything else defaults to deterministic. Routing, classification, field extraction, dispatch — these are textbook software-engineering problems with cheap, deterministic solutions. Reaching for the model because it *can* answer the question, when a switch statement *would* answer the same question, is the failure mode.

What are the warning signs that I'm over-seasoning?

If you're using an LLM to route a request, you should be using a switch. If you're using one to extract a field from a known-shape input, you should be using a regex or a parser. If you're using one to decide which tool to call from a small enumerated set, you should be using a classifier with confidence tiers. If you're using one to transform structured data into other structured data with a known schema, you should be using code. The general rule: the LLM is doing work code is better at, and you're paying the seasoning's cost for what should be the dish.

What about agent frameworks that route everything through the LLM?

Most of them get the proportions wrong by default. The tutorials show LLM-routed dispatch because it's simpler to demo, not because it's the right architecture for production. When you build an agent yourself, you'll quickly discover that the parts that need to be reproducible — routing, validation, persistence, scoring — are the parts that should never touch an LLM. The model belongs at the boundary where input is genuinely unbounded; the rest of the system is regular code.

Use AI Like Seasoning, Not Like Flour

Q: How is this different from saying 'don't use AI'?

It's not a rejection of AI — it's a discipline about proportions. The systems I build call Claude dozens of times a day. The point is that the deterministic core handles most of the work, and the model shows up at the boundary where interpretation is actually required. A dish that's 80% flour and 20% seasoning tastes right. A dish that's 80% seasoning tastes like a marketing demo. The architectures that ship cheaply and stay reliable get the proportions right.

The first version of Obi’s message router asked Claude what to do with every Discord message. Twenty messages a day, one Sonnet call per message, ~1,500 tokens per call, ~700ms latency per call. Roughly:

// First version — every message goes through the model
async function routeMessage(message) {
  const decision = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 16,
    system: "You are a router. Classify the message as one of: " +
            "clear, help, status, model, note, brief, weekly, costs, log, chat. " +
            "Reply with the single word.",
    messages: [{ role: "user", content: message.text }],
  });
  const intent = decision.content[0].text.trim();
  return (handlers[intent] || handlers.chat)(message); // hallucinated intents fall through to chat
}

The router worked. It was also the most expensive, slowest, and least deterministic part of the system — and it was solving a problem a switch statement would have solved faster, cheaper, and more honestly. Every /help was a Sonnet call. Every /costs was a Sonnet call. Every typo was a Sonnet call that occasionally hallucinated chat and dropped the user into a conversation they didn’t ask for.

I rewrote it as a slash-command lookup with a conversational fallback:

// Second version — deterministic dispatch, model only for the residual
const HARNESS_COMMANDS = {
  '/clear':   { handler: handleClear },
  '/help':    { handler: handleHelp },
  '/status':  { handler: handleStatus },
  '/note':    { handler: handleNote },
  '/brief':   { handler: handleBrief },
  '/costs':   { handler: handleCosts },
  // ... ~10 entries total
};

function routeMessage(message) {
  const [verb, ...args] = message.text.trim().split(/\s+/);
  if (verb in HARNESS_COMMANDS) {
    return HARNESS_COMMANDS[verb].handler(message, args);
  }
  // Mentions and free-form messages flow to the conversational path
  return handleConversation(message);
}

Same accuracy on the messages I’d tested. Three orders of magnitude cheaper. Instant.

That was the moment the principle clicked: deterministic code is the dish. AI is the seasoning that finishes it. Most of what people reach for an LLM to do, a regex would do faster, cheaper, and more honestly. The model belongs at the boundary where interpretation is actually required — not at the dispatch table, not in the routing layer, not in the parts of your system that need to be reproducible.

This isn’t “AI is bad.” Obi calls Claude dozens of times a day on the conversational path that the switch hands it. The principle is about proportions. Most of Obi’s message paths never touch the model now; the few that do are the seasoning. A dish that’s 80% flour and 20% seasoning tastes right. A dish that’s 80% seasoning tastes like a marketing demo.

When You’re Tempted

The pattern shows up everywhere once you’ve named it. Concrete signs you’re about to over-season:

You’re using an LLM to route a request. → Use a switch.
You’re using an LLM to extract a field from a known-shape input. → Use a regex or a parser.
You’re using an LLM to decide which tool to call from a small enumerated set. → Use a classifier with confidence tiers.
You’re using an LLM to transform structured data into other structured data with a known schema. → Use code.
You’re using an LLM because the deterministic version “would be hard to write.” → Write it once. The LLM cost compounds; the regex cost is one-time.

In each of these the LLM is doing work code is better at, and you’re paying the seasoning’s cost for what should be the dish.

The Test: Does the Question Actually Require Interpretation?

The honest filter for “should this part of my system be deterministic or AI?” is one question:

Does the answer require interpretation that no regex can capture?

Three cases where the answer is yes:

The question genuinely requires judgment. Is this prose hostile? What’s the user’s emotional tone? Does this image contain a person who looks tired? These aren’t pattern-matching questions — they require synthesis a deterministic system can’t fake.
The input space is too varied to enumerate. Free-text user requests, novel error messages from third-party systems, messy data scraped from the open web. When the input is unbounded, an LLM’s flexibility earns its cost.
The output is for a human, not for code. Drafting copy, explaining a result, summarizing a long document into something readable. The output is read, not parsed. Non-determinism is acceptable because the consumer is a human who tolerates phrasing variance.

Everything else — everything — defaults to deterministic. Routing? Switch statement. Classification? Heuristic with confidence tiers. Field extraction from structured sources? Regex. Parsing JSON? JSON.parse. Dispatching to a tool based on what the user asked? Pattern match on the request shape.

The mistake is reaching for the LLM because it can answer the question, when a deterministic layer would answer the same question faster and more reliably.

Two More Examples From Real Builds

The dispatch rewrite above is the cleanest version of this pattern, but it shows up everywhere I’ve built. Two more from systems I’ve shipped this year:

Heuristic page classification. An agent crawling docs sites needs to know what platform each page is on. The first instinct is to ask Claude. The right answer is ~30 lines of Python that score strong-vs-weak signals into a confidence tier — microseconds per page, deterministic, free. The classification is wrong sometimes, but the confidence tier tells the agent how much to trust it. The model never gets called for this question. (How the classifier is structured.)

The MCP two-tool pattern. A scan tool produces a result and persists it under an 8-character ID. A lookup tool addresses the existing artifact by ID. The lookup is a single SQLite read — no LLM involvement, no cache logic, no ambiguity. Splitting one tool into two was a deterministic-first move: the cheap operation got to stay cheap because we refused to fold it into the expensive one. (Why two tools beats one.)

The shape is the same in all three. The deterministic core does the bulk of the work cheaply and predictably. The LLM only shows up where the input is unbounded or the answer requires judgment — and where it does, the cost is justified because the alternative isn’t a regex, it’s nothing.

Deterministic Core, AI at the Boundary

The shape these examples share is the design principle:

The core of the system is deterministic. Routing, dispatch, field extraction, classification, schema validation, retrieval — all the parts that a thousand pages of textbook software-engineering taught us how to build cheaply.
AI shows up at the boundary where interpretation is required: parsing free-text user input, summarizing prose for a human, judging tone, drafting copy. The boundary is small relative to the core.

This is the inverse of the “wrap an LLM around it” reflex. When the LLM call is the boundary instead of the spine, system failures get easier to localize (“the heuristic returned the wrong tier” beats “the model hallucinated”), costs stay bounded (“the boundary handles ~10% of requests” beats “every request pays tokens”), and the deterministic core gets to be tested the way deterministic code is supposed to be tested — with fixtures and assertions, not with vibes.

The taxes argument from earlier this month is the same shape from a different angle: don’t ask Claude to calculate your taxes — that’s deterministic — but do ask Claude to help prepare them, because that’s interpretation. (The two-layer breakdown.)

The Closing Test

When you reach for an LLM, ask: would a regex, a switch, or a hundred lines of deterministic Python answer this — even imperfectly? If yes, write that first. Layer the LLM on top only if the deterministic version’s accuracy is genuinely insufficient for the use case.

The 2026 default is to skip the dish and serve a bowl of seasoning. The systems that actually taste right are mostly the dish, finished with seasoning. Build accordingly.