general pattern beginner

Don't Have Claude Do Your Taxes

The hard part of building AI systems isn't making the AI work. It's finding the seam between what code should handle and what needs a brain.

Taxes look like a lookup table. Given the same inputs, the same tax code produces the same output every time. There’s a correct answer. It’s math.

So don’t have Claude calculate your taxes. That’s what code is for.

But here’s where it gets interesting: preparing your taxes is a completely different problem. Which of your home office expenses qualify as deductions? Is that side project income or hobby income? Does the mileage for the client meeting count when you also stopped for groceries? Those are judgment calls. And judgment calls are exactly what LLMs are good at.

Taxes aren’t a one-layer problem. They’re a two-layer problem. And most AI systems are too.

The Two Layers

Every real system has a seam running through it. On one side: rules, math, lookups — things with correct answers. On the other side: interpretation, categorization, judgment — things that depend on context.

The reflex right now is to throw Claude at the whole thing. Parse the receipts, categorize the expenses, calculate the totals, fill in the forms. The API can technically do all of it. But “can” and “should” are doing a lot of heavy lifting in that sentence.

The calculation layer — applying rates, summing deductions, computing what you owe — is deterministic. Code does it faster, cheaper, and without rounding errors. That’s Layer 1.

The preparation layer — reading a receipt and deciding whether that dinner was a business expense, interpreting whether your home office qualifies under the simplified method or actual-expense method, flagging that you probably qualify for a credit you didn’t know about — that’s where Claude earns its keep. That’s Layer 2.

A Real Example: Support Ticket Priority Scoring

The same pattern shows up everywhere. Say you’re building a customer support system. Tickets come in, and each one needs a priority score and a routing recommendation.

The scoring engine is Layer 1. Zero LLM calls. Pure Python:

# Layer 1: Deterministic priority scoring
def score_ticket(ticket):
    score = 0
    if ticket["is_paying_customer"]:
        score += 30
    if ticket["plan_tier"] == "enterprise":
        score += 20
    if ticket["hours_since_submission"] > 24:
        score += 15
    if ticket["category"] == "billing":
        score += 10
    if ticket["has_prior_tickets"] and ticket["prior_unresolved"] > 0:
        score += 10
    return min(score, 100)  # cap at 100

No ambiguity. A customer is either on the enterprise plan or they aren’t. The ticket is either older than 24 hours or it isn’t. Binary checks, point values, deterministic output. An LLM would give you the same answer — eventually, probably — but at 1000x the cost without the guarantee.

The routing recommendation is Layer 2. After the score is calculated, Claude reads the actual ticket body:

“This is a billing dispute from an enterprise customer, but the actual issue is a misconfigured SSO integration that’s causing duplicate charges. Route to engineering, not billing — the invoicing fix won’t stick until the SSO loop is resolved.”

That requires reading between the lines, connecting dots across systems, and making a judgment call. That’s what LLMs are for. The priority score itself? That’s what a function is for.

The Boundary Test

When you’re building a system that uses AI, ask this about every component:

“Could I write a test that asserts the correct output given the input?”

If yes — use code. Write a function. Write a lookup table. Write a state machine. It’ll be faster, cheaper, more reliable, and auditable.

If no — if the “right” answer depends on judgment, interpretation, or context — use the LLM.

For taxes: “What’s 22% of $85,000?” is a test you can write. “Does this $3,200 home renovation qualify as a deductible home office improvement?” is not. One is arithmetic. The other is interpretation. You need both, and you need them handled by different tools.

The Cost Math

This isn’t theoretical:

ApproachCost per operationLatencyReliability
Python function~$0.000001<1ms100% deterministic
Claude Haiku API call~$0.001500-2000ms~99% (occasional edge cases)
Claude Sonnet API call~$0.011000-5000ms~99.5%

For a scoring engine that runs thousands of times a day, the difference between a function call and an API call is the difference between a $3/month bill and a $300/month bill. For the same output. With worse reliability.

But for the judgment layer — the part that reads a ticket and figures out the real problem, or reads your receipts and flags a deduction you missed — that’s where the API cost is justified. You’re paying for interpretation, not arithmetic.

Key Takeaway

AI is the most powerful tool most of us have ever had access to. The skill is knowing which layer you’re working in. Build the deterministic layer first — the calculations, the validation, the scoring. Then bring in the LLM for the parts that actually need a brain: interpretation, categorization, recommendation, explanation.

Don’t have Claude calculate your taxes. But absolutely have Claude help you prepare them. The line between those two sentences is the line you’re drawing in every AI system you build.

Resources

  • Anthropic API Pricing — Current token costs for Claude models. Do the math before choosing AI over code for high-volume deterministic tasks.

FAQ

When should I use an LLM instead of regular code?

Use an LLM when the task involves ambiguity, judgment, natural language understanding, or creative synthesis — things like categorizing expenses, interpreting whether something qualifies as a deduction, or summarizing complex documents. Use regular code when the task has fixed rules, known inputs, and a deterministic correct answer — applying the tax rate, summing line items, validating form fields.

Why not just use Claude for everything since it can do math and follow rules too?

Three reasons: cost (a tax calculation via Claude API costs 1000x more than a function call), reliability (LLMs occasionally get arithmetic wrong or hallucinate edge cases in rule systems), and auditability (you can't step through an LLM's reasoning the way you can step through code). For anything where the answer must be exactly right every time, code wins.

How do I decide where the boundary is in a system that needs both?

Build the deterministic layer first — calculations, validation rules, data transforms, scoring rubrics. Then add the LLM layer on top for the parts that genuinely need judgment: interpreting ambiguous inputs, categorizing messy data, generating natural language explanations, making recommendations. The deterministic layer gives you the facts. The LLM layer gives you the interpretation.