Taxes look like a lookup table. Given the same inputs, the same tax code produces the same output every time. There’s a correct answer. It’s math.
So don’t have Claude calculate your taxes. That’s what code is for.
But here’s where it gets interesting: preparing your taxes is a completely different problem. Which of your home office expenses qualify as deductions? Is that side project income or hobby income? Does the mileage for the client meeting count when you also stopped for groceries? Those are judgment calls. And judgment calls are exactly what LLMs are good at.
Taxes aren’t a one-layer problem. They’re a two-layer problem. And most AI systems are too.
The Two Layers
Every real system has a seam running through it. On one side: rules, math, lookups — things with correct answers. On the other side: interpretation, categorization, judgment — things that depend on context.
The reflex right now is to throw Claude at the whole thing. Parse the receipts, categorize the expenses, calculate the totals, fill in the forms. The API can technically do all of it. But “can” and “should” are doing a lot of heavy lifting in that sentence.
The calculation layer — applying rates, summing deductions, computing what you owe — is deterministic. Code does it faster, cheaper, and without rounding errors. That’s Layer 1.
The preparation layer — reading a receipt and deciding whether that dinner was a business expense, interpreting whether your home office qualifies under the simplified method or actual-expense method, flagging that you probably qualify for a credit you didn’t know about — that’s where Claude earns its keep. That’s Layer 2.
A Real Example: Support Ticket Priority Scoring
The same pattern shows up everywhere. Say you’re building a customer support system. Tickets come in, and each one needs a priority score and a routing recommendation.
The scoring engine is Layer 1. Zero LLM calls. Pure Python:
# Layer 1: Deterministic priority scoring
def score_ticket(ticket):
score = 0
if ticket["is_paying_customer"]:
score += 30
if ticket["plan_tier"] == "enterprise":
score += 20
if ticket["hours_since_submission"] > 24:
score += 15
if ticket["category"] == "billing":
score += 10
if ticket["has_prior_tickets"] and ticket["prior_unresolved"] > 0:
score += 10
return min(score, 100) # cap at 100
No ambiguity. A customer is either on the enterprise plan or they aren’t. The ticket is either older than 24 hours or it isn’t. Binary checks, point values, deterministic output. An LLM would give you the same answer — eventually, probably — but at 1000x the cost without the guarantee.
The routing recommendation is Layer 2. After the score is calculated, Claude reads the actual ticket body:
“This is a billing dispute from an enterprise customer, but the actual issue is a misconfigured SSO integration that’s causing duplicate charges. Route to engineering, not billing — the invoicing fix won’t stick until the SSO loop is resolved.”
That requires reading between the lines, connecting dots across systems, and making a judgment call. That’s what LLMs are for. The priority score itself? That’s what a function is for.
The Boundary Test
When you’re building a system that uses AI, ask this about every component:
“Could I write a test that asserts the correct output given the input?”
If yes — use code. Write a function. Write a lookup table. Write a state machine. It’ll be faster, cheaper, more reliable, and auditable.
If no — if the “right” answer depends on judgment, interpretation, or context — use the LLM.
For taxes: “What’s 22% of $85,000?” is a test you can write. “Does this $3,200 home renovation qualify as a deductible home office improvement?” is not. One is arithmetic. The other is interpretation. You need both, and you need them handled by different tools.
The Cost Math
This isn’t theoretical:
| Approach | Cost per operation | Latency | Reliability |
|---|---|---|---|
| Python function | ~$0.000001 | <1ms | 100% deterministic |
| Claude Haiku API call | ~$0.001 | 500-2000ms | ~99% (occasional edge cases) |
| Claude Sonnet API call | ~$0.01 | 1000-5000ms | ~99.5% |
For a scoring engine that runs thousands of times a day, the difference between a function call and an API call is the difference between a $3/month bill and a $300/month bill. For the same output. With worse reliability.
But for the judgment layer — the part that reads a ticket and figures out the real problem, or reads your receipts and flags a deduction you missed — that’s where the API cost is justified. You’re paying for interpretation, not arithmetic.
Key Takeaway
AI is the most powerful tool most of us have ever had access to. The skill is knowing which layer you’re working in. Build the deterministic layer first — the calculations, the validation, the scoring. Then bring in the LLM for the parts that actually need a brain: interpretation, categorization, recommendation, explanation.
Don’t have Claude calculate your taxes. But absolutely have Claude help you prepare them. The line between those two sentences is the line you’re drawing in every AI system you build.
Resources
- Anthropic API Pricing — Current token costs for Claude models. Do the math before choosing AI over code for high-volume deterministic tasks.