When should I use heuristic classification instead of an LLM call?

Use heuristics whenever the answer is structural — what framework powers this page, what type of content this URL serves, which API a request matches. The signals are in the HTML or URL itself. An LLM call adds 500ms to a few seconds of latency and a token cost per page; a regex matches in microseconds and costs nothing. Reserve LLM classification for cases where the answer requires interpretation that a regex genuinely cannot capture — e.g., judging whether prose is hostile, or summarizing what a page is about.

How many signals should I check per platform?

Three to five is usually right. Below three you risk under-detection on stripped or customized installations; above five you spend implementation effort on diminishing returns. The composition rule that matters: classify each signal as strong (uniquely identifies the platform) or weak (consistent with it but not exclusive). One strong signal is enough for high confidence; three weak signals corroborate to high; two weak earn medium; one weak is low; zero is no detection.

Why does iteration order matter when each detector is supposed to be independent?

Because a naive 'first-match-wins' orchestrator returns the earliest detector that fires medium or higher — and if a less-likely platform's medium-confidence match appears earlier in the list than a more-likely platform's high-confidence match, the better answer never runs. The fix is structural: evaluate every detector and pick the highest tier, with iteration order as tie-breaker only when tiers are equal. Pin the property with a parametrized test that runs across iteration orders and asserts the highest-confidence answer wins regardless of where its detector sits in the list. The test fails loudly the moment someone reverts to first-match-wins or reorders the list in a way that breaks correctness.

What if my classifier guesses wrong?

Surface the confidence level alongside the answer and let the calling code decide what to do. A high-confidence wrong answer is a bug in your strong-signal definition (something you marked unique isn't actually unique — fix the classification). A medium-confidence wrong answer is acceptable noise from corroborating evidence; show 'likely Mintlify' in copy or hedge in the agent's reasoning. A low-confidence answer should not reach customer-facing surfaces at all — keep it in your data layer as a supplementary hint for downstream features that want the signal without acting on it as truth.

Don't Make Your Agent Call an LLM to Tell What CMS a Page Is Built On

An agent crawling a docs site needs to know what platform the docs are built on. Mintlify uses a different sidebar pattern than Docusaurus. VitePress emits different asset paths than GitBook. ReadMe lives at *.readme.io and renders API references differently than anyone else. If your agent is going to extract structured content, follow links, or apply platform-specific scrapers, it has to classify the page first.

The first instinct is to ask Claude. Pass the HTML, ask “what platform is this?”, get back a string. That works. It also costs ~500ms of latency, ~2,000 tokens per page, and produces non-deterministic output that varies between runs. For a million-page crawl, the bill is real. For an agent that needs to make this decision dozens of times per session, the latency stacks.

The second instinct is to write if "mintlify" in html: return "mintlify". That works too — until the first Mintlify-themed Tailwind site that mentions Mintlify in its blog post fires a false positive, or the first heavily-customized Mintlify install with renamed CSS classes fires a false negative. Substring matching is the wrong tool because it has no notion of confidence.

The pattern that works for both: score-based heuristic classification with strong and weak signals, composed into a confidence tier. ~30 lines of Python per platform, runs in microseconds, returns a (name, confidence) tuple your agent can act on. Here’s the model and the footguns.

The Signal Model

For each platform you want to classify, list the signals that suggest the page is built on it. Each signal is one of two kinds:

Strong signal — uniquely identifies the platform. If the signal fires, the platform is almost certainly this one. Examples:
- <meta name="generator" content="VitePress"> (only VitePress sets this)
- window.__DOCUSAURUS__ global in inline <script> (only Docusaurus emits this)
- *.readme.io subdomain (only ReadMe-hosted sites use this)
Weak signal — consistent with the platform but not exclusive. Multiple platforms could plausibly emit it. Examples:
- <div class="sidebar"> (every docs platform has a sidebar)
- Tailwind-flavored class patterns (used by many docs themes, not just Mintlify)
- Vue runtime alongside Vite-shaped asset paths (could be VitePress, Nuxt, or hand-rolled Vite)

The strong/weak split is the discipline that does the work. If you classify “uniquely identifies” loosely, you’ll mark something as strong when it isn’t, and your classifier will start asserting Mintlify on a Mintlify-themed Tailwind site. A signal is strong only if no other platform can plausibly emit it. When in doubt, classify as weak.

Substring matching is fine when the substring is unique. Earlier I called substring matching “the wrong tool” — that was about substring-matching a generic platform name (if "mintlify" in html) which fires on any site that happens to mention Mintlify. Substring-matching a known-unique marker (if "mintlify-app-shell" in html) is exactly right. The technique isn’t the problem; the discipline of choosing what to match against is.

Composition Rules

Each platform-detector function inspects its signal list against the page and returns a confidence level using these rules:

def compose_confidence(strong_count: int, weak_count: int) -> str | None:
    if strong_count >= 1:
        return "high"          # one unique signal is enough
    if weak_count >= 3:
        return "high"          # corroboration at scale
    if weak_count == 2:
        return "medium"        # some corroboration, no unique ID
    if weak_count == 1:
        return "low"           # one ambiguous signal
    return None                # no match

Rules are mutually exclusive — a strong signal always escalates to high, never to medium. The “strong = unique” definition does the work; the composition rule is mechanical once the per-signal classification is honest.

A typical detector function:

def detect_mintlify(html: str, url: str, meta: dict) -> str | None:
    strong = 0
    weak = 0

    # Strong signals — only Mintlify emits these
    if meta.get("generator", "").lower().startswith("mintlify"):
        strong += 1
    if "mintlify-app-shell" in html:
        strong += 1
    if "/_mintlify/" in html:  # asset path prefix unique to Mintlify
        strong += 1

    # Weak signals — Mintlify-leaning, but other platforms emit them too
    if 'class="prose' in html:  # Tailwind Typography; Mintlify default, broadly used
        weak += 1
    if "tabler-icon" in html:  # Mintlify's default icon set, but Tabler is widely used
        weak += 1

    return compose_confidence(strong, weak)

Note the line between strong and weak: every “strong” entry is a marker no other platform sets. Every “weak” entry is something Mintlify sites typically emit but other sites could too. If you can’t articulate that distinction for a candidate signal, it goes in the weak bucket.

Three to five signals per platform is the working range. Below three and you’ll under-detect on customized installations; above five and you’re spending effort on diminishing returns.

The Orchestrator: First-Match Wins Is a Trap

The obvious orchestrator runs detectors in a fixed order and returns the first one that fires high or medium:

DETECTORS = [
    ("mintlify",   detect_mintlify),
    ("docusaurus", detect_docusaurus),
    ("vitepress",  detect_vitepress),
    ("gitbook",    detect_gitbook),
    ("readme",     detect_readme),
]

def classify_platform_naive(html, url, meta):
    """First-match wins. Has a bug — see below."""
    for name, detector in DETECTORS:
        confidence = detector(html, url, meta)
        if confidence in ("high", "medium"):
            return (name, confidence)
    return None

That looks innocent. It contains a footgun.

Imagine a page that emits weak Docusaurus signals (it borrowed a Docusaurus-flavored sidebar pattern, scoring two weak Docusaurus markers) AND strong Mintlify signals (the Mintlify generator meta tag). With Docusaurus iterated first, the orchestrator runs detect_docusaurus → returns medium (two weak signals) → the early return fires immediately. detect_mintlify never runs. The page, which is actually a Mintlify site with one borrowed pattern, gets classified as Docusaurus-medium.

Reverse the iteration order and the answer flips: Mintlify-first returns Mintlify-high, the correct answer. Same input, two different outputs depending on list ordering.

The reproducibility property your agent depends on — same page in, same answer out — survives, since the iteration order is fixed at module load. The correctness property — the highest-confidence match always wins — does not. Alphabetizing the detector list looks like a cleanup commit and is actually a behavior change.

The fix isn’t “order detectors by best-guess match probability.” That helps minimize the failure rate, but doesn’t eliminate it — there’s always a site somewhere whose signals favor a less-likely platform. The fix is to evaluate every detector and pick the highest tier:

TIER_RANK = {"high": 3, "medium": 2, "low": 1}

def classify_platform(html, url, meta):
    """Iterate-all-pick-highest. Iteration order is tie-breaker only."""
    results = []
    for name, detector in DETECTORS:
        confidence = detector(html, url, meta)
        if confidence is not None:
            results.append((name, confidence))

    if not results:
        return None

    # Highest tier wins; ties broken by DETECTORS order (stable sort)
    results.sort(key=lambda r: TIER_RANK[r[1]], reverse=True)
    return results[0]

Now strong signals always beat weak ones regardless of where in the list a detector sits. Iteration order still matters for ties — if both Mintlify and Docusaurus return high, the first-listed wins — but cross-platform high ties are rare in practice (a page emitting both a Mintlify generator meta AND a Docusaurus generator meta is malformed) and a deterministic tie-breaker is enough.

Pin the property with a parametrized test that runs across iteration orders:

import pytest

@pytest.mark.parametrize("order", [
    [("mintlify", detect_mintlify), ("docusaurus", detect_docusaurus)],
    [("docusaurus", detect_docusaurus), ("mintlify", detect_mintlify)],
])
def test_classifier_picks_highest_tier_regardless_of_order(monkeypatch, order):
    monkeypatch.setattr("classifier.DETECTORS", order)
    # Strong Mintlify signal (high) + two weak Docusaurus signals (medium).
    # Highest tier (high → mintlify) must win in either iteration order.
    html = (
        '<meta name="generator" content="Mintlify">'                          # strong → high Mintlify
        '<nav class="navbar"><a class="navbar__item" href="/">Docs</a></nav>' # weak Docusaurus — BEM convention, copyable
        '<aside class="theme-doc-sidebar-container"></aside>'                 # weak Docusaurus — theme class pattern, copyable
    )
    name, confidence = classify_platform(html, "https://example.com", {})
    assert name == "mintlify"
    assert confidence == "high"

The test fails loudly the moment someone reverts to the first-match-wins orchestrator, because the [docusaurus, mintlify] parametrize variant would return Docusaurus-medium under that logic. The test doesn’t care about iteration order — it cares about the property that the highest-confidence answer wins. Pinning that property is what keeps the classifier safe under refactor.

One last note: document the iteration-order rationale in the orchestrator’s docstring even though correctness no longer depends on it. The next contributor adding a sixth platform needs to know where to place it (we still order by best-guess match probability so ties break sensibly).

Confidence-Graded, Not Boolean

The classifier returns ("mintlify", "high") — not is_mintlify: True. The confidence tier is the difference between an agent that hedges intelligently and one that asserts wrongly.

How the calling code consumes the tiers:

high → act as if the answer is correct. Apply Mintlify-specific scrapers, render “your Mintlify docs” in copy, route to the Mintlify-aware code path.
medium → soften phrasing in customer-facing surfaces (“we detected likely-Mintlify markers”); apply platform-specific behavior with a fallback path if it fails.
low → don’t surface to the user. Keep the result in your data layer as a supplementary hint for downstream code that wants the signal without acting on it as truth.
None → no detection. Fall back to platform-agnostic behavior.

The mistake to avoid: collapsing the tiers to a boolean at the API boundary. If your classify_platform() function returns just "mintlify" | None, every caller is forced to treat medium and high identically. Surface the confidence and let each caller decide its own threshold.

Detection Is Metadata, Not Scoring

This last decision interacts with how your agent uses the classification.

Detecting Mintlify shouldn’t change a quality score, ranking, or ordering. The score should be invariant under “did we recognize the platform.” If your scoring function awards points for “this page is on Mintlify,” you’ve created a perverse incentive — sites that defeat your fingerprints (renamed CSS classes, custom themes) get penalized for being unrecognizable rather than for being lower-quality. A custom-built docs site that’s actually excellent (Stripe Docs is the canonical example) shouldn’t lose points just because your classifier didn’t recognize the stack.

The fix is structural: keep detection and scoring as separate code paths. Detection enriches metadata (report copy, agent routing, downstream feature gating). Scoring is purely about content quality. The two never trade points. This is the same shape as the related-article advice for validating AI-parsed output before your code touches it — both push toward keeping interpretive layers out of code paths whose outputs other systems trust.

Why Not Just Ask Claude

Comparing the two approaches on the dimensions that matter for an agent:

Dimension	LLM classification	Heuristic classification
Latency	500ms – several seconds	microseconds
Cost per page	tokens × pages	zero after implementation
Determinism	varies between runs	same input → same output
Confidence signal	hidden in prose	explicit tier
Failure mode	hallucinated platform	`None` (honest no-detect)
Maintenance	model drift between versions	regenerate fingerprints when platforms change

The LLM-based version is right when the question genuinely requires interpretation that no regex can capture — judging tone, summarizing intent, deciding whether prose is hostile. Page classification is not that question. The answer is in the HTML.

A Concrete Aside: Picking Which Platforms to Cover

One operational note worth pinning, because it’s the input the rest of the article assumes you’ve gotten right.

You can only classify what you’ve written a detector for. The list of platforms you cover is itself a decision — pick wrong and your classifier works perfectly at finding nothing your agent will actually encounter. The two filters that matter:

Real deployment data, not folk knowledge. Pull the npm download counts; check Wappalyzer or wmtips detection numbers; look at the platform’s own customer pages. “The big five” is a phrase developers say confidently and inaccurately.
Structural detectability, not just popularity. A platform with no signals unique to it (e.g., a “theme on top of Next.js” that emits the same markers as raw Next.js) cannot be classified cleanly — adding it to your list guarantees false positives. Drop it from the list even if it’s popular.

The intersection is the working filter: a platform earns a slot in your detector list when it’s measurably popular AND has at least one strong (unique) signal. If neither condition holds, the platform doesn’t belong in your classifier — accepting a non-detect for those sites is better than asserting wrongly.

The Skeleton Your Agent Plugs Into

Putting it together — what an agent’s call site looks like:

result = classify_platform(html, url, meta)

if result is None:
    use_generic_scraper(html)
elif result[1] == "high":
    use_platform_scraper(result[0], html)
elif result[1] == "medium":
    try:
        use_platform_scraper(result[0], html)
    except PlatformScrapingError:
        use_generic_scraper(html)
else:  # low
    use_generic_scraper(html)
    log_low_confidence_signal(result[0])  # supplementary hint, not a decision

The agent never asks Claude what platform it’s looking at. It asks a function that runs in microseconds and returns a tier the agent can branch on. The classification is wrong sometimes — that’s fine, because the confidence tier tells the agent how much to trust it. Heuristic classifiers don’t need to be perfect to beat LLM calls on this kind of question. They need to be honest.