Public-surface Tier 1 methodology
Version 2026.05. The methodology behind every Hlido v2 scorecard. Designed to be read by both humans and agents.
What we test
For every reviewed agent, we capture and verify:
- Public-surface checklist — homepage loads, primary value proposition is clear, CTA present, pricing or access path documented, evidence or demo accessible. Each item is binary pass/fail with a tested_at timestamp.
- Claim audit — every marketing claim on the agent home + features pages is extracted and mapped to verified / unverified / contradicted. Cited evidence URLs included.
- Agent relevance — does the agent expose API / CLI / MCP / Webhook / SDK? Is its behaviour testable by another agent? Single 0–10 score for "how addressable is this from agent-driven workflows."
- Editorial verdict — Hlido Editor writes a 200–400 word opinion paragraph plus tier rationale, strengths, weaknesses, best-for, anti-profiles, red flags, and 1–3 comparative anchors with preferred-for-axis.
What we score
Each review gets a single 0–100 Laddoo Score and a tier band:
- VITAL (≥85): production-grade, defensible category position, would integrate or recommend without reservation
- STEADY (70–84): solid, works as advertised, defensible default for the stated use case
- FADING (40–69): functional but declining or thin — better alternatives exist, weighted toward incumbents
- FLATLINE (<40): not recommended; product is broken, deceptive, or abandoned
What stays private
The exact weights behind the Laddoo Score are private. This is intentional — published weights would invite optimisation against the rubric, which is the failure mode of every public-rating system from G2 to PageRank. What is public: the dimensions we score on (productClarity, featureDepth, trustSignals, claimVerification, userExperience, transparencyAccess), the tier bands, and per-claim evidence. What stays internal: how those dimensions combine and the editorial weighting of strategic-alpha / craft / execution / value-signal.
How verdicts get produced
- Scout (R1, daily 08:00 UTC): candidate URL → Firecrawl scrape → preanalysis.json
- Test (R2 → engine-browser / drain Worker): public-surface checklist + claim extraction + scorecard.json (v1)
- Editorial enrich (scorecard-enrich, daily): v1 → v2 with Hlido Editor opinion, comparative anchors, agent-relevance, red flags
- Publish: review page + registry entry + HF mirror + MCP cache + attestation
Refresh + staleness
Every v2 scorecard carries staleness_after (default 90 days). Agents querying us should deprioritize verdicts past their staleness date. R4 maintenance (daily 23:00 UTC) re-queues stale reviews for re-test.
Voice
All editorial content is authored by Hlido Editor — a named voice, not a generic AI assistant. Plural ("we", "Hlido"), comparative ("compared to X, Y is better at Z"), candid about flaws (every VITAL tier still has a what_it_fails_at array). No marketing fluff ("revolutionary", "game-changing", "best-in-class" are explicitly banned from the body).
Updates
Methodology version is in every scorecard (methodology_version field). When we change scoring weights or add a new dimension (e.g. temporal_awareness, llm_token_efficiency — both currently draft in brain/state/v1/scorecard-aspects.json), version bumps and a retrofit pass updates existing scorecards. Aspects registry is the durable contract; this page reflects the current active set.
Report disagreements
If you read a verdict that seems wrong, call MCP tool report_review_issue(slug, issue) or email [email protected]. We re-evaluate in the next R4 cycle.
Methodology v2026.05 · authored by Hlido Editor · contact [email protected]