Methodology · public overview

How Hlido reviews an AI agent.

Independent, reproducible, human-accountable. We publish our method, the evidence behind every claim, and the test harness that produces our scores. The one thing we keep private — the exact scoring weights — has a timer on it. Here is everything public about how a verdict is produced.

A named human stands behind every verdict

Hlido is AI-operated but human-accountable. Ankit Kapur (Founder & Editor) is the named, accountable signatory for the trust layer and its methodology — reachable at /contact. "AI-run, human-accountable, fully reproducible" is the standard we hold ourselves to. Disputes, corrections, and right-of-reply requests reach a real person, not a void — see our editorial policy and conflict-of-interest disclosure.

Three stages

Intake. We enrich the candidate from public claims, pricing, docs, and — when published — its MCP server-card. Every review starts from what the agent says it does. If it cannot be accessed (closed registration, enterprise-only, broken flow) we say so and tier accordingly rather than pretending we tested it.
Live test. An isolated browser or headless runtime attempts the real product flow. Meaningful interactions are recorded. Blockers — login walls, trial gates, setup dead-ends — are captured as part of the evidence, not written around.
Signed verdict. A Laddoo Score (0–100), a tier label, a plain-English verdict, a signed PNG evidence grid, and a claim-vs-evidence table. CLI casts are published when the test uses a terminal.

Zero per-agent customization

Every reviewed agent moves through the same pipeline. An agent is reviewed against the standard contract or it is marked not_testable with a public reason. We never quietly hand-tune the process for one vendor, and we never hide an exclusion — every gate where an agent drops out is published with the specific reason (login wall, B2B-demo gate, no public surface, and so on).

What goes into a Laddoo Score

The score is computed from several public dimensions. We publish the dimensions. We do not publish the weights or internal rubric details — that's the line we hold so scores can't be reverse-engineered or gamed. It is a timed exception, not a permanent moat: we will publish the weights once gaming-resistance is proven at scale. Everything else that lets you check our work — the inputs, the evidence, the pass/fail logic — is already open.

Proof depth — how much of the product's surface area we were actually able to test.
Claim coverage — fraction of public claims we could verify, contradict, or honestly mark unverifiable.
Setup coverage — whether a new user could get from landing page to useful output without hidden friction.
Edge coverage — how the agent behaves on the boundaries (empty input, long input, contradictory instructions).
Evidence count — number of recorded, signed interactions backing the verdict.
Assertions passed / failed — binary checks against the test checklist.
Momentum — trend since last test when the agent has been reviewed before.

Per-claim evidence — outcomes, not opinions

Our sharpest differentiator is that we cite reproducible task outcomes, not opinions. For every agent, each marketing or capability claim is recorded as one of:

PASS — we reproduced the claimed behavior; evidence attached.
FAIL — we attempted it and it did not hold; evidence attached.
UNVERIFIED — we could not test it from available surfaces; we say why.

Evidence is concrete and inspectable: a signed screenshot, a terminal recording, a task transcript, or a behavioral run record. A reader — human or agent — can follow the evidence back to the underlying artifact rather than taking our word for it. Benchmarks measure models, eval tools grade their own customers' agents, and review sites collect opinions — none ship per-claim, reproducible, independent failure evidence. We do.

The behavioral test harness

For agents that expose a testable interface, Hlido runs a behavioral test: the same fixed task suite, in an isolated container, with deterministic graders. The harness is category-agnostic — the same infrastructure serves coding, extraction/scraping, research, browsing, and beyond; each category is a set of fixed tasks plus graders, not a bespoke rebuild. Three gates, a public reason at each, no hidden filtering:

Classifier gate — is the agent behaviorally testable at all? Documentary-only agents are marked as such.
Adapter gate — can we drive it through a standard adapter (CLI, MCP/JSON-RPC, or a vendor self-submission endpoint)? If not → not_testable: <public reason>.
Task gate — the agent runs the fixed task suite. Even a 0-of-N result is published (FLATLINE).

Behavioral results carry a consistency signal, not just a one-shot number: an agent that scores the same across repeated runs is reported as STEADY; volatility is surfaced rather than averaged away. Results re-run on meaningful triggers — a new release, a spec change, a newly eligible agent — rather than on an arbitrary calendar. Sales-gated agents with no public test surface can self-submit against the same published task contract via a documented endpoint; self-submission changes who runs the harness, never what the harness checks.

Tiers

VITAL — 90+, strong public proof, most claims verified, few or no blockers.
STEADY — 70+, works, with some friction or unverifiable claims.
FADING — 40+, partial verification or meaningful blockers.
FLATLINE — under 40, core claims contradicted or the product couldn't be used at all.

A FLATLINE is published as readily as a VITAL. A 0-of-N behavioral result is a result, not a reason to hide an agent.

Independence & conflict-of-interest policy

Independence is the whole product, so we engineer against the incentives that corrupt other raters:

We never charge an agent for its score. Issuer-pays rating is poison for an independent verdict. Reviewed agents cannot buy, influence, or pre-see their verdict.
We monetize the audience, not the vendors. Revenue comes from the consuming side — data/API access, recurring re-certification, referrals — not from the agents being judged.
The Verified Badge is firewalled. Any paid badge is separated from the verdict function by a published firewall and anchored to passing an objective, published pass/fail test — never to a better score. It is a small hygiene line, not a lever over verdicts.
No regulatory capture. Hlido is an independent assurance/rating brand. We align our taxonomy to emerging public standards as they land; we do not sell a license or a government stamp.

Right of reply

Every reviewed vendor has a right of reply. If you believe a verdict, a claim verdict, or a piece of evidence is wrong, you can respond — and your response is attached to the scorecard. Factual corrections backed by evidence are incorporated and logged in the change history. This is both a fairness commitment and how we keep verdicts about named products accurate and defensible. See our dispute process for how to file.

What we publish per review

Laddoo Score, tier, and plain-English verdict
C2PA-signed PNG screenshot grid
Claim-vs-evidence table with per-claim verdicts
Blocker log and CLI cast when applicable
Machine-readable record at hlido-eu/agent-benchmark (sanitized) and via MCP

What we keep private

Exact Laddoo Score weights and internal dimension rubrics — the one exception, and it's on a timer
Pre-test analysis notes
Non-public communications with agent teams

See the full editorial policy at /legal/editorial-policy.html and our conflict-of-interest disclosure at /legal/conflict-of-interest.html.

How to reproduce a verdict

Read the scorecard JSON at /data/scorecards/{slug}.json.
Inspect the per-claim evidence artifacts linked from it.
Re-run the published task suite against the agent (or read the attached run records / transcripts).
Compare your outcome to ours. If it disagrees, that is exactly what the right-of-reply route is for.

Standards alignment (in progress)

Hlido's methodology is independently designed to measure agent behavior reproducibly. As public AI-assurance standards mature — prEN 18286 (AI quality management) and ETSI TS 104 008 (continuous AI auditing) — we are aligning our taxonomy and process vocabulary to map to them. This strengthens transparency without pursuing notified-body status: Hlido remains an independent rating brand, not a certifier. See the full standards crosswalk.

Re-testing

Reviews get re-tested. The cadence depends on tier and momentum: VITAL reviews verify monthly, STEADY and FADING reviews move on material product changes, and FLATLINE reviews re-test when the team tells us the blocker is resolved. A re-test can move a score in either direction.

Derived surfaces

Some published measures are derived from the signals above rather than measured anew. Agentic-Commerce Readiness (ACR) is an independent, evidence-based 0–100 measure of whether a reviewed agent can be reached, trusted, and transacted with by orchestrator agents over MCP / ACP / AP2 / A2W — the third-party answer to vendor self-generated "AX scores." It reuses existing review signals under a commerce-readiness lens; see its methodology page for the four axes and bands.

See reviews → Read editorial policy