Hlido Reliability Report — 2026-06-12
Independent reliability signals across 642 reviewed AI agents. Published incidents: 19 (6 critical · 4 high · 7 low · 2 medium).
Dead agents (6)
Agents whose primary product surface no longer exists — domain dead or unreachable, verified on two independent networks:
- stanley-for-x — Primary site unreachable: domain no longer resolves (first observed 2026-05-08)
- playht — Primary site unreachable: domain no longer resolves (first observed 2026-05-08)
- oraza — Primary site unreachable: domain no longer resolves (first observed 2026-05-08)
- hapax — Primary site unreachable: domain no longer resolves (first observed 2026-05-08)
- fleece-ai — Primary site unreachable: domain no longer resolves (first observed 2026-05-08)
- adaptive — Primary site unreachable: domain no longer resolves (first observed 2026-05-08)
Findings from the review corpus
- AI Agent is the largest category (230 agents) but scores avg 63.9 — 5.5pts below corpus
Of all reviewed categories with >10 agents, 'AI Agent' (230 agents, 36% of corpus) has the worst average score at 63.9 vs corpus avg 69.4. 83 agents (36%) score below 60. Red flags in this category dominate the whole corpus: 83 unverified-claim flags, 32 auth-opacity flags. In contrast, top categories (Voice 79.3, Eval 79.1, Frameworks & Eval 79.0) score 15+ points higher. The most-reviewed category is also the worst quality signal. - Chat & Companion (28 agents, avg 54.8) and Companion (3 agents, avg 53) are the bottom performers
Chat & Companion agents average 54.8 — the lowest of any category with 5+ agents. Companion agents avg 53. Combined 31 agents in the consumer chat space average below 55. Evidence coverage is also lowest here (11% for Chat & Companion). These products frequently fail on claim verification and auth transparency. This is a credibility risk for Hlido if these categories are over-visible in the discovery surface. - Unverified claims is the #1 red flag across 161 scorecards — concentrated in AI Agent category
Across 635 reviewed agents, 'absence/lack of verified/verifiable claims' appears as a red flag in 161 scorecards (25% of all reviews). AI Agent category alone accounts for 83 of these (52%). The three-way cluster of unverified claims (161) + auth opacity (104) + sparse docs (97) = 362 flags affecting an estimated 35-40% of the corpus. Together these represent the single biggest trust gap in the AI agent ecosystem. - Only 21% of reviews achieve high confidence; 28% are low confidence
Confidence breakdown: high=136 (21%), medium-high=94 (15%), medium=221 (35%), medium-low=5 (1%), low=178 (28%). Over half the corpus (63%) is medium or lower confidence, driven primarily by login walls, sparse public surfaces, and limited testability of enterprise/API products. Low-confidence reviews disproportionately affect AI Agent category. High-confidence reviews correlate with open-source tools, CLI agents, and API-first products with public docs.
Registry state
Tier distribution across 642 scored agents: FADING 297 · STEADY 193 · VITAL 152.
Incident registry · All reviews · Past reports
Independent, evidence-backed. Machine-readable edition: report.json · incidents API · MCP. Hlido never exposes scoring weights.