The State of AI Agents, 2026 — what 664 hands-on reviews reveal
We independently tested 664 AI agents. Nearly half are "fading" — the gap between the launch claim and the working product is now the defining feature of the category.
By the Hlido Editor · 2026-06-14
Everyone is shipping an "AI agent." Almost half of them don't deliver.
Over the past year, Hlido independently tested 664 AI agents — hands-on, through the CLI, API, or live product, claim by claim — and published one evidence-backed verdict for each: a Laddoo Score from 0 to 100. No vendor surveys, no self-reported benchmarks. Just what happened when we actually ran the thing.
Here is what the corpus says about the state of the agent economy.
1. Half of all agents are fading
The median agent scores 73 / 100; the average is 69. But the distribution is the real story:
- VITAL (90+): 23% — genuinely deliver, with evidence to back the claims.
- STEADY (70–89): 31% — solid, with caveats.
- FADING (40–69): 46% — the largest group. The product exists, the marketing is confident, but the core promise wobbles under a real test.
Nearly one in two agents we tested falls into FADING — the gap between the launch tweet and the working product is now the defining feature of the category.
2. The most crowded category is one of the weakest
"AI Agent" — general-purpose autonomous agents — is the single largest category (154 products) and one of the lowest-scoring (avg 59.6). "Chat & Companion" scores lower still (56.3). The land-grab is real, and most entrants haven't earned the name yet.
For buyers: the more generic the pitch ("an autonomous agent for everything"), the more you should demand evidence before you commit.
3. Where agents actually work
The categories that hold up under testing are the specific, bounded ones:
| Category | Avg score |
|---|---|
| Voice | 77.5 |
| Frameworks & Eval | 77.4 |
| Infrastructure | 75.4 |
| Coding | 75.2 |
The pattern: agents that do one well-scoped job — route a call, write code, run a workflow — outperform the ones promising general autonomy. Narrow and real beats broad and aspirational.
4. Reliability is the next frontier — and it's fragile
A good score on launch day isn't the whole story. Hlido's incident registry has already logged 19 independently-verified availability failures and self-reported retractions across reviewed agents. An agent you depend on can degrade, break, or quietly change behaviour — and almost nobody is tracking that in the open.
5. The takeaway
The agent economy has cleared the "can it demo?" bar. The bar now is can it deliver, repeatedly, under a real workload — and on that bar, 46% are fading and only 23% are vital. Independent, evidence-backed testing isn't a nice-to-have anymore; it's the only way a buyer — or another agent — can tell the 23% from the 46%.
Every agent above is testable, scored, and evidence-backed. Browse all 664 verdicts at hlido.eu/reviews, or query them over MCP at hlido.eu/mcp. Methodology: outcomes, claim audits, and signed evidence are public; scoring weights are not.