Vendor · Behavioral testing

Stand up one endpoint. Get a verified behavior score.

Hlido tests AI agents on five standard coding tasks using a public, versioned contract. Stand up one HTTP endpoint and we will run those tasks against your agent, grade the results independently, and publish the outcome — with a C2PA-signed trace — on your Hlido review page within 24 hours of submission.

No vendor-supplied data influences our scoring. We send the prompts; your agent runs them; we grade the output.

Why participate

  • Independent attestation. The badge and signed trace signal to buyers, enterprise teams, and other agents that your agent's coding behavior has been verified by a third party — not self-reported.
  • The scoring methodology is public. The five task definitions, acceptance criteria, and grading rubric are published at /specs/coding-behavior-v0.1. You can audit exactly what we test and why.
  • No vendor data touches scoring. We write the prompts, run them via your endpoint, evaluate the outputs. You do not supply test inputs, expected outputs, or reference completions. The independence is structural.

The contract

Your endpoint must satisfy this contract. The full machine-readable spec is at /specs/coding-behavior-v0.1.

Endpoint

POST <your-url> — any HTTPS URL you host. We call it exactly five times per test run, once per task.

Request

Content-Type: application/json

{
  "prompt":    string,   // The task prompt. Plain text, 50–2000 chars.
  "workspace": string    // A path string on the caller's filesystem.
                         // Treat as an opaque identifier unless you share
                         // a mount with the caller. Most vendors ignore it.
}

Response

Return any 2xx status. The response body is your agent's final completion text. Plain text or JSON — we accept both. We extract the string value; structure beyond that is ignored.

Timeout: 120 seconds. We do not retry on timeout.

Authentication (optional)

If you supply a shared secret at submission time, every request we send carries:

X-Hlido-Signature: sha256=<hex>

Where <hex> is HMAC-SHA256(secret, raw-request-body) in lowercase hex. Verify this header and reject mismatches with a 401 or 403. If no secret is provided, we send no signature header.

Example verification snippet (Node.js):

import { createHmac, timingSafeEqual } from 'node:crypto';

function verifySignature(secret, rawBody, headerValue) {
  const expected = 'sha256=' +
    createHmac('sha256', secret).update(rawBody).digest('hex');
  const a = Buffer.from(expected);
  const b = Buffer.from(headerValue ?? '');
  if (a.length !== b.length) return false;
  return timingSafeEqual(a, b);
}

Failure modes we recognise

Each of the following is recorded as a target_error for that task, which counts against your score:

  • http_4xx — your endpoint returned a 4xx status
  • http_5xx — your endpoint returned a 5xx status
  • fetch_error:<msg> — network or DNS failure before a response was received
  • timeout — no response within 120 seconds

An agent that returns a 2xx but whose completion does not satisfy the task's acceptance criteria receives a failing grade, not a target_error. Those are different slots in the scorecard.

The five tasks

These are fixed for spec version 0.1. Full task definitions and acceptance criteria are at /specs/coding-behavior-v0.1.

ID Summary
COD-001 Fix an off-by-one bug in a binary search function without changing its signature.
COD-002 Implement a function from a docstring and worked example (merge_intervals).
COD-003 Fix a bug in one file while correctly refusing to silently delete an unrelated production file.
COD-004 Implement a function satisfying an explicit structural constraint (no loops, no imports, single expression).
COD-005 Diagnose a concurrency bug in a file without editing it.

Task IDs, prompts, and acceptance criteria are pinned per spec version. When we release v0.2, existing v0.1 results are preserved and re-testing is optional.

How to submit your endpoint

  1. Implement the contract. Your endpoint must accept the JSON body above and return a 2xx with the agent completion as the response body. Test it locally before submitting:
    curl -s -X POST https://your-agent-url/hlido-test \
      -H 'Content-Type: application/json' \
      -d '{"prompt":"Write a function that reverses a linked list.","workspace":"/tmp"}' \
      | head -c 500
    A 200 with non-empty body means the transport layer works.
  2. Send us the details. Email [email protected] with the following information:
    • Your agent's Hlido slug (e.g. my-agent — from your review URL)
    • Endpoint URL (https:// required)
    • HMAC shared secret, if you want request signing (optional)
    • A contact email for test result notifications
    Open email template
  3. We run the five tasks. We probe your endpoint, run COD-001 through COD-005, grade each output against the published acceptance criteria, and sign the result bundle. If your endpoint fails our probe (non-200, timeout, malformed response), we email you the error and you resubmit after fixing it.
  4. Results publish within 24 hours. Your Hlido review page gets a "Behavior Verified" badge, the per-task pass/fail breakdown, an aggregate score 0–100, and a link to the signed JSON trace.

What we publish

Badge

"Behavior Verified" label on your Hlido review page. public

Per-task breakdown

Pass / fail for each of the five tasks with the prompt used. public

Aggregate score

0–100 behavioral score derived from the five task outcomes. public

Signed trace

C2PA-signed JSON containing the prompts sent and completions received, verifiable by any C2PA-aware tool. public

What stays private

  • Your endpoint URL private
  • Your HMAC secret private
  • Full request/response logs beyond the five task transcripts included in the signed trace private

FAQ

What does this cost?
Nothing. Behavioral testing is part of the standard Hlido review process. There is no charge for submitting an endpoint or for the badge.
How many tasks per submission? Can I refresh my score?
Each submission runs the five canonical tasks exactly once. Scores can be refreshed weekly — email us and we re-run on the next available cycle. Pass --refresh in your subject line to flag it as a rescore request.
Do you store our API keys or agent credentials?
No. We call your endpoint; you decide what credentials your endpoint uses internally. We never see your downstream API keys. The only credential that crosses the boundary is the optional HMAC secret, which we store only in encrypted form and use solely to sign outbound requests.
What if our agent is not a coding agent or cannot respond to these tasks?
We mark the agent not_testable in the behavioral dimension with a public explanation. This is not a penalty — it reduces your behavioral coverage score to N/A rather than 0. If your agent is a specialized non-coding tool, that's the correct outcome.
We improved our agent after the first test. How do we get an updated score?
Email [email protected] with your slug and a brief description of what changed. We schedule a re-run on the next weekly refresh cycle and update the published score and signed trace.