Every model holds up in a clean room.
Then the environment fights back.
We give AI agents a real task, then make the world messier — a thing that's missing, a name that's ambiguous, a door that's locked, a situation that changes. The leading models all start strong. We check what they actually did, not what they claimed to do. Most fall apart.
Reliability drops as conditions get messier
— measured
– – projected
· · · illustrative
Does it admit failure, or fake success?
When a model can't do the task, it should say so. Instead it sometimes claims it finished — a confident answer with nothing behind it. This is how often that happened.
Where models break
Watch it fail
Pick a recorded run to replay its real tool-call trace. Combos without a captured trace show “no recorded trace yet” — nothing is fabricated.
No recorded trace yet for this run. Run a model
through the matrix with EMIT_SITE=1 to capture one.
Methodology & trust
- Environment is the variable. One base task, the same prompt, against clones perturbed along one entropy axis at a graduated level (L0 clean → L5 adversarial).
- State-based grading. Graders read the clone's Postgres (scoped to writes after the run started); the agent never touches the database. It can't pass by claiming it did the work.
- Repeated trials, not anecdotes. Each point is several independent runs; the shaded band
is a 95% confidence interval and
n=shows how many runs back it. Every single run is saved as a raw record (prompt, tool calls, grade), so any number here is auditable and re-gradable. - Pre-registered levels. The level rubric is fixed before runs and fingerprinted
(
—) so the curve isn't a dial we tuned after seeing results. - Honest about what's measured. The naive baseline is a real run; model curves here are illustrative until a model harness lands. Voided trials (infra failures) are disclosed, never scored as model failures.
- Reproducible.
—