Every model holds up in a clean room.
Then the environment fights back.

We give AI agents a real task, then make the world messier — a thing that's missing, a name that's ambiguous, a door that's locked, a situation that changes. The leading models all start strong. We check what they actually did, not what they claimed to do. Most fall apart.

Reliability drops as conditions get messier

— measured – – projected · · · illustrative

Does it admit failure, or fake success?

When a model can't do the task, it should say so. Instead it sometimes claims it finished — a confident answer with nothing behind it. This is how often that happened.

Where models break

Watch it fail

Pick a recorded run to replay its real tool-call trace. Combos without a captured trace show “no recorded trace yet” — nothing is fabricated.

trace

No recorded trace yet for this run. Run a model through the matrix with EMIT_SITE=1 to capture one.

Methodology & trust