The environment is the variable nobody controls.
Every model looks good in a clean room. Then it meets the real world: a record that's missing, a name that matches three people, a channel it isn't allowed to post in, a situation that changed while it was working. That is where agents quietly fall apart — and it is the part almost no one measures.
The reason is simple. The thing that decides whether an agent works is the environment it acts in, and the environment is the one variable nobody can hold still. Real SaaS tools are non-deterministic, stateful, rate-limited, and impossible to reset. You can't run the same agent against the same world twice, so you can't say what actually broke. The industry has enormous data on models and almost none on how those models degrade as the world around them gets messier.
We think that gap is the whole game.
To close it you need environments you can control: spin a real, seeded clone of the tool, perturb it along one axis of difficulty at a time — a missing referent, an ambiguous one, a permission wall, a flood of distractors — at graduated levels, run the agent, and grade what it actually did. Not what it claimed. We grade the final state by reading the environment's own database, which the agent never touches. An agent can't pass by saying it posted the message. The message is either in the channel or it isn't.
The early picture is stark
A perfect-play policy passes every level, so the difficulty is real but solvable. A naive agent that follows the surface instruction aces the clean room and then falls off a cliff as the environment gets messier. The space between those two curves is the reliability problem, and for the first time it's a number you can move.
Reliability drops as the environment gets messier
Success rate by entropy level, measured against a live asymmetric Slack clone.
Perfect play passes every level — proof the levels are solvable, not impossible. The naive (surface-following) baseline is measured at L0 (100%) and L1 (33%) against a live clone; L2–L5 are projected, shown dashed. Model runs are being measured now and published on asym-bench as they land — never illustrated here as if they were measured.
How it stays honest
State-based grading
The grader reads the clone's Postgres (only writes after the run started). The agent never touches the database, so it can't pass by claiming it did the work.
Pre-registered levels
The difficulty rubric is frozen and fingerprinted before each run (63a8ea4c), so the curve isn't a dial we turned after seeing results.
A solvability ceiling
A perfect-play reference passes every level. A model failing a solvable level is a real capability gap, not an impossible task.
Reproducible & auditable
Every run is saved raw (prompt, tool calls, grade) and re-gradable. Reproduce the matrix with python examples/run_matrix.py.
What we vary, and what we ask
One base task, the same prompt, against clones perturbed along four axes of entropy — a missing referent, an ambiguous referent, a permission wall, and distractor load — each from L0 (clean) to L5 (adversarial). The agent acts as a real seeded user with real channel membership. The five Slack tasks:
It's early, and we say so
One bench so far (Slack), the model harness is landing now, and we publish measured, projected, and illustrative as different things on purpose. The leaderboard isn't the point yet. The method and the bet are: environments are the primitive, and whoever can reproducibly generate and measure high-entropy environments owns the data that makes agents reliable — for testing today, for training tomorrow.
Spin one and see for yourself.
The same clones the benchmark runs on are one command away.