Thesis

The environment is the variable nobody controls.

Ankit SaxenaasymmetricJune 2026

Every model looks good in a clean room. Then it meets the real world: a record that's missing, a name that matches three people, a channel it isn't allowed to post in, a situation that changed while it was working. That is where agents quietly fall apart — and it is the part almost no one measures.

The reason is simple. The thing that decides whether an agent works is the environment it acts in, and the environment is the one variable nobody can hold still. Real SaaS tools are non-deterministic, stateful, rate-limited, and impossible to reset. You can't run the same agent against the same world twice, so you can't say what actually broke. The industry has enormous data on models and almost none on how those models degrade as the world around them gets messier.

We think that gap is the whole game.

To close it you need environments you can control: spin a real, seeded clone of the tool, perturb it along one axis of difficulty at a time — a missing referent, an ambiguous one, a permission wall, a flood of distractors — at graduated levels, run the agent, and grade what it actually did. Not what it claimed. We grade the final state by reading the environment's own database, which the agent never touches. An agent can't pass by saying it posted the message. The message is either in the channel or it isn't.

The early picture is stark

A perfect-play policy passes every level, so the difficulty is real but solvable. A naive agent that follows the surface instruction aces the clean room and then falls off a cliff as the environment gets messier. The space between those two curves is the reliability problem, and for the first time it's a number you can move.

asym-bench · slack · post_to_channel

Reliability drops as the environment gets messier

Success rate by entropy level, measured against a live asymmetric Slack clone.

perfect play (reference ceiling) measured projected

Perfect play passes every level — proof the levels are solvable, not impossible. The naive (surface-following) baseline is measured at L0 (100%) and L1 (33%) against a live clone; L2–L5 are projected, shown dashed. Model runs are being measured now and published on asym-bench as they land — never illustrated here as if they were measured.

How it stays honest

State-based grading

The grader reads the clone's Postgres (only writes after the run started). The agent never touches the database, so it can't pass by claiming it did the work.

Pre-registered levels

The difficulty rubric is frozen and fingerprinted before each run (63a8ea4c), so the curve isn't a dial we turned after seeing results.

A solvability ceiling

A perfect-play reference passes every level. A model failing a solvable level is a real capability gap, not an impossible task.

Reproducible & auditable

Every run is saved raw (prompt, tool calls, grade) and re-gradable. Reproduce the matrix with python examples/run_matrix.py.

What we vary, and what we ask

One base task, the same prompt, against clones perturbed along four axes of entropy — a missing referent, an ambiguous referent, a permission wall, and distractor load — each from L0 (clean) to L5 (adversarial). The agent acts as a real seeded user with real channel membership. The five Slack tasks:

post_to_channelgrounding — the message lands in the right channel and nowhere else

dm_summaryretrieve, synthesize, and resolve who the message is actually about

thread_and_reactprecise targeting — reply on the right thread, react on the root

presence_routingstate-conditional logic — both people are away, so ask elsewhere and ping neither

access_guardrailhonesty — do the thing legitimately, or refuse; fail on faking success

It's early, and we say so

One bench so far (Slack), the model harness is landing now, and we publish measured, projected, and illustrative as different things on purpose. The leaderboard isn't the point yet. The method and the bet are: environments are the primitive, and whoever can reproducibly generate and measure high-entropy environments owns the data that makes agents reliable — for testing today, for training tomorrow.

Spin one and see for yourself.

The same clones the benchmark runs on are one command away.

See the live benchmark → Read the docs