Benchmark

The AI coding agent code-quality benchmark.

The same tasks across every major agent, scored deterministically by aislop. Because the scoring has no LLM in the loop, the same output always produces the same score, so the leaderboard is reproducible and non-circular — unlike benchmarks where one model grades another.

Why this is the authority asset

An LLM-judged benchmark cannot be re-run to the same answer. A deterministic one can. That is the difference between a marketing chart and a measurement anyone can audit.

Methodology preview

There is no benchmark data here yet. The first public run publishes soon. Everything in the table below is an illustrative sample of the format, not measured results — every value is a placeholder. Do not read the rows as scores.

What it measures

Code quality, not whether it runs.

Plenty of benchmarks measure whether an agent produces working code. This one measures whether the code is clean enough to inherit: the patterns a human maintainer would flag in review, scored consistently across every agent.

AI-slop density

Named anti-patterns per task output: narrative comments, swallowed exceptions, unsafe casts, todo stubs, generic naming.

Structural quality

Function size, complexity, unreachable code, and dead code in the generated solution.

Security constructs

Risky primitives and unsafe patterns introduced while satisfying the prompt.

Composite score

A single 0–100 aislop score per agent, per task, computed by the same deterministic engines a team would run locally.

Methodology

Pinned, clean-room, reproducible.

The method matters more than the headline number. Every variable that could change a score is pinned and published, so the run can be reproduced from scratch.

01 Pin the tasks. A fixed set of coding tasks, published in full, so every agent solves identical work.
02 Pin the agent versions. Each agent is recorded at an exact version and model, so a run is reproducible against the same software.
03 Clean-room runs. Each task is run in an isolated environment with no shared state between agents or tasks.
04 Deterministic aislop scoring. The same engines, the same config, the same version score every output — no LLM judge in the loop.
05 Publish the raw output. Generated code and raw JSON findings ship alongside the table, so anyone can re-score and check the numbers.

Agents covered

Six agents at launch.

The first run covers the agents teams ship with today. The set grows as new agents reach production use.

Claude CodeCursorCodexGeminiAiderOpenCode

Leaderboard format

This is the shape, not the result.

Illustrative · sample format

Agent (illustrative)	Score	AI-slop	Structural	Security
Agent A	—	—	—	—
Agent B	—	—	—	—
Agent C	—	—	—	—

Rows show placeholder agents and dashes, not measured scores. Real numbers publish with the first run.

Why it is non-circular

No model grades another model.

When an LLM judges another LLM, the result depends on the judge’s mood, prompt, and version, and it cannot be re-run to the same number. aislop scores by parsing the code with deterministic engines: no model in the loop, no opinion, no drift. Re-running the same output reproduces the same score exactly, which is what makes the leaderboard auditable rather than persuasive.

The contrast

LLM-judged — opaque, non-reproducible, grader-dependent.
aislop — deterministic, reproducible, published raw output.

Nominate a task or get notified.

The task set is public and open to nominations. Suggest a task that exposes how agents really write code, or follow the research program to get the first run when it publishes.

Nominate a task → Follow the research