crewswarm Benchmarks — 29 Models at 100/100 Quality, Full Results

Methodology

How we test

The quality benchmark runs 7 scoped TypeScript tasks against each model. Every task is deterministic and reproducible.

Task 1

Create a README with specific content structure and sections.

Task 2

Create a typed add function with a summary documentation file.

Task 3

Create utils with 3 functions plus tests, then run the tests to verify.

Task 4

Fix a divide-by-zero bug, update the tests, then run to confirm the fix.

Task 5

Refactor: extract a function to its own file, update imports, verify the build.

Task 6

Fix a wrong test assertion, then run tests to verify correctness.

Task 7

Create a calculator module importing math functions, with full test coverage.

Scoring

Each task is checked for correctness (file exists, content correct), tsc --strict passes, all tests pass, no regressions. 100 = all 7 pass.

These are scoped L3 execution tasks — the kind that run in parallel after the planner breaks work down. They test the engine's ability to make any model produce correct, verified code.

L3 Execution

29 models at 100/100

Every model below scores perfect on the quality benchmark: correct TypeScript, all tests passing, typecheck clean, no regressions. The engine makes $0.0003/task models produce the same quality as $0.03/task models.

Model	Provider	Input $/1M	Output $/1M	Speed	~Cost/Task
GPT-OSS 20B	Groq	$0.08	$0.30	12s	$0.0003
Gemini 2.5 Flash Lite	Google	$0.10	$0.40	11s	$0.0004
DeepSeek Chat	DeepSeek	$0.28	$0.42	26s	$0.001
Grok 4-1 Fast	xAI	$0.20	$0.50	75s	$0.001
MiniMax M2.1	OpenRouter	$0.27	$0.95	12s	$0.001
Gemini 2.5 Flash	Google	$0.30	$2.50	11s	$0.002
Kimi K2.5	OpenRouter	$0.38	$1.72	16s	$0.002
Groq Llama 3.3 70B	Groq	$0.59	$0.79	12s	$0.002
Cerebras Qwen3-235B	Cerebras	$0.60	$1.20	2s	$0.002
GLM-5	OpenCode/Zen	$0.72	$2.30	15s	$0.003
Claude Haiku 4.5	Anthropic	$1.00	$5.00	25s	$0.007
GPT-5.4	OpenAI	$2.50	$15.00	14s	$0.02
Claude Sonnet 4.6	Anthropic	$3.00	$15.00	69s	$0.02
Claude Opus 4.6	Anthropic	$5.00	$25.00	29s	$0.03

+ 13 more models at 100/100: GPT-5.4 Mini/Nano, GPT-5/5.2, GLM-4.6/4.7, MiniMax M2.5, Kimi K2, Grok 3 Mini, Grok Code Fast, Big Pickle, DeepSeek Reasoner, Qwen3-32B

L2 Planner

14 models at 90/100

Task decomposition quality — correct dependency chains, diverse persona assignments, acceptance criteria, verification checks. Cheap models match expensive ones because the prompt engineering does the work.

Model	Score	Units	~$/Plan
Claude Sonnet 4.6	90	9	$0.02
GPT-5.4	90	12	$0.02
GPT-OSS 20B (Groq)	90	15	$0.003
Gemini 2.5 Flash Lite	90	~10	$0.004
DeepSeek Reasoner	90	~10	$0.004
Grok 3 Mini / Qwen3-32B	90	~10	$0.005
GLM-5 (Zen)	90	16	$0.02
Kimi K2.5 (Zen)	90	~10	$0.015

Under the Hood

Why cheap models match expensive ones

The execution quality engine wraps every task in 8 runtime modules that prevent common failure modes. These are not prompting tricks — they are structural constraints that force correct behavior regardless of model intelligence.

Failure Memory

When a tool call fails, the engine remembers. Same command with same params gets blocked after one failure for shell, two for others. Models are forced to try different approaches instead of looping.

Verification Gate

The engine extracts verification goals from the task and tracks them as first-class state. When the model says "done" with unproven goals, it gets forced back with up to 3 extra turns demanding proof.

Action Ranking

Each turn scores 7 action types based on execution state. Edited without testing? Test scores highest. Three reads in a row? Penalized. The model sees ranked priorities, not just a list of tools.

Task Mode Strategies

The engine auto-detects 5 modes — bugfix, feature, refactor, test repair, analysis — and applies mode-specific strategies. Bugfixes reproduce first. Refactors typecheck before declaring done.

Patch Critic

Every file edit is evaluated in real time: was the file read before writing? Is the same file being churned? Are edits in scope? The critic injects quality guidance with zero extra LLM calls.

Adaptive Weights

The action ranking system learns from past runs. High-scoring trajectories boost their action patterns. Low-scoring runs penalize theirs. The engine calibrates to your codebase over time.

Structured History

Instead of flattening tool results into text, the engine preserves rich state: files read vs written, active goals, failures and reasons. This survives context compaction intact.

Smart Delegation

When work splits into parallel units, the engine scores each specialist against the task: language match, complexity, historical success rate. Bug fixes route to the fixer. Docs route to the writer.

Reproducibility

Run the benchmarks yourself

        # Clone and build

        git clone https://github.com/crewswarm/crewswarm

        cd crewswarm/crew-cli

        npm install && npm run build

        # Run quality benchmark against a specific model

        CREW_PROVIDER=groq node scripts/benchmark-quality.mjs --model llama-3.3-70b

        # Run L2 planner benchmark

        node scripts/benchmark-l2-planner.mjs

        # Run full preset sweep

        node scripts/benchmark-presets.mjs

Full benchmark code in crew-cli/benchmarks/ and crew-cli/scripts/

FAQ

Common questions

Why only TypeScript tasks?

TypeScript has strict type checking which is a harder bar than plain JavaScript. tsc --strict catches errors that would silently pass in other languages — implicit any, null safety, unused variables. If a model produces correct strict TypeScript, it can produce correct code in less strict languages too.

Why scoped tasks instead of full features?

These test L3 execution quality in isolation. Full features involve L2 planning + L3 execution combined — the L2 planner benchmark tests that decomposition step separately. Scoped tasks give a clean signal on execution quality without confounding it with planning quality.

Can I add my own model?

Yes. Set the API key for your provider, then run the benchmark script with the --model flag. Any OpenAI-compatible endpoint works. If the model supports tool calling and can produce TypeScript, it can run the benchmark.

29 models. 100/100 quality. Reproducible.