Every model below produces correct, tested TypeScript on scoped coding tasks. The execution quality engine is the equalizer — cheap models match expensive ones because the runtime prevents failure modes, not the model.
The quality benchmark runs 7 scoped TypeScript tasks against each model. Every task is deterministic and reproducible.
Create a README with specific content structure and sections.
Create a typed add function with a summary documentation file.
Create utils with 3 functions plus tests, then run the tests to verify.
Fix a divide-by-zero bug, update the tests, then run to confirm the fix.
Refactor: extract a function to its own file, update imports, verify the build.
Fix a wrong test assertion, then run tests to verify correctness.
Create a calculator module importing math functions, with full test coverage.
Each task is checked for correctness (file exists, content correct), tsc --strict passes, all tests pass, no regressions. 100 = all 7 pass.
These are scoped L3 execution tasks — the kind that run in parallel after the planner breaks work down. They test the engine's ability to make any model produce correct, verified code.
Every model below scores perfect on the quality benchmark: correct TypeScript, all tests passing, typecheck clean, no regressions. The engine makes $0.0003/task models produce the same quality as $0.03/task models.
| Model | Provider | Input $/1M | Output $/1M | Speed | ~Cost/Task |
|---|---|---|---|---|---|
| GPT-OSS 20B | Groq | $0.08 | $0.30 | 12s | $0.0003 |
| Gemini 2.5 Flash Lite | $0.10 | $0.40 | 11s | $0.0004 | |
| DeepSeek Chat | DeepSeek | $0.28 | $0.42 | 26s | $0.001 |
| Grok 4-1 Fast | xAI | $0.20 | $0.50 | 75s | $0.001 |
| MiniMax M2.1 | OpenRouter | $0.27 | $0.95 | 12s | $0.001 |
| Gemini 2.5 Flash | $0.30 | $2.50 | 11s | $0.002 | |
| Kimi K2.5 | OpenRouter | $0.38 | $1.72 | 16s | $0.002 |
| Groq Llama 3.3 70B | Groq | $0.59 | $0.79 | 12s | $0.002 |
| Cerebras Qwen3-235B | Cerebras | $0.60 | $1.20 | 2s | $0.002 |
| GLM-5 | OpenCode/Zen | $0.72 | $2.30 | 15s | $0.003 |
| Claude Haiku 4.5 | Anthropic | $1.00 | $5.00 | 25s | $0.007 |
| GPT-5.4 | OpenAI | $2.50 | $15.00 | 14s | $0.02 |
| Claude Sonnet 4.6 | Anthropic | $3.00 | $15.00 | 69s | $0.02 |
| Claude Opus 4.6 | Anthropic | $5.00 | $25.00 | 29s | $0.03 |
+ 13 more models at 100/100: GPT-5.4 Mini/Nano, GPT-5/5.2, GLM-4.6/4.7, MiniMax M2.5, Kimi K2, Grok 3 Mini, Grok Code Fast, Big Pickle, DeepSeek Reasoner, Qwen3-32B
Task decomposition quality — correct dependency chains, diverse persona assignments, acceptance criteria, verification checks. Cheap models match expensive ones because the prompt engineering does the work.
| Model | Score | Units | ~$/Plan |
|---|---|---|---|
| Claude Sonnet 4.6 | 90 | 9 | $0.02 |
| GPT-5.4 | 90 | 12 | $0.02 |
| GPT-OSS 20B (Groq) | 90 | 15 | $0.003 |
| Gemini 2.5 Flash Lite | 90 | ~10 | $0.004 |
| DeepSeek Reasoner | 90 | ~10 | $0.004 |
| Grok 3 Mini / Qwen3-32B | 90 | ~10 | $0.005 |
| GLM-5 (Zen) | 90 | 16 | $0.02 |
| Kimi K2.5 (Zen) | 90 | ~10 | $0.015 |
The execution quality engine wraps every task in 8 runtime modules that prevent common failure modes. These are not prompting tricks — they are structural constraints that force correct behavior regardless of model intelligence.
When a tool call fails, the engine remembers. Same command with same params gets blocked after one failure for shell, two for others. Models are forced to try different approaches instead of looping.
The engine extracts verification goals from the task and tracks them as first-class state. When the model says "done" with unproven goals, it gets forced back with up to 3 extra turns demanding proof.
Each turn scores 7 action types based on execution state. Edited without testing? Test scores highest. Three reads in a row? Penalized. The model sees ranked priorities, not just a list of tools.
The engine auto-detects 5 modes — bugfix, feature, refactor, test repair, analysis — and applies mode-specific strategies. Bugfixes reproduce first. Refactors typecheck before declaring done.
Every file edit is evaluated in real time: was the file read before writing? Is the same file being churned? Are edits in scope? The critic injects quality guidance with zero extra LLM calls.
The action ranking system learns from past runs. High-scoring trajectories boost their action patterns. Low-scoring runs penalize theirs. The engine calibrates to your codebase over time.
Instead of flattening tool results into text, the engine preserves rich state: files read vs written, active goals, failures and reasons. This survives context compaction intact.
When work splits into parallel units, the engine scores each specialist against the task: language match, complexity, historical success rate. Bug fixes route to the fixer. Docs route to the writer.
Full benchmark code in crew-cli/benchmarks/ and crew-cli/scripts/
TypeScript has strict type checking which is a harder bar than plain JavaScript. tsc --strict catches errors that would silently pass in other languages — implicit any, null safety, unused variables. If a model produces correct strict TypeScript, it can produce correct code in less strict languages too.
These test L3 execution quality in isolation. Full features involve L2 planning + L3 execution combined — the L2 planner benchmark tests that decomposition step separately. Scoped tasks give a clean signal on execution quality without confounding it with planning quality.
Yes. Set the API key for your provider, then run the benchmark script with the --model flag. Any OpenAI-compatible endpoint works. If the model supports tool calling and can produce TypeScript, it can run the benchmark.