crew-cli — AI Coding Engine | 16+ Providers, Free Models Match Opus

Commands

Eight ways to work

Each command uses the same 3-tier pipeline underneath. Simple tasks skip planning. Complex tasks get full artifact generation.

crew chat

One-shot task execution. Describe what you want, get files on disk. Routes automatically — simple tasks execute directly, complex ones get planned first.

crew chat "Add error handling to server.mjs" --apply

crew plan

Generate 7 planning artifacts before writing code: PDD, ROADMAP, ARCH, SCAFFOLD, CONTRACT-TESTS, DOD, GOLDEN-BENCHMARKS. Dual-model validation with risk assessment.

crew plan "Build user auth with OAuth"

crew test-first

TDD pipeline: generates tests first, implements to pass them, then validates. Catches its own bugs. Three LLM calls, ~$0.0002.

crew test-first "write add(a,b) with full test coverage"

crew validate

Blind code review. Scores correctness, security, performance, readability, test coverage. Returns a PASS/FIX verdict with actionable items.

crew validate src/auth.mjs

crew auto

Autonomous mode. Iterates until the task is complete — reads files, writes code, runs commands, verifies output. No human in the loop.

crew auto "create a greet function with tests"

crew repl

Interactive session with colored diff preview before applying changes, session history, memory, and mode switching (manual/assist/autopilot). /preview to review, /apply or /rollback. /sessions to list past sessions, /resume [id] to continue one.

crew repl --mode autopilot

crew apply

Apply sandbox changes with safety gates. Blast-radius analysis blocks risky diffs. --check runs your test suite and parses diagnostics — TSC, ESLint, Go, Rust, pytest errors get fed back to crew-fixer for targeted retry.

crew apply --check "npm test" --retries 3

crew doctor

Health check in 4 seconds: Node.js, Git, API keys, gateway connectivity, MCP servers, CLI updates. Suggests cheapest providers when no keys are configured.

crew doctor

Resilience

No single-provider trap

crew-cli is built for the real constraint in AI terminals: one vendor hits a wall, but the job still needs to finish.

Bring Your Own Keys

Use the providers you already pay for: OpenAI, Anthropic, Google, Groq, DeepSeek, xAI, OpenRouter, local models, and more. No middleman markup, no platform lock-in.

Fail Over, Don’t Stall

If one model path is rate-limited, out of quota, or temporarily unavailable, crew-cli can fall through to another configured provider instead of dying on a single stack.

CLI Limits Aren’t Terminal

Claude Code, Codex CLI, Cursor, Gemini, and direct APIs each have different strengths and failure modes. crew-cli is the portable execution lane when one runtime hits a wall mid-task.

Local Lanes Stay Cheap

Mix local models into the same workflow for effectively free parallel throughput. Keep premium APIs for the hard parts and offload glue work, summaries, and background execution to local lanes.

Pay For The Brain, Not The Glue

Use fast cheap lanes for L1 routing and L3 worker churn. Spend premium tokens on L2 planning, validation, and hard reasoning only when the task actually justifies it.

Architecture

The 3-tier pipeline

Every task flows through three stages. Simple tasks skip straight to execution. Complex tasks get full planning with risk validation.

🚦

Tier 1: Router

Fast, cheap lane decides how to handle the task: direct answer, single execution, or decomposition into more work. This is where cheap hosted models or local router lanes shine.

Groq / Grok / Gemini Flash / local router lanes

🗺️

Tier 2: Planner

This is the expensive brain layer: planning, validation, risk checks, and hard reasoning. Use your premium context-heavy models here when the task actually needs them.

GPT / Claude / Gemini Pro / premium reasoning models

⚡

Tier 3: Executor

Tool-using worker lane with 45+ built-in tools: file I/O, shell, git, LSP, browser, and web search. Use strong cheap models, premium coders, or local workers depending on cost and quality needs.

Gemini / Codex / Claude / Groq / local worker lanes

Technical Manual

Engineering-First Terminals

🖼️ Multimodal Vision

Attach images via --image flag or /image REPL command. Native support for Gemini, GPT-4o, Claude 3, Grok Vision — no base64 text dumps.

🔧 45+ Built-in Tools

File I/O, shell, git, LSP diagnostics, Jupyter notebooks, web search, Docker sandbox, memory, multi-turn sub-agents, worktree isolation — more tools than any competitor CLI.

ATAT Protocol

10x more token-efficient than JSON-RPC. Stream-parseable, no escaping, and supports graceful partial execution if a model times out.

View Spec →

Parallel Worker Pool

Execute independent tasks concurrently. Achieve a 2.96x wall-clock speedup over sequential implementation cycles.

Git Worktree Isolation

Multi-agent waves get automatic git worktree isolation — each agent works on its own branch so parallel file edits never conflict. Merges back after the wave completes.

LSP Self-Healing

Language Server Protocol integration identifies syntax/type errors in the sandbox. Agents fix their own bugs before human review.

🧠 Agent Memory

Persistent cross-session memory with MemoryBroker. Agents recall facts, decisions, and prior task results across conversations.

Diagnostic Lint-Loop

Not just pass/fail — --check parses error output into structured diagnostics (TSC, ESLint, GCC, Go, Rust, pytest). Feeds specific file:line:col errors to crew-fixer. Stops early when no progress is detected.

Checkpoint-at-Interval

Git stash snapshots every 60 seconds during long pipeline runs. Roll back to any point via git stash list. Configure with CREW_CHECKPOINT_INTERVAL_MS. Zero overhead when idle.

Speculative Explore

Run competing implementation strategies in parallel branches. Compare architectural diffs side-by-side and merge the winner.

Browser Debugging

Headless Chrome integration for visual QA. Agents can inspect live DOM state and fix CSS/UX issues autonomously.

⚡ Real-time Streaming

Token-by-token output as the LLM generates. All providers stream — Gemini, OpenAI, Anthropic, Grok, DeepSeek, Groq, and OpenRouter. No buffered waits.

🔄 Session Continuity

Full conversation history persists across REPL sessions via SessionManager. Resume where you left off with /history, /status, /clear, /sessions, and /resume.

🩺 crew doctor

Built-in diagnostics: checks Node.js version, Git, API keys, gateway, MCP, and CLI updates in under 4 seconds. Suggests cheapest providers when no keys are set.

🔄 Session Resume

/sessions lists past sessions, /resume [id] picks up where you left off. JSONL crash-safe transcripts survive mid-write crashes — no lost context, ever.

🪝 Tool Hooks

.crew/hooks.json lets you intercept any tool call. PreToolUse can block dangerous commands, PostToolUse can log everything. Shell commands with JSON on stdin.

🌳 Git Worktree Isolation

Agents work in isolated git worktrees on separate branches. No file conflicts during parallel work. Auto-cleanup if no changes, squash merge if changes made.

📊 Token-Aware Compaction

Context compression adapts to how full your context window is. Light compression at 50%, aggressive at 75%+. Per-model context window awareness keeps agents sharp.

🖥️ tmux Session Handoff

Multi-wave pipelines use labeled tmux panes for cross-agent context sharing. Agent A's output, cwd, and env vars are handed off to Agent B via the session manager. Zero cold starts between pipeline waves.

📡 PreToolUse / PostToolUse Hooks

Intercept any tool call with .crew/hooks.json. Block dangerous shell commands, log every file write, or transform tool input before execution. JSON piped on stdin to your shell scripts.

💰 Cost Tracking

Real-time token spend per model with prompt cache savings. Tracks Anthropic 90% cache discount, Groq 50%, Google free tier. Dashboard shows per-agent, per-model cost breakdown.

🔄 Intelligent Retry

Detects stuck agents: questions instead of work, plans instead of code, incomplete bail-outs. Auto-corrects with targeted prompts. Not just backoff — adaptive recovery.

🔌 64 MCP Tools

Built-in MCP server exposes the full swarm via JSON-RPC. dispatch_agent, run_pipeline, chat_send, crewswarm_status — any MCP client can orchestrate the fleet.

🧠 Tool Auto-Filter

Instead of sending 41 tools to every LLM call, crew-cli detects what the task needs (coding, git, web, tests, docs) and sends only relevant tools. Reduces context, improves model accuracy, fixes degradation on smaller models.

📌 Top of Mind

Drop a .crew/instructions.md in your project with persistent rules: "always use single quotes", "never modify package-lock.json", team conventions. Injected into every LLM turn automatically.

🔍 /recall

Search across all past conversations: /recall auth middleware. Finds relevant commands, outputs, and routing decisions from previous sessions. Never re-explain context the agent already saw.

⚡ /summon

Switch personas mid-task: /summon crew-qa write tests for auth.ts. Six specialists (QA, backend, frontend, security, docs, fixer) with persona-specific prompts and filtered tool sets. No new session — stays in context.

🎨 DESIGN.md

For frontend tasks, the L2 planner auto-generates a full design system: color tokens, typography, spacing, component patterns, accessibility rules, dark/light themes. Frontend workers reference it for consistent UI.

Under the Hood

Execution Quality Engine

Every other AI coding CLI runs a blind loop: prompt the model, execute its tool call, repeat until it says "done." There's no memory of what failed, no proof that it worked, no feedback for next time. crew-cli is different. It wraps every task in a quality-aware engine that prevents common failure modes, demands verification, and improves over time.

Why this matters

Without execution quality controls, AI coding agents fail in predictable ways: they retry the same broken command in a loop, they declare "done" without running tests, they edit files they never read, they waste turns exploring instead of acting. These aren't model problems — they happen with GPT-5, Claude Opus, and Gemini alike. The engine fixes them at the runtime level so every model performs better.

Failure Memory

When a tool call fails, the engine remembers. Same command with same params? Blocked after one failure for shell commands, two for others. The model gets "don't repeat this" context every turn, forcing different approaches. In benchmarks this eliminated the #1 waste pattern: models retrying npm install 4x in a row.

Verification Gate

The engine extracts verification goals from your task — "tests pass", "build succeeds", "lint clean" — and tracks them as first-class state. When the model says "done" with unproven goals, it doesn't stop. It gets forced back with up to 3 extra turns demanding proof. No more "I think I fixed it" without evidence.

Action Ranking

Each turn, the engine scores 7 action types (read, search, edit, test, build, verify, delegate) based on execution state. Edited without testing? Test scores highest. Three reads in a row? Penalized. Recent failure on shell commands? "DO NOT retry" warning injected. The model sees ranked priorities, not just a list of 40 tools.

Task Mode Strategies

Not all tasks should be approached the same way. The engine auto-detects 5 modes — bugfix, feature, refactor, test repair, analysis — and applies mode-specific strategies. Bugfixes reproduce first then fix. Refactors run typecheck before declaring done. Analysis tasks don't make speculative edits.

Patch Critic

Every file edit is evaluated in real time: was the file read before writing? Is the same file being churned? Are edits staying in scope? Is verification missing after changes? The critic injects quality guidance into the next turn — zero extra LLM calls, zero latency cost.

Adaptive Weights

The action ranking system learns from past runs. High-scoring trajectories boost the weights of their dominant action patterns. Low-scoring runs penalize theirs. Over time, the engine calibrates to your codebase and the models you use. The feedback loop is automatic — every completed task improves the next one.

Structured History

Instead of flattening tool results into text (losing context each turn), the engine preserves rich state: which files were read vs written, what goals are active, what failed and why. This survives context compaction — the model always knows what matters even when older turns are summarized.

Smart Delegation

When work splits into parallel units, the engine scores each specialist against the task: language match, complexity, historical success rate, recent failures. Bug fixes route to the fixer. Docs route to the writer. Performance data improves routing over time.

5 models pass all tests

These models pass our full quality benchmark solo: correct TypeScript, all tests passing, typecheck clean, no regressions. Includes 2 free local models via Ollama.

Model	Provider	~Cost/Task	Result
Claude Opus 4.6	Anthropic	~$0.07	All tests pass
Qwen 3.5 (397B)	Ollama cloud	FREE	All tests pass
GPT-5.4	OpenAI	~$0.13	All tests pass
Claude Sonnet 4.6	Anthropic	~$0.06	All tests pass
GLM-5.1	Ollama cloud	FREE	All tests pass

L2 Planner: 14 models at 90/100

Task decomposition quality — correct dependency chains, diverse persona assignments, acceptance criteria, verification checks. Cheap models match expensive ones because the prompt engineering does the work.

Model	Score	Units	~$/Plan
Claude / GPT-5.4 (OAuth)	90	9-12	$0
GPT-OSS 20B (Groq)	90	15	$0.003
Gemini 2.5 Flash Lite	90	~10	$0.004
DeepSeek Reasoner	90	~10	$0.004
Grok 3 Mini / Qwen3-32B	90	~10	$0.005
GLM-5 (Zen)	90	16	$0.02
Kimi K2.5 (Zen)	90	~10	$0.015

Feature	crew-cli	Claude Code	Codex CLI	Gemini CLI	Cursor
Execution Quality Engine	✅ 8 Modules	❌ Simple Loop	❌ Simple Loop	❌ Simple Loop	❌ Simple Loop
Multi-model routing	✅ 10+ Providers	❌ Anthropic Only	❌ OpenAI Only	❌ Google Only	✅ Native
Multimodal (Images)	✅ All Providers	✅ Claude Vision	❌ Text Only	✅ Gemini Vision	✅ Native
Built-in Tools	✅ 45+ Tools	✅ ~15 Tools	✅ ~10 Tools	✅ ~12 Tools	✅ ~20 Tools
Sandbox + Branching	✅ Professional	❌ Direct Write	✅ Sandbox	❌ Direct Write	❌ Passive
Parallel Dispatch	✅ 21 Specialists	✅ Subagents	❌ Single Agent	❌ Single Agent	✅ Subagents
Agent Memory	✅ Cross-Session	❌ Per-Session	❌ Per-Session	✅ Gems	❌ Per-Session
Diagnostic Lint-Loop	✅ Parsed Errors	❌ Manual	❌ Manual	❌ Manual	✅ loop_on_lints
Browser Debugging	✅ Headless Chrome	❌ No UI Vision	❌ No	❌ No	❌ Passive
Cost Tracking	✅ Per-Session	✅ Integrated	❌ No	❌ No	❌ No Granularity
Streaming Output	✅ All Providers	✅ Native	✅ Native	✅ Native	✅ Native
Diagnostics CLI	✅ crew doctor	❌ No	❌ No	❌ No	❌ No
Session Memory	✅ Persistent	✅ Per-Conversation	❌ Stateless	✅ Gems	❌ Per-Session

One agent is too sequential.crew-cli keeps the work moving.

Built for the real bottleneck

One Agent Is Too Sequential

One Vendor Is Too Fragile

One Price Tier Is Wasteful