crew-cli is the portable execution engine inside crewswarm. It runs real agentic coding loops: route the task, plan when needed, use tools, write files, run commands, validate the result, and keep going until the work is done.
The point is not just "another CLI." The point is escaping the one-agent, one-vendor bottleneck. Bring whatever keys you have, mix in local lanes, and keep building when Claude, OpenAI, Anthropic, or another runtime hits limits.
Use cheap or local models for routing and worker churn. Reserve premium models for planning, validation, and hard reasoning. Pay for intelligence where it matters, not on every token of every step.
npm install -g crewswarm-cli
Want the practical tradeoffs? Compare crew-cli with Claude, Codex, Cursor, Gemini, and OpenCode.
Most AI coding tools still assume one person driving one model in one lane. That breaks down fast when the work needs retries, tools, parallelism, or another provider.
A single chat thread is fine for small fixes. Real engineering work needs planning, execution, validation, retries, and sometimes multiple lanes moving at once.
Rate limits, quota, outages, and model drift should not stop the job. crew-cli is built to keep moving across providers, local models, and external runtimes.
Not every step needs premium reasoning. crew-cli lets you spend on the planning brains and keep cheap or local lanes doing the glue work.
Each command uses the same 3-tier pipeline underneath. Simple tasks skip planning. Complex tasks get full artifact generation.
One-shot task execution. Describe what you want, get files on disk. Routes automatically — simple tasks execute directly, complex ones get planned first.
crew chat "Add error handling to server.mjs" --apply
Generate 7 planning artifacts before writing code: PDD, ROADMAP, ARCH, SCAFFOLD, CONTRACT-TESTS, DOD, GOLDEN-BENCHMARKS. Dual-model validation with risk assessment.
crew plan "Build user auth with OAuth"
TDD pipeline: generates tests first, implements to pass them, then validates. Catches its own bugs. Three LLM calls, ~$0.0002.
crew test-first "write add(a,b) with full test coverage"
Blind code review. Scores correctness, security, performance, readability, test coverage. Returns a PASS/FIX verdict with actionable items.
crew validate src/auth.mjs
Autonomous mode. Iterates until the task is complete — reads files, writes code, runs commands, verifies output. No human in the loop.
crew auto "create a greet function with tests"
Interactive session with colored diff preview before applying changes, session history, memory, and mode switching (manual/assist/autopilot). /preview to review, /apply or /rollback. /sessions to list past sessions, /resume [id] to continue one.
crew repl --mode autopilot
Apply sandbox changes with safety gates. Blast-radius analysis blocks risky diffs. --check runs your test suite and parses diagnostics — TSC, ESLint, Go, Rust, pytest errors get fed back to crew-fixer for targeted retry.
crew apply --check "npm test" --retries 3
Health check in 4 seconds: Node.js, Git, API keys, gateway connectivity, MCP servers, CLI updates. Suggests cheapest providers when no keys are configured.
crew doctor
crew-cli is built for the real constraint in AI terminals: one vendor hits a wall, but the job still needs to finish.
Use the providers you already pay for: OpenAI, Anthropic, Google, Groq, DeepSeek, xAI, OpenRouter, local models, and more. No middleman markup, no platform lock-in.
If one model path is rate-limited, out of quota, or temporarily unavailable, crew-cli can fall through to another configured provider instead of dying on a single stack.
Claude Code, Codex CLI, Cursor, Gemini, and direct APIs each have different strengths and failure modes. crew-cli is the portable execution lane when one runtime hits a wall mid-task.
Mix local models into the same workflow for effectively free parallel throughput. Keep premium APIs for the hard parts and offload glue work, summaries, and background execution to local lanes.
Use fast cheap lanes for L1 routing and L3 worker churn. Spend premium tokens on L2 planning, validation, and hard reasoning only when the task actually justifies it.
Every task flows through three stages. Simple tasks skip straight to execution. Complex tasks get full planning with risk validation.
Fast, cheap lane decides how to handle the task: direct answer, single execution, or decomposition into more work. This is where cheap hosted models or local router lanes shine.
This is the expensive brain layer: planning, validation, risk checks, and hard reasoning. Use your premium context-heavy models here when the task actually needs them.
Tool-using worker lane with 45+ built-in tools: file I/O, shell, git, LSP, browser, and web search. Use strong cheap models, premium coders, or local workers depending on cost and quality needs.
Attach images via --image flag or /image REPL command. Native support for Gemini, GPT-4o, Claude 3, Grok Vision — no base64 text dumps.
File I/O, shell, git, LSP diagnostics, Jupyter notebooks, web search, Docker sandbox, memory, multi-turn sub-agents, worktree isolation — more tools than any competitor CLI.
10x more token-efficient than JSON-RPC. Stream-parseable, no escaping, and supports graceful partial execution if a model times out.
View Spec →Execute independent tasks concurrently. Achieve a 2.96x wall-clock speedup over sequential implementation cycles.
Multi-agent waves get automatic git worktree isolation — each agent works on its own branch so parallel file edits never conflict. Merges back after the wave completes.
Language Server Protocol integration identifies syntax/type errors in the sandbox. Agents fix their own bugs before human review.
Persistent cross-session memory with MemoryBroker. Agents recall facts, decisions, and prior task results across conversations.
Not just pass/fail — --check parses error output into structured diagnostics (TSC, ESLint, GCC, Go, Rust, pytest). Feeds specific file:line:col errors to crew-fixer. Stops early when no progress is detected.
Git stash snapshots every 60 seconds during long pipeline runs. Roll back to any point via git stash list. Configure with CREW_CHECKPOINT_INTERVAL_MS. Zero overhead when idle.
Run competing implementation strategies in parallel branches. Compare architectural diffs side-by-side and merge the winner.
Headless Chrome integration for visual QA. Agents can inspect live DOM state and fix CSS/UX issues autonomously.
Token-by-token output as the LLM generates. All providers stream — Gemini, OpenAI, Anthropic, Grok, DeepSeek, Groq, and OpenRouter. No buffered waits.
Full conversation history persists across REPL sessions via SessionManager. Resume where you left off with /history, /status, /clear, /sessions, and /resume.
Built-in diagnostics: checks Node.js version, Git, API keys, gateway, MCP, and CLI updates in under 4 seconds. Suggests cheapest providers when no keys are set.
/sessions lists past sessions, /resume [id] picks up where you left off. JSONL crash-safe transcripts survive mid-write crashes — no lost context, ever.
.crew/hooks.json lets you intercept any tool call. PreToolUse can block dangerous commands, PostToolUse can log everything. Shell commands with JSON on stdin.
Agents work in isolated git worktrees on separate branches. No file conflicts during parallel work. Auto-cleanup if no changes, squash merge if changes made.
Context compression adapts to how full your context window is. Light compression at 50%, aggressive at 75%+. Per-model context window awareness keeps agents sharp.
Multi-wave pipelines use labeled tmux panes for cross-agent context sharing. Agent A's output, cwd, and env vars are handed off to Agent B via the session manager. Zero cold starts between pipeline waves.
Intercept any tool call with .crew/hooks.json. Block dangerous shell commands, log every file write, or transform tool input before execution. JSON piped on stdin to your shell scripts.
Real-time token spend per model with prompt cache savings. Tracks Anthropic 90% cache discount, Groq 50%, Google free tier. Dashboard shows per-agent, per-model cost breakdown.
Detects stuck agents: questions instead of work, plans instead of code, incomplete bail-outs. Auto-corrects with targeted prompts. Not just backoff — adaptive recovery.
Built-in MCP server exposes the full swarm via JSON-RPC. dispatch_agent, run_pipeline, chat_send, crewswarm_status — any MCP client can orchestrate the fleet.
Instead of sending 41 tools to every LLM call, crew-cli detects what the task needs (coding, git, web, tests, docs) and sends only relevant tools. Reduces context, improves model accuracy, fixes degradation on smaller models.
Drop a .crew/instructions.md in your project with persistent rules: "always use single quotes", "never modify package-lock.json", team conventions. Injected into every LLM turn automatically.
Search across all past conversations: /recall auth middleware. Finds relevant commands, outputs, and routing decisions from previous sessions. Never re-explain context the agent already saw.
Switch personas mid-task: /summon crew-qa write tests for auth.ts. Six specialists (QA, backend, frontend, security, docs, fixer) with persona-specific prompts and filtered tool sets. No new session — stays in context.
For frontend tasks, the L2 planner auto-generates a full design system: color tokens, typography, spacing, component patterns, accessibility rules, dark/light themes. Frontend workers reference it for consistent UI.
Every other AI coding CLI runs a blind loop: prompt the model, execute its tool call, repeat until it says "done." There's no memory of what failed, no proof that it worked, no feedback for next time. crew-cli is different. It wraps every task in a quality-aware engine that prevents common failure modes, demands verification, and improves over time.
Without execution quality controls, AI coding agents fail in predictable ways: they retry the same broken command in a loop, they declare "done" without running tests, they edit files they never read, they waste turns exploring instead of acting. These aren't model problems — they happen with GPT-5, Claude Opus, and Gemini alike. The engine fixes them at the runtime level so every model performs better.
When a tool call fails, the engine remembers. Same command with same params? Blocked after one failure for shell commands, two for others. The model gets "don't repeat this" context every turn, forcing different approaches. In benchmarks this eliminated the #1 waste pattern: models retrying npm install 4x in a row.
The engine extracts verification goals from your task — "tests pass", "build succeeds", "lint clean" — and tracks them as first-class state. When the model says "done" with unproven goals, it doesn't stop. It gets forced back with up to 3 extra turns demanding proof. No more "I think I fixed it" without evidence.
Each turn, the engine scores 7 action types (read, search, edit, test, build, verify, delegate) based on execution state. Edited without testing? Test scores highest. Three reads in a row? Penalized. Recent failure on shell commands? "DO NOT retry" warning injected. The model sees ranked priorities, not just a list of 40 tools.
Not all tasks should be approached the same way. The engine auto-detects 5 modes — bugfix, feature, refactor, test repair, analysis — and applies mode-specific strategies. Bugfixes reproduce first then fix. Refactors run typecheck before declaring done. Analysis tasks don't make speculative edits.
Every file edit is evaluated in real time: was the file read before writing? Is the same file being churned? Are edits staying in scope? Is verification missing after changes? The critic injects quality guidance into the next turn — zero extra LLM calls, zero latency cost.
The action ranking system learns from past runs. High-scoring trajectories boost the weights of their dominant action patterns. Low-scoring runs penalize theirs. Over time, the engine calibrates to your codebase and the models you use. The feedback loop is automatic — every completed task improves the next one.
Instead of flattening tool results into text (losing context each turn), the engine preserves rich state: which files were read vs written, what goals are active, what failed and why. This survives context compaction — the model always knows what matters even when older turns are summarized.
When work splits into parallel units, the engine scores each specialist against the task: language match, complexity, historical success rate, recent failures. Bug fixes route to the fixer. Docs route to the writer. Performance data improves routing over time.
These models pass our full quality benchmark solo: correct TypeScript, all tests passing, typecheck clean, no regressions. Includes 2 free local models via Ollama.
| Model | Provider | ~Cost/Task | Result |
|---|---|---|---|
| Claude Opus 4.6 | Anthropic | ~$0.07 | All tests pass |
| Qwen 3.5 (397B) | Ollama cloud | FREE | All tests pass |
| GPT-5.4 | OpenAI | ~$0.13 | All tests pass |
| Claude Sonnet 4.6 | Anthropic | ~$0.06 | All tests pass |
| GLM-5.1 | Ollama cloud | FREE | All tests pass |
Task decomposition quality — correct dependency chains, diverse persona assignments, acceptance criteria, verification checks. Cheap models match expensive ones because the prompt engineering does the work.
| Model | Score | Units | ~$/Plan |
|---|---|---|---|
| Claude / GPT-5.4 (OAuth) | 90 | 9-12 | $0 |
| GPT-OSS 20B (Groq) | 90 | 15 | $0.003 |
| Gemini 2.5 Flash Lite | 90 | ~10 | $0.004 |
| DeepSeek Reasoner | 90 | ~10 | $0.004 |
| Grok 3 Mini / Qwen3-32B | 90 | ~10 | $0.005 |
| GLM-5 (Zen) | 90 | 16 | $0.02 |
| Kimi K2.5 (Zen) | 90 | ~10 | $0.015 |
crew-cli is built for the complexity of professional software development, not just smol demos.
| Feature | crew-cli | Claude Code | Codex CLI | Gemini CLI | Cursor |
|---|---|---|---|---|---|
| Execution Quality Engine | ✅ 8 Modules | ❌ Simple Loop | ❌ Simple Loop | ❌ Simple Loop | ❌ Simple Loop |
| Multi-model routing | ✅ 10+ Providers | ❌ Anthropic Only | ❌ OpenAI Only | ❌ Google Only | ✅ Native |
| Multimodal (Images) | ✅ All Providers | ✅ Claude Vision | ❌ Text Only | ✅ Gemini Vision | ✅ Native |
| Built-in Tools | ✅ 45+ Tools | ✅ ~15 Tools | ✅ ~10 Tools | ✅ ~12 Tools | ✅ ~20 Tools |
| Sandbox + Branching | ✅ Professional | ❌ Direct Write | ✅ Sandbox | ❌ Direct Write | ❌ Passive |
| Parallel Dispatch | ✅ 21 Specialists | ✅ Subagents | ❌ Single Agent | ❌ Single Agent | ✅ Subagents |
| Agent Memory | ✅ Cross-Session | ❌ Per-Session | ❌ Per-Session | ✅ Gems | ❌ Per-Session |
| Diagnostic Lint-Loop | ✅ Parsed Errors | ❌ Manual | ❌ Manual | ❌ Manual | ✅ loop_on_lints |
| Browser Debugging | ✅ Headless Chrome | ❌ No UI Vision | ❌ No | ❌ No | ❌ Passive |
| Cost Tracking | ✅ Per-Session | ✅ Integrated | ❌ No | ❌ No | ❌ No Granularity |
| Streaming Output | ✅ All Providers | ✅ Native | ✅ Native | ✅ Native | ✅ Native |
| Diagnostics CLI | ✅ crew doctor | ❌ No | ❌ No | ❌ No | ❌ No |
| Session Memory | ✅ Persistent | ✅ Per-Conversation | ❌ Stateless | ✅ Gems | ❌ Per-Session |