How these coding agents are actually built
Under the surface, these repos cluster into a few distinct shapes: bespoke terminal runtimes, provider multiplexers, protocol bridges, and orchestration frameworks. Their tool design choices tell you which camp they belong to.
The four architecture families
Bespoke terminal kernels
Claude Code, Crush, and Codex feel like software products first. Tool schemas, permissions, terminal UX, and recovery behavior are baked into the runtime rather than bolted on through a generic orchestration layer. Crush is uniquely notable for shipping native LSP integration (diagnostics and symbol references) and Sourcegraph code search as first-class built-in tools. Codex stands out as a Rust-native binary built around a Ratatui TUI, organized as a 70+ crate Cargo workspace, with platform-specific sandboxes (Seatbelt on macOS, bubblewrap on Linux, Windows tokens).
Provider-matrix CLIs
Mux, Neovate, and Qwen Code invest heavily in model registries, provider catalogs, and config resolution. The runtime exists partly to normalize many providers into one interface.
Protocol-heavy adapters
Pochi and Kimi CLI are notable for bridge code: MCP adapters, ACP translation, vendor packages, and importers from adjacent ecosystems.
Sandbox and orchestration platforms
DeerFlow and OpenHands focus on multi-step execution environments. They care about app servers, middleware, sandboxes, and delegating work as much as a single interactive CLI session.
Self-improving multi-platform agents
Hermes (Nous Research) is in a category of its own: persistent skill learning, multi-platform messaging gateways, RL training pipelines, and a MoA synthesis tool. Not a coding-only agent, but the most feature-rich architecture in the set.
How tool calls are represented
| Repo | Tool representation | What stands out | Editing style |
|---|---|---|---|
| Claude Code | Typed internal tool modules with schemas, metadata, UI, and permission hooks | Tools are product features, not just JSON functions. There are dedicated plan, task, worktree, slash-command, and MCP surfaces. | Bespoke file tools and patch flows under a large central runtime |
| Crush | Go implementations paired with self-describing tool docs | Feels handcrafted. The tooling layer is readable, cohesive, and consistent with the terminal product. | Custom file, shell, and LSP-oriented operations |
| Qwen Code | Declarative tools translated into provider-facing function declarations | The tool system is cleanly separated from model config, confirmation rules, and MCP discovery. | Structured edit tools plus guarded shell execution |
| Neovate Code | AI SDK-style tools with strong typing and bash guards | Its bash tool is unusually opinionated about banned commands, substitutions, and risky patterns. | CLI tools with explicit safety checks and MCP conversion |
| Pochi | Built-in tool map first, MCP tools second | Mixes direct tools, background jobs, diff application, and vendor-specific agent packages. | Search/replace and diff-centric editing with worktree workflows |
| Kimi CLI | Python tool modules and ACP conversion blocks | Notable bridge layer that converts internal diffs and tool results into ACP-friendly structures. | Terminal and edit tools shaped for protocol export |
| DeerFlow | LangGraph/LangChain tool assembly from config, MCP, and subagent sources | Tools are part of a harness. Middleware and role configuration matter as much as the tools themselves. | Sandboxed file tools plus delegated subagent execution |
| OpenHands | Legacy function-calling tools mapped into actions | The local repo still shows classic CodeAct function tooling, but it is not the full story for the newer system. | Sandbox actions such as bash execution and string-replace editing |
| Mux | Broader workspace/runtime orchestration around provider-aware agents | Less obviously tool-schema-centric in the docs I read; more focused on workspaces, execution environments, and provider routing. | Workspace-driven execution with desktop/browser product framing |
| Hermes | Python tool modules with COMMAND_REGISTRY autodiscovery pattern |
A single CommandDef list auto-derives CLI autocomplete, Telegram menu, Slack subcommand map, and gateway help simultaneously. MoA and Delegate tools are unique in this set. |
Standard read/write/shell tools plus skill_manager, MoA synthesis, sub-agent delegation |
| Pi Mono | TypeBox JSON schemas with AJV validation, 7 core tools | Multi-disjoint edits per call, file mutation queue, fuzzy matching (Unicode normalization), uniqueness validation. Extensions can register custom tools with full TypeBox schemas. | Precise text replacement (not diffs) with reverse-order application, line ending preservation, BOM handling |
| Codex | codex-tools crate with ToolSpec/ToolDefinition, JSON Schema via schemars, 20+ built-in tools |
Rust-native tool definitions derive JSON Schema automatically. The tool registry is tightly integrated with the Ratatui TUI and permission model. | Structured tool calls with Rust-type safety, integrated with shell and file operations |
The big divider is whether tools are treated as a neutral transport format or as a first-class product surface. Claude Code and Crush are on the product side of that line.
Shell and CLI execution
| Repo | Shell model | Background support | Guardrails |
|---|---|---|---|
| Claude Code | Rich shell and process tooling inside the main runtime | Yes | Permission modes, explicit tool policies, and runtime-level orchestration |
| Neovate Code | Bash tool with timeout, truncation, and risk checks | Some long-running cases | Banned commands, command-substitution checks, and high-risk detection |
| Qwen Code | Shell tool with command parsing and read-only detection | Yes | Permission decisions and shell classification before execution |
| Pochi | Command execution plus explicit background-job support | Yes | Separate tools for foreground commands and long-running jobs |
| Kimi CLI | Fresh shell-oriented tool calls plus shell mode UX | Yes, via task management | Task-oriented UX and protocol-aware terminal capability handling |
| Crush | Native Go shell service integrated with permissions and TUI | Yes | Product-level permissions, service boundaries, and custom runtime control |
| DeerFlow | Sandboxed bash-like operations inside a harness | Yes | Middleware plus sandbox abstractions keep execution constrained |
| OpenHands | Sandbox bash actions exposed to the agent | Yes | Isolation comes from the runtime sandbox more than the tool schema itself |
| Mux | Workspace and runtime execution, often closer to a persistent environment | Yes | Provider routing is central; shell policy is less front-and-center in the docs than in Neovate or Claude |
| Hermes | Shell tool plus 6 remote execution backends: local, Docker, SSH, Daytona, Modal, Singularity | Yes | Tirith binary verifies execution environment authenticity via SHA-256 + cosign provenance before runs |
| Pi Mono | Pluggable BashOperations interface with streaming output, detached process trees | Yes (via detached processes + killProcessTree) | Extension-based (BashSpawnHook), commandPrefix option, no built-in bans β security is the user's responsibility (container or extension) |
| Codex | zsh-fork backend (macOS), Unix escalation (others), execpolicy rule engine |
Yes | Platform-specific sandboxes: Seatbelt (macOS), bubblewrap (Linux), Windows tokens; execpolicy rule engine for fine-grained command authorization |
Shell security depth comparison
The agents vary enormously in how much engineering goes into preventing the bash tool from doing damage. Here is the full spectrum from most to least defensive:
Claude Code β Tree-sitter AST analysis and Zsh attack catalog
Claude Code imports tree-sitter shell grammar to parse command ASTs before
execution. The bashSecurity.ts file contains a catalog of
Zsh-specific attack vectors:
| Attack pattern | What it does | How detected |
|---|---|---|
zmodload | Loads Zsh modules including network, file descriptor, and cryptographic modules | Command name prefix match |
emulate -c | Evaluate code in a sub-shell with emulated environment β effectively eval | Flag pattern match |
sysopen / sysread / syswrite | Low-level file descriptor operations from zsh/system module | Command name match |
=cmd (EQUALS expansion) | Resolves to the full path of cmd, bypassing binary-name blocklists | AST token shape |
<() / >() | Zsh process substitution, creates anonymous FIFOs | AST subtree match |
$() / backtick substitution | Classic command substitution β but caught via tree-sitter, not regex | AST node type |
<# | PowerShell-style comment, unexpected in a bash context β flags context confusion | Token match |
Neovate Code β Quote-aware pipeline parser
Neovate's bash tool uses a character-level state machine to handle quoting correctly before any security check:
// State machine tracks: inSingleQuote, inDoubleQuote, escaping
// splitPipelineSegments() respects quoting so 'echo "a|b"' is ONE segment
// hasCommandSubstitution() tracks same states to find $() and backticks
The hard-coded banned commands include some surprises beyond the obvious:
aria2c, axel, curlie,
http-prompt, httpie, links,
lynx, w3m, xh (all web fetchers),
plus shell alternatives bash, sh, fish,
zsh, and the dangerous utility trio nc,
telnet, eval.
Hermes β Binary provenance verification
Hermes takes a different approach: rather than blocking specific commands, it verifies the execution environment. The Tirith security binary is downloaded from GitHub and its SHA-256 hash is verified against a known-good value. On supported platforms, cosign provenance attestation is also checked. This is supply-chain security, not just command blocking.
Crush β Product-level permission gates
Crush integrates a permission system directly into its TUI. Before executing dangerous operations, the user sees a permission prompt. This is a UX-level defense rather than a parse-level one β appropriate for an interactive tool where the user is present.
Loop and stuck detection
What happens when an agent is spinning its wheels? Most agents don't have explicit detection. Two in this set do:
Crush β SHA-256 tool signature hashing
For each step, Crush computes a signature by SHA-256 hashing the
concatenation of tool_name + "\x00" + tool_input + "\x00" +
tool_output for every tool call in that step. It slides a window
over the last 10 steps. If any signature appears more than 5
times in that window, the agent is halted as stuck.
This is robust: calling the same tool with different arguments gets a different hash. Calling it with the same arguments but getting different output (e.g., due to a flaky command) also gets a different hash. Only genuine repetition triggers the halt.
// internal/agent/loop_detection.go
windowSize = 10 // last N steps to check
maxRepeats = 5 // halt if any signature appears this many times
Hermes β Per-session trajectory tracking
Hermes exports full trajectory data in <tool_call>
XML tags wrapping JSON β a format that matches Nous Research's Hermes
model fine-tuning data format. This trajectory data can be used
post-hoc for RL training to reward or penalize specific action sequences.
The context compressor's iterative re-compression also detects when the same summary is being generated repeatedly (diminishing new information), allowing the RL training signal to identify "summary convergence" as a proxy for being stuck on a task.
Why most agents don't have this
Most agents rely on the model to notice it is repeating itself. Crush's explicit detection is a sign of production experience: models sometimes don't notice, and users definitely don't want to watch 50 identical tool calls scroll by.
Context compression strategies
When context windows fill up, agents need to compress history. The strategies here range from no-op to sophisticated:
| Agent | Strategy | Key detail |
|---|---|---|
| Hermes | 5-step algorithmic compression with structured summaries | Prune old tool results β protect head β protect tail by token budget β LLM summarize middle β iterative update on re-compression. Summaries include Goal, Progress, Decisions, Files, Next Steps sections. Summary ratio: 20% of compressed content, max 12K tokens. |
| Neovate Code | Trigger-ratio compaction + separate pruning phase | Compaction fires at configurable triggerRatio (% of model context limit). Separate pruning config with protectedTools, protectTurns, minimumPrune. autoContinue mode resumes automatically after compaction. |
| Claude Code | Rolling context with auto-compaction | Built into the runtime; integrates with plan mode and memory retrieval. Less configurable than Neovate but tightly integrated with the product experience. |
| Qwen Code | Config-driven rolling/summarization | Configurable via user settings. Less implementation detail visible in public snapshot than Neovate or Hermes. |
| DeerFlow | LangGraph checkpointer-based state management | State is checkpointed per run in the LangGraph store. Thread resumption reloads only what the graph needs, rather than replaying the full conversation history. |
| Others | Provider-default truncation | Most CLIs rely on the model provider to handle context limits, either by truncating oldest messages or raising a context-length error. |
Agent state machines and retry strategies
How agents handle the "what to do next" decision after each turn reveals their architectural maturity. Three agents have explicit named state transitions:
Claude Code β 10 terminal states, 8 continue reasons
The query engine (src/query/transitions.ts) is a named state
machine with explicit exit reasons. Terminal exits:
'completed', 'blocking_limit',
'image_error', 'model_error',
'aborted_streaming', 'aborted_tools',
'prompt_too_long', 'stop_hook_prevented',
'hook_stopped', 'max_turns'.
Continue reasons (agent loops back):
'tool_use', 'reactive_compact_retry',
'max_output_tokens_recovery',
'max_output_tokens_escalate',
'collapse_drain_retry', 'stop_hook_blocking',
'token_budget_continuation', 'queued_command'.
Each is a distinct named transition, not a generic "keep going" flag.
Token budget fires at COMPLETION_THRESHOLD = 0.9; if per-check
delta falls below DIMINISHING_THRESHOLD = 500 tokens three
times, the agent is considered complete.
OpenHands β Temperature bumping on dead LLM responses
OpenHands' retry mixin (openhands/llm/retry_mixin.py) uses
the tenacity library with a documented intentional quirk: on
LLMNoResponseError when temperature is 0, it
automatically sets temperature = 1.0 on the next attempt.
The reasoning: a fully deterministic model (temp=0) that returns nothing is stuck in a degenerate fixed point and will return nothing again. Adding randomness breaks the loop. This is one of the more thoughtful LLM retry patterns in the set β it adapts the request rather than just retrying identically.
DeerFlow β 200-line buckets to prevent false loop detection
DeerFlow's LoopDetectionMiddleware hashes tool name + input +
output to detect repetition. But for read_file, line numbers
are bucketed into 200-line groups before hashing β because
paginated file reads of the same file look identical to the naive algorithm,
but are legitimate progress.
On warn (3 repeats): inject HumanMessage("you are repeating
yourself β wrap up"). On hard limit (5 repeats): strip
tool_calls entirely from the response, forcing a
plain-text answer and ending the loop definitively.
Qwen Code β Truncation recovery with Levenshtein validation
When the model's response is truncated mid-tool-call, Qwen Code
(coreToolScheduler.ts) injects
TRUNCATION_PARAM_GUIDANCE ("your previous response was
truncated due to max_tokensβ¦") to ask the model to retry, and returns
TRUNCATION_EDIT_REJECTION ("tool call has been rejected to
prevent writing truncated content") for any edit tool where the output
is incomplete.
The scheduler imports both diff and
fast-levenshtein to verify that proposed file edits are not
corrupted: if a diff-patch looks syntactically valid but the Levenshtein
distance between the "before" and "after" is implausible, the edit is
rejected. modifiable-tool.ts additionally lets tool calls
be edited in-flight by the user before execution.
Hooks and lifecycle events
Tool hooks β code that runs before and after tool calls β let external systems observe or modify agent behavior without patching the core. Three agents in this set have them as first-class concepts:
Kimi CLI β Three-event hooks system
The hooks/events.py module defines three events per tool call:
pre_tool_use(tool_name, input)β runs before the tool; can modify or block the callpost_tool_use(tool_name, input, result)β runs after success; can observe outputpost_tool_use_failure(tool_name, input, error)β runs on failure; allows custom error handling
This hooks API enables telemetry, authorization checks, result caching, and test mocking without touching tool implementation code.
Hermes β Security scan hooks on skill saves
Hermes does not have a general hooks API, but it has a specific
security-scan hook on the skill learning path: before any
SKILL.md file is written to ~/.hermes/skills/,
the content is scanned for prompt injection patterns and invisible Unicode.
Similarly, MEMORY.md and USER.md are scanned
on every load.
This is a security-specific hooks pattern rather than a general lifecycle API.
Pi Mono β 20+ lifecycle events via extensions
Pi's extension system exposes 20+ events via the
ExtensionAPI: agent_start, agent_end,
tool_call, tool_result,
beforeToolCall, afterToolCall,
message_start/end, turn_start/end,
session_start/compact/tree, model_select,
before_provider_request, and more.
Extensions can block, modify, or augment any tool execution. Event handlers return promises that are awaited in order, enabling synchronous interception. Custom tools, providers, renderers, widgets, and status lines are all registered through the same event-driven API.
MCP, ACP, and bridge strategy
MCP leaders
Claude Code, Qwen Code, Neovate, Pochi, and DeerFlow all show serious MCP handling. The difference is emphasis:
- Claude treats MCP as part of a larger integrated runtime.
- Qwen and Neovate handle discovery, connection state, and health more explicitly.
- Pochi blends MCP with vendor and agent ecosystem imports.
- DeerFlow folds MCP into a composable extensions system.
ACP specialist
Kimi CLI is the clearest protocol-bridge project in
this set. Its ACP conversion layer is not decorative; it maps
internal edits, terminal calls, and result blocks into transportable
protocol content. The docs even call out current feature gaps such
as missing session/set_mode and
session/set_model.
Bidirectional MCP
Codex supports MCP in both directions: as a client
via the codex-mcp crate (connecting to external MCP
servers) and as a server via codex-mcp-server, exposed
through the codex mcp-server CLI command. This lets
Codex both consume external MCP tools and expose its own tools to
other MCP clients.
OpenHands is a qualified case
The local repo contains architecture, runtime, sandbox, and legacy tool/action patterns, but the project itself says the main V1 agentic core moved elsewhere. Any MCP or tool comparison for OpenHands should be read as a snapshot of the local repo, not the whole current product story.
Error handling and recovery patterns
Claude Code
Recovery is systemic: permissions, retries, tool-specific controls, command systems, and a dedicated runtime all participate. It feels designed around failure as a normal operating condition.
Neovate Code
Strong at pre-emptive failure avoidance. Its shell code tries hard to stop dangerous or malformed commands before they ever run, and its MCP manager tracks retries and connection state explicitly.
Qwen Code
Particularly good at configuration and state management. Model resolution has source precedence, runtime snapshots, and rollback-ish handling that makes the system easier to reason about.
DeerFlow
Uses middleware to absorb failure modes: clarification interrupts, dangling tool-call handling, summarization, and subagent limits. That is framework-style robustness rather than CLI-style robustness.
Crush
Calm, explicit Go-style error boundaries. Provider metadata caching and service organization make failure paths easier to trace than in sprawling dynamic runtimes.
Kimi CLI
More honest than flashy: the ACP docs document current limitations, which is often a sign of a healthy protocol mindset rather than a polished marketing wrapper.
Pi Mono
Error handling is multi-layered: tool-level errors return
isError: true to the LLM for self-retry, abort signals
trigger graceful cleanup (temp files closed, partial results returned),
compaction failures auto-retry, and extension errors are caught and
emitted via emitError() without crashing the agent. The
file mutation queue prevents concurrent write races entirely.
Codex
Strict clippy configuration denies unwrap_used and
expect_used across the workspace, forcing explicit
error handling at compile time. Runtime recovery includes a 300s
idle timeout and up to 5 stream retries before giving up.
What architecture tells you about product intent
The repos that feel the strongest are the ones whose architecture matches their promise. Claude Code promises an elite terminal coding partner, and the repo looks like a full custom runtime. Crush promises a serious CLI product, and the repo looks handcrafted for that. Qwen promises a configurable multi-provider CLI, and its model-resolution machinery backs that up. Hermes promises a self-improving multi-platform agent, and its skills system, RL infrastructure, and gateway adapters all back that up. Pi Mono promises a minimalist extensible kernel, and its 438-file codebase with differential TUI, tree sessions, and Pi Packages delivers exactly that.
The weaker-feeling designs are not necessarily bad; they are often just trying to do a broader or less opinionated job. Pochi is ambitious and eclectic. DeerFlow is powerful but framework-heavy. OpenHands is split across repo boundaries. Kimi prioritizes bridge fidelity over terminal theatrics. Those are different goals, and the code reflects them.