Architecture • Tools • Shell • Protocols

How these coding agents are actually built

Under the surface, these repos cluster into a few distinct shapes: bespoke terminal runtimes, provider multiplexers, protocol bridges, and orchestration frameworks. Their tool design choices tell you which camp they belong to.

(Alright, ad over. Back to the serious technical analysis.)

The four architecture families

Bespoke terminal kernels

Claude Code, Crush, and Codex feel like software products first. Tool schemas, permissions, terminal UX, and recovery behavior are baked into the runtime rather than bolted on through a generic orchestration layer. Crush is uniquely notable for shipping native LSP integration (diagnostics and symbol references) and Sourcegraph code search as first-class built-in tools. Codex stands out as a Rust-native binary built around a Ratatui TUI, organized as a 70+ crate Cargo workspace, with platform-specific sandboxes (Seatbelt on macOS, bubblewrap on Linux, Windows tokens).

Provider-matrix CLIs

Mux, Neovate, and Qwen Code invest heavily in model registries, provider catalogs, and config resolution. The runtime exists partly to normalize many providers into one interface.

Protocol-heavy adapters

Pochi and Kimi CLI are notable for bridge code: MCP adapters, ACP translation, vendor packages, and importers from adjacent ecosystems.

Sandbox and orchestration platforms

DeerFlow and OpenHands focus on multi-step execution environments. They care about app servers, middleware, sandboxes, and delegating work as much as a single interactive CLI session.

Self-improving multi-platform agents

Hermes (Nous Research) is in a category of its own: persistent skill learning, multi-platform messaging gateways, RL training pipelines, and a MoA synthesis tool. Not a coding-only agent, but the most feature-rich architecture in the set.

How tool calls are represented

Repo	Tool representation	What stands out	Editing style
Claude Code	Typed internal tool modules with schemas, metadata, UI, and permission hooks	Tools are product features, not just JSON functions. There are dedicated plan, task, worktree, slash-command, and MCP surfaces.	Bespoke file tools and patch flows under a large central runtime
Crush	Go implementations paired with self-describing tool docs	Feels handcrafted. The tooling layer is readable, cohesive, and consistent with the terminal product.	Custom file, shell, and LSP-oriented operations
Qwen Code	Declarative tools translated into provider-facing function declarations	The tool system is cleanly separated from model config, confirmation rules, and MCP discovery.	Structured edit tools plus guarded shell execution
Neovate Code	AI SDK-style tools with strong typing and bash guards	Its bash tool is unusually opinionated about banned commands, substitutions, and risky patterns.	CLI tools with explicit safety checks and MCP conversion
Pochi	Built-in tool map first, MCP tools second	Mixes direct tools, background jobs, diff application, and vendor-specific agent packages.	Search/replace and diff-centric editing with worktree workflows
Kimi CLI	Python tool modules and ACP conversion blocks	Notable bridge layer that converts internal diffs and tool results into ACP-friendly structures.	Terminal and edit tools shaped for protocol export
DeerFlow	LangGraph/LangChain tool assembly from config, MCP, and subagent sources	Tools are part of a harness. Middleware and role configuration matter as much as the tools themselves.	Sandboxed file tools plus delegated subagent execution
OpenHands	Legacy function-calling tools mapped into actions	The local repo still shows classic CodeAct function tooling, but it is not the full story for the newer system.	Sandbox actions such as bash execution and string-replace editing
Mux	Broader workspace/runtime orchestration around provider-aware agents	Less obviously tool-schema-centric in the docs I read; more focused on workspaces, execution environments, and provider routing.	Workspace-driven execution with desktop/browser product framing
Hermes	Python tool modules with `COMMAND_REGISTRY` autodiscovery pattern	A single `CommandDef` list auto-derives CLI autocomplete, Telegram menu, Slack subcommand map, and gateway help simultaneously. MoA and Delegate tools are unique in this set.	Standard read/write/shell tools plus skill_manager, MoA synthesis, sub-agent delegation
Pi Mono	TypeBox JSON schemas with AJV validation, 7 core tools	Multi-disjoint edits per call, file mutation queue, fuzzy matching (Unicode normalization), uniqueness validation. Extensions can register custom tools with full TypeBox schemas.	Precise text replacement (not diffs) with reverse-order application, line ending preservation, BOM handling
Codex	`codex-tools` crate with `ToolSpec`/`ToolDefinition`, JSON Schema via `schemars`, 20+ built-in tools	Rust-native tool definitions derive JSON Schema automatically. The tool registry is tightly integrated with the Ratatui TUI and permission model.	Structured tool calls with Rust-type safety, integrated with shell and file operations

The big divider is whether tools are treated as a neutral transport format or as a first-class product surface. Claude Code and Crush are on the product side of that line.

Shell and CLI execution

Repo	Shell model	Background support	Guardrails
Claude Code	Rich shell and process tooling inside the main runtime	Yes	Permission modes, explicit tool policies, and runtime-level orchestration
Neovate Code	Bash tool with timeout, truncation, and risk checks	Some long-running cases	Banned commands, command-substitution checks, and high-risk detection
Qwen Code	Shell tool with command parsing and read-only detection	Yes	Permission decisions and shell classification before execution
Pochi	Command execution plus explicit background-job support	Yes	Separate tools for foreground commands and long-running jobs
Kimi CLI	Fresh shell-oriented tool calls plus shell mode UX	Yes, via task management	Task-oriented UX and protocol-aware terminal capability handling
Crush	Native Go shell service integrated with permissions and TUI	Yes	Product-level permissions, service boundaries, and custom runtime control
DeerFlow	Sandboxed bash-like operations inside a harness	Yes	Middleware plus sandbox abstractions keep execution constrained
OpenHands	Sandbox bash actions exposed to the agent	Yes	Isolation comes from the runtime sandbox more than the tool schema itself
Mux	Workspace and runtime execution, often closer to a persistent environment	Yes	Provider routing is central; shell policy is less front-and-center in the docs than in Neovate or Claude
Hermes	Shell tool plus 6 remote execution backends: local, Docker, SSH, Daytona, Modal, Singularity	Yes	Tirith binary verifies execution environment authenticity via SHA-256 + cosign provenance before runs
Pi Mono	Pluggable BashOperations interface with streaming output, detached process trees	Yes (via detached processes + killProcessTree)	Extension-based (BashSpawnHook), commandPrefix option, no built-in bans — security is the user's responsibility (container or extension)
Codex	zsh-fork backend (macOS), Unix escalation (others), `execpolicy` rule engine	Yes	Platform-specific sandboxes: Seatbelt (macOS), bubblewrap (Linux), Windows tokens; `execpolicy` rule engine for fine-grained command authorization

Shell security depth comparison

The agents vary enormously in how much engineering goes into preventing the bash tool from doing damage. Here is the full spectrum from most to least defensive:

Claude Code — Tree-sitter AST analysis and Zsh attack catalog

Claude Code imports tree-sitter shell grammar to parse command ASTs before execution. The bashSecurity.ts file contains a catalog of Zsh-specific attack vectors:

Attack pattern	What it does	How detected
`zmodload`	Loads Zsh modules including network, file descriptor, and cryptographic modules	Command name prefix match
`emulate -c`	Evaluate code in a sub-shell with emulated environment — effectively eval	Flag pattern match
`sysopen / sysread / syswrite`	Low-level file descriptor operations from `zsh/system` module	Command name match
`=cmd` (EQUALS expansion)	Resolves to the full path of `cmd`, bypassing binary-name blocklists	AST token shape
`<() / >()`	Zsh process substitution, creates anonymous FIFOs	AST subtree match
`$()` / backtick substitution	Classic command substitution — but caught via tree-sitter, not regex	AST node type
`<#`	PowerShell-style comment, unexpected in a bash context — flags context confusion	Token match

Neovate Code — Quote-aware pipeline parser

Neovate's bash tool uses a character-level state machine to handle quoting correctly before any security check:

// State machine tracks: inSingleQuote, inDoubleQuote, escaping
// splitPipelineSegments() respects quoting so 'echo "a|b"' is ONE segment
// hasCommandSubstitution() tracks same states to find $() and backticks

The hard-coded banned commands include some surprises beyond the obvious: aria2c, axel, curlie, http-prompt, httpie, links, lynx, w3m, xh (all web fetchers), plus shell alternatives bash, sh, fish, zsh, and the dangerous utility trio nc, telnet, eval.

Hermes — Binary provenance verification

Hermes takes a different approach: rather than blocking specific commands, it verifies the execution environment. The Tirith security binary is downloaded from GitHub and its SHA-256 hash is verified against a known-good value. On supported platforms, cosign provenance attestation is also checked. This is supply-chain security, not just command blocking.

Crush — Product-level permission gates

Crush integrates a permission system directly into its TUI. Before executing dangerous operations, the user sees a permission prompt. This is a UX-level defense rather than a parse-level one — appropriate for an interactive tool where the user is present.

Loop and stuck detection

What happens when an agent is spinning its wheels? Most agents don't have explicit detection. Two in this set do:

Crush — SHA-256 tool signature hashing

For each step, Crush computes a signature by SHA-256 hashing the concatenation of tool_name + "\x00" + tool_input + "\x00" + tool_output for every tool call in that step. It slides a window over the last 10 steps. If any signature appears more than 5 times in that window, the agent is halted as stuck.

This is robust: calling the same tool with different arguments gets a different hash. Calling it with the same arguments but getting different output (e.g., due to a flaky command) also gets a different hash. Only genuine repetition triggers the halt.

// internal/agent/loop_detection.go
windowSize = 10  // last N steps to check
maxRepeats  = 5  // halt if any signature appears this many times

Hermes — Per-session trajectory tracking

Hermes exports full trajectory data in <tool_call> XML tags wrapping JSON — a format that matches Nous Research's Hermes model fine-tuning data format. This trajectory data can be used post-hoc for RL training to reward or penalize specific action sequences.

The context compressor's iterative re-compression also detects when the same summary is being generated repeatedly (diminishing new information), allowing the RL training signal to identify "summary convergence" as a proxy for being stuck on a task.

💡

Why most agents don't have this

Most agents rely on the model to notice it is repeating itself. Crush's explicit detection is a sign of production experience: models sometimes don't notice, and users definitely don't want to watch 50 identical tool calls scroll by.

Context compression strategies

When context windows fill up, agents need to compress history. The strategies here range from no-op to sophisticated:

Agent	Strategy	Key detail
Hermes	5-step algorithmic compression with structured summaries	Prune old tool results → protect head → protect tail by token budget → LLM summarize middle → iterative update on re-compression. Summaries include Goal, Progress, Decisions, Files, Next Steps sections. Summary ratio: 20% of compressed content, max 12K tokens.
Neovate Code	Trigger-ratio compaction + separate pruning phase	Compaction fires at configurable `triggerRatio` (% of model context limit). Separate pruning config with `protectedTools`, `protectTurns`, `minimumPrune`. `autoContinue` mode resumes automatically after compaction.
Claude Code	Rolling context with auto-compaction	Built into the runtime; integrates with plan mode and memory retrieval. Less configurable than Neovate but tightly integrated with the product experience.
Qwen Code	Config-driven rolling/summarization	Configurable via user settings. Less implementation detail visible in public snapshot than Neovate or Hermes.
DeerFlow	LangGraph checkpointer-based state management	State is checkpointed per run in the LangGraph store. Thread resumption reloads only what the graph needs, rather than replaying the full conversation history.
Others	Provider-default truncation	Most CLIs rely on the model provider to handle context limits, either by truncating oldest messages or raising a context-length error.

Agent state machines and retry strategies

How agents handle the "what to do next" decision after each turn reveals their architectural maturity. Three agents have explicit named state transitions:

Claude Code — 10 terminal states, 8 continue reasons

The query engine (src/query/transitions.ts) is a named state machine with explicit exit reasons. Terminal exits: 'completed', 'blocking_limit', 'image_error', 'model_error', 'aborted_streaming', 'aborted_tools', 'prompt_too_long', 'stop_hook_prevented', 'hook_stopped', 'max_turns'.

Continue reasons (agent loops back): 'tool_use', 'reactive_compact_retry', 'max_output_tokens_recovery', 'max_output_tokens_escalate', 'collapse_drain_retry', 'stop_hook_blocking', 'token_budget_continuation', 'queued_command'. Each is a distinct named transition, not a generic "keep going" flag. Token budget fires at COMPLETION_THRESHOLD = 0.9; if per-check delta falls below DIMINISHING_THRESHOLD = 500 tokens three times, the agent is considered complete.

OpenHands — Temperature bumping on dead LLM responses

OpenHands' retry mixin (openhands/llm/retry_mixin.py) uses the tenacity library with a documented intentional quirk: on LLMNoResponseError when temperature is 0, it automatically sets temperature = 1.0 on the next attempt.

The reasoning: a fully deterministic model (temp=0) that returns nothing is stuck in a degenerate fixed point and will return nothing again. Adding randomness breaks the loop. This is one of the more thoughtful LLM retry patterns in the set — it adapts the request rather than just retrying identically.

DeerFlow — 200-line buckets to prevent false loop detection

DeerFlow's LoopDetectionMiddleware hashes tool name + input + output to detect repetition. But for read_file, line numbers are bucketed into 200-line groups before hashing — because paginated file reads of the same file look identical to the naive algorithm, but are legitimate progress.

On warn (3 repeats): inject HumanMessage("you are repeating yourself — wrap up"). On hard limit (5 repeats): strip tool_calls entirely from the response, forcing a plain-text answer and ending the loop definitively.

Qwen Code — Truncation recovery with Levenshtein validation

When the model's response is truncated mid-tool-call, Qwen Code (coreToolScheduler.ts) injects TRUNCATION_PARAM_GUIDANCE ("your previous response was truncated due to max_tokens…") to ask the model to retry, and returns TRUNCATION_EDIT_REJECTION ("tool call has been rejected to prevent writing truncated content") for any edit tool where the output is incomplete.

The scheduler imports both diff and fast-levenshtein to verify that proposed file edits are not corrupted: if a diff-patch looks syntactically valid but the Levenshtein distance between the "before" and "after" is implausible, the edit is rejected. modifiable-tool.ts additionally lets tool calls be edited in-flight by the user before execution.

Hooks and lifecycle events

Tool hooks — code that runs before and after tool calls — let external systems observe or modify agent behavior without patching the core. Three agents in this set have them as first-class concepts:

Kimi CLI — Three-event hooks system

The hooks/events.py module defines three events per tool call:

pre_tool_use(tool_name, input) — runs before the tool; can modify or block the call
post_tool_use(tool_name, input, result) — runs after success; can observe output
post_tool_use_failure(tool_name, input, error) — runs on failure; allows custom error handling

This hooks API enables telemetry, authorization checks, result caching, and test mocking without touching tool implementation code.

Hermes — Security scan hooks on skill saves

Hermes does not have a general hooks API, but it has a specific security-scan hook on the skill learning path: before any SKILL.md file is written to ~/.hermes/skills/, the content is scanned for prompt injection patterns and invisible Unicode. Similarly, MEMORY.md and USER.md are scanned on every load.

This is a security-specific hooks pattern rather than a general lifecycle API.

Pi Mono — 20+ lifecycle events via extensions

Pi's extension system exposes 20+ events via the ExtensionAPI: agent_start, agent_end, tool_call, tool_result, beforeToolCall, afterToolCall, message_start/end, turn_start/end, session_start/compact/tree, model_select, before_provider_request, and more.

Extensions can block, modify, or augment any tool execution. Event handlers return promises that are awaited in order, enabling synchronous interception. Custom tools, providers, renderers, widgets, and status lines are all registered through the same event-driven API.

MCP, ACP, and bridge strategy

MCP leaders

Claude Code, Qwen Code, Neovate, Pochi, and DeerFlow all show serious MCP handling. The difference is emphasis:

Claude treats MCP as part of a larger integrated runtime.
Qwen and Neovate handle discovery, connection state, and health more explicitly.
Pochi blends MCP with vendor and agent ecosystem imports.
DeerFlow folds MCP into a composable extensions system.

ACP specialist

Kimi CLI is the clearest protocol-bridge project in this set. Its ACP conversion layer is not decorative; it maps internal edits, terminal calls, and result blocks into transportable protocol content. The docs even call out current feature gaps such as missing session/set_mode and session/set_model.

Bidirectional MCP

Codex supports MCP in both directions: as a client via the codex-mcp crate (connecting to external MCP servers) and as a server via codex-mcp-server, exposed through the codex mcp-server CLI command. This lets Codex both consume external MCP tools and expose its own tools to other MCP clients.

⚠️

OpenHands is a qualified case

The local repo contains architecture, runtime, sandbox, and legacy tool/action patterns, but the project itself says the main V1 agentic core moved elsewhere. Any MCP or tool comparison for OpenHands should be read as a snapshot of the local repo, not the whole current product story.

Error handling and recovery patterns

Claude Code

Recovery is systemic: permissions, retries, tool-specific controls, command systems, and a dedicated runtime all participate. It feels designed around failure as a normal operating condition.

Neovate Code

Strong at pre-emptive failure avoidance. Its shell code tries hard to stop dangerous or malformed commands before they ever run, and its MCP manager tracks retries and connection state explicitly.

Qwen Code

Particularly good at configuration and state management. Model resolution has source precedence, runtime snapshots, and rollback-ish handling that makes the system easier to reason about.

DeerFlow

Uses middleware to absorb failure modes: clarification interrupts, dangling tool-call handling, summarization, and subagent limits. That is framework-style robustness rather than CLI-style robustness.

Crush

Calm, explicit Go-style error boundaries. Provider metadata caching and service organization make failure paths easier to trace than in sprawling dynamic runtimes.

Kimi CLI

More honest than flashy: the ACP docs document current limitations, which is often a sign of a healthy protocol mindset rather than a polished marketing wrapper.

Pi Mono

Error handling is multi-layered: tool-level errors return isError: true to the LLM for self-retry, abort signals trigger graceful cleanup (temp files closed, partial results returned), compaction failures auto-retry, and extension errors are caught and emitted via emitError() without crashing the agent. The file mutation queue prevents concurrent write races entirely.

Codex

Strict clippy configuration denies unwrap_used and expect_used across the workspace, forcing explicit error handling at compile time. Runtime recovery includes a 300s idle timeout and up to 5 stream retries before giving up.

What architecture tells you about product intent

The repos that feel the strongest are the ones whose architecture matches their promise. Claude Code promises an elite terminal coding partner, and the repo looks like a full custom runtime. Crush promises a serious CLI product, and the repo looks handcrafted for that. Qwen promises a configurable multi-provider CLI, and its model-resolution machinery backs that up. Hermes promises a self-improving multi-platform agent, and its skills system, RL infrastructure, and gateway adapters all back that up. Pi Mono promises a minimalist extensible kernel, and its 438-file codebase with differential TUI, tree sessions, and Pi Packages delivers exactly that.

The weaker-feeling designs are not necessarily bad; they are often just trying to do a broader or less opinionated job. Pochi is ambitious and eclectic. DeerFlow is powerful but framework-heavy. OpenHands is split across repo boundaries. Kimi prioritizes bridge fidelity over terminal theatrics. Those are different goals, and the code reflects them.