docs(analysis): map IDE coding-agent harness techniques onto Hermes

Study of the five harness subsystems that make in-editor coding agents outperform the raw model (same models underneath), with a grounded map of each onto the Hermes codebase: indexed semantic retrieval, retrieval-as- accuracy-driver, decoupled apply model, ambient context, per-task routing. Findings reference real files (tools/fuzzy_match.py, agent/auxiliary_client.py, agent/prompt_builder.py, hermes_state.py). Identifies semantic codebase retrieval as the one structural gap; apply-reliability and model-routing as existing strengths.
2026-06-10 12:18:44 +08:00 · 2026-06-04 01:54:49 -05:00
1 changed files with 179 additions and 0 deletions
--- a/docs/analysis/coding-agent-harness.md
+++ b/docs/analysis/coding-agent-harness.md
@@ -0,0 +1,179 @@
+# Why IDE-Embedded Coding Agents Feel Better — and Where Hermes Stands
+
+A study of the *harness* techniques that make in-editor coding agents punch above
+the raw model, and an honest map of which ones Hermes already implements, which it
+approximates, and which it lacks.
+
+## TL;DR
+
+The leading IDE-embedded coding products are **not better models** — they call the
+same frontier models (Claude, GPT) that power terminal agents like this one. Their
+edge comes entirely from the **harness**: how context is retrieved and assembled,
+how mechanical edits are applied reliably, and how cheap specialized models are
+routed in for sub-tasks. The model is a commodity; the meal it's fed is not.
+
+This document breaks the advantage into five concrete subsystems, each backed by
+published engineering from the vendors, and maps each onto the Hermes codebase.
+
+| # | Technique | What it buys | Hermes status |
+|---|-----------|--------------|---------------|
+| 1 | Indexed semantic retrieval (Merkle delta-sync + content-addressed embedding cache) | Knows the repo in ms; feeds the *right* snippets | ❌ **Gap** (grep/FTS5 only, no vector index) |
+| 2 | Retrieval as the accuracy driver | +~12.5% answer accuracy (vendor eval) | ⚠️ **Approximated** (lexical search, not semantic) |
+| 3 | Decoupled "apply" model + line-number-free search/replace | Frontier model only *reasons*; mechanical patching never botches the file | ✅ **Have an analog** (`tools/fuzzy_match.py`) |
+| 4 | Ambient IDE context (cursor pos, selection, live diagnostics) | More intent-signal per token | ⚠️ **Partial** (context files + LSP, no live cursor/selection) |
+| 5 | Per-task model routing (tiny model for autocomplete, frontier for reasoning) | Right tool per job | ✅ **Have** (`agent/auxiliary_client.py`) |
+
+---
+
+## 1. Indexed semantic retrieval — the actual moat
+
+The headline trick is *not* the system prompt. It's a vector index over the repo
+that is kept fresh cheaply:
+
+- **Merkle tree for change detection.** Every file is SHA-256 hashed; folder hashes
+  derive from children; the root summarizes the repo. An edit changes only that
+  file's hash plus the path to the root, so the indexer walks **only the branches
+  that differ** instead of rescanning. (This is git's own content-addressing trick
+  repurposed for indexing.)
+- **Syntax-aware chunking.** Changed files are split on function/class boundaries,
+  not arbitrary token windows, then embedded.
+- **Content-addressed embedding cache.** Embeddings are keyed by the hash of the
+  chunk content. Re-indexing unchanged code is a cache hit → zero embedding cost.
+  Embedding is the expensive step, so this is the whole ballgame for speed.
+- **Cross-clone index reuse.** Vendors observe that clones of one repo are ~92%
+  identical across an org; a "simhash" lets a new clone reuse a teammate's index,
+  collapsing time-to-first-query from hours (99th pct) to seconds. Access is gated
+  cryptographically: you can only compute a Merkle node's hash if you actually hold
+  the file, so results you can't *prove* you possess are dropped.
+
+**Hermes status: this is the real gap.** Core Hermes has no vector index, no
+embedding store, no Merkle delta-sync. Codebase awareness is achieved at task time
+via lexical tools (`search_files` → ripgrep) and session recall via SQLite FTS5
+(`hermes_state.py`, `tools/session_search_tool.py`). The only embedding-flavored
+retrieval lives in an optional plugin (`plugins/memory/holographic/`), and even
+that is FTS5-backed, not dense-vector.
+
+This is a *defensible* design choice for a terminal-first agent — ripgrep over a
+known working tree is fast, dependency-free, and always current — but it means
+Hermes "discovers" a codebase cold each task rather than walking in pre-indexed.
+
+## 2. Retrieval is the accuracy driver (empirically)
+
+Vendor evals attribute **~+12.5% answer accuracy** and higher edit-retention to
+semantic search alone — same model, better-retrieved context. This is the
+empirical proof of the thesis: *a worse model with better context beats a better
+model with worse context.*
+
+**Hermes status: approximated, lexically.** Hermes gets the *shape* of this through
+aggressive context assembly — `agent/prompt_builder.py` and `agent/system_prompt.py`
+inject project context files (AGENTS.md / CLAUDE.md / .cursorrules), and
+`agent/subdirectory_hints.py` surfaces local structure. What's missing is *ranked
+semantic* retrieval: Hermes finds text by pattern, not by meaning. For
+"where is the thing that does X" questions, lexical search is strictly weaker than
+embeddings.
+
+## 3. Decoupled apply model — the most underrated trick
+
+Leading products split edits into two stages:
+
+1. **Plan** — the frontier model emits a *terse* edit, often with
+   `// ... existing code ...` placeholders.
+2. **Apply** — a separate, cheap, often self-hosted model turns that sketch into the
+   final file.
+
+Why bother? Frontier models are *lazy and inaccurate* at large rewrites: they drop
+code, emit `...`, "helpfully" reformat unrelated lines, miscount line numbers, and
+can trap the agent in retry loops. Three published findings drive the design:
+
+- **Whole-file rewrites beat diffs** for the model, because diffs force fewer output
+  tokens (less room to "think"), are out-of-distribution (models saw far more whole
+  files in training), and line numbers are tokenizer poison (a number is one token,
+  forcing a one-shot commit, and models can't count lines).
+- So edits use **search/replace blocks with no line numbers**, with redundant
+  context lines so the parser tolerates model slips.
+- **Speculative decoding** makes apply fast (~1000 tok/s) because the unchanged file
+  *is* the draft — and because that can't be built into hosted Anthropic/OpenAI
+  models, vendors train and self-host their own apply model.
+
+**Hermes status: it has a deterministic analog, and it's good.**
+`tools/fuzzy_match.py` implements an **8-strategy search/replace matcher** (exact →
+line-trimmed → whitespace-normalized → indentation-flexible → escape-normalized →
+trimmed-boundary → block-anchor → context-aware-similarity) that is *precisely* the
+"tolerate model slips in line-number-free search/replace" idea — just solved with
+`difflib.SequenceMatcher` instead of a trained model. `tools/patch_parser.py` and
+`tools/file_tools.py` wire it into the `patch` tool. Hermes also already adopts the
+correct *interface*: the model emits `old_string`/`new_string`, never line numbers.
+
+Where the vendors go further: a *trained* apply model can reconstruct intent from a
+sketch (resolve `// ... existing ...` placeholders against the real file), whereas a
+fuzzy matcher can only locate-and-substitute text the model actually wrote. Hermes
+trades that capability for zero latency, zero cost, and full determinism — a sound
+trade for a local agent, but worth naming.
+
+## 4. Ambient context — the editor's free advantage
+
+Because the product *is* the editor, it injects for free: the open file, **cursor
+position**, current selection, **live LSP/linter diagnostics**, and recent diffs.
+Terminal agents must spend tool calls reconstructing all of this. More intent-signal
+per token → better output from the same model.
+
+**Hermes status: partial.** Hermes injects project context files and has LSP
+plumbing (`agent/lsp/`), and the ACP adapter (`acp_adapter/`) gives editors a way to
+feed edits/approvals back. What it lacks is the *passive* signal: it doesn't know
+where your cursor is or what you've selected, because in a terminal there is no
+cursor to read. The ACP integration narrows this gap when Hermes runs inside an
+editor, but the default terminal surface is signal-poorer by construction.
+
+## 5. Per-task model routing
+
+Autocomplete uses a tiny fast model; chat uses a frontier model; apply uses the
+custom fast model; the agent loop uses a frontier model plus tools. Nothing is
+forced through one monolith.
+
+**Hermes status: have it.** `agent/auxiliary_client.py` provides a routed auxiliary
+model used for cheaper sub-tasks — title generation (`agent/title_generator.py`),
+vision routing (`tools/computer_use/vision_routing.py`), background review
+(`agent/background_review.py`), and conversation compression
+(`agent/context_compressor.py`, `trajectory_compressor.py`). The pattern — reserve
+the expensive model for reasoning, route mechanical sub-tasks to a cheap one — is
+already core to Hermes.
+
+---
+
+## Synthesis: where Hermes wins, ties, and trails
+
+**Ties or wins:**
+- **Apply reliability** — the 8-strategy fuzzy matcher is a genuinely strong,
+  zero-cost analog to a trained apply model, and uses the same line-number-free
+  search/replace interface the research converged on.
+- **Model routing** — auxiliary-client routing already reserves the frontier model
+  for reasoning.
+- **Context-file injection & session memory** — robust, and FTS5 session search is a
+  real recall capability terminal-first.
+
+**Trails:**
+- **Semantic codebase retrieval** is the one structural gap. Hermes is lexical
+  (ripgrep + FTS5) where the leaders are dense-vector with a cheaply-maintained
+  index. This is the highest-leverage area if Hermes ever wants to close the
+  "feels like it already knows my repo" gap.
+- **Ambient passive context** (cursor/selection/live diagnostics) is inherently
+  weaker outside an editor; the ACP path is the right place to invest if that
+  matters.
+
+**The single most transferable insight:** the research independently concluded that
+it is *better to rewrite via fuzzy-tolerant, line-number-free search/replace than to
+trust the smart model to emit a precise diff* — and Hermes already lands on the same
+answer in `tools/fuzzy_match.py`. That convergence is a good sign the harness
+fundamentals here are sound; the missing piece is retrieval, not editing.
+
+## Suggested follow-ups (not implemented here — analysis only)
+
+1. **Optional semantic index plugin.** A content-addressed embedding cache keyed by
+   file hash, behind the existing plugin interface, would give ranked semantic
+   retrieval without bloating the core terminal path. Merkle delta-sync keeps it
+   cheap to refresh.
+2. **Apply-from-sketch mode.** Let the model emit `// ... existing ...` placeholders
+   and resolve them against the real file before handing to the fuzzy matcher —
+   captures most of a trained apply model's benefit deterministically.
+3. **Richer ambient context over ACP.** Pipe editor cursor/selection/diagnostics
+   into prompt assembly when running embedded, closing the passive-signal gap.