docs(analysis): map IDE coding-agent harness techniques onto Hermes

Study of the five harness subsystems that make in-editor coding agents outperform the raw model (same models underneath), with a grounded map of each onto the Hermes codebase: indexed semantic retrieval, retrieval-as- accuracy-driver, decoupled apply model, ambient context, per-task routing. Findings reference real files (tools/fuzzy_match.py, agent/auxiliary_client.py, agent/prompt_builder.py, hermes_state.py). Identifies semantic codebase retrieval as the one structural gap; apply-reliability and model-routing as existing strengths.
2026-07-05 01:27:52 +08:00 · 2026-06-04 01:54:49 -05:00
1 changed files with 179 additions and 0 deletions
--- a/docs/analysis/coding-agent-harness.md
+++ b/docs/analysis/coding-agent-harness.md
@@ -0,0 +1,179 @@
 # Why IDE-Embedded Coding Agents Feel Better — and Where Hermes Stands
 A study of the *harness* techniques that make in-editor coding agents punch above
 the raw model, and an honest map of which ones Hermes already implements, which it
 approximates, and which it lacks.
 ## TL;DR
 The leading IDE-embedded coding products are **not better models** — they call the
 same frontier models (Claude, GPT) that power terminal agents like this one. Their
 edge comes entirely from the **harness**: how context is retrieved and assembled,
 how mechanical edits are applied reliably, and how cheap specialized models are
 routed in for sub-tasks. The model is a commodity; the meal it's fed is not.
 This document breaks the advantage into five concrete subsystems, each backed by
 published engineering from the vendors, and maps each onto the Hermes codebase.
 | # | Technique | What it buys | Hermes status |
 |---|-----------|--------------|---------------|
 | 1 | Indexed semantic retrieval (Merkle delta-sync + content-addressed embedding cache) | Knows the repo in ms; feeds the *right* snippets | ❌ **Gap** (grep/FTS5 only, no vector index) |
 | 2 | Retrieval as the accuracy driver | +~12.5% answer accuracy (vendor eval) | ⚠️ **Approximated** (lexical search, not semantic) |
 | 3 | Decoupled "apply" model + line-number-free search/replace | Frontier model only *reasons*; mechanical patching never botches the file | ✅ **Have an analog** (`tools/fuzzy_match.py`) |
 | 4 | Ambient IDE context (cursor pos, selection, live diagnostics) | More intent-signal per token | ⚠️ **Partial** (context files + LSP, no live cursor/selection) |
 | 5 | Per-task model routing (tiny model for autocomplete, frontier for reasoning) | Right tool per job | ✅ **Have** (`agent/auxiliary_client.py`) |
 ---
 ## 1. Indexed semantic retrieval — the actual moat
 The headline trick is *not* the system prompt. It's a vector index over the repo
 that is kept fresh cheaply:
 - **Merkle tree for change detection.** Every file is SHA-256 hashed; folder hashes
  derive from children; the root summarizes the repo. An edit changes only that
  file's hash plus the path to the root, so the indexer walks **only the branches
  that differ** instead of rescanning. (This is git's own content-addressing trick
  repurposed for indexing.)
 - **Syntax-aware chunking.** Changed files are split on function/class boundaries,
  not arbitrary token windows, then embedded.
 - **Content-addressed embedding cache.** Embeddings are keyed by the hash of the
  chunk content. Re-indexing unchanged code is a cache hit → zero embedding cost.
  Embedding is the expensive step, so this is the whole ballgame for speed.
 - **Cross-clone index reuse.** Vendors observe that clones of one repo are ~92%
  identical across an org; a "simhash" lets a new clone reuse a teammate's index,
  collapsing time-to-first-query from hours (99th pct) to seconds. Access is gated
  cryptographically: you can only compute a Merkle node's hash if you actually hold
  the file, so results you can't *prove* you possess are dropped.
 **Hermes status: this is the real gap.** Core Hermes has no vector index, no
 embedding store, no Merkle delta-sync. Codebase awareness is achieved at task time
 via lexical tools (`search_files` → ripgrep) and session recall via SQLite FTS5
 (`hermes_state.py`, `tools/session_search_tool.py`). The only embedding-flavored
 retrieval lives in an optional plugin (`plugins/memory/holographic/`), and even
 that is FTS5-backed, not dense-vector.
 This is a *defensible* design choice for a terminal-first agent — ripgrep over a
 known working tree is fast, dependency-free, and always current — but it means
 Hermes "discovers" a codebase cold each task rather than walking in pre-indexed.
 ## 2. Retrieval is the accuracy driver (empirically)
 Vendor evals attribute **~+12.5% answer accuracy** and higher edit-retention to
 semantic search alone — same model, better-retrieved context. This is the
 empirical proof of the thesis: *a worse model with better context beats a better
 model with worse context.*
 **Hermes status: approximated, lexically.** Hermes gets the *shape* of this through
 aggressive context assembly — `agent/prompt_builder.py` and `agent/system_prompt.py`
 inject project context files (AGENTS.md / CLAUDE.md / .cursorrules), and
 `agent/subdirectory_hints.py` surfaces local structure. What's missing is *ranked
 semantic* retrieval: Hermes finds text by pattern, not by meaning. For
 "where is the thing that does X" questions, lexical search is strictly weaker than
 embeddings.
 ## 3. Decoupled apply model — the most underrated trick
 Leading products split edits into two stages:
 1. **Plan** — the frontier model emits a *terse* edit, often with
   `// ... existing code ...` placeholders.
 2. **Apply** — a separate, cheap, often self-hosted model turns that sketch into the
   final file.
 Why bother? Frontier models are *lazy and inaccurate* at large rewrites: they drop
 code, emit `...`, "helpfully" reformat unrelated lines, miscount line numbers, and
 can trap the agent in retry loops. Three published findings drive the design:
 - **Whole-file rewrites beat diffs** for the model, because diffs force fewer output
  tokens (less room to "think"), are out-of-distribution (models saw far more whole
  files in training), and line numbers are tokenizer poison (a number is one token,
  forcing a one-shot commit, and models can't count lines).
 - So edits use **search/replace blocks with no line numbers**, with redundant
  context lines so the parser tolerates model slips.
 - **Speculative decoding** makes apply fast (~1000 tok/s) because the unchanged file
  *is* the draft — and because that can't be built into hosted Anthropic/OpenAI
  models, vendors train and self-host their own apply model.
 **Hermes status: it has a deterministic analog, and it's good.**
 `tools/fuzzy_match.py` implements an **8-strategy search/replace matcher** (exact →
 line-trimmed → whitespace-normalized → indentation-flexible → escape-normalized →
 trimmed-boundary → block-anchor → context-aware-similarity) that is *precisely* the
 "tolerate model slips in line-number-free search/replace" idea — just solved with
 `difflib.SequenceMatcher` instead of a trained model. `tools/patch_parser.py` and
 `tools/file_tools.py` wire it into the `patch` tool. Hermes also already adopts the
 correct *interface*: the model emits `old_string`/`new_string`, never line numbers.
 Where the vendors go further: a *trained* apply model can reconstruct intent from a
 sketch (resolve `// ... existing ...` placeholders against the real file), whereas a
 fuzzy matcher can only locate-and-substitute text the model actually wrote. Hermes
 trades that capability for zero latency, zero cost, and full determinism — a sound
 trade for a local agent, but worth naming.
 ## 4. Ambient context — the editor's free advantage
 Because the product *is* the editor, it injects for free: the open file, **cursor
 position**, current selection, **live LSP/linter diagnostics**, and recent diffs.
 Terminal agents must spend tool calls reconstructing all of this. More intent-signal
 per token → better output from the same model.
 **Hermes status: partial.** Hermes injects project context files and has LSP
 plumbing (`agent/lsp/`), and the ACP adapter (`acp_adapter/`) gives editors a way to
 feed edits/approvals back. What it lacks is the *passive* signal: it doesn't know
 where your cursor is or what you've selected, because in a terminal there is no
 cursor to read. The ACP integration narrows this gap when Hermes runs inside an
 editor, but the default terminal surface is signal-poorer by construction.
 ## 5. Per-task model routing
 Autocomplete uses a tiny fast model; chat uses a frontier model; apply uses the
 custom fast model; the agent loop uses a frontier model plus tools. Nothing is
 forced through one monolith.
 **Hermes status: have it.** `agent/auxiliary_client.py` provides a routed auxiliary
 model used for cheaper sub-tasks — title generation (`agent/title_generator.py`),
 vision routing (`tools/computer_use/vision_routing.py`), background review
 (`agent/background_review.py`), and conversation compression
 (`agent/context_compressor.py`, `trajectory_compressor.py`). The pattern — reserve
 the expensive model for reasoning, route mechanical sub-tasks to a cheap one — is
 already core to Hermes.
 ---
 ## Synthesis: where Hermes wins, ties, and trails
 **Ties or wins:**
 - **Apply reliability** — the 8-strategy fuzzy matcher is a genuinely strong,
  zero-cost analog to a trained apply model, and uses the same line-number-free
  search/replace interface the research converged on.
 - **Model routing** — auxiliary-client routing already reserves the frontier model
  for reasoning.
 - **Context-file injection & session memory** — robust, and FTS5 session search is a
  real recall capability terminal-first.
 **Trails:**
 - **Semantic codebase retrieval** is the one structural gap. Hermes is lexical
  (ripgrep + FTS5) where the leaders are dense-vector with a cheaply-maintained
  index. This is the highest-leverage area if Hermes ever wants to close the
  "feels like it already knows my repo" gap.
 - **Ambient passive context** (cursor/selection/live diagnostics) is inherently
  weaker outside an editor; the ACP path is the right place to invest if that
  matters.
 **The single most transferable insight:** the research independently concluded that
 it is *better to rewrite via fuzzy-tolerant, line-number-free search/replace than to
 trust the smart model to emit a precise diff* — and Hermes already lands on the same
 answer in `tools/fuzzy_match.py`. That convergence is a good sign the harness
 fundamentals here are sound; the missing piece is retrieval, not editing.
 ## Suggested follow-ups (not implemented here — analysis only)
 1. **Optional semantic index plugin.** A content-addressed embedding cache keyed by
   file hash, behind the existing plugin interface, would give ranked semantic
   retrieval without bloating the core terminal path. Merkle delta-sync keeps it
   cheap to refresh.
 2. **Apply-from-sketch mode.** Let the model emit `// ... existing ...` placeholders
   and resolve them against the real file before handing to the fuzzy matcher —
   captures most of a trained apply model's benefit deterministically.
 3. **Richer ambient context over ACP.** Pipe editor cursor/selection/diagnostics
   into prompt assembly when running embedded, closing the passive-signal gap.