Compare commits

...

1 Commits

Author SHA1 Message Date
Brooklyn Nicholson
d459868776 docs(analysis): map IDE coding-agent harness techniques onto Hermes
Study of the five harness subsystems that make in-editor coding agents
outperform the raw model (same models underneath), with a grounded map of
each onto the Hermes codebase: indexed semantic retrieval, retrieval-as-
accuracy-driver, decoupled apply model, ambient context, per-task routing.

Findings reference real files (tools/fuzzy_match.py, agent/auxiliary_client.py,
agent/prompt_builder.py, hermes_state.py). Identifies semantic codebase
retrieval as the one structural gap; apply-reliability and model-routing as
existing strengths.
2026-06-04 01:54:49 -05:00

View File

@@ -0,0 +1,179 @@
# Why IDE-Embedded Coding Agents Feel Better — and Where Hermes Stands
A study of the *harness* techniques that make in-editor coding agents punch above
the raw model, and an honest map of which ones Hermes already implements, which it
approximates, and which it lacks.
## TL;DR
The leading IDE-embedded coding products are **not better models** — they call the
same frontier models (Claude, GPT) that power terminal agents like this one. Their
edge comes entirely from the **harness**: how context is retrieved and assembled,
how mechanical edits are applied reliably, and how cheap specialized models are
routed in for sub-tasks. The model is a commodity; the meal it's fed is not.
This document breaks the advantage into five concrete subsystems, each backed by
published engineering from the vendors, and maps each onto the Hermes codebase.
| # | Technique | What it buys | Hermes status |
|---|-----------|--------------|---------------|
| 1 | Indexed semantic retrieval (Merkle delta-sync + content-addressed embedding cache) | Knows the repo in ms; feeds the *right* snippets | ❌ **Gap** (grep/FTS5 only, no vector index) |
| 2 | Retrieval as the accuracy driver | +~12.5% answer accuracy (vendor eval) | ⚠️ **Approximated** (lexical search, not semantic) |
| 3 | Decoupled "apply" model + line-number-free search/replace | Frontier model only *reasons*; mechanical patching never botches the file | ✅ **Have an analog** (`tools/fuzzy_match.py`) |
| 4 | Ambient IDE context (cursor pos, selection, live diagnostics) | More intent-signal per token | ⚠️ **Partial** (context files + LSP, no live cursor/selection) |
| 5 | Per-task model routing (tiny model for autocomplete, frontier for reasoning) | Right tool per job | ✅ **Have** (`agent/auxiliary_client.py`) |
---
## 1. Indexed semantic retrieval — the actual moat
The headline trick is *not* the system prompt. It's a vector index over the repo
that is kept fresh cheaply:
- **Merkle tree for change detection.** Every file is SHA-256 hashed; folder hashes
derive from children; the root summarizes the repo. An edit changes only that
file's hash plus the path to the root, so the indexer walks **only the branches
that differ** instead of rescanning. (This is git's own content-addressing trick
repurposed for indexing.)
- **Syntax-aware chunking.** Changed files are split on function/class boundaries,
not arbitrary token windows, then embedded.
- **Content-addressed embedding cache.** Embeddings are keyed by the hash of the
chunk content. Re-indexing unchanged code is a cache hit → zero embedding cost.
Embedding is the expensive step, so this is the whole ballgame for speed.
- **Cross-clone index reuse.** Vendors observe that clones of one repo are ~92%
identical across an org; a "simhash" lets a new clone reuse a teammate's index,
collapsing time-to-first-query from hours (99th pct) to seconds. Access is gated
cryptographically: you can only compute a Merkle node's hash if you actually hold
the file, so results you can't *prove* you possess are dropped.
**Hermes status: this is the real gap.** Core Hermes has no vector index, no
embedding store, no Merkle delta-sync. Codebase awareness is achieved at task time
via lexical tools (`search_files` → ripgrep) and session recall via SQLite FTS5
(`hermes_state.py`, `tools/session_search_tool.py`). The only embedding-flavored
retrieval lives in an optional plugin (`plugins/memory/holographic/`), and even
that is FTS5-backed, not dense-vector.
This is a *defensible* design choice for a terminal-first agent — ripgrep over a
known working tree is fast, dependency-free, and always current — but it means
Hermes "discovers" a codebase cold each task rather than walking in pre-indexed.
## 2. Retrieval is the accuracy driver (empirically)
Vendor evals attribute **~+12.5% answer accuracy** and higher edit-retention to
semantic search alone — same model, better-retrieved context. This is the
empirical proof of the thesis: *a worse model with better context beats a better
model with worse context.*
**Hermes status: approximated, lexically.** Hermes gets the *shape* of this through
aggressive context assembly — `agent/prompt_builder.py` and `agent/system_prompt.py`
inject project context files (AGENTS.md / CLAUDE.md / .cursorrules), and
`agent/subdirectory_hints.py` surfaces local structure. What's missing is *ranked
semantic* retrieval: Hermes finds text by pattern, not by meaning. For
"where is the thing that does X" questions, lexical search is strictly weaker than
embeddings.
## 3. Decoupled apply model — the most underrated trick
Leading products split edits into two stages:
1. **Plan** — the frontier model emits a *terse* edit, often with
`// ... existing code ...` placeholders.
2. **Apply** — a separate, cheap, often self-hosted model turns that sketch into the
final file.
Why bother? Frontier models are *lazy and inaccurate* at large rewrites: they drop
code, emit `...`, "helpfully" reformat unrelated lines, miscount line numbers, and
can trap the agent in retry loops. Three published findings drive the design:
- **Whole-file rewrites beat diffs** for the model, because diffs force fewer output
tokens (less room to "think"), are out-of-distribution (models saw far more whole
files in training), and line numbers are tokenizer poison (a number is one token,
forcing a one-shot commit, and models can't count lines).
- So edits use **search/replace blocks with no line numbers**, with redundant
context lines so the parser tolerates model slips.
- **Speculative decoding** makes apply fast (~1000 tok/s) because the unchanged file
*is* the draft — and because that can't be built into hosted Anthropic/OpenAI
models, vendors train and self-host their own apply model.
**Hermes status: it has a deterministic analog, and it's good.**
`tools/fuzzy_match.py` implements an **8-strategy search/replace matcher** (exact →
line-trimmed → whitespace-normalized → indentation-flexible → escape-normalized →
trimmed-boundary → block-anchor → context-aware-similarity) that is *precisely* the
"tolerate model slips in line-number-free search/replace" idea — just solved with
`difflib.SequenceMatcher` instead of a trained model. `tools/patch_parser.py` and
`tools/file_tools.py` wire it into the `patch` tool. Hermes also already adopts the
correct *interface*: the model emits `old_string`/`new_string`, never line numbers.
Where the vendors go further: a *trained* apply model can reconstruct intent from a
sketch (resolve `// ... existing ...` placeholders against the real file), whereas a
fuzzy matcher can only locate-and-substitute text the model actually wrote. Hermes
trades that capability for zero latency, zero cost, and full determinism — a sound
trade for a local agent, but worth naming.
## 4. Ambient context — the editor's free advantage
Because the product *is* the editor, it injects for free: the open file, **cursor
position**, current selection, **live LSP/linter diagnostics**, and recent diffs.
Terminal agents must spend tool calls reconstructing all of this. More intent-signal
per token → better output from the same model.
**Hermes status: partial.** Hermes injects project context files and has LSP
plumbing (`agent/lsp/`), and the ACP adapter (`acp_adapter/`) gives editors a way to
feed edits/approvals back. What it lacks is the *passive* signal: it doesn't know
where your cursor is or what you've selected, because in a terminal there is no
cursor to read. The ACP integration narrows this gap when Hermes runs inside an
editor, but the default terminal surface is signal-poorer by construction.
## 5. Per-task model routing
Autocomplete uses a tiny fast model; chat uses a frontier model; apply uses the
custom fast model; the agent loop uses a frontier model plus tools. Nothing is
forced through one monolith.
**Hermes status: have it.** `agent/auxiliary_client.py` provides a routed auxiliary
model used for cheaper sub-tasks — title generation (`agent/title_generator.py`),
vision routing (`tools/computer_use/vision_routing.py`), background review
(`agent/background_review.py`), and conversation compression
(`agent/context_compressor.py`, `trajectory_compressor.py`). The pattern — reserve
the expensive model for reasoning, route mechanical sub-tasks to a cheap one — is
already core to Hermes.
---
## Synthesis: where Hermes wins, ties, and trails
**Ties or wins:**
- **Apply reliability** — the 8-strategy fuzzy matcher is a genuinely strong,
zero-cost analog to a trained apply model, and uses the same line-number-free
search/replace interface the research converged on.
- **Model routing** — auxiliary-client routing already reserves the frontier model
for reasoning.
- **Context-file injection & session memory** — robust, and FTS5 session search is a
real recall capability terminal-first.
**Trails:**
- **Semantic codebase retrieval** is the one structural gap. Hermes is lexical
(ripgrep + FTS5) where the leaders are dense-vector with a cheaply-maintained
index. This is the highest-leverage area if Hermes ever wants to close the
"feels like it already knows my repo" gap.
- **Ambient passive context** (cursor/selection/live diagnostics) is inherently
weaker outside an editor; the ACP path is the right place to invest if that
matters.
**The single most transferable insight:** the research independently concluded that
it is *better to rewrite via fuzzy-tolerant, line-number-free search/replace than to
trust the smart model to emit a precise diff* — and Hermes already lands on the same
answer in `tools/fuzzy_match.py`. That convergence is a good sign the harness
fundamentals here are sound; the missing piece is retrieval, not editing.
## Suggested follow-ups (not implemented here — analysis only)
1. **Optional semantic index plugin.** A content-addressed embedding cache keyed by
file hash, behind the existing plugin interface, would give ranked semantic
retrieval without bloating the core terminal path. Merkle delta-sync keeps it
cheap to refresh.
2. **Apply-from-sketch mode.** Let the model emit `// ... existing ...` placeholders
and resolve them against the real file before handing to the fuzzy matcher —
captures most of a trained apply model's benefit deterministically.
3. **Richer ambient context over ACP.** Pipe editor cursor/selection/diagnostics
into prompt assembly when running embedded, closing the passive-signal gap.