mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-10 20:29:00 +08:00
Compare commits
1 Commits
feat/deskt
...
bb/coding-
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
d459868776 |
179
docs/analysis/coding-agent-harness.md
Normal file
179
docs/analysis/coding-agent-harness.md
Normal file
@@ -0,0 +1,179 @@
|
||||
# Why IDE-Embedded Coding Agents Feel Better — and Where Hermes Stands
|
||||
|
||||
A study of the *harness* techniques that make in-editor coding agents punch above
|
||||
the raw model, and an honest map of which ones Hermes already implements, which it
|
||||
approximates, and which it lacks.
|
||||
|
||||
## TL;DR
|
||||
|
||||
The leading IDE-embedded coding products are **not better models** — they call the
|
||||
same frontier models (Claude, GPT) that power terminal agents like this one. Their
|
||||
edge comes entirely from the **harness**: how context is retrieved and assembled,
|
||||
how mechanical edits are applied reliably, and how cheap specialized models are
|
||||
routed in for sub-tasks. The model is a commodity; the meal it's fed is not.
|
||||
|
||||
This document breaks the advantage into five concrete subsystems, each backed by
|
||||
published engineering from the vendors, and maps each onto the Hermes codebase.
|
||||
|
||||
| # | Technique | What it buys | Hermes status |
|
||||
|---|-----------|--------------|---------------|
|
||||
| 1 | Indexed semantic retrieval (Merkle delta-sync + content-addressed embedding cache) | Knows the repo in ms; feeds the *right* snippets | ❌ **Gap** (grep/FTS5 only, no vector index) |
|
||||
| 2 | Retrieval as the accuracy driver | +~12.5% answer accuracy (vendor eval) | ⚠️ **Approximated** (lexical search, not semantic) |
|
||||
| 3 | Decoupled "apply" model + line-number-free search/replace | Frontier model only *reasons*; mechanical patching never botches the file | ✅ **Have an analog** (`tools/fuzzy_match.py`) |
|
||||
| 4 | Ambient IDE context (cursor pos, selection, live diagnostics) | More intent-signal per token | ⚠️ **Partial** (context files + LSP, no live cursor/selection) |
|
||||
| 5 | Per-task model routing (tiny model for autocomplete, frontier for reasoning) | Right tool per job | ✅ **Have** (`agent/auxiliary_client.py`) |
|
||||
|
||||
---
|
||||
|
||||
## 1. Indexed semantic retrieval — the actual moat
|
||||
|
||||
The headline trick is *not* the system prompt. It's a vector index over the repo
|
||||
that is kept fresh cheaply:
|
||||
|
||||
- **Merkle tree for change detection.** Every file is SHA-256 hashed; folder hashes
|
||||
derive from children; the root summarizes the repo. An edit changes only that
|
||||
file's hash plus the path to the root, so the indexer walks **only the branches
|
||||
that differ** instead of rescanning. (This is git's own content-addressing trick
|
||||
repurposed for indexing.)
|
||||
- **Syntax-aware chunking.** Changed files are split on function/class boundaries,
|
||||
not arbitrary token windows, then embedded.
|
||||
- **Content-addressed embedding cache.** Embeddings are keyed by the hash of the
|
||||
chunk content. Re-indexing unchanged code is a cache hit → zero embedding cost.
|
||||
Embedding is the expensive step, so this is the whole ballgame for speed.
|
||||
- **Cross-clone index reuse.** Vendors observe that clones of one repo are ~92%
|
||||
identical across an org; a "simhash" lets a new clone reuse a teammate's index,
|
||||
collapsing time-to-first-query from hours (99th pct) to seconds. Access is gated
|
||||
cryptographically: you can only compute a Merkle node's hash if you actually hold
|
||||
the file, so results you can't *prove* you possess are dropped.
|
||||
|
||||
**Hermes status: this is the real gap.** Core Hermes has no vector index, no
|
||||
embedding store, no Merkle delta-sync. Codebase awareness is achieved at task time
|
||||
via lexical tools (`search_files` → ripgrep) and session recall via SQLite FTS5
|
||||
(`hermes_state.py`, `tools/session_search_tool.py`). The only embedding-flavored
|
||||
retrieval lives in an optional plugin (`plugins/memory/holographic/`), and even
|
||||
that is FTS5-backed, not dense-vector.
|
||||
|
||||
This is a *defensible* design choice for a terminal-first agent — ripgrep over a
|
||||
known working tree is fast, dependency-free, and always current — but it means
|
||||
Hermes "discovers" a codebase cold each task rather than walking in pre-indexed.
|
||||
|
||||
## 2. Retrieval is the accuracy driver (empirically)
|
||||
|
||||
Vendor evals attribute **~+12.5% answer accuracy** and higher edit-retention to
|
||||
semantic search alone — same model, better-retrieved context. This is the
|
||||
empirical proof of the thesis: *a worse model with better context beats a better
|
||||
model with worse context.*
|
||||
|
||||
**Hermes status: approximated, lexically.** Hermes gets the *shape* of this through
|
||||
aggressive context assembly — `agent/prompt_builder.py` and `agent/system_prompt.py`
|
||||
inject project context files (AGENTS.md / CLAUDE.md / .cursorrules), and
|
||||
`agent/subdirectory_hints.py` surfaces local structure. What's missing is *ranked
|
||||
semantic* retrieval: Hermes finds text by pattern, not by meaning. For
|
||||
"where is the thing that does X" questions, lexical search is strictly weaker than
|
||||
embeddings.
|
||||
|
||||
## 3. Decoupled apply model — the most underrated trick
|
||||
|
||||
Leading products split edits into two stages:
|
||||
|
||||
1. **Plan** — the frontier model emits a *terse* edit, often with
|
||||
`// ... existing code ...` placeholders.
|
||||
2. **Apply** — a separate, cheap, often self-hosted model turns that sketch into the
|
||||
final file.
|
||||
|
||||
Why bother? Frontier models are *lazy and inaccurate* at large rewrites: they drop
|
||||
code, emit `...`, "helpfully" reformat unrelated lines, miscount line numbers, and
|
||||
can trap the agent in retry loops. Three published findings drive the design:
|
||||
|
||||
- **Whole-file rewrites beat diffs** for the model, because diffs force fewer output
|
||||
tokens (less room to "think"), are out-of-distribution (models saw far more whole
|
||||
files in training), and line numbers are tokenizer poison (a number is one token,
|
||||
forcing a one-shot commit, and models can't count lines).
|
||||
- So edits use **search/replace blocks with no line numbers**, with redundant
|
||||
context lines so the parser tolerates model slips.
|
||||
- **Speculative decoding** makes apply fast (~1000 tok/s) because the unchanged file
|
||||
*is* the draft — and because that can't be built into hosted Anthropic/OpenAI
|
||||
models, vendors train and self-host their own apply model.
|
||||
|
||||
**Hermes status: it has a deterministic analog, and it's good.**
|
||||
`tools/fuzzy_match.py` implements an **8-strategy search/replace matcher** (exact →
|
||||
line-trimmed → whitespace-normalized → indentation-flexible → escape-normalized →
|
||||
trimmed-boundary → block-anchor → context-aware-similarity) that is *precisely* the
|
||||
"tolerate model slips in line-number-free search/replace" idea — just solved with
|
||||
`difflib.SequenceMatcher` instead of a trained model. `tools/patch_parser.py` and
|
||||
`tools/file_tools.py` wire it into the `patch` tool. Hermes also already adopts the
|
||||
correct *interface*: the model emits `old_string`/`new_string`, never line numbers.
|
||||
|
||||
Where the vendors go further: a *trained* apply model can reconstruct intent from a
|
||||
sketch (resolve `// ... existing ...` placeholders against the real file), whereas a
|
||||
fuzzy matcher can only locate-and-substitute text the model actually wrote. Hermes
|
||||
trades that capability for zero latency, zero cost, and full determinism — a sound
|
||||
trade for a local agent, but worth naming.
|
||||
|
||||
## 4. Ambient context — the editor's free advantage
|
||||
|
||||
Because the product *is* the editor, it injects for free: the open file, **cursor
|
||||
position**, current selection, **live LSP/linter diagnostics**, and recent diffs.
|
||||
Terminal agents must spend tool calls reconstructing all of this. More intent-signal
|
||||
per token → better output from the same model.
|
||||
|
||||
**Hermes status: partial.** Hermes injects project context files and has LSP
|
||||
plumbing (`agent/lsp/`), and the ACP adapter (`acp_adapter/`) gives editors a way to
|
||||
feed edits/approvals back. What it lacks is the *passive* signal: it doesn't know
|
||||
where your cursor is or what you've selected, because in a terminal there is no
|
||||
cursor to read. The ACP integration narrows this gap when Hermes runs inside an
|
||||
editor, but the default terminal surface is signal-poorer by construction.
|
||||
|
||||
## 5. Per-task model routing
|
||||
|
||||
Autocomplete uses a tiny fast model; chat uses a frontier model; apply uses the
|
||||
custom fast model; the agent loop uses a frontier model plus tools. Nothing is
|
||||
forced through one monolith.
|
||||
|
||||
**Hermes status: have it.** `agent/auxiliary_client.py` provides a routed auxiliary
|
||||
model used for cheaper sub-tasks — title generation (`agent/title_generator.py`),
|
||||
vision routing (`tools/computer_use/vision_routing.py`), background review
|
||||
(`agent/background_review.py`), and conversation compression
|
||||
(`agent/context_compressor.py`, `trajectory_compressor.py`). The pattern — reserve
|
||||
the expensive model for reasoning, route mechanical sub-tasks to a cheap one — is
|
||||
already core to Hermes.
|
||||
|
||||
---
|
||||
|
||||
## Synthesis: where Hermes wins, ties, and trails
|
||||
|
||||
**Ties or wins:**
|
||||
- **Apply reliability** — the 8-strategy fuzzy matcher is a genuinely strong,
|
||||
zero-cost analog to a trained apply model, and uses the same line-number-free
|
||||
search/replace interface the research converged on.
|
||||
- **Model routing** — auxiliary-client routing already reserves the frontier model
|
||||
for reasoning.
|
||||
- **Context-file injection & session memory** — robust, and FTS5 session search is a
|
||||
real recall capability terminal-first.
|
||||
|
||||
**Trails:**
|
||||
- **Semantic codebase retrieval** is the one structural gap. Hermes is lexical
|
||||
(ripgrep + FTS5) where the leaders are dense-vector with a cheaply-maintained
|
||||
index. This is the highest-leverage area if Hermes ever wants to close the
|
||||
"feels like it already knows my repo" gap.
|
||||
- **Ambient passive context** (cursor/selection/live diagnostics) is inherently
|
||||
weaker outside an editor; the ACP path is the right place to invest if that
|
||||
matters.
|
||||
|
||||
**The single most transferable insight:** the research independently concluded that
|
||||
it is *better to rewrite via fuzzy-tolerant, line-number-free search/replace than to
|
||||
trust the smart model to emit a precise diff* — and Hermes already lands on the same
|
||||
answer in `tools/fuzzy_match.py`. That convergence is a good sign the harness
|
||||
fundamentals here are sound; the missing piece is retrieval, not editing.
|
||||
|
||||
## Suggested follow-ups (not implemented here — analysis only)
|
||||
|
||||
1. **Optional semantic index plugin.** A content-addressed embedding cache keyed by
|
||||
file hash, behind the existing plugin interface, would give ranked semantic
|
||||
retrieval without bloating the core terminal path. Merkle delta-sync keeps it
|
||||
cheap to refresh.
|
||||
2. **Apply-from-sketch mode.** Let the model emit `// ... existing ...` placeholders
|
||||
and resolve them against the real file before handing to the fuzzy matcher —
|
||||
captures most of a trained apply model's benefit deterministically.
|
||||
3. **Richer ambient context over ACP.** Pipe editor cursor/selection/diagnostics
|
||||
into prompt assembly when running embedded, closing the passive-signal gap.
|
||||
Reference in New Issue
Block a user