mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-07-05 01:27:52 +08:00
Compare commits
1 Commits
feat/apify
...
bb/coding-
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
d459868776 |
179
docs/analysis/coding-agent-harness.md
Normal file
179
docs/analysis/coding-agent-harness.md
Normal file
@@ -0,0 +1,179 @@
|
|||||||
|
# Why IDE-Embedded Coding Agents Feel Better — and Where Hermes Stands
|
||||||
|
|
||||||
|
A study of the *harness* techniques that make in-editor coding agents punch above
|
||||||
|
the raw model, and an honest map of which ones Hermes already implements, which it
|
||||||
|
approximates, and which it lacks.
|
||||||
|
|
||||||
|
## TL;DR
|
||||||
|
|
||||||
|
The leading IDE-embedded coding products are **not better models** — they call the
|
||||||
|
same frontier models (Claude, GPT) that power terminal agents like this one. Their
|
||||||
|
edge comes entirely from the **harness**: how context is retrieved and assembled,
|
||||||
|
how mechanical edits are applied reliably, and how cheap specialized models are
|
||||||
|
routed in for sub-tasks. The model is a commodity; the meal it's fed is not.
|
||||||
|
|
||||||
|
This document breaks the advantage into five concrete subsystems, each backed by
|
||||||
|
published engineering from the vendors, and maps each onto the Hermes codebase.
|
||||||
|
|
||||||
|
| # | Technique | What it buys | Hermes status |
|
||||||
|
|---|-----------|--------------|---------------|
|
||||||
|
| 1 | Indexed semantic retrieval (Merkle delta-sync + content-addressed embedding cache) | Knows the repo in ms; feeds the *right* snippets | ❌ **Gap** (grep/FTS5 only, no vector index) |
|
||||||
|
| 2 | Retrieval as the accuracy driver | +~12.5% answer accuracy (vendor eval) | ⚠️ **Approximated** (lexical search, not semantic) |
|
||||||
|
| 3 | Decoupled "apply" model + line-number-free search/replace | Frontier model only *reasons*; mechanical patching never botches the file | ✅ **Have an analog** (`tools/fuzzy_match.py`) |
|
||||||
|
| 4 | Ambient IDE context (cursor pos, selection, live diagnostics) | More intent-signal per token | ⚠️ **Partial** (context files + LSP, no live cursor/selection) |
|
||||||
|
| 5 | Per-task model routing (tiny model for autocomplete, frontier for reasoning) | Right tool per job | ✅ **Have** (`agent/auxiliary_client.py`) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Indexed semantic retrieval — the actual moat
|
||||||
|
|
||||||
|
The headline trick is *not* the system prompt. It's a vector index over the repo
|
||||||
|
that is kept fresh cheaply:
|
||||||
|
|
||||||
|
- **Merkle tree for change detection.** Every file is SHA-256 hashed; folder hashes
|
||||||
|
derive from children; the root summarizes the repo. An edit changes only that
|
||||||
|
file's hash plus the path to the root, so the indexer walks **only the branches
|
||||||
|
that differ** instead of rescanning. (This is git's own content-addressing trick
|
||||||
|
repurposed for indexing.)
|
||||||
|
- **Syntax-aware chunking.** Changed files are split on function/class boundaries,
|
||||||
|
not arbitrary token windows, then embedded.
|
||||||
|
- **Content-addressed embedding cache.** Embeddings are keyed by the hash of the
|
||||||
|
chunk content. Re-indexing unchanged code is a cache hit → zero embedding cost.
|
||||||
|
Embedding is the expensive step, so this is the whole ballgame for speed.
|
||||||
|
- **Cross-clone index reuse.** Vendors observe that clones of one repo are ~92%
|
||||||
|
identical across an org; a "simhash" lets a new clone reuse a teammate's index,
|
||||||
|
collapsing time-to-first-query from hours (99th pct) to seconds. Access is gated
|
||||||
|
cryptographically: you can only compute a Merkle node's hash if you actually hold
|
||||||
|
the file, so results you can't *prove* you possess are dropped.
|
||||||
|
|
||||||
|
**Hermes status: this is the real gap.** Core Hermes has no vector index, no
|
||||||
|
embedding store, no Merkle delta-sync. Codebase awareness is achieved at task time
|
||||||
|
via lexical tools (`search_files` → ripgrep) and session recall via SQLite FTS5
|
||||||
|
(`hermes_state.py`, `tools/session_search_tool.py`). The only embedding-flavored
|
||||||
|
retrieval lives in an optional plugin (`plugins/memory/holographic/`), and even
|
||||||
|
that is FTS5-backed, not dense-vector.
|
||||||
|
|
||||||
|
This is a *defensible* design choice for a terminal-first agent — ripgrep over a
|
||||||
|
known working tree is fast, dependency-free, and always current — but it means
|
||||||
|
Hermes "discovers" a codebase cold each task rather than walking in pre-indexed.
|
||||||
|
|
||||||
|
## 2. Retrieval is the accuracy driver (empirically)
|
||||||
|
|
||||||
|
Vendor evals attribute **~+12.5% answer accuracy** and higher edit-retention to
|
||||||
|
semantic search alone — same model, better-retrieved context. This is the
|
||||||
|
empirical proof of the thesis: *a worse model with better context beats a better
|
||||||
|
model with worse context.*
|
||||||
|
|
||||||
|
**Hermes status: approximated, lexically.** Hermes gets the *shape* of this through
|
||||||
|
aggressive context assembly — `agent/prompt_builder.py` and `agent/system_prompt.py`
|
||||||
|
inject project context files (AGENTS.md / CLAUDE.md / .cursorrules), and
|
||||||
|
`agent/subdirectory_hints.py` surfaces local structure. What's missing is *ranked
|
||||||
|
semantic* retrieval: Hermes finds text by pattern, not by meaning. For
|
||||||
|
"where is the thing that does X" questions, lexical search is strictly weaker than
|
||||||
|
embeddings.
|
||||||
|
|
||||||
|
## 3. Decoupled apply model — the most underrated trick
|
||||||
|
|
||||||
|
Leading products split edits into two stages:
|
||||||
|
|
||||||
|
1. **Plan** — the frontier model emits a *terse* edit, often with
|
||||||
|
`// ... existing code ...` placeholders.
|
||||||
|
2. **Apply** — a separate, cheap, often self-hosted model turns that sketch into the
|
||||||
|
final file.
|
||||||
|
|
||||||
|
Why bother? Frontier models are *lazy and inaccurate* at large rewrites: they drop
|
||||||
|
code, emit `...`, "helpfully" reformat unrelated lines, miscount line numbers, and
|
||||||
|
can trap the agent in retry loops. Three published findings drive the design:
|
||||||
|
|
||||||
|
- **Whole-file rewrites beat diffs** for the model, because diffs force fewer output
|
||||||
|
tokens (less room to "think"), are out-of-distribution (models saw far more whole
|
||||||
|
files in training), and line numbers are tokenizer poison (a number is one token,
|
||||||
|
forcing a one-shot commit, and models can't count lines).
|
||||||
|
- So edits use **search/replace blocks with no line numbers**, with redundant
|
||||||
|
context lines so the parser tolerates model slips.
|
||||||
|
- **Speculative decoding** makes apply fast (~1000 tok/s) because the unchanged file
|
||||||
|
*is* the draft — and because that can't be built into hosted Anthropic/OpenAI
|
||||||
|
models, vendors train and self-host their own apply model.
|
||||||
|
|
||||||
|
**Hermes status: it has a deterministic analog, and it's good.**
|
||||||
|
`tools/fuzzy_match.py` implements an **8-strategy search/replace matcher** (exact →
|
||||||
|
line-trimmed → whitespace-normalized → indentation-flexible → escape-normalized →
|
||||||
|
trimmed-boundary → block-anchor → context-aware-similarity) that is *precisely* the
|
||||||
|
"tolerate model slips in line-number-free search/replace" idea — just solved with
|
||||||
|
`difflib.SequenceMatcher` instead of a trained model. `tools/patch_parser.py` and
|
||||||
|
`tools/file_tools.py` wire it into the `patch` tool. Hermes also already adopts the
|
||||||
|
correct *interface*: the model emits `old_string`/`new_string`, never line numbers.
|
||||||
|
|
||||||
|
Where the vendors go further: a *trained* apply model can reconstruct intent from a
|
||||||
|
sketch (resolve `// ... existing ...` placeholders against the real file), whereas a
|
||||||
|
fuzzy matcher can only locate-and-substitute text the model actually wrote. Hermes
|
||||||
|
trades that capability for zero latency, zero cost, and full determinism — a sound
|
||||||
|
trade for a local agent, but worth naming.
|
||||||
|
|
||||||
|
## 4. Ambient context — the editor's free advantage
|
||||||
|
|
||||||
|
Because the product *is* the editor, it injects for free: the open file, **cursor
|
||||||
|
position**, current selection, **live LSP/linter diagnostics**, and recent diffs.
|
||||||
|
Terminal agents must spend tool calls reconstructing all of this. More intent-signal
|
||||||
|
per token → better output from the same model.
|
||||||
|
|
||||||
|
**Hermes status: partial.** Hermes injects project context files and has LSP
|
||||||
|
plumbing (`agent/lsp/`), and the ACP adapter (`acp_adapter/`) gives editors a way to
|
||||||
|
feed edits/approvals back. What it lacks is the *passive* signal: it doesn't know
|
||||||
|
where your cursor is or what you've selected, because in a terminal there is no
|
||||||
|
cursor to read. The ACP integration narrows this gap when Hermes runs inside an
|
||||||
|
editor, but the default terminal surface is signal-poorer by construction.
|
||||||
|
|
||||||
|
## 5. Per-task model routing
|
||||||
|
|
||||||
|
Autocomplete uses a tiny fast model; chat uses a frontier model; apply uses the
|
||||||
|
custom fast model; the agent loop uses a frontier model plus tools. Nothing is
|
||||||
|
forced through one monolith.
|
||||||
|
|
||||||
|
**Hermes status: have it.** `agent/auxiliary_client.py` provides a routed auxiliary
|
||||||
|
model used for cheaper sub-tasks — title generation (`agent/title_generator.py`),
|
||||||
|
vision routing (`tools/computer_use/vision_routing.py`), background review
|
||||||
|
(`agent/background_review.py`), and conversation compression
|
||||||
|
(`agent/context_compressor.py`, `trajectory_compressor.py`). The pattern — reserve
|
||||||
|
the expensive model for reasoning, route mechanical sub-tasks to a cheap one — is
|
||||||
|
already core to Hermes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Synthesis: where Hermes wins, ties, and trails
|
||||||
|
|
||||||
|
**Ties or wins:**
|
||||||
|
- **Apply reliability** — the 8-strategy fuzzy matcher is a genuinely strong,
|
||||||
|
zero-cost analog to a trained apply model, and uses the same line-number-free
|
||||||
|
search/replace interface the research converged on.
|
||||||
|
- **Model routing** — auxiliary-client routing already reserves the frontier model
|
||||||
|
for reasoning.
|
||||||
|
- **Context-file injection & session memory** — robust, and FTS5 session search is a
|
||||||
|
real recall capability terminal-first.
|
||||||
|
|
||||||
|
**Trails:**
|
||||||
|
- **Semantic codebase retrieval** is the one structural gap. Hermes is lexical
|
||||||
|
(ripgrep + FTS5) where the leaders are dense-vector with a cheaply-maintained
|
||||||
|
index. This is the highest-leverage area if Hermes ever wants to close the
|
||||||
|
"feels like it already knows my repo" gap.
|
||||||
|
- **Ambient passive context** (cursor/selection/live diagnostics) is inherently
|
||||||
|
weaker outside an editor; the ACP path is the right place to invest if that
|
||||||
|
matters.
|
||||||
|
|
||||||
|
**The single most transferable insight:** the research independently concluded that
|
||||||
|
it is *better to rewrite via fuzzy-tolerant, line-number-free search/replace than to
|
||||||
|
trust the smart model to emit a precise diff* — and Hermes already lands on the same
|
||||||
|
answer in `tools/fuzzy_match.py`. That convergence is a good sign the harness
|
||||||
|
fundamentals here are sound; the missing piece is retrieval, not editing.
|
||||||
|
|
||||||
|
## Suggested follow-ups (not implemented here — analysis only)
|
||||||
|
|
||||||
|
1. **Optional semantic index plugin.** A content-addressed embedding cache keyed by
|
||||||
|
file hash, behind the existing plugin interface, would give ranked semantic
|
||||||
|
retrieval without bloating the core terminal path. Merkle delta-sync keeps it
|
||||||
|
cheap to refresh.
|
||||||
|
2. **Apply-from-sketch mode.** Let the model emit `// ... existing ...` placeholders
|
||||||
|
and resolve them against the real file before handing to the fuzzy matcher —
|
||||||
|
captures most of a trained apply model's benefit deterministically.
|
||||||
|
3. **Richer ambient context over ACP.** Pipe editor cursor/selection/diagnostics
|
||||||
|
into prompt assembly when running embedded, closing the passive-signal gap.
|
||||||
Reference in New Issue
Block a user