mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-29 15:31:38 +08:00
Ships a complete offline eval harness at scripts/compression_eval/. Runs a real conversation fixture through ContextCompressor.compress(), asks the compressor model to answer probe questions from the compressed state, then has a judge model score each answer 0-5 on six dimensions (accuracy, context_awareness, artifact_trail, completeness, continuity, instruction_following). Methodology adapted from Factory's Dec 2025 write-up (https://factory.ai/news/evaluating-compression); the scoreboard framing is not adopted. Motivation: we edit context_compressor.py prompts and _template_sections by hand and ship with no automated check that compression still preserves file paths, error codes, or the active task. Until now there has been no signal between 'test suite green' and 'a user hits a bad summary in production.' What's shipped - DESIGN.md — full architecture, fixture/probe format, scrubber pipeline, grading rubric, open follow-ups - README.md — usage, cost expectations, when to run it - scrub_fixtures.py — reproducible pipeline that converts real sessions from ~/.hermes/sessions/*.jsonl into public-safe JSON fixtures. Applies agent.redact.redact_sensitive_text + username path normalisation + personal handle scrubbing + email/git-author normalisation + reasoning scratchpad stripping + platform-mention scrubbing + first-user paraphrase + system-prompt placeholder + orphan-message pruning + 2KB tool-output truncation - fixtures/ — three scrubbed session snapshots covering three session shapes: feature-impl-context-priority (75 msgs / ~17k tokens) debug-session-feishu-id-model (59 msgs / ~13k tokens) config-build-competitive-scouts (61 msgs / ~23k tokens) - probes/ — three probe banks (10-11 probes each) covering all four types (recall/artifact/continuation/decision) with expected_facts anchors (PR numbers, file paths, error codes, commands) - rubric.py — six-dimension grading rubric, judge-prompt builder, JSON-with-fallback response parser - compressor_driver.py — thin wrapper around ContextCompressor for forced single-shot compression (fixtures are below the default 100k threshold so we force compress() to attribute score deltas to prompt changes, not threshold-fire variance) - grader.py — two-phase continuation + grading calls via the OpenAI SDK directly against the resolved provider endpoint - report.py — markdown report renderer (paste-ready for PR bodies), --compare-to delta mode, per-run JSON dumper - run_eval.py — fire-style CLI (--fixtures, --runs, --judge-model, --compressor-model, --label, --focus-topic, --compare-to, --verbose) - tests/scripts/test_compression_eval.py — 33 hermetic unit tests covering rubric parsing edge cases, judge-prompt building, report rendering, summariser medians, per-run JSON roundtrip, fixture and probe loading, and a PII smoke check on the checked-in fixtures Non-LLM paths are covered by the 33-test suite that runs in CI. The LLM paths (continuation + grading) require credentials and real API calls, so they're exercised by running the eval itself — not by CI. Validation - 33/33 unit tests pass in 0.33s via scripts/run_tests.sh - 50/50 adjacent tests (tests/agent/test_context_compressor.py) still pass — no regression introduced - End-to-end dry run against debug-session-feishu-id-model with openai/gpt-5.4-mini via Nous Portal: Compression: 13081 -> 3055 tokens (76.6% ratio), 59 -> 10 messages Overall score: 3.25 (artifact_trail 1.50 is the weak spot, matching Factory's published observation) Specific probe misses surfaced with concrete judge notes Noise floor (one empirical data point) Same inputs re-run: overall 3.25 -> 3.17 (delta -0.08). Individual dimensions varied up to ±0.5 between two single-run medians. Confirms the DESIGN.md < 0.3 noise guidance is the right order of magnitude for single-run comparisons. Tighter noise measurement (N=10) is tracked as an open follow-up in DESIGN.md. Why scripts/ and not tests/ Requires API credentials, costs ~$0.50-1.50 per run, minutes to execute, LLM-graded (non-deterministic). Incompatible with scripts/run_tests.sh which is hermetic, parallel, credential-free. scripts/sample_and_compress.py is the existing precedent for offline credentialed tooling. Open follow-ups (tracked in DESIGN.md, not blocking this PR) 1. Iterative-merge fixture (two chained compressions on one session) 2. Precise noise-floor measurement at N=10 3. Scripted scrubber helpers to lower the cost of fixture #4+ 4. Judge model selection policy (pin vs. per-user)
111 lines
4.4 KiB
Markdown
111 lines
4.4 KiB
Markdown
# compression_eval
|
||
|
||
Offline eval harness for `agent/context_compressor.py`. Runs a real
|
||
conversation transcript through the compressor, then probes the
|
||
compressed state with targeted questions graded on six dimensions.
|
||
|
||
## When to run
|
||
|
||
Before merging changes to:
|
||
|
||
- `agent/context_compressor.py` — any change to `_template_sections`,
|
||
`_generate_summary`, `compress()`, or its boundary logic
|
||
- `agent/auxiliary_client.py` — when changing how compression tasks
|
||
are routed
|
||
- `agent/prompt_builder.py` — when the compression-note phrasing
|
||
changes
|
||
|
||
## Not for CI
|
||
|
||
This harness makes real model calls (compressor + continuation +
|
||
judge = ~3 calls per probe × probes per fixture × runs). Costs ~$0.50
|
||
to ~$1.50 per full run depending on models, takes minutes, is
|
||
LLM-graded (non-deterministic). It lives in `scripts/` and is
|
||
invoked by hand. `tests/` and `scripts/run_tests.sh` do not touch it.
|
||
|
||
`tests/scripts/test_compression_eval.py` covers the non-LLM code
|
||
paths (rubric parsing, report rendering, fixture/probe loading, PII
|
||
smoke check on the checked-in fixtures) and DOES run in CI.
|
||
|
||
## Usage
|
||
|
||
```bash
|
||
# Run all three fixtures, 3 runs each, with your configured provider
|
||
python3 scripts/compression_eval/run_eval.py
|
||
|
||
# Faster iteration — one fixture, one run
|
||
python3 scripts/compression_eval/run_eval.py \
|
||
--fixtures=debug-session-feishu-id-model --runs=1
|
||
|
||
# Pin a cheap model for both compression + judge (recommended)
|
||
python3 scripts/compression_eval/run_eval.py \
|
||
--compressor-provider=nous --compressor-model=openai/gpt-5.4-mini \
|
||
--judge-provider=nous --judge-model=openai/gpt-5.4-mini \
|
||
--runs=3 --label=baseline
|
||
|
||
# After editing context_compressor.py, rerun with a new label and diff
|
||
python3 scripts/compression_eval/run_eval.py \
|
||
--compressor-provider=nous --compressor-model=openai/gpt-5.4-mini \
|
||
--judge-provider=nous --judge-model=openai/gpt-5.4-mini \
|
||
--runs=3 --label=my-prompt-tweak \
|
||
--compare-to=results/baseline
|
||
```
|
||
|
||
Results land in `results/<label>/report.md` and are intended to be
|
||
pasted verbatim into PR descriptions. `--compare-to` renders a delta
|
||
column per dimension so reviewers can see "did this actually help?"
|
||
at a glance.
|
||
|
||
Rule of thumb: dimension deltas below ±0.3 are within run-to-run
|
||
noise on `runs=3`. Publish a bigger N if you want tighter bounds.
|
||
|
||
## Fixtures
|
||
|
||
Three scrubbed session snapshots live under `fixtures/`:
|
||
|
||
- `feature-impl-context-priority.json` — 75 msgs, investigate →
|
||
patch → test → PR → merge
|
||
- `debug-session-feishu-id-model.json` — 59 msgs, PR triage +
|
||
upstream docs + decision
|
||
- `config-build-competitive-scouts.json` — 61 msgs, iterative
|
||
config accumulation (11 cron jobs)
|
||
|
||
Regenerate them from the maintainer's `~/.hermes/sessions/*.jsonl`
|
||
with `python3 scripts/compression_eval/scrub_fixtures.py`. The
|
||
scrubber pipeline and PII-audit checklist are documented in
|
||
`DESIGN.md` under **Scrubber pipeline**.
|
||
|
||
## Probes
|
||
|
||
One probe bank per fixture under `probes/`, 10-11 probes each,
|
||
covering all four types: **recall**, **artifact**, **continuation**,
|
||
**decision**. Each probe carries an `expected_facts` list of concrete
|
||
anchors (PR numbers, file paths, error codes, commands run) that the
|
||
judge sees alongside the assistant's answer.
|
||
|
||
## How it scores
|
||
|
||
Six dimensions, 0-5 per probe:
|
||
|
||
| Dimension | What it measures |
|
||
|-----------------------|------------------------------------------------------|
|
||
| accuracy | File paths, function names, PR/issue numbers correct |
|
||
| context_awareness | Reflects current session state, not a snapshot |
|
||
| artifact_trail | Correctly enumerates files / commands / PRs |
|
||
| completeness | Addresses ALL parts of the probe |
|
||
| continuity | Next assistant could continue without re-fetching |
|
||
| instruction_following | Answer in the requested form |
|
||
|
||
Report renders medians across N runs; probes scoring below 3.0
|
||
overall surface in a separate section with the judge's specific
|
||
complaint noted inline.
|
||
|
||
## Related
|
||
|
||
- `agent/context_compressor.py` — the thing under test
|
||
- `tests/agent/test_context_compressor.py` — structural unit tests
|
||
that do run in CI
|
||
- `scripts/sample_and_compress.py` — the closest existing script in
|
||
shape (offline, credential-requiring, not in CI)
|
||
- `DESIGN.md` — full architecture + methodology + open follow-ups
|