Files
hermes-agent/scripts/compression_eval/probes/feature-impl-context-priority.probes.json
Teknium 1e6285c53d feat: compression eval harness for agent/context_compressor.py
Ships a complete offline eval harness at scripts/compression_eval/. Runs
a real conversation fixture through ContextCompressor.compress(), asks
the compressor model to answer probe questions from the compressed
state, then has a judge model score each answer 0-5 on six dimensions
(accuracy, context_awareness, artifact_trail, completeness, continuity,
instruction_following). Methodology adapted from Factory's Dec 2025
write-up (https://factory.ai/news/evaluating-compression); the
scoreboard framing is not adopted.

Motivation: we edit context_compressor.py prompts and _template_sections
by hand and ship with no automated check that compression still
preserves file paths, error codes, or the active task. Until now there
has been no signal between 'test suite green' and 'a user hits a bad
summary in production.'

What's shipped
- DESIGN.md — full architecture, fixture/probe format, scrubber
  pipeline, grading rubric, open follow-ups
- README.md — usage, cost expectations, when to run it
- scrub_fixtures.py — reproducible pipeline that converts real sessions
  from ~/.hermes/sessions/*.jsonl into public-safe JSON fixtures. Applies
  agent.redact.redact_sensitive_text + username path normalisation +
  personal handle scrubbing + email/git-author normalisation + reasoning
  scratchpad stripping + platform-mention scrubbing + first-user
  paraphrase + system-prompt placeholder + orphan-message pruning + 2KB
  tool-output truncation
- fixtures/ — three scrubbed session snapshots covering three session
  shapes:
    feature-impl-context-priority  (75 msgs / ~17k tokens)
    debug-session-feishu-id-model  (59 msgs / ~13k tokens)
    config-build-competitive-scouts (61 msgs / ~23k tokens)
- probes/ — three probe banks (10-11 probes each) covering all four
  types (recall/artifact/continuation/decision) with expected_facts
  anchors (PR numbers, file paths, error codes, commands)
- rubric.py — six-dimension grading rubric, judge-prompt builder,
  JSON-with-fallback response parser
- compressor_driver.py — thin wrapper around ContextCompressor for
  forced single-shot compression (fixtures are below the default
  100k threshold so we force compress() to attribute score deltas
  to prompt changes, not threshold-fire variance)
- grader.py — two-phase continuation + grading calls via the OpenAI
  SDK directly against the resolved provider endpoint
- report.py — markdown report renderer (paste-ready for PR bodies),
  --compare-to delta mode, per-run JSON dumper
- run_eval.py — fire-style CLI (--fixtures, --runs, --judge-model,
  --compressor-model, --label, --focus-topic, --compare-to, --verbose)
- tests/scripts/test_compression_eval.py — 33 hermetic unit tests
  covering rubric parsing edge cases, judge-prompt building, report
  rendering, summariser medians, per-run JSON roundtrip, fixture and
  probe loading, and a PII smoke check on the checked-in fixtures

Non-LLM paths are covered by the 33-test suite that runs in CI. The
LLM paths (continuation + grading) require credentials and real API
calls, so they're exercised by running the eval itself — not by CI.

Validation
- 33/33 unit tests pass in 0.33s via scripts/run_tests.sh
- 50/50 adjacent tests (tests/agent/test_context_compressor.py) still
  pass — no regression introduced
- End-to-end dry run against debug-session-feishu-id-model with
  openai/gpt-5.4-mini via Nous Portal:
    Compression: 13081 -> 3055 tokens (76.6% ratio), 59 -> 10 messages
    Overall score: 3.25 (artifact_trail 1.50 is the weak spot,
    matching Factory's published observation)
    Specific probe misses surfaced with concrete judge notes

Noise floor (one empirical data point)
Same inputs re-run: overall 3.25 -> 3.17 (delta -0.08). Individual
dimensions varied up to ±0.5 between two single-run medians. Confirms
the DESIGN.md < 0.3 noise guidance is the right order of magnitude
for single-run comparisons. Tighter noise measurement (N=10) is
tracked as an open follow-up in DESIGN.md.

Why scripts/ and not tests/
Requires API credentials, costs ~$0.50-1.50 per run, minutes to
execute, LLM-graded (non-deterministic). Incompatible with
scripts/run_tests.sh which is hermetic, parallel, credential-free.
scripts/sample_and_compress.py is the existing precedent for offline
credentialed tooling.

Open follow-ups (tracked in DESIGN.md, not blocking this PR)
1. Iterative-merge fixture (two chained compressions on one session)
2. Precise noise-floor measurement at N=10
3. Scripted scrubber helpers to lower the cost of fixture #4+
4. Judge model selection policy (pin vs. per-user)
2026-04-25 06:17:44 -07:00

75 lines
3.3 KiB
JSON

{
"fixture": "feature-impl-context-priority",
"description": "Probes for the .hermes.md / AGENTS.md / CLAUDE.md / .cursorrules priority feature session. Anchors are the concrete facts the next assistant would need to continue: user's priority order, files modified, helper-function structure, live-test scenarios, and PR number.",
"probes": [
{
"id": "recall-priority-order",
"type": "recall",
"question": "What is the priority order the user asked for when multiple project-context files are present? List them from highest to lowest priority.",
"expected_facts": [".hermes.md", "AGENTS.md", "CLAUDE.md", ".cursorrules", "highest to lowest"]
},
{
"id": "recall-selection-mode",
"type": "recall",
"question": "When multiple context files exist in the same directory, does the agent now load all of them or pick only one?",
"expected_facts": ["only one", "priority-based selection", "highest-priority winner"]
},
{
"id": "artifact-files-modified",
"type": "artifact",
"question": "Which files in the hermes-agent repository were modified during this session? List them.",
"expected_facts": [
"agent/prompt_builder.py",
"tests/agent/test_prompt_builder.py"
]
},
{
"id": "artifact-helper-functions",
"type": "artifact",
"question": "The session introduced separate helper functions for each context-file type. What are their names?",
"expected_facts": [
"_load_hermes_md",
"_load_agents_md",
"_load_claude_md",
"_load_cursorrules"
]
},
{
"id": "artifact-test-scenarios",
"type": "artifact",
"question": "A scratch directory was created with scenario subdirectories to live-test the priority chain. Roughly how many scenarios, and what directory was it created under?",
"expected_facts": ["10 scenarios", "/tmp/context-priority-test"]
},
{
"id": "decision-claude-md-was-unsupported",
"type": "decision",
"question": "What was the finding about CLAUDE.md support in the existing loader before this session's changes?",
"expected_facts": ["CLAUDE.md was not handled", "not supported", "new handler added"]
},
{
"id": "decision-load-all-or-one",
"type": "decision",
"question": "Was the decision to load multiple context files when present, or to load only the highest-priority one? Explain the reasoning in one sentence.",
"expected_facts": ["load only one", "highest priority", "user preference", "do not want to load multiple"]
},
{
"id": "continuation-pr-number-and-status",
"type": "continuation",
"question": "A pull request was opened for this feature. What is the PR number and what is its merge status?",
"expected_facts": ["PR #2301", "merged", "squash"]
},
{
"id": "continuation-test-suite-result",
"type": "continuation",
"question": "What was the result of the full test suite run after the implementation changes?",
"expected_facts": ["5680 passed", "0 failures", "clean"]
},
{
"id": "continuation-next-step",
"type": "continuation",
"question": "If asked to pick up this session, what is the current state of main? Anything left to do?",
"expected_facts": ["merged to main", "main is current", "nothing outstanding", "pulled"]
}
]
}