mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-30 16:01:49 +08:00
feat: compression eval harness for agent/context_compressor.py
Ships a complete offline eval harness at scripts/compression_eval/. Runs a real conversation fixture through ContextCompressor.compress(), asks the compressor model to answer probe questions from the compressed state, then has a judge model score each answer 0-5 on six dimensions (accuracy, context_awareness, artifact_trail, completeness, continuity, instruction_following). Methodology adapted from Factory's Dec 2025 write-up (https://factory.ai/news/evaluating-compression); the scoreboard framing is not adopted. Motivation: we edit context_compressor.py prompts and _template_sections by hand and ship with no automated check that compression still preserves file paths, error codes, or the active task. Until now there has been no signal between 'test suite green' and 'a user hits a bad summary in production.' What's shipped - DESIGN.md — full architecture, fixture/probe format, scrubber pipeline, grading rubric, open follow-ups - README.md — usage, cost expectations, when to run it - scrub_fixtures.py — reproducible pipeline that converts real sessions from ~/.hermes/sessions/*.jsonl into public-safe JSON fixtures. Applies agent.redact.redact_sensitive_text + username path normalisation + personal handle scrubbing + email/git-author normalisation + reasoning scratchpad stripping + platform-mention scrubbing + first-user paraphrase + system-prompt placeholder + orphan-message pruning + 2KB tool-output truncation - fixtures/ — three scrubbed session snapshots covering three session shapes: feature-impl-context-priority (75 msgs / ~17k tokens) debug-session-feishu-id-model (59 msgs / ~13k tokens) config-build-competitive-scouts (61 msgs / ~23k tokens) - probes/ — three probe banks (10-11 probes each) covering all four types (recall/artifact/continuation/decision) with expected_facts anchors (PR numbers, file paths, error codes, commands) - rubric.py — six-dimension grading rubric, judge-prompt builder, JSON-with-fallback response parser - compressor_driver.py — thin wrapper around ContextCompressor for forced single-shot compression (fixtures are below the default 100k threshold so we force compress() to attribute score deltas to prompt changes, not threshold-fire variance) - grader.py — two-phase continuation + grading calls via the OpenAI SDK directly against the resolved provider endpoint - report.py — markdown report renderer (paste-ready for PR bodies), --compare-to delta mode, per-run JSON dumper - run_eval.py — fire-style CLI (--fixtures, --runs, --judge-model, --compressor-model, --label, --focus-topic, --compare-to, --verbose) - tests/scripts/test_compression_eval.py — 33 hermetic unit tests covering rubric parsing edge cases, judge-prompt building, report rendering, summariser medians, per-run JSON roundtrip, fixture and probe loading, and a PII smoke check on the checked-in fixtures Non-LLM paths are covered by the 33-test suite that runs in CI. The LLM paths (continuation + grading) require credentials and real API calls, so they're exercised by running the eval itself — not by CI. Validation - 33/33 unit tests pass in 0.33s via scripts/run_tests.sh - 50/50 adjacent tests (tests/agent/test_context_compressor.py) still pass — no regression introduced - End-to-end dry run against debug-session-feishu-id-model with openai/gpt-5.4-mini via Nous Portal: Compression: 13081 -> 3055 tokens (76.6% ratio), 59 -> 10 messages Overall score: 3.25 (artifact_trail 1.50 is the weak spot, matching Factory's published observation) Specific probe misses surfaced with concrete judge notes Noise floor (one empirical data point) Same inputs re-run: overall 3.25 -> 3.17 (delta -0.08). Individual dimensions varied up to ±0.5 between two single-run medians. Confirms the DESIGN.md < 0.3 noise guidance is the right order of magnitude for single-run comparisons. Tighter noise measurement (N=10) is tracked as an open follow-up in DESIGN.md. Why scripts/ and not tests/ Requires API credentials, costs ~$0.50-1.50 per run, minutes to execute, LLM-graded (non-deterministic). Incompatible with scripts/run_tests.sh which is hermetic, parallel, credential-free. scripts/sample_and_compress.py is the existing precedent for offline credentialed tooling. Open follow-ups (tracked in DESIGN.md, not blocking this PR) 1. Iterative-merge fixture (two chained compressions on one session) 2. Precise noise-floor measurement at N=10 3. Scripted scrubber helpers to lower the cost of fixture #4+ 4. Judge model selection policy (pin vs. per-user)
This commit is contained in:
@@ -0,0 +1,96 @@
|
||||
{
|
||||
"fixture": "config-build-competitive-scouts",
|
||||
"description": "Probes for the competitive-scout cron-job setup session. Anchors are which agents were configured, which day of the week each runs, and the full final schedule. This fixture most directly tests artifact-trail and iterative-merge because the job list grows by one per user turn.",
|
||||
"probes": [
|
||||
{
|
||||
"id": "recall-first-repo",
|
||||
"type": "recall",
|
||||
"question": "What was the first repository the user asked to create a scout cron for, and on what day of the week?",
|
||||
"expected_facts": ["openclaw", "Sunday"]
|
||||
},
|
||||
{
|
||||
"id": "recall-closed-source-target",
|
||||
"type": "recall",
|
||||
"question": "One of the scout targets does not have an open-source repository and had to be configured as a web scan instead. Which one, and on what day?",
|
||||
"expected_facts": ["claude code", "Friday", "web scan"]
|
||||
},
|
||||
{
|
||||
"id": "artifact-all-jobs",
|
||||
"type": "artifact",
|
||||
"question": "List every scout cron job created in this session.",
|
||||
"expected_facts": [
|
||||
"openclaw-pr-scout",
|
||||
"nanoclaw-pr-scout",
|
||||
"ironclaw-pr-scout",
|
||||
"kilocode-pr-scout",
|
||||
"codex-pr-scout",
|
||||
"gemini-cli-pr-scout",
|
||||
"cline-pr-scout",
|
||||
"opencode-pr-scout",
|
||||
"claude-code-scout",
|
||||
"aider-pr-scout",
|
||||
"roocode-pr-scout"
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "artifact-final-schedule",
|
||||
"type": "artifact",
|
||||
"question": "What is the final weekly schedule? Give the day and the agents scanned on each day.",
|
||||
"expected_facts": [
|
||||
"Sun: openclaw, nanoclaw, ironclaw",
|
||||
"Mon: kilo code",
|
||||
"Tue: codex",
|
||||
"Wed: gemini cli, cline",
|
||||
"Thu: opencode",
|
||||
"Fri: claude code",
|
||||
"Sat: aider, roo"
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "artifact-sunday-count",
|
||||
"type": "artifact",
|
||||
"question": "How many cron jobs run on Sunday?",
|
||||
"expected_facts": ["3", "three", "openclaw, nanoclaw, ironclaw"]
|
||||
},
|
||||
{
|
||||
"id": "artifact-total-count",
|
||||
"type": "artifact",
|
||||
"question": "How many scout cron jobs were created in total by the end of the session?",
|
||||
"expected_facts": ["11", "eleven"]
|
||||
},
|
||||
{
|
||||
"id": "decision-kilo-open-source",
|
||||
"type": "decision",
|
||||
"question": "The user asked whether Kilo Code is open source. What was the answer, and what did the user decide to do with it?",
|
||||
"expected_facts": [
|
||||
"yes, open source",
|
||||
"Kilo-Org/kilocode",
|
||||
"added as Monday scout"
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "decision-saturday-fill",
|
||||
"type": "decision",
|
||||
"question": "Saturday was the last open day at one point. Which scout(s) were placed on Saturday, and why were those chosen?",
|
||||
"expected_facts": ["aider", "roo", "filled in last based on openrouter popularity / cli comparison rankings"]
|
||||
},
|
||||
{
|
||||
"id": "continuation-execution-time",
|
||||
"type": "continuation",
|
||||
"question": "At what local time of day do these scout cron jobs run?",
|
||||
"expected_facts": ["10 AM Pacific", "17:00 UTC", "0 17 * * *"]
|
||||
},
|
||||
{
|
||||
"id": "continuation-skill-used",
|
||||
"type": "continuation",
|
||||
"question": "Each scout job runs with a specific skill preloaded. Which one?",
|
||||
"expected_facts": ["hermes-agent-dev"]
|
||||
},
|
||||
{
|
||||
"id": "continuation-weekday-coverage",
|
||||
"type": "continuation",
|
||||
"question": "After the session ended, are there any weekdays still uncovered by a scout job?",
|
||||
"expected_facts": ["no", "all 7 days covered", "full week loaded"]
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,72 @@
|
||||
{
|
||||
"fixture": "debug-session-feishu-id-model",
|
||||
"description": "Probes for the Feishu identity-model PR #8388 triage session. Anchors are the PR number, what the PR actually contained, what upstream docs confirmed, and the final decision + reasoning.",
|
||||
"probes": [
|
||||
{
|
||||
"id": "recall-pr-number",
|
||||
"type": "recall",
|
||||
"question": "What is the PR number under review in this session, and what repository is it against?",
|
||||
"expected_facts": ["PR #8388", "NousResearch/hermes-agent", "hermes-agent"]
|
||||
},
|
||||
{
|
||||
"id": "recall-bug-claim",
|
||||
"type": "recall",
|
||||
"question": "What is the core bug the PR claims to fix? Be specific about the identifier involved.",
|
||||
"expected_facts": ["open_id", "app-scoped", "not canonical", "Feishu identity model"]
|
||||
},
|
||||
{
|
||||
"id": "recall-upstream-confirmation",
|
||||
"type": "recall",
|
||||
"question": "Do upstream Feishu/Lark docs confirm that open_id is app-scoped rather than a canonical cross-app identity?",
|
||||
"expected_facts": ["yes", "confirmed", "open.feishu.cn", "same user has different Open IDs in different apps"]
|
||||
},
|
||||
{
|
||||
"id": "artifact-pr-scope",
|
||||
"type": "artifact",
|
||||
"question": "Roughly how large is PR #8388, and which gateway subsystems does it touch beyond the Feishu adapter?",
|
||||
"expected_facts": ["4647 lines", "gateway/run.py", "cron/scheduler.py", "gateway/config.py", "multi-account", "bind"]
|
||||
},
|
||||
{
|
||||
"id": "artifact-new-tool",
|
||||
"type": "artifact",
|
||||
"question": "Does the PR add a new tool file? If so, what is its path?",
|
||||
"expected_facts": ["tools/feishu_id_tool.py", "new file"]
|
||||
},
|
||||
{
|
||||
"id": "decision-pr-assessment",
|
||||
"type": "decision",
|
||||
"question": "What is the reviewer's overall assessment of PR #8388 — approve, reject, or something more nuanced? Explain in one sentence.",
|
||||
"expected_facts": [
|
||||
"core claim is correct",
|
||||
"scope is wrong",
|
||||
"bait-and-switch",
|
||||
"overbuilt",
|
||||
"implement cleaner ourselves"
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "decision-core-claim-validity",
|
||||
"type": "decision",
|
||||
"question": "Setting aside the PR's size, is the underlying identity-model concern technically valid or not?",
|
||||
"expected_facts": ["technically valid", "correct", "open_id is app-scoped"]
|
||||
},
|
||||
{
|
||||
"id": "continuation-next-action",
|
||||
"type": "continuation",
|
||||
"question": "Based on the review outcome, what is the next action the agent has been asked to take regarding this PR?",
|
||||
"expected_facts": ["close the PR", "implement ourselves", "cleaner", "less complex"]
|
||||
},
|
||||
{
|
||||
"id": "continuation-implementation-scope",
|
||||
"type": "continuation",
|
||||
"question": "If implementing the Feishu fix cleanly ourselves, which specific behaviour needs to change — what should replace the current use of open_id?",
|
||||
"expected_facts": ["use union_id", "or user_id", "canonical identity", "cross-app stable ID"]
|
||||
},
|
||||
{
|
||||
"id": "continuation-sources-to-reference",
|
||||
"type": "continuation",
|
||||
"question": "Which upstream documentation sources were fetched during review that should be referenced when writing the clean implementation?",
|
||||
"expected_facts": ["open.feishu.cn", "open.larkoffice.com", "user-identity-introduction"]
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,74 @@
|
||||
{
|
||||
"fixture": "feature-impl-context-priority",
|
||||
"description": "Probes for the .hermes.md / AGENTS.md / CLAUDE.md / .cursorrules priority feature session. Anchors are the concrete facts the next assistant would need to continue: user's priority order, files modified, helper-function structure, live-test scenarios, and PR number.",
|
||||
"probes": [
|
||||
{
|
||||
"id": "recall-priority-order",
|
||||
"type": "recall",
|
||||
"question": "What is the priority order the user asked for when multiple project-context files are present? List them from highest to lowest priority.",
|
||||
"expected_facts": [".hermes.md", "AGENTS.md", "CLAUDE.md", ".cursorrules", "highest to lowest"]
|
||||
},
|
||||
{
|
||||
"id": "recall-selection-mode",
|
||||
"type": "recall",
|
||||
"question": "When multiple context files exist in the same directory, does the agent now load all of them or pick only one?",
|
||||
"expected_facts": ["only one", "priority-based selection", "highest-priority winner"]
|
||||
},
|
||||
{
|
||||
"id": "artifact-files-modified",
|
||||
"type": "artifact",
|
||||
"question": "Which files in the hermes-agent repository were modified during this session? List them.",
|
||||
"expected_facts": [
|
||||
"agent/prompt_builder.py",
|
||||
"tests/agent/test_prompt_builder.py"
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "artifact-helper-functions",
|
||||
"type": "artifact",
|
||||
"question": "The session introduced separate helper functions for each context-file type. What are their names?",
|
||||
"expected_facts": [
|
||||
"_load_hermes_md",
|
||||
"_load_agents_md",
|
||||
"_load_claude_md",
|
||||
"_load_cursorrules"
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "artifact-test-scenarios",
|
||||
"type": "artifact",
|
||||
"question": "A scratch directory was created with scenario subdirectories to live-test the priority chain. Roughly how many scenarios, and what directory was it created under?",
|
||||
"expected_facts": ["10 scenarios", "/tmp/context-priority-test"]
|
||||
},
|
||||
{
|
||||
"id": "decision-claude-md-was-unsupported",
|
||||
"type": "decision",
|
||||
"question": "What was the finding about CLAUDE.md support in the existing loader before this session's changes?",
|
||||
"expected_facts": ["CLAUDE.md was not handled", "not supported", "new handler added"]
|
||||
},
|
||||
{
|
||||
"id": "decision-load-all-or-one",
|
||||
"type": "decision",
|
||||
"question": "Was the decision to load multiple context files when present, or to load only the highest-priority one? Explain the reasoning in one sentence.",
|
||||
"expected_facts": ["load only one", "highest priority", "user preference", "do not want to load multiple"]
|
||||
},
|
||||
{
|
||||
"id": "continuation-pr-number-and-status",
|
||||
"type": "continuation",
|
||||
"question": "A pull request was opened for this feature. What is the PR number and what is its merge status?",
|
||||
"expected_facts": ["PR #2301", "merged", "squash"]
|
||||
},
|
||||
{
|
||||
"id": "continuation-test-suite-result",
|
||||
"type": "continuation",
|
||||
"question": "What was the result of the full test suite run after the implementation changes?",
|
||||
"expected_facts": ["5680 passed", "0 failures", "clean"]
|
||||
},
|
||||
{
|
||||
"id": "continuation-next-step",
|
||||
"type": "continuation",
|
||||
"question": "If asked to pick up this session, what is the current state of main? Anything left to do?",
|
||||
"expected_facts": ["merged to main", "main is current", "nothing outstanding", "pulled"]
|
||||
}
|
||||
]
|
||||
}
|
||||
Reference in New Issue
Block a user