mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-29 15:31:38 +08:00
feat: compression eval harness for agent/context_compressor.py
Ships a complete offline eval harness at scripts/compression_eval/. Runs a real conversation fixture through ContextCompressor.compress(), asks the compressor model to answer probe questions from the compressed state, then has a judge model score each answer 0-5 on six dimensions (accuracy, context_awareness, artifact_trail, completeness, continuity, instruction_following). Methodology adapted from Factory's Dec 2025 write-up (https://factory.ai/news/evaluating-compression); the scoreboard framing is not adopted. Motivation: we edit context_compressor.py prompts and _template_sections by hand and ship with no automated check that compression still preserves file paths, error codes, or the active task. Until now there has been no signal between 'test suite green' and 'a user hits a bad summary in production.' What's shipped - DESIGN.md — full architecture, fixture/probe format, scrubber pipeline, grading rubric, open follow-ups - README.md — usage, cost expectations, when to run it - scrub_fixtures.py — reproducible pipeline that converts real sessions from ~/.hermes/sessions/*.jsonl into public-safe JSON fixtures. Applies agent.redact.redact_sensitive_text + username path normalisation + personal handle scrubbing + email/git-author normalisation + reasoning scratchpad stripping + platform-mention scrubbing + first-user paraphrase + system-prompt placeholder + orphan-message pruning + 2KB tool-output truncation - fixtures/ — three scrubbed session snapshots covering three session shapes: feature-impl-context-priority (75 msgs / ~17k tokens) debug-session-feishu-id-model (59 msgs / ~13k tokens) config-build-competitive-scouts (61 msgs / ~23k tokens) - probes/ — three probe banks (10-11 probes each) covering all four types (recall/artifact/continuation/decision) with expected_facts anchors (PR numbers, file paths, error codes, commands) - rubric.py — six-dimension grading rubric, judge-prompt builder, JSON-with-fallback response parser - compressor_driver.py — thin wrapper around ContextCompressor for forced single-shot compression (fixtures are below the default 100k threshold so we force compress() to attribute score deltas to prompt changes, not threshold-fire variance) - grader.py — two-phase continuation + grading calls via the OpenAI SDK directly against the resolved provider endpoint - report.py — markdown report renderer (paste-ready for PR bodies), --compare-to delta mode, per-run JSON dumper - run_eval.py — fire-style CLI (--fixtures, --runs, --judge-model, --compressor-model, --label, --focus-topic, --compare-to, --verbose) - tests/scripts/test_compression_eval.py — 33 hermetic unit tests covering rubric parsing edge cases, judge-prompt building, report rendering, summariser medians, per-run JSON roundtrip, fixture and probe loading, and a PII smoke check on the checked-in fixtures Non-LLM paths are covered by the 33-test suite that runs in CI. The LLM paths (continuation + grading) require credentials and real API calls, so they're exercised by running the eval itself — not by CI. Validation - 33/33 unit tests pass in 0.33s via scripts/run_tests.sh - 50/50 adjacent tests (tests/agent/test_context_compressor.py) still pass — no regression introduced - End-to-end dry run against debug-session-feishu-id-model with openai/gpt-5.4-mini via Nous Portal: Compression: 13081 -> 3055 tokens (76.6% ratio), 59 -> 10 messages Overall score: 3.25 (artifact_trail 1.50 is the weak spot, matching Factory's published observation) Specific probe misses surfaced with concrete judge notes Noise floor (one empirical data point) Same inputs re-run: overall 3.25 -> 3.17 (delta -0.08). Individual dimensions varied up to ±0.5 between two single-run medians. Confirms the DESIGN.md < 0.3 noise guidance is the right order of magnitude for single-run comparisons. Tighter noise measurement (N=10) is tracked as an open follow-up in DESIGN.md. Why scripts/ and not tests/ Requires API credentials, costs ~$0.50-1.50 per run, minutes to execute, LLM-graded (non-deterministic). Incompatible with scripts/run_tests.sh which is hermetic, parallel, credential-free. scripts/sample_and_compress.py is the existing precedent for offline credentialed tooling. Open follow-ups (tracked in DESIGN.md, not blocking this PR) 1. Iterative-merge fixture (two chained compressions on one session) 2. Precise noise-floor measurement at N=10 3. Scripted scrubber helpers to lower the cost of fixture #4+ 4. Judge model selection policy (pin vs. per-user)
This commit is contained in:
198
scripts/compression_eval/rubric.py
Normal file
198
scripts/compression_eval/rubric.py
Normal file
@@ -0,0 +1,198 @@
|
||||
"""Rubric for probe-based compression eval grading.
|
||||
|
||||
Six dimensions scored 0-5 by a judge model. The scoring anchors are spelled
|
||||
out so the judge interpretation is stable across runs and across judge
|
||||
models.
|
||||
|
||||
Adapted from the methodology in
|
||||
https://factory.ai/news/evaluating-compression. Their scoreboard is not
|
||||
adopted; only the dimension definitions and the 0-5 scale.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any, Dict, List
|
||||
|
||||
# Canonical dimension order. All reports, parsers, and comparisons derive
|
||||
# from this list — do not hardcode the order elsewhere.
|
||||
DIMENSIONS: List[str] = [
|
||||
"accuracy",
|
||||
"context_awareness",
|
||||
"artifact_trail",
|
||||
"completeness",
|
||||
"continuity",
|
||||
"instruction_following",
|
||||
]
|
||||
|
||||
DIMENSION_DESCRIPTIONS: Dict[str, str] = {
|
||||
"accuracy": (
|
||||
"Are concrete facts correct — file paths, function names, PR/issue "
|
||||
"numbers, error codes, command outputs, line numbers? A single wrong "
|
||||
"path or error code should cost points. Vague but non-contradicting "
|
||||
"answers score mid-range."
|
||||
),
|
||||
"context_awareness": (
|
||||
"Does the answer reflect the CURRENT state of the session, not a "
|
||||
"mid-session snapshot? For example, if a file was modified then "
|
||||
"reverted, does the answer describe the reverted state? If three "
|
||||
"PRs were opened, does the answer know which was merged?"
|
||||
),
|
||||
"artifact_trail": (
|
||||
"Does the answer correctly enumerate the artifacts (files read, "
|
||||
"files modified, commands run, tools called, PRs opened, cron jobs "
|
||||
"created)? Missing artifacts cost more than extra unrelated ones."
|
||||
),
|
||||
"completeness": (
|
||||
"Does the answer address ALL parts of the probe question? If the "
|
||||
"probe asks for three things and only two are answered, that is "
|
||||
"incomplete regardless of accuracy on the two."
|
||||
),
|
||||
"continuity": (
|
||||
"Could the next assistant continue the work using only this answer, "
|
||||
"without having to re-fetch files or re-explore the codebase? An "
|
||||
"answer that lists files by name but doesn't mention the change is "
|
||||
"poor continuity even if accurate."
|
||||
),
|
||||
"instruction_following": (
|
||||
"Is the answer in the format the probe requested (list, number, "
|
||||
"short phrase, yes/no)? Ignore tone and length, only assess "
|
||||
"whether the requested form was honoured."
|
||||
),
|
||||
}
|
||||
|
||||
SCORE_SCALE: Dict[int, str] = {
|
||||
0: "No useful information; wrong or hallucinated.",
|
||||
1: "Major gaps or a key fact is wrong.",
|
||||
2: "Partially correct but significant omissions.",
|
||||
3: "Mostly correct with minor omissions or imprecision.",
|
||||
4: "Correct and complete with only trivial imprecision.",
|
||||
5: "Fully correct, complete, and in the requested format.",
|
||||
}
|
||||
|
||||
|
||||
_RUBRIC_HEADER = """You are an evaluator grading a single answer produced by an AI assistant \
|
||||
that was given a COMPRESSED handoff summary of an earlier conversation and \
|
||||
asked a probe question. You are NOT evaluating the compression summary \
|
||||
directly — you are evaluating whether the answer the assistant produced \
|
||||
from that summary is correct, complete, and useful.
|
||||
|
||||
Grade on six dimensions, each 0-5:
|
||||
|
||||
{dimension_block}
|
||||
|
||||
0-5 scale:
|
||||
{scale_block}
|
||||
|
||||
Grade strictly. Fractional scores are NOT allowed — output integers only. \
|
||||
If the answer is ambiguous, use the lower of the two candidate scores."""
|
||||
|
||||
|
||||
def build_judge_prompt(
|
||||
*,
|
||||
probe_question: str,
|
||||
probe_type: str,
|
||||
expected_facts: List[str],
|
||||
assistant_answer: str,
|
||||
) -> str:
|
||||
"""Build the full judge prompt for one (probe, answer) pair.
|
||||
|
||||
The judge is told the expected_facts up front so grading is anchored to
|
||||
concrete signal rather than judge taste. Expected facts are intentionally
|
||||
NOT shown to the assistant that produces the answer.
|
||||
"""
|
||||
dim_block = "\n".join(
|
||||
f"- {d}: {DIMENSION_DESCRIPTIONS[d]}" for d in DIMENSIONS
|
||||
)
|
||||
scale_block = "\n".join(
|
||||
f" {score}: {desc}" for score, desc in sorted(SCORE_SCALE.items())
|
||||
)
|
||||
header = _RUBRIC_HEADER.format(
|
||||
dimension_block=dim_block,
|
||||
scale_block=scale_block,
|
||||
)
|
||||
|
||||
expected_block = (
|
||||
"\n".join(f"- {f}" for f in expected_facts) if expected_facts else "(none provided)"
|
||||
)
|
||||
|
||||
output_schema = (
|
||||
"Respond with ONLY a JSON object, no prose before or after, matching "
|
||||
"this schema exactly:\n"
|
||||
"{\n"
|
||||
' "accuracy": <int 0-5>,\n'
|
||||
' "context_awareness": <int 0-5>,\n'
|
||||
' "artifact_trail": <int 0-5>,\n'
|
||||
' "completeness": <int 0-5>,\n'
|
||||
' "continuity": <int 0-5>,\n'
|
||||
' "instruction_following": <int 0-5>,\n'
|
||||
' "notes": "<one short sentence, <=200 chars, identifying the '
|
||||
'single biggest issue with the answer if any>"\n'
|
||||
"}"
|
||||
)
|
||||
|
||||
return (
|
||||
f"{header}\n\n"
|
||||
f"PROBE TYPE: {probe_type}\n\n"
|
||||
f"PROBE QUESTION:\n{probe_question}\n\n"
|
||||
f"EXPECTED FACTS (the answer should contain these concrete anchors; "
|
||||
f"missing any is a material defect in accuracy and/or completeness):\n"
|
||||
f"{expected_block}\n\n"
|
||||
f"ASSISTANT ANSWER TO GRADE:\n{assistant_answer}\n\n"
|
||||
f"{output_schema}"
|
||||
)
|
||||
|
||||
|
||||
def parse_judge_response(raw: str) -> Dict[str, Any]:
|
||||
"""Parse the judge model's JSON response into a score dict.
|
||||
|
||||
Tolerates surrounding prose (judges ignore instructions sometimes) by
|
||||
extracting the first {...} block. Validates that every dimension is
|
||||
present as an integer 0-5.
|
||||
|
||||
Returns dict with keys: scores (dim->int), notes (str), overall (float).
|
||||
Raises ValueError if the response cannot be parsed into a complete
|
||||
score set.
|
||||
"""
|
||||
import json
|
||||
import re
|
||||
|
||||
if not raw or not raw.strip():
|
||||
raise ValueError("empty judge response")
|
||||
|
||||
# Strip code fences and any ```json prefix judges sometimes emit.
|
||||
stripped = raw.strip()
|
||||
fence_match = re.match(r"^```(?:json)?\s*(.*?)\s*```$", stripped, re.DOTALL)
|
||||
if fence_match:
|
||||
stripped = fence_match.group(1).strip()
|
||||
|
||||
# Extract the first {...} block greedy-to-matching-brace.
|
||||
brace_match = re.search(r"\{.*\}", stripped, re.DOTALL)
|
||||
if not brace_match:
|
||||
raise ValueError(f"no JSON object found in judge response: {raw[:200]!r}")
|
||||
candidate = brace_match.group(0)
|
||||
|
||||
try:
|
||||
parsed = json.loads(candidate)
|
||||
except json.JSONDecodeError as exc:
|
||||
raise ValueError(f"judge response not valid JSON: {exc}; raw={candidate[:200]!r}")
|
||||
|
||||
scores: Dict[str, int] = {}
|
||||
for dim in DIMENSIONS:
|
||||
if dim not in parsed:
|
||||
raise ValueError(f"judge response missing dimension {dim!r}: {parsed}")
|
||||
value = parsed[dim]
|
||||
if isinstance(value, bool) or not isinstance(value, (int, float)):
|
||||
raise ValueError(f"dimension {dim} is not numeric: {value!r}")
|
||||
int_val = int(round(value))
|
||||
if int_val < 0 or int_val > 5:
|
||||
raise ValueError(f"dimension {dim} out of range: {int_val}")
|
||||
scores[dim] = int_val
|
||||
|
||||
notes_val = parsed.get("notes", "")
|
||||
notes = str(notes_val)[:200] if notes_val else ""
|
||||
|
||||
overall = sum(scores.values()) / len(scores)
|
||||
return {
|
||||
"scores": scores,
|
||||
"notes": notes,
|
||||
"overall": overall,
|
||||
}
|
||||
Reference in New Issue
Block a user