mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-01 00:11:39 +08:00
feat: compression eval harness for agent/context_compressor.py
Ships a complete offline eval harness at scripts/compression_eval/. Runs a real conversation fixture through ContextCompressor.compress(), asks the compressor model to answer probe questions from the compressed state, then has a judge model score each answer 0-5 on six dimensions (accuracy, context_awareness, artifact_trail, completeness, continuity, instruction_following). Methodology adapted from Factory's Dec 2025 write-up (https://factory.ai/news/evaluating-compression); the scoreboard framing is not adopted. Motivation: we edit context_compressor.py prompts and _template_sections by hand and ship with no automated check that compression still preserves file paths, error codes, or the active task. Until now there has been no signal between 'test suite green' and 'a user hits a bad summary in production.' What's shipped - DESIGN.md — full architecture, fixture/probe format, scrubber pipeline, grading rubric, open follow-ups - README.md — usage, cost expectations, when to run it - scrub_fixtures.py — reproducible pipeline that converts real sessions from ~/.hermes/sessions/*.jsonl into public-safe JSON fixtures. Applies agent.redact.redact_sensitive_text + username path normalisation + personal handle scrubbing + email/git-author normalisation + reasoning scratchpad stripping + platform-mention scrubbing + first-user paraphrase + system-prompt placeholder + orphan-message pruning + 2KB tool-output truncation - fixtures/ — three scrubbed session snapshots covering three session shapes: feature-impl-context-priority (75 msgs / ~17k tokens) debug-session-feishu-id-model (59 msgs / ~13k tokens) config-build-competitive-scouts (61 msgs / ~23k tokens) - probes/ — three probe banks (10-11 probes each) covering all four types (recall/artifact/continuation/decision) with expected_facts anchors (PR numbers, file paths, error codes, commands) - rubric.py — six-dimension grading rubric, judge-prompt builder, JSON-with-fallback response parser - compressor_driver.py — thin wrapper around ContextCompressor for forced single-shot compression (fixtures are below the default 100k threshold so we force compress() to attribute score deltas to prompt changes, not threshold-fire variance) - grader.py — two-phase continuation + grading calls via the OpenAI SDK directly against the resolved provider endpoint - report.py — markdown report renderer (paste-ready for PR bodies), --compare-to delta mode, per-run JSON dumper - run_eval.py — fire-style CLI (--fixtures, --runs, --judge-model, --compressor-model, --label, --focus-topic, --compare-to, --verbose) - tests/scripts/test_compression_eval.py — 33 hermetic unit tests covering rubric parsing edge cases, judge-prompt building, report rendering, summariser medians, per-run JSON roundtrip, fixture and probe loading, and a PII smoke check on the checked-in fixtures Non-LLM paths are covered by the 33-test suite that runs in CI. The LLM paths (continuation + grading) require credentials and real API calls, so they're exercised by running the eval itself — not by CI. Validation - 33/33 unit tests pass in 0.33s via scripts/run_tests.sh - 50/50 adjacent tests (tests/agent/test_context_compressor.py) still pass — no regression introduced - End-to-end dry run against debug-session-feishu-id-model with openai/gpt-5.4-mini via Nous Portal: Compression: 13081 -> 3055 tokens (76.6% ratio), 59 -> 10 messages Overall score: 3.25 (artifact_trail 1.50 is the weak spot, matching Factory's published observation) Specific probe misses surfaced with concrete judge notes Noise floor (one empirical data point) Same inputs re-run: overall 3.25 -> 3.17 (delta -0.08). Individual dimensions varied up to ±0.5 between two single-run medians. Confirms the DESIGN.md < 0.3 noise guidance is the right order of magnitude for single-run comparisons. Tighter noise measurement (N=10) is tracked as an open follow-up in DESIGN.md. Why scripts/ and not tests/ Requires API credentials, costs ~$0.50-1.50 per run, minutes to execute, LLM-graded (non-deterministic). Incompatible with scripts/run_tests.sh which is hermetic, parallel, credential-free. scripts/sample_and_compress.py is the existing precedent for offline credentialed tooling. Open follow-ups (tracked in DESIGN.md, not blocking this PR) 1. Iterative-merge fixture (two chained compressions on one session) 2. Precise noise-floor measurement at N=10 3. Scripted scrubber helpers to lower the cost of fixture #4+ 4. Judge model selection policy (pin vs. per-user)
This commit is contained in:
181
scripts/compression_eval/grader.py
Normal file
181
scripts/compression_eval/grader.py
Normal file
@@ -0,0 +1,181 @@
|
||||
"""Two-phase probe grading.
|
||||
|
||||
Phase 1 — **Continuation**: simulate the next assistant turn. Feed the
|
||||
compressed message list plus the probe question and ask the continuing
|
||||
model to answer using only the compressed context. This is exactly what
|
||||
a real next-turn call would look like.
|
||||
|
||||
Phase 2 — **Grading**: a separate judge-model call scores the answer on
|
||||
the six rubric dimensions using ``rubric.build_judge_prompt``.
|
||||
|
||||
Both phases use the OpenAI SDK directly against the resolved provider
|
||||
endpoint, so the explicit api_key + base_url we pass always reaches the
|
||||
wire. (``agent.auxiliary_client.call_llm`` is designed for task-tagged
|
||||
auxiliary calls backed by config lookups; for eval we need the explicit
|
||||
credentials to win unconditionally.)
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Any, Dict, List, Optional
|
||||
|
||||
_REPO_ROOT = Path(__file__).resolve().parents[2]
|
||||
if str(_REPO_ROOT) not in sys.path:
|
||||
sys.path.insert(0, str(_REPO_ROOT))
|
||||
|
||||
from openai import OpenAI # noqa: E402
|
||||
|
||||
from rubric import build_judge_prompt, parse_judge_response # noqa: E402
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
_CONTINUATION_SYSTEM = (
|
||||
"You are the continuing assistant in a long session. Earlier turns have "
|
||||
"been compacted into a handoff summary that is now part of the "
|
||||
"conversation history. The user has just asked you a question. "
|
||||
"Answer using ONLY what you can determine from the conversation history "
|
||||
"you see (including the handoff summary). Do NOT invent details. If the "
|
||||
"summary does not contain a specific fact, say so explicitly rather "
|
||||
"than guessing. Be direct and concrete — cite file paths, PR numbers, "
|
||||
"error codes, and exact values when they are present in the summary."
|
||||
)
|
||||
|
||||
|
||||
def answer_probe(
|
||||
*,
|
||||
compressed_messages: List[Dict[str, Any]],
|
||||
probe_question: str,
|
||||
model: str,
|
||||
provider: str,
|
||||
base_url: str,
|
||||
api_key: str,
|
||||
max_tokens: int = 1024,
|
||||
timeout: Optional[float] = 120.0,
|
||||
) -> str:
|
||||
"""Run the continuation call: what does the next assistant answer?
|
||||
|
||||
Builds a messages list of [system_continuation, *compressed, probe_user]
|
||||
and asks the configured model. Returns the answer content as a string.
|
||||
"""
|
||||
# Strip any pre-existing system message from the compressed list and
|
||||
# replace with our continuation system prompt. The fixture's generic
|
||||
# system is not the right frame for the continuation simulation.
|
||||
history = [m for m in compressed_messages if m.get("role") != "system"]
|
||||
messages = (
|
||||
[{"role": "system", "content": _CONTINUATION_SYSTEM}]
|
||||
+ _sanitize_for_chat_api(history)
|
||||
+ [{"role": "user", "content": probe_question}]
|
||||
)
|
||||
|
||||
client = OpenAI(api_key=api_key, base_url=base_url, timeout=timeout)
|
||||
response = client.chat.completions.create(
|
||||
model=model,
|
||||
messages=messages,
|
||||
max_tokens=max_tokens,
|
||||
)
|
||||
content = response.choices[0].message.content
|
||||
if not isinstance(content, str):
|
||||
content = "" if content is None else str(content)
|
||||
return content.strip()
|
||||
|
||||
|
||||
def grade_probe(
|
||||
*,
|
||||
probe_question: str,
|
||||
probe_type: str,
|
||||
expected_facts: List[str],
|
||||
assistant_answer: str,
|
||||
judge_model: str,
|
||||
judge_provider: str,
|
||||
judge_base_url: str,
|
||||
judge_api_key: str,
|
||||
max_tokens: int = 512,
|
||||
timeout: Optional[float] = 120.0,
|
||||
) -> Dict[str, Any]:
|
||||
"""Run the judge call and parse the six dimension scores.
|
||||
|
||||
Returns dict {scores: {dim: int}, notes: str, overall: float,
|
||||
raw: str, parse_error: str|None}. On parse failure, scores are zeros
|
||||
and parse_error is populated — the caller decides whether to retry
|
||||
or accept.
|
||||
"""
|
||||
prompt = build_judge_prompt(
|
||||
probe_question=probe_question,
|
||||
probe_type=probe_type,
|
||||
expected_facts=expected_facts,
|
||||
assistant_answer=assistant_answer,
|
||||
)
|
||||
client = OpenAI(api_key=judge_api_key, base_url=judge_base_url, timeout=timeout)
|
||||
response = client.chat.completions.create(
|
||||
model=judge_model,
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
max_tokens=max_tokens,
|
||||
)
|
||||
raw = response.choices[0].message.content or ""
|
||||
if not isinstance(raw, str):
|
||||
raw = str(raw)
|
||||
|
||||
try:
|
||||
parsed = parse_judge_response(raw)
|
||||
parsed["raw"] = raw
|
||||
parsed["parse_error"] = None
|
||||
return parsed
|
||||
except ValueError as exc:
|
||||
logger.warning("Judge response parse failed: %s | raw=%r", exc, raw[:200])
|
||||
from rubric import DIMENSIONS
|
||||
return {
|
||||
"scores": {d: 0 for d in DIMENSIONS},
|
||||
"notes": "",
|
||||
"overall": 0.0,
|
||||
"raw": raw,
|
||||
"parse_error": str(exc),
|
||||
}
|
||||
|
||||
|
||||
def _sanitize_for_chat_api(
|
||||
messages: List[Dict[str, Any]],
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""Drop tool_calls/tool pairs that are incomplete.
|
||||
|
||||
A compressed message list may contain tool_call references whose matching
|
||||
``tool`` result was summarized away, which breaks strict-validator
|
||||
providers (Anthropic, OpenAI). Easiest correct behaviour for the eval:
|
||||
strip tool_calls entirely and drop ``tool`` role messages — the
|
||||
continuation model only needs the summary + recent turns to answer the
|
||||
probe, not the precise tool-call bookkeeping.
|
||||
"""
|
||||
clean: List[Dict[str, Any]] = []
|
||||
for m in messages:
|
||||
role = m.get("role")
|
||||
if role == "tool":
|
||||
# Convert tool result to a plain user note so the continuation
|
||||
# model still sees the content without needing the structured
|
||||
# tool_call_id pairing.
|
||||
content = m.get("content")
|
||||
if isinstance(content, list):
|
||||
content = "\n".join(
|
||||
p.get("text", "") for p in content if isinstance(p, dict)
|
||||
)
|
||||
clean.append({
|
||||
"role": "user",
|
||||
"content": f"[earlier tool result]\n{content or ''}",
|
||||
})
|
||||
continue
|
||||
new = {"role": role, "content": m.get("content", "")}
|
||||
# Drop tool_calls — the downstream assistant message's content
|
||||
# still describes what the agent was doing.
|
||||
clean.append(new)
|
||||
# Collapse consecutive same-role turns into one (alternation rule)
|
||||
merged: List[Dict[str, Any]] = []
|
||||
for m in clean:
|
||||
if merged and merged[-1]["role"] == m["role"]:
|
||||
prev = merged[-1]
|
||||
prev_c = prev.get("content") or ""
|
||||
new_c = m.get("content") or ""
|
||||
prev["content"] = f"{prev_c}\n\n{new_c}" if prev_c else new_c
|
||||
else:
|
||||
merged.append(m)
|
||||
return merged
|
||||
Reference in New Issue
Block a user