mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-30 16:01:49 +08:00
Compare commits
1 Commits
gemini-cli
...
design/com
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
1e6285c53d |
4
.gitignore
vendored
4
.gitignore
vendored
@@ -52,6 +52,10 @@ ignored/
|
|||||||
.worktrees/
|
.worktrees/
|
||||||
environments/benchmarks/evals/
|
environments/benchmarks/evals/
|
||||||
|
|
||||||
|
# Compression eval run outputs (harness lives in scripts/compression_eval/)
|
||||||
|
scripts/compression_eval/results/*
|
||||||
|
!scripts/compression_eval/results/.gitkeep
|
||||||
|
|
||||||
# Web UI build output
|
# Web UI build output
|
||||||
hermes_cli/web_dist/
|
hermes_cli/web_dist/
|
||||||
|
|
||||||
|
|||||||
377
scripts/compression_eval/DESIGN.md
Normal file
377
scripts/compression_eval/DESIGN.md
Normal file
@@ -0,0 +1,377 @@
|
|||||||
|
# Compression Eval — Design
|
||||||
|
|
||||||
|
Status: proposal. Nothing under `scripts/compression_eval/` runs in CI.
|
||||||
|
This is an offline tool authors run before merging prompt or algorithm
|
||||||
|
changes to `agent/context_compressor.py`.
|
||||||
|
|
||||||
|
## Why
|
||||||
|
|
||||||
|
We tune the compressor prompt and the `_template_sections` checklist by
|
||||||
|
hand, ship, and wait for the next real session to notice regressions.
|
||||||
|
There is no automated check that a prompt edit still preserves file
|
||||||
|
paths, error messages, or the active task across a compression.
|
||||||
|
|
||||||
|
Factory.ai's December 2025 write-up
|
||||||
|
(https://factory.ai/news/evaluating-compression) describes a
|
||||||
|
probe-based eval that scores compressed state on six dimensions. The
|
||||||
|
methodology is the valuable part — the benchmarks in the post are a
|
||||||
|
marketing piece. We adopt the methodology and discard the scoreboard.
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Given a real session transcript and a bank of probe questions that
|
||||||
|
exercise what the transcript contained, answer:
|
||||||
|
|
||||||
|
1. After `ContextCompressor.compress()` runs, can the agent still
|
||||||
|
answer each probe correctly from the compressed state?
|
||||||
|
2. Which of the six dimensions (accuracy, context awareness, artifact
|
||||||
|
trail, completeness, continuity, instruction following) is the
|
||||||
|
prompt weakest on?
|
||||||
|
3. Does a prompt change improve or regress any dimension vs. the
|
||||||
|
previous run?
|
||||||
|
|
||||||
|
That is the full scope. No "compare against OpenAI and Anthropic"
|
||||||
|
benchmarking, no public scoreboard, no marketing claims.
|
||||||
|
|
||||||
|
## Non-goals
|
||||||
|
|
||||||
|
- Not a pytest. Requires API credentials, costs money, takes minutes
|
||||||
|
per fixture, and output is LLM-graded and non-deterministic.
|
||||||
|
- Not part of `scripts/run_tests.sh`. Not invoked by CI.
|
||||||
|
- Not a replacement for the existing compressor unit tests in
|
||||||
|
`tests/agent/test_context_compressor.py` — those stay as the
|
||||||
|
structural / boundary / tool-pair-sanitization guard.
|
||||||
|
- Not a general trajectory eval. Scoped to context compaction only.
|
||||||
|
|
||||||
|
## Where it lives
|
||||||
|
|
||||||
|
```
|
||||||
|
scripts/compression_eval/
|
||||||
|
├── DESIGN.md # this file
|
||||||
|
├── README.md # how to run, cost expectations, caveats
|
||||||
|
├── run_eval.py # entry point (fire CLI, like sample_and_compress.py)
|
||||||
|
├── scrub_fixtures.py # regenerate fixtures from ~/.hermes/sessions/*.jsonl
|
||||||
|
├── fixtures/ # checked-in scrubbed session snapshots
|
||||||
|
│ ├── feature-impl-context-priority.json
|
||||||
|
│ ├── debug-session-feishu-id-model.json
|
||||||
|
│ └── config-build-competitive-scouts.json
|
||||||
|
├── probes/ # probe banks paired with fixtures
|
||||||
|
│ └── <fixture>.probes.json
|
||||||
|
├── rubric.py # grading prompt + dimension definitions
|
||||||
|
├── grader.py # judge-model call + score parsing
|
||||||
|
├── compressor_driver.py # thin wrapper over ContextCompressor
|
||||||
|
└── results/ # gitignored; timestamped output per run
|
||||||
|
└── .gitkeep
|
||||||
|
```
|
||||||
|
|
||||||
|
`scripts/` is the right home: offline tooling, no CI involvement,
|
||||||
|
precedent already set by `sample_and_compress.py`,
|
||||||
|
`contributor_audit.py`, `discord-voice-doctor.py`.
|
||||||
|
|
||||||
|
`environments/` is for Atropos RL training environments — wrong shape.
|
||||||
|
`tests/` is hermetic and credential-free — incompatible with a
|
||||||
|
probe-based eval that needs a judge model.
|
||||||
|
|
||||||
|
## Fixture format
|
||||||
|
|
||||||
|
A fixture is a single compressed-enough conversation captured from a
|
||||||
|
real session. Stored as JSON (pretty-printed, reviewable in PRs):
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"name": "401-debug",
|
||||||
|
"description": "178-turn session debugging a 401 on /api/auth/login",
|
||||||
|
"model": "anthropic/claude-sonnet-4.6",
|
||||||
|
"context_length": 200000,
|
||||||
|
"messages": [
|
||||||
|
{"role": "system", "content": "..."},
|
||||||
|
{"role": "user", "content": "..."},
|
||||||
|
{"role": "assistant", "content": "...", "tool_calls": [...]},
|
||||||
|
{"role": "tool", "tool_call_id": "...", "content": "..."}
|
||||||
|
],
|
||||||
|
"notes": "Captured 2026-04-24 from session 20260424_*.jsonl; \
|
||||||
|
PII scrubbed; secrets redacted via redact_sensitive_text."
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Sourcing fixtures
|
||||||
|
|
||||||
|
Fixtures are scrubbed snapshots of real sessions from the
|
||||||
|
maintainer's `~/.hermes/sessions/*.jsonl` store, generated
|
||||||
|
reproducibly by `scrub_fixtures.py` in this directory. Re-run the
|
||||||
|
scrubber with `python3 scripts/compression_eval/scrub_fixtures.py`
|
||||||
|
to regenerate them after a scrubber change.
|
||||||
|
|
||||||
|
Three shipped fixtures cover three different session shapes:
|
||||||
|
|
||||||
|
| Fixture | Source shape | Messages | Tokens (rough) | Tests |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| `feature-impl-context-priority` | investigate → patch → test → PR → merge | 75 | ~45k | continuation, artifact trail (2 files modified, 1 PR, ~16k skill_view in head) |
|
||||||
|
| `debug-session-feishu-id-model` | PR triage + upstream docs + decision | 59 | ~28k | recall (PR #, error shape), decision (outcome + reason), large PR diff blocks |
|
||||||
|
| `config-build-competitive-scouts` | iterative config: 11 cron jobs across 7 weekdays | 61 | ~26k | artifact trail (which jobs, which days), iterative-merge |
|
||||||
|
|
||||||
|
The `~26k-45k` token range is below the default 50%-of-200k
|
||||||
|
compression threshold, so the eval will always **force** a
|
||||||
|
`compress()` call rather than wait for the natural trigger. That is
|
||||||
|
the intended shape — we want a controlled single-shot compression so
|
||||||
|
score deltas are attributable to the prompt change, not to whether
|
||||||
|
the threshold happened to fire at the same boundary twice.
|
||||||
|
|
||||||
|
### Scrubber pipeline
|
||||||
|
|
||||||
|
`scrub_fixtures.py` applies, per message:
|
||||||
|
|
||||||
|
1. `agent.redact.redact_sensitive_text` — API keys, tokens,
|
||||||
|
connection strings
|
||||||
|
2. Username paths: `/home/teknium` → `/home/user`
|
||||||
|
3. Personal handles: all case variants of the maintainer name → `user`
|
||||||
|
4. Email addresses → `contributor@example.com`; git
|
||||||
|
`Author: Name <addr>` header lines normalised
|
||||||
|
5. `<REASONING_SCRATCHPAD>...</REASONING_SCRATCHPAD>` and
|
||||||
|
`<think>...</think>` stripped from assistant content
|
||||||
|
6. Messaging-platform user mentions (`<@123456>`, `<@***>`) →
|
||||||
|
`<@user>`
|
||||||
|
7. First user message paraphrased to remove personal voice;
|
||||||
|
subsequent user turns kept verbatim after the redactions above
|
||||||
|
8. System prompt replaced with a generic public-safe placeholder so
|
||||||
|
we don't check in the maintainer's tuned soul/skills/memory system
|
||||||
|
block
|
||||||
|
9. Orphan empty-assistant messages (artifact of scratchpad-only
|
||||||
|
turns) and trailing tool messages with no matching assistant are
|
||||||
|
dropped
|
||||||
|
10. Tool outputs preserved verbatim. An earlier iteration truncated
|
||||||
|
> 2KB tool bodies to keep fixture JSON small, but that defeats
|
||||||
|
the purpose: real sessions have 30KB `skill_view` dumps, 10KB
|
||||||
|
`read_file` outputs, 5KB `web_extract` bodies — compression has
|
||||||
|
to handle them. Truncation is now a no-op; the pipeline note
|
||||||
|
remains in `scrubbing_passes` for audit trail clarity.
|
||||||
|
|
||||||
|
Before every fixture PR: grep the fixture for PII patterns. An
|
||||||
|
audit is embedded at the bottom of the scrubber as comments.
|
||||||
|
|
||||||
|
**Fixtures must stay small.** Target <200 KB per fixture, <500 KB
|
||||||
|
total for the directory. Current total: ~410 KB across three
|
||||||
|
fixtures. Larger sessions are truncated with a
|
||||||
|
`truncated_to: <index>` field in the fixture header so the cut is
|
||||||
|
reviewable.
|
||||||
|
|
||||||
|
## Probe format
|
||||||
|
|
||||||
|
One probe file per fixture, so reviewers can see the question bank
|
||||||
|
evolve alongside the fixture:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"fixture": "401-debug",
|
||||||
|
"probes": [
|
||||||
|
{
|
||||||
|
"id": "recall-error-code",
|
||||||
|
"type": "recall",
|
||||||
|
"question": "What was the original error code and endpoint?",
|
||||||
|
"expected_facts": ["401", "/api/auth/login"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "artifact-files-modified",
|
||||||
|
"type": "artifact",
|
||||||
|
"question": "Which files have been modified in this session?",
|
||||||
|
"expected_facts": ["session_store.py", "redis_client.py"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "continuation-next-step",
|
||||||
|
"type": "continuation",
|
||||||
|
"question": "What should we do next?",
|
||||||
|
"expected_facts": ["re-run the integration tests", "restart the worker"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "decision-redis-approach",
|
||||||
|
"type": "decision",
|
||||||
|
"question": "What did we decide about the Redis issue?",
|
||||||
|
"expected_facts": ["switch to redis-py 5.x", "pooled connection"]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The four probe types come directly from Factory's methodology:
|
||||||
|
**recall, artifact, continuation, decision**. `expected_facts` gives
|
||||||
|
the grader concrete anchors instead of relying purely on LLM taste.
|
||||||
|
|
||||||
|
Authoring a probe bank is a one-time cost per fixture. 8-12 probes per
|
||||||
|
fixture is the target — enough to cover all four types, few enough to
|
||||||
|
grade in under a minute at reasonable cost.
|
||||||
|
|
||||||
|
## Grading
|
||||||
|
|
||||||
|
Each probe gets scored 0-5 on **six dimensions** (Factory's six):
|
||||||
|
|
||||||
|
| Dimension | What it measures |
|
||||||
|
|-----------------------|-----------------------------------------------------|
|
||||||
|
| accuracy | File paths, function names, error codes are correct |
|
||||||
|
| context_awareness | Reflects current state, not a mid-session snapshot |
|
||||||
|
| artifact_trail | Knows which files were read / modified / created |
|
||||||
|
| completeness | Addresses all parts of the probe |
|
||||||
|
| continuity | Agent can continue without re-fetching |
|
||||||
|
| instruction_following | Probe answered in the requested form |
|
||||||
|
|
||||||
|
Grading is done by a single judge-model call per probe with a
|
||||||
|
deterministic rubric prompt (see `rubric.py`). The rubric includes the
|
||||||
|
`expected_facts` list so the judge has a concrete anchor. Default
|
||||||
|
judge model: whatever the user has configured as their main model at
|
||||||
|
run time (same resolution path as `auxiliary_client.call_llm`). A
|
||||||
|
`--judge-model` flag allows overriding for consistency across runs.
|
||||||
|
|
||||||
|
Non-determinism caveat: two runs of the same fixture will produce
|
||||||
|
different scores. A single run means nothing. Report medians over
|
||||||
|
N=3 runs by default, and require an improvement of >=0.3 on any
|
||||||
|
dimension before claiming a prompt change is a win.
|
||||||
|
|
||||||
|
## Run flow
|
||||||
|
|
||||||
|
```
|
||||||
|
python scripts/compression_eval/run_eval.py [OPTIONS]
|
||||||
|
```
|
||||||
|
|
||||||
|
Options (fire-style, mirroring `sample_and_compress.py`):
|
||||||
|
|
||||||
|
| Flag | Default | Purpose |
|
||||||
|
|------------------------|------------|-------------------------------------------|
|
||||||
|
| `--fixtures` | all | Comma-separated fixture names |
|
||||||
|
| `--runs` | 3 | Runs per fixture (for median) |
|
||||||
|
| `--judge-model` | auto | Override judge model |
|
||||||
|
| `--compressor-model` | auto | Override model used *inside* the compressor |
|
||||||
|
| `--label` | timestamp | Subdirectory under `results/` |
|
||||||
|
| `--focus-topic` | none | Pass-through to `compress(focus_topic=)` |
|
||||||
|
| `--compare-to` | none | Path to a previous run for diff output |
|
||||||
|
|
||||||
|
Steps per fixture per run:
|
||||||
|
|
||||||
|
1. Load fixture JSON and probe bank.
|
||||||
|
2. Construct a `ContextCompressor` against the fixture's model.
|
||||||
|
3. Call `compressor.compress(messages)` — capture the compressed
|
||||||
|
message list.
|
||||||
|
4. For each probe: ask the judge model to role-play as the continuing
|
||||||
|
agent with only the compressed state, then grade the answer on the
|
||||||
|
six dimensions using `rubric.py`.
|
||||||
|
5. Write a per-run JSON to `results/<label>/<fixture>-run-N.json`.
|
||||||
|
6. After all runs, emit a markdown summary to
|
||||||
|
`results/<label>/report.md`.
|
||||||
|
|
||||||
|
## Report format
|
||||||
|
|
||||||
|
Pasted verbatim into PR descriptions that touch the compressor:
|
||||||
|
|
||||||
|
```
|
||||||
|
## Compression eval — label 2026-04-25_13-40-02
|
||||||
|
|
||||||
|
Main model: anthropic/claude-sonnet-4.6 Judge: same
|
||||||
|
3 runs per fixture, medians reported.
|
||||||
|
|
||||||
|
| Fixture | Accuracy | Context | Artifact | Complete | Continuity | Instruction | Overall |
|
||||||
|
|----------------|----------|---------|----------|----------|------------|-------------|---------|
|
||||||
|
| 401-debug | 4.1 | 4.0 | 2.5 | 4.3 | 3.8 | 5.0 | 3.95 |
|
||||||
|
| pr-review | 3.9 | 3.8 | 3.1 | 4.2 | 3.9 | 5.0 | 3.98 |
|
||||||
|
| feature-impl | 4.0 | 3.9 | 2.9 | 4.1 | 4.0 | 5.0 | 3.98 |
|
||||||
|
|
||||||
|
Per-probe misses (score < 3.0):
|
||||||
|
- 401-debug / artifact-files-modified: 1.7 — summary dropped redis_client.py
|
||||||
|
- pr-review / decision-auth-rewrite: 2.3 — outcome captured, reasoning dropped
|
||||||
|
```
|
||||||
|
|
||||||
|
## Cost expectations
|
||||||
|
|
||||||
|
Dominated by the judge calls. For 3 fixtures × 10 probes × 3 runs =
|
||||||
|
90 judge calls per eval run. On Claude Sonnet 4.6 that is roughly
|
||||||
|
$0.50-$1.50 per full eval depending on probe length. The compressor
|
||||||
|
itself makes 1 call per fixture × 3 runs = 9 additional calls.
|
||||||
|
|
||||||
|
**This is not a check to run after every commit.** It is a
|
||||||
|
before-merge check for PRs that touch:
|
||||||
|
|
||||||
|
- `agent/context_compressor.py` — any change to `_template_sections`,
|
||||||
|
`_generate_summary`, or `compress()`.
|
||||||
|
- `agent/auxiliary_client.py` — when changing how compression tasks
|
||||||
|
are routed.
|
||||||
|
- `agent/prompt_builder.py` — when the compression-note phrasing
|
||||||
|
changes.
|
||||||
|
|
||||||
|
## Open questions (to resolve before implementing)
|
||||||
|
|
||||||
|
1. **Fixture scrubbing: manual or scripted?** A scripted scrub that
|
||||||
|
also replaces project names / hostnames would lower the cost of
|
||||||
|
contributing a new fixture. Risk: over-aggressive replacement
|
||||||
|
destroys the signal the probe depends on. Propose: start manual,
|
||||||
|
add scripted helpers once we have 3 fixtures and know the common
|
||||||
|
PII shapes.
|
||||||
|
|
||||||
|
2. **Judge model selection.** Factory uses GPT-5.2. We can't pin one
|
||||||
|
— user's main model changes. Options: (a) grade with main model
|
||||||
|
(cheap, inconsistent across users), (b) require a specific judge
|
||||||
|
model (e.g. `claude-sonnet-4.6`), inconsistent for users without
|
||||||
|
access. Propose (a) with a `--judge-model` override, and make the
|
||||||
|
model name prominent in the report so comparisons across machines
|
||||||
|
are legible.
|
||||||
|
|
||||||
|
3. **Noise floor.** Before landing prompt changes, run the current
|
||||||
|
prompt N=10 times to measure per-dimension stddev. That tells us
|
||||||
|
the minimum delta to call a change significant. Suspect 0.2-0.3 on
|
||||||
|
a 0-5 scale. Decision deferred until after the first fixture is
|
||||||
|
landed.
|
||||||
|
|
||||||
|
4. **Iterative-merge coverage.** The real Factory-vs-Anthropic
|
||||||
|
difference is incremental merge vs. regenerate. A fixture that
|
||||||
|
only compresses once doesn't exercise our iterative path. Add a
|
||||||
|
fourth fixture that forces two compressions (manually chained),
|
||||||
|
with probes that test whether information from the first
|
||||||
|
compression survives the second. Deferred to a follow-up PR.
|
||||||
|
|
||||||
|
## Implementation status
|
||||||
|
|
||||||
|
This PR ships the full eval end-to-end:
|
||||||
|
|
||||||
|
- `scrub_fixtures.py` — reproducible scrubber
|
||||||
|
- `fixtures/` — three scrubbed session fixtures
|
||||||
|
- `probes/` — three probe banks (10-11 probes each, all four types)
|
||||||
|
- `rubric.py` — six-dimension grading rubric + judge-prompt builder + response parser
|
||||||
|
- `compressor_driver.py` — thin wrapper around `ContextCompressor` for forced single-shot compression
|
||||||
|
- `grader.py` — two-phase continuation + grading calls via OpenAI SDK
|
||||||
|
- `report.py` — markdown report renderer + `--compare-to` delta mode + per-run JSON dumper
|
||||||
|
- `run_eval.py` — entry point (`fire`-style CLI)
|
||||||
|
- `tests/scripts/test_compression_eval.py` — 33 unit tests covering rubric parsing, report rendering, fixture/probe loading, and a PII smoke test on the fixtures (LLM paths not tested — they require credentials and are exercised by the eval itself)
|
||||||
|
|
||||||
|
### Noise floor — one empirical data point
|
||||||
|
|
||||||
|
A single same-inputs re-run of `debug-session-feishu-id-model`
|
||||||
|
(compressor + judge = `openai/gpt-5.4-mini` via Nous Portal,
|
||||||
|
runs=1) produced:
|
||||||
|
|
||||||
|
- Run A overall: 3.25
|
||||||
|
- Run B overall: 3.17 (delta -0.08)
|
||||||
|
|
||||||
|
Individual dimensions varied by up to ±0.5 between the two runs on
|
||||||
|
single-run medians. This confirms DESIGN.md's "< 0.3 is noise"
|
||||||
|
guidance is the right order of magnitude for a single-run
|
||||||
|
comparison. With `runs=3` default, per-dimension variance should
|
||||||
|
tighten; noise-floor measurement at N=10 is still a useful
|
||||||
|
follow-up to calibrate precisely.
|
||||||
|
|
||||||
|
## Open follow-ups (not blocking this PR)
|
||||||
|
|
||||||
|
1. **Iterative-merge fixture** — our actual compression win over
|
||||||
|
"regenerate from scratch" approaches is only exercised when
|
||||||
|
`_previous_summary` is re-used on a second compression. None of
|
||||||
|
the three shipped fixtures force two compressions. The natural
|
||||||
|
basis is `config-build-competitive-scouts` (already iterative by
|
||||||
|
shape); splitting it at the Monday/Tuesday boundary would force
|
||||||
|
the second compression to merge rather than regenerate.
|
||||||
|
2. **Noise-floor precision** — run the current prompt N=10 times
|
||||||
|
against one fixture to pin down per-dimension stddev and publish
|
||||||
|
the numbers in README.
|
||||||
|
3. **Scripted scrubber helpers** — the current scrubber is manual
|
||||||
|
per-fixture. A helper that identifies candidate sessions to
|
||||||
|
scrub (by shape or by keyword) would lower the cost of adding
|
||||||
|
fixture #4+.
|
||||||
|
4. **Judge model selection policy** — current code uses whatever
|
||||||
|
the user passes as `--judge-model` (default: same as compressor).
|
||||||
|
Pinning the judge across users would stabilise cross-machine
|
||||||
|
comparisons, at the cost of blocking users without access to
|
||||||
|
the pinned model.
|
||||||
110
scripts/compression_eval/README.md
Normal file
110
scripts/compression_eval/README.md
Normal file
@@ -0,0 +1,110 @@
|
|||||||
|
# compression_eval
|
||||||
|
|
||||||
|
Offline eval harness for `agent/context_compressor.py`. Runs a real
|
||||||
|
conversation transcript through the compressor, then probes the
|
||||||
|
compressed state with targeted questions graded on six dimensions.
|
||||||
|
|
||||||
|
## When to run
|
||||||
|
|
||||||
|
Before merging changes to:
|
||||||
|
|
||||||
|
- `agent/context_compressor.py` — any change to `_template_sections`,
|
||||||
|
`_generate_summary`, `compress()`, or its boundary logic
|
||||||
|
- `agent/auxiliary_client.py` — when changing how compression tasks
|
||||||
|
are routed
|
||||||
|
- `agent/prompt_builder.py` — when the compression-note phrasing
|
||||||
|
changes
|
||||||
|
|
||||||
|
## Not for CI
|
||||||
|
|
||||||
|
This harness makes real model calls (compressor + continuation +
|
||||||
|
judge = ~3 calls per probe × probes per fixture × runs). Costs ~$0.50
|
||||||
|
to ~$1.50 per full run depending on models, takes minutes, is
|
||||||
|
LLM-graded (non-deterministic). It lives in `scripts/` and is
|
||||||
|
invoked by hand. `tests/` and `scripts/run_tests.sh` do not touch it.
|
||||||
|
|
||||||
|
`tests/scripts/test_compression_eval.py` covers the non-LLM code
|
||||||
|
paths (rubric parsing, report rendering, fixture/probe loading, PII
|
||||||
|
smoke check on the checked-in fixtures) and DOES run in CI.
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run all three fixtures, 3 runs each, with your configured provider
|
||||||
|
python3 scripts/compression_eval/run_eval.py
|
||||||
|
|
||||||
|
# Faster iteration — one fixture, one run
|
||||||
|
python3 scripts/compression_eval/run_eval.py \
|
||||||
|
--fixtures=debug-session-feishu-id-model --runs=1
|
||||||
|
|
||||||
|
# Pin a cheap model for both compression + judge (recommended)
|
||||||
|
python3 scripts/compression_eval/run_eval.py \
|
||||||
|
--compressor-provider=nous --compressor-model=openai/gpt-5.4-mini \
|
||||||
|
--judge-provider=nous --judge-model=openai/gpt-5.4-mini \
|
||||||
|
--runs=3 --label=baseline
|
||||||
|
|
||||||
|
# After editing context_compressor.py, rerun with a new label and diff
|
||||||
|
python3 scripts/compression_eval/run_eval.py \
|
||||||
|
--compressor-provider=nous --compressor-model=openai/gpt-5.4-mini \
|
||||||
|
--judge-provider=nous --judge-model=openai/gpt-5.4-mini \
|
||||||
|
--runs=3 --label=my-prompt-tweak \
|
||||||
|
--compare-to=results/baseline
|
||||||
|
```
|
||||||
|
|
||||||
|
Results land in `results/<label>/report.md` and are intended to be
|
||||||
|
pasted verbatim into PR descriptions. `--compare-to` renders a delta
|
||||||
|
column per dimension so reviewers can see "did this actually help?"
|
||||||
|
at a glance.
|
||||||
|
|
||||||
|
Rule of thumb: dimension deltas below ±0.3 are within run-to-run
|
||||||
|
noise on `runs=3`. Publish a bigger N if you want tighter bounds.
|
||||||
|
|
||||||
|
## Fixtures
|
||||||
|
|
||||||
|
Three scrubbed session snapshots live under `fixtures/`:
|
||||||
|
|
||||||
|
- `feature-impl-context-priority.json` — 75 msgs, investigate →
|
||||||
|
patch → test → PR → merge
|
||||||
|
- `debug-session-feishu-id-model.json` — 59 msgs, PR triage +
|
||||||
|
upstream docs + decision
|
||||||
|
- `config-build-competitive-scouts.json` — 61 msgs, iterative
|
||||||
|
config accumulation (11 cron jobs)
|
||||||
|
|
||||||
|
Regenerate them from the maintainer's `~/.hermes/sessions/*.jsonl`
|
||||||
|
with `python3 scripts/compression_eval/scrub_fixtures.py`. The
|
||||||
|
scrubber pipeline and PII-audit checklist are documented in
|
||||||
|
`DESIGN.md` under **Scrubber pipeline**.
|
||||||
|
|
||||||
|
## Probes
|
||||||
|
|
||||||
|
One probe bank per fixture under `probes/`, 10-11 probes each,
|
||||||
|
covering all four types: **recall**, **artifact**, **continuation**,
|
||||||
|
**decision**. Each probe carries an `expected_facts` list of concrete
|
||||||
|
anchors (PR numbers, file paths, error codes, commands run) that the
|
||||||
|
judge sees alongside the assistant's answer.
|
||||||
|
|
||||||
|
## How it scores
|
||||||
|
|
||||||
|
Six dimensions, 0-5 per probe:
|
||||||
|
|
||||||
|
| Dimension | What it measures |
|
||||||
|
|-----------------------|------------------------------------------------------|
|
||||||
|
| accuracy | File paths, function names, PR/issue numbers correct |
|
||||||
|
| context_awareness | Reflects current session state, not a snapshot |
|
||||||
|
| artifact_trail | Correctly enumerates files / commands / PRs |
|
||||||
|
| completeness | Addresses ALL parts of the probe |
|
||||||
|
| continuity | Next assistant could continue without re-fetching |
|
||||||
|
| instruction_following | Answer in the requested form |
|
||||||
|
|
||||||
|
Report renders medians across N runs; probes scoring below 3.0
|
||||||
|
overall surface in a separate section with the judge's specific
|
||||||
|
complaint noted inline.
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
- `agent/context_compressor.py` — the thing under test
|
||||||
|
- `tests/agent/test_context_compressor.py` — structural unit tests
|
||||||
|
that do run in CI
|
||||||
|
- `scripts/sample_and_compress.py` — the closest existing script in
|
||||||
|
shape (offline, credential-requiring, not in CI)
|
||||||
|
- `DESIGN.md` — full architecture + methodology + open follow-ups
|
||||||
114
scripts/compression_eval/compressor_driver.py
Normal file
114
scripts/compression_eval/compressor_driver.py
Normal file
@@ -0,0 +1,114 @@
|
|||||||
|
"""Wraps ContextCompressor to run a single forced compression on a fixture.
|
||||||
|
|
||||||
|
The real agent loop checks ``should_compress()`` before calling ``compress()``.
|
||||||
|
Fixtures are intentionally sized below the 100k threshold so ``compress()``
|
||||||
|
runs in a controlled, single-shot mode — score deltas attribute to the
|
||||||
|
prompt change, not to whether the threshold happened to fire at the same
|
||||||
|
boundary twice.
|
||||||
|
|
||||||
|
Resolves the provider for the compression call via the same path the real
|
||||||
|
agent uses (``hermes_cli.runtime_provider.resolve_runtime_provider``) so
|
||||||
|
behaviour matches production aside from being a single call.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
|
||||||
|
# Make sibling imports work whether invoked as a script or as a module.
|
||||||
|
_REPO_ROOT = Path(__file__).resolve().parents[2]
|
||||||
|
if str(_REPO_ROOT) not in sys.path:
|
||||||
|
sys.path.insert(0, str(_REPO_ROOT))
|
||||||
|
|
||||||
|
from agent.context_compressor import ( # noqa: E402
|
||||||
|
ContextCompressor,
|
||||||
|
estimate_messages_tokens_rough,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def run_compression(
|
||||||
|
*,
|
||||||
|
messages: List[Dict[str, Any]],
|
||||||
|
compressor_model: str,
|
||||||
|
compressor_provider: str,
|
||||||
|
compressor_base_url: str,
|
||||||
|
compressor_api_key: str,
|
||||||
|
compressor_api_mode: str,
|
||||||
|
context_length: int,
|
||||||
|
focus_topic: Optional[str] = None,
|
||||||
|
summary_model_override: Optional[str] = None,
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""Run a single forced compression pass over the fixture messages.
|
||||||
|
|
||||||
|
Returns a dict with:
|
||||||
|
- compressed_messages: the post-compression message list
|
||||||
|
- summary_text: the summary produced (extracted from the compressed head)
|
||||||
|
- pre_tokens, post_tokens: rough token counts before/after
|
||||||
|
- compression_ratio: 1 - (post/pre)
|
||||||
|
- pre_message_count, post_message_count
|
||||||
|
"""
|
||||||
|
compressor = ContextCompressor(
|
||||||
|
model=compressor_model,
|
||||||
|
threshold_percent=0.50,
|
||||||
|
protect_first_n=3,
|
||||||
|
protect_last_n=20,
|
||||||
|
summary_target_ratio=0.20,
|
||||||
|
quiet_mode=True,
|
||||||
|
summary_model_override=summary_model_override or "",
|
||||||
|
base_url=compressor_base_url,
|
||||||
|
api_key=compressor_api_key,
|
||||||
|
config_context_length=context_length,
|
||||||
|
provider=compressor_provider,
|
||||||
|
api_mode=compressor_api_mode,
|
||||||
|
)
|
||||||
|
|
||||||
|
pre_tokens = estimate_messages_tokens_rough(messages)
|
||||||
|
compressed = compressor.compress(
|
||||||
|
messages,
|
||||||
|
current_tokens=pre_tokens,
|
||||||
|
focus_topic=focus_topic,
|
||||||
|
)
|
||||||
|
post_tokens = estimate_messages_tokens_rough(compressed)
|
||||||
|
|
||||||
|
summary_text = _extract_summary_from_messages(compressed)
|
||||||
|
|
||||||
|
ratio = (1.0 - (post_tokens / pre_tokens)) if pre_tokens > 0 else 0.0
|
||||||
|
|
||||||
|
return {
|
||||||
|
"compressed_messages": compressed,
|
||||||
|
"summary_text": summary_text,
|
||||||
|
"pre_tokens": pre_tokens,
|
||||||
|
"post_tokens": post_tokens,
|
||||||
|
"compression_ratio": ratio,
|
||||||
|
"pre_message_count": len(messages),
|
||||||
|
"post_message_count": len(compressed),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
_SUMMARY_MARKERS = (
|
||||||
|
"## Active Task",
|
||||||
|
"## Goal",
|
||||||
|
"## Completed Actions",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _extract_summary_from_messages(messages: List[Dict[str, Any]]) -> str:
|
||||||
|
"""Find the structured summary block inside the compressed message list.
|
||||||
|
|
||||||
|
The compressor injects the summary as a user (or system-appended) message
|
||||||
|
near the head. We look for the section-header markers from
|
||||||
|
``_template_sections`` in ``agent/context_compressor.py``.
|
||||||
|
"""
|
||||||
|
for msg in messages:
|
||||||
|
content = msg.get("content")
|
||||||
|
if not isinstance(content, str):
|
||||||
|
if isinstance(content, list):
|
||||||
|
content = "\n".join(
|
||||||
|
p.get("text", "") for p in content if isinstance(p, dict)
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
continue
|
||||||
|
if any(marker in content for marker in _SUMMARY_MARKERS):
|
||||||
|
return content
|
||||||
|
return ""
|
||||||
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
181
scripts/compression_eval/grader.py
Normal file
181
scripts/compression_eval/grader.py
Normal file
@@ -0,0 +1,181 @@
|
|||||||
|
"""Two-phase probe grading.
|
||||||
|
|
||||||
|
Phase 1 — **Continuation**: simulate the next assistant turn. Feed the
|
||||||
|
compressed message list plus the probe question and ask the continuing
|
||||||
|
model to answer using only the compressed context. This is exactly what
|
||||||
|
a real next-turn call would look like.
|
||||||
|
|
||||||
|
Phase 2 — **Grading**: a separate judge-model call scores the answer on
|
||||||
|
the six rubric dimensions using ``rubric.build_judge_prompt``.
|
||||||
|
|
||||||
|
Both phases use the OpenAI SDK directly against the resolved provider
|
||||||
|
endpoint, so the explicit api_key + base_url we pass always reaches the
|
||||||
|
wire. (``agent.auxiliary_client.call_llm`` is designed for task-tagged
|
||||||
|
auxiliary calls backed by config lookups; for eval we need the explicit
|
||||||
|
credentials to win unconditionally.)
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
|
||||||
|
_REPO_ROOT = Path(__file__).resolve().parents[2]
|
||||||
|
if str(_REPO_ROOT) not in sys.path:
|
||||||
|
sys.path.insert(0, str(_REPO_ROOT))
|
||||||
|
|
||||||
|
from openai import OpenAI # noqa: E402
|
||||||
|
|
||||||
|
from rubric import build_judge_prompt, parse_judge_response # noqa: E402
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
_CONTINUATION_SYSTEM = (
|
||||||
|
"You are the continuing assistant in a long session. Earlier turns have "
|
||||||
|
"been compacted into a handoff summary that is now part of the "
|
||||||
|
"conversation history. The user has just asked you a question. "
|
||||||
|
"Answer using ONLY what you can determine from the conversation history "
|
||||||
|
"you see (including the handoff summary). Do NOT invent details. If the "
|
||||||
|
"summary does not contain a specific fact, say so explicitly rather "
|
||||||
|
"than guessing. Be direct and concrete — cite file paths, PR numbers, "
|
||||||
|
"error codes, and exact values when they are present in the summary."
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def answer_probe(
|
||||||
|
*,
|
||||||
|
compressed_messages: List[Dict[str, Any]],
|
||||||
|
probe_question: str,
|
||||||
|
model: str,
|
||||||
|
provider: str,
|
||||||
|
base_url: str,
|
||||||
|
api_key: str,
|
||||||
|
max_tokens: int = 1024,
|
||||||
|
timeout: Optional[float] = 120.0,
|
||||||
|
) -> str:
|
||||||
|
"""Run the continuation call: what does the next assistant answer?
|
||||||
|
|
||||||
|
Builds a messages list of [system_continuation, *compressed, probe_user]
|
||||||
|
and asks the configured model. Returns the answer content as a string.
|
||||||
|
"""
|
||||||
|
# Strip any pre-existing system message from the compressed list and
|
||||||
|
# replace with our continuation system prompt. The fixture's generic
|
||||||
|
# system is not the right frame for the continuation simulation.
|
||||||
|
history = [m for m in compressed_messages if m.get("role") != "system"]
|
||||||
|
messages = (
|
||||||
|
[{"role": "system", "content": _CONTINUATION_SYSTEM}]
|
||||||
|
+ _sanitize_for_chat_api(history)
|
||||||
|
+ [{"role": "user", "content": probe_question}]
|
||||||
|
)
|
||||||
|
|
||||||
|
client = OpenAI(api_key=api_key, base_url=base_url, timeout=timeout)
|
||||||
|
response = client.chat.completions.create(
|
||||||
|
model=model,
|
||||||
|
messages=messages,
|
||||||
|
max_tokens=max_tokens,
|
||||||
|
)
|
||||||
|
content = response.choices[0].message.content
|
||||||
|
if not isinstance(content, str):
|
||||||
|
content = "" if content is None else str(content)
|
||||||
|
return content.strip()
|
||||||
|
|
||||||
|
|
||||||
|
def grade_probe(
|
||||||
|
*,
|
||||||
|
probe_question: str,
|
||||||
|
probe_type: str,
|
||||||
|
expected_facts: List[str],
|
||||||
|
assistant_answer: str,
|
||||||
|
judge_model: str,
|
||||||
|
judge_provider: str,
|
||||||
|
judge_base_url: str,
|
||||||
|
judge_api_key: str,
|
||||||
|
max_tokens: int = 512,
|
||||||
|
timeout: Optional[float] = 120.0,
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""Run the judge call and parse the six dimension scores.
|
||||||
|
|
||||||
|
Returns dict {scores: {dim: int}, notes: str, overall: float,
|
||||||
|
raw: str, parse_error: str|None}. On parse failure, scores are zeros
|
||||||
|
and parse_error is populated — the caller decides whether to retry
|
||||||
|
or accept.
|
||||||
|
"""
|
||||||
|
prompt = build_judge_prompt(
|
||||||
|
probe_question=probe_question,
|
||||||
|
probe_type=probe_type,
|
||||||
|
expected_facts=expected_facts,
|
||||||
|
assistant_answer=assistant_answer,
|
||||||
|
)
|
||||||
|
client = OpenAI(api_key=judge_api_key, base_url=judge_base_url, timeout=timeout)
|
||||||
|
response = client.chat.completions.create(
|
||||||
|
model=judge_model,
|
||||||
|
messages=[{"role": "user", "content": prompt}],
|
||||||
|
max_tokens=max_tokens,
|
||||||
|
)
|
||||||
|
raw = response.choices[0].message.content or ""
|
||||||
|
if not isinstance(raw, str):
|
||||||
|
raw = str(raw)
|
||||||
|
|
||||||
|
try:
|
||||||
|
parsed = parse_judge_response(raw)
|
||||||
|
parsed["raw"] = raw
|
||||||
|
parsed["parse_error"] = None
|
||||||
|
return parsed
|
||||||
|
except ValueError as exc:
|
||||||
|
logger.warning("Judge response parse failed: %s | raw=%r", exc, raw[:200])
|
||||||
|
from rubric import DIMENSIONS
|
||||||
|
return {
|
||||||
|
"scores": {d: 0 for d in DIMENSIONS},
|
||||||
|
"notes": "",
|
||||||
|
"overall": 0.0,
|
||||||
|
"raw": raw,
|
||||||
|
"parse_error": str(exc),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _sanitize_for_chat_api(
|
||||||
|
messages: List[Dict[str, Any]],
|
||||||
|
) -> List[Dict[str, Any]]:
|
||||||
|
"""Drop tool_calls/tool pairs that are incomplete.
|
||||||
|
|
||||||
|
A compressed message list may contain tool_call references whose matching
|
||||||
|
``tool`` result was summarized away, which breaks strict-validator
|
||||||
|
providers (Anthropic, OpenAI). Easiest correct behaviour for the eval:
|
||||||
|
strip tool_calls entirely and drop ``tool`` role messages — the
|
||||||
|
continuation model only needs the summary + recent turns to answer the
|
||||||
|
probe, not the precise tool-call bookkeeping.
|
||||||
|
"""
|
||||||
|
clean: List[Dict[str, Any]] = []
|
||||||
|
for m in messages:
|
||||||
|
role = m.get("role")
|
||||||
|
if role == "tool":
|
||||||
|
# Convert tool result to a plain user note so the continuation
|
||||||
|
# model still sees the content without needing the structured
|
||||||
|
# tool_call_id pairing.
|
||||||
|
content = m.get("content")
|
||||||
|
if isinstance(content, list):
|
||||||
|
content = "\n".join(
|
||||||
|
p.get("text", "") for p in content if isinstance(p, dict)
|
||||||
|
)
|
||||||
|
clean.append({
|
||||||
|
"role": "user",
|
||||||
|
"content": f"[earlier tool result]\n{content or ''}",
|
||||||
|
})
|
||||||
|
continue
|
||||||
|
new = {"role": role, "content": m.get("content", "")}
|
||||||
|
# Drop tool_calls — the downstream assistant message's content
|
||||||
|
# still describes what the agent was doing.
|
||||||
|
clean.append(new)
|
||||||
|
# Collapse consecutive same-role turns into one (alternation rule)
|
||||||
|
merged: List[Dict[str, Any]] = []
|
||||||
|
for m in clean:
|
||||||
|
if merged and merged[-1]["role"] == m["role"]:
|
||||||
|
prev = merged[-1]
|
||||||
|
prev_c = prev.get("content") or ""
|
||||||
|
new_c = m.get("content") or ""
|
||||||
|
prev["content"] = f"{prev_c}\n\n{new_c}" if prev_c else new_c
|
||||||
|
else:
|
||||||
|
merged.append(m)
|
||||||
|
return merged
|
||||||
@@ -0,0 +1,96 @@
|
|||||||
|
{
|
||||||
|
"fixture": "config-build-competitive-scouts",
|
||||||
|
"description": "Probes for the competitive-scout cron-job setup session. Anchors are which agents were configured, which day of the week each runs, and the full final schedule. This fixture most directly tests artifact-trail and iterative-merge because the job list grows by one per user turn.",
|
||||||
|
"probes": [
|
||||||
|
{
|
||||||
|
"id": "recall-first-repo",
|
||||||
|
"type": "recall",
|
||||||
|
"question": "What was the first repository the user asked to create a scout cron for, and on what day of the week?",
|
||||||
|
"expected_facts": ["openclaw", "Sunday"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "recall-closed-source-target",
|
||||||
|
"type": "recall",
|
||||||
|
"question": "One of the scout targets does not have an open-source repository and had to be configured as a web scan instead. Which one, and on what day?",
|
||||||
|
"expected_facts": ["claude code", "Friday", "web scan"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "artifact-all-jobs",
|
||||||
|
"type": "artifact",
|
||||||
|
"question": "List every scout cron job created in this session.",
|
||||||
|
"expected_facts": [
|
||||||
|
"openclaw-pr-scout",
|
||||||
|
"nanoclaw-pr-scout",
|
||||||
|
"ironclaw-pr-scout",
|
||||||
|
"kilocode-pr-scout",
|
||||||
|
"codex-pr-scout",
|
||||||
|
"gemini-cli-pr-scout",
|
||||||
|
"cline-pr-scout",
|
||||||
|
"opencode-pr-scout",
|
||||||
|
"claude-code-scout",
|
||||||
|
"aider-pr-scout",
|
||||||
|
"roocode-pr-scout"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "artifact-final-schedule",
|
||||||
|
"type": "artifact",
|
||||||
|
"question": "What is the final weekly schedule? Give the day and the agents scanned on each day.",
|
||||||
|
"expected_facts": [
|
||||||
|
"Sun: openclaw, nanoclaw, ironclaw",
|
||||||
|
"Mon: kilo code",
|
||||||
|
"Tue: codex",
|
||||||
|
"Wed: gemini cli, cline",
|
||||||
|
"Thu: opencode",
|
||||||
|
"Fri: claude code",
|
||||||
|
"Sat: aider, roo"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "artifact-sunday-count",
|
||||||
|
"type": "artifact",
|
||||||
|
"question": "How many cron jobs run on Sunday?",
|
||||||
|
"expected_facts": ["3", "three", "openclaw, nanoclaw, ironclaw"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "artifact-total-count",
|
||||||
|
"type": "artifact",
|
||||||
|
"question": "How many scout cron jobs were created in total by the end of the session?",
|
||||||
|
"expected_facts": ["11", "eleven"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "decision-kilo-open-source",
|
||||||
|
"type": "decision",
|
||||||
|
"question": "The user asked whether Kilo Code is open source. What was the answer, and what did the user decide to do with it?",
|
||||||
|
"expected_facts": [
|
||||||
|
"yes, open source",
|
||||||
|
"Kilo-Org/kilocode",
|
||||||
|
"added as Monday scout"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "decision-saturday-fill",
|
||||||
|
"type": "decision",
|
||||||
|
"question": "Saturday was the last open day at one point. Which scout(s) were placed on Saturday, and why were those chosen?",
|
||||||
|
"expected_facts": ["aider", "roo", "filled in last based on openrouter popularity / cli comparison rankings"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "continuation-execution-time",
|
||||||
|
"type": "continuation",
|
||||||
|
"question": "At what local time of day do these scout cron jobs run?",
|
||||||
|
"expected_facts": ["10 AM Pacific", "17:00 UTC", "0 17 * * *"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "continuation-skill-used",
|
||||||
|
"type": "continuation",
|
||||||
|
"question": "Each scout job runs with a specific skill preloaded. Which one?",
|
||||||
|
"expected_facts": ["hermes-agent-dev"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "continuation-weekday-coverage",
|
||||||
|
"type": "continuation",
|
||||||
|
"question": "After the session ended, are there any weekdays still uncovered by a scout job?",
|
||||||
|
"expected_facts": ["no", "all 7 days covered", "full week loaded"]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
@@ -0,0 +1,72 @@
|
|||||||
|
{
|
||||||
|
"fixture": "debug-session-feishu-id-model",
|
||||||
|
"description": "Probes for the Feishu identity-model PR #8388 triage session. Anchors are the PR number, what the PR actually contained, what upstream docs confirmed, and the final decision + reasoning.",
|
||||||
|
"probes": [
|
||||||
|
{
|
||||||
|
"id": "recall-pr-number",
|
||||||
|
"type": "recall",
|
||||||
|
"question": "What is the PR number under review in this session, and what repository is it against?",
|
||||||
|
"expected_facts": ["PR #8388", "NousResearch/hermes-agent", "hermes-agent"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "recall-bug-claim",
|
||||||
|
"type": "recall",
|
||||||
|
"question": "What is the core bug the PR claims to fix? Be specific about the identifier involved.",
|
||||||
|
"expected_facts": ["open_id", "app-scoped", "not canonical", "Feishu identity model"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "recall-upstream-confirmation",
|
||||||
|
"type": "recall",
|
||||||
|
"question": "Do upstream Feishu/Lark docs confirm that open_id is app-scoped rather than a canonical cross-app identity?",
|
||||||
|
"expected_facts": ["yes", "confirmed", "open.feishu.cn", "same user has different Open IDs in different apps"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "artifact-pr-scope",
|
||||||
|
"type": "artifact",
|
||||||
|
"question": "Roughly how large is PR #8388, and which gateway subsystems does it touch beyond the Feishu adapter?",
|
||||||
|
"expected_facts": ["4647 lines", "gateway/run.py", "cron/scheduler.py", "gateway/config.py", "multi-account", "bind"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "artifact-new-tool",
|
||||||
|
"type": "artifact",
|
||||||
|
"question": "Does the PR add a new tool file? If so, what is its path?",
|
||||||
|
"expected_facts": ["tools/feishu_id_tool.py", "new file"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "decision-pr-assessment",
|
||||||
|
"type": "decision",
|
||||||
|
"question": "What is the reviewer's overall assessment of PR #8388 — approve, reject, or something more nuanced? Explain in one sentence.",
|
||||||
|
"expected_facts": [
|
||||||
|
"core claim is correct",
|
||||||
|
"scope is wrong",
|
||||||
|
"bait-and-switch",
|
||||||
|
"overbuilt",
|
||||||
|
"implement cleaner ourselves"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "decision-core-claim-validity",
|
||||||
|
"type": "decision",
|
||||||
|
"question": "Setting aside the PR's size, is the underlying identity-model concern technically valid or not?",
|
||||||
|
"expected_facts": ["technically valid", "correct", "open_id is app-scoped"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "continuation-next-action",
|
||||||
|
"type": "continuation",
|
||||||
|
"question": "Based on the review outcome, what is the next action the agent has been asked to take regarding this PR?",
|
||||||
|
"expected_facts": ["close the PR", "implement ourselves", "cleaner", "less complex"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "continuation-implementation-scope",
|
||||||
|
"type": "continuation",
|
||||||
|
"question": "If implementing the Feishu fix cleanly ourselves, which specific behaviour needs to change — what should replace the current use of open_id?",
|
||||||
|
"expected_facts": ["use union_id", "or user_id", "canonical identity", "cross-app stable ID"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "continuation-sources-to-reference",
|
||||||
|
"type": "continuation",
|
||||||
|
"question": "Which upstream documentation sources were fetched during review that should be referenced when writing the clean implementation?",
|
||||||
|
"expected_facts": ["open.feishu.cn", "open.larkoffice.com", "user-identity-introduction"]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
@@ -0,0 +1,74 @@
|
|||||||
|
{
|
||||||
|
"fixture": "feature-impl-context-priority",
|
||||||
|
"description": "Probes for the .hermes.md / AGENTS.md / CLAUDE.md / .cursorrules priority feature session. Anchors are the concrete facts the next assistant would need to continue: user's priority order, files modified, helper-function structure, live-test scenarios, and PR number.",
|
||||||
|
"probes": [
|
||||||
|
{
|
||||||
|
"id": "recall-priority-order",
|
||||||
|
"type": "recall",
|
||||||
|
"question": "What is the priority order the user asked for when multiple project-context files are present? List them from highest to lowest priority.",
|
||||||
|
"expected_facts": [".hermes.md", "AGENTS.md", "CLAUDE.md", ".cursorrules", "highest to lowest"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "recall-selection-mode",
|
||||||
|
"type": "recall",
|
||||||
|
"question": "When multiple context files exist in the same directory, does the agent now load all of them or pick only one?",
|
||||||
|
"expected_facts": ["only one", "priority-based selection", "highest-priority winner"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "artifact-files-modified",
|
||||||
|
"type": "artifact",
|
||||||
|
"question": "Which files in the hermes-agent repository were modified during this session? List them.",
|
||||||
|
"expected_facts": [
|
||||||
|
"agent/prompt_builder.py",
|
||||||
|
"tests/agent/test_prompt_builder.py"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "artifact-helper-functions",
|
||||||
|
"type": "artifact",
|
||||||
|
"question": "The session introduced separate helper functions for each context-file type. What are their names?",
|
||||||
|
"expected_facts": [
|
||||||
|
"_load_hermes_md",
|
||||||
|
"_load_agents_md",
|
||||||
|
"_load_claude_md",
|
||||||
|
"_load_cursorrules"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "artifact-test-scenarios",
|
||||||
|
"type": "artifact",
|
||||||
|
"question": "A scratch directory was created with scenario subdirectories to live-test the priority chain. Roughly how many scenarios, and what directory was it created under?",
|
||||||
|
"expected_facts": ["10 scenarios", "/tmp/context-priority-test"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "decision-claude-md-was-unsupported",
|
||||||
|
"type": "decision",
|
||||||
|
"question": "What was the finding about CLAUDE.md support in the existing loader before this session's changes?",
|
||||||
|
"expected_facts": ["CLAUDE.md was not handled", "not supported", "new handler added"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "decision-load-all-or-one",
|
||||||
|
"type": "decision",
|
||||||
|
"question": "Was the decision to load multiple context files when present, or to load only the highest-priority one? Explain the reasoning in one sentence.",
|
||||||
|
"expected_facts": ["load only one", "highest priority", "user preference", "do not want to load multiple"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "continuation-pr-number-and-status",
|
||||||
|
"type": "continuation",
|
||||||
|
"question": "A pull request was opened for this feature. What is the PR number and what is its merge status?",
|
||||||
|
"expected_facts": ["PR #2301", "merged", "squash"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "continuation-test-suite-result",
|
||||||
|
"type": "continuation",
|
||||||
|
"question": "What was the result of the full test suite run after the implementation changes?",
|
||||||
|
"expected_facts": ["5680 passed", "0 failures", "clean"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "continuation-next-step",
|
||||||
|
"type": "continuation",
|
||||||
|
"question": "If asked to pick up this session, what is the current state of main? Anything left to do?",
|
||||||
|
"expected_facts": ["merged to main", "main is current", "nothing outstanding", "pulled"]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
235
scripts/compression_eval/report.py
Normal file
235
scripts/compression_eval/report.py
Normal file
@@ -0,0 +1,235 @@
|
|||||||
|
"""Markdown report rendering + diff-against-baseline for compression-eval runs.
|
||||||
|
|
||||||
|
Report format is optimised for pasting directly into a PR description.
|
||||||
|
Top-of-report table is the per-fixture medians; below that is the
|
||||||
|
probe-by-probe miss list (scores < 3.0 on overall).
|
||||||
|
|
||||||
|
Diff mode (``compare_to``) emits a second table with deltas per fixture
|
||||||
|
per dimension against a previous run directory.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import statistics
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
|
||||||
|
from rubric import DIMENSIONS
|
||||||
|
|
||||||
|
|
||||||
|
def write_run_json(
|
||||||
|
*,
|
||||||
|
results_dir: Path,
|
||||||
|
fixture_name: str,
|
||||||
|
run_index: int,
|
||||||
|
payload: Dict[str, Any],
|
||||||
|
) -> Path:
|
||||||
|
"""Dump one fixture's per-run results as JSON for later diffing."""
|
||||||
|
results_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
path = results_dir / f"{fixture_name}-run-{run_index}.json"
|
||||||
|
with path.open("w") as fh:
|
||||||
|
json.dump(payload, fh, indent=2, ensure_ascii=False)
|
||||||
|
return path
|
||||||
|
|
||||||
|
|
||||||
|
def _median(values: List[float]) -> float:
|
||||||
|
return statistics.median(values) if values else 0.0
|
||||||
|
|
||||||
|
|
||||||
|
def _format_score(value: float) -> str:
|
||||||
|
return f"{value:.2f}"
|
||||||
|
|
||||||
|
|
||||||
|
def _format_delta(baseline: float, current: float) -> str:
|
||||||
|
delta = current - baseline
|
||||||
|
if abs(delta) < 0.01:
|
||||||
|
return f"{current:.2f} (±0)"
|
||||||
|
sign = "+" if delta > 0 else ""
|
||||||
|
return f"{current:.2f} ({sign}{delta:.2f})"
|
||||||
|
|
||||||
|
|
||||||
|
def summarize_fixture_runs(
|
||||||
|
fixture_runs: List[Dict[str, Any]],
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""Collapse N runs of one fixture into per-dimension medians + metadata.
|
||||||
|
|
||||||
|
Each run payload is {probes: [{id, type, scores: {...}, overall, ...}]}.
|
||||||
|
Returns {fixture_name, runs, dimension_medians, overall_median, misses}.
|
||||||
|
"""
|
||||||
|
if not fixture_runs:
|
||||||
|
return {}
|
||||||
|
|
||||||
|
fixture_name = fixture_runs[0]["fixture_name"]
|
||||||
|
n_runs = len(fixture_runs)
|
||||||
|
|
||||||
|
# Per-probe-per-dimension aggregation across runs
|
||||||
|
probe_ids = [p["id"] for p in fixture_runs[0]["probes"]]
|
||||||
|
per_probe: Dict[str, Dict[str, List[float]]] = {
|
||||||
|
pid: {d: [] for d in DIMENSIONS} for pid in probe_ids
|
||||||
|
}
|
||||||
|
per_probe_overall: Dict[str, List[float]] = {pid: [] for pid in probe_ids}
|
||||||
|
|
||||||
|
for run in fixture_runs:
|
||||||
|
for p in run["probes"]:
|
||||||
|
pid = p["id"]
|
||||||
|
for d in DIMENSIONS:
|
||||||
|
per_probe[pid][d].append(p["scores"].get(d, 0))
|
||||||
|
per_probe_overall[pid].append(p["overall"])
|
||||||
|
|
||||||
|
# Median each probe across runs, then median those medians across probes
|
||||||
|
dim_medians: Dict[str, float] = {}
|
||||||
|
for d in DIMENSIONS:
|
||||||
|
per_probe_med = [_median(per_probe[pid][d]) for pid in probe_ids]
|
||||||
|
dim_medians[d] = _median(per_probe_med)
|
||||||
|
overall_median = _median([_median(per_probe_overall[pid]) for pid in probe_ids])
|
||||||
|
|
||||||
|
# Misses = probes whose median overall < 3.0
|
||||||
|
misses: List[Dict[str, Any]] = []
|
||||||
|
for pid in probe_ids:
|
||||||
|
med = _median(per_probe_overall[pid])
|
||||||
|
if med < 3.0:
|
||||||
|
# Pull the notes from the last run to give the reader a
|
||||||
|
# concrete clue. (Taking the most recent run is fine —
|
||||||
|
# notes vary across runs and any one is illustrative.)
|
||||||
|
notes = ""
|
||||||
|
probe_type = ""
|
||||||
|
for p in fixture_runs[-1]["probes"]:
|
||||||
|
if p["id"] == pid:
|
||||||
|
notes = p.get("notes", "")
|
||||||
|
probe_type = p.get("type", "")
|
||||||
|
break
|
||||||
|
misses.append({
|
||||||
|
"id": pid,
|
||||||
|
"type": probe_type,
|
||||||
|
"overall_median": med,
|
||||||
|
"notes": notes,
|
||||||
|
})
|
||||||
|
|
||||||
|
return {
|
||||||
|
"fixture_name": fixture_name,
|
||||||
|
"runs": n_runs,
|
||||||
|
"dimension_medians": dim_medians,
|
||||||
|
"overall_median": overall_median,
|
||||||
|
"misses": misses,
|
||||||
|
"compression": fixture_runs[0].get("compression", {}),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def render_report(
|
||||||
|
*,
|
||||||
|
label: str,
|
||||||
|
compressor_model: str,
|
||||||
|
judge_model: str,
|
||||||
|
runs_per_fixture: int,
|
||||||
|
summaries: List[Dict[str, Any]],
|
||||||
|
baseline_summaries: Optional[List[Dict[str, Any]]] = None,
|
||||||
|
) -> str:
|
||||||
|
"""Render the full markdown report.
|
||||||
|
|
||||||
|
baseline_summaries is the same shape as summaries, sourced from a
|
||||||
|
previous run (via --compare-to). When present, dimension scores in
|
||||||
|
the main table render with deltas.
|
||||||
|
"""
|
||||||
|
lines: List[str] = []
|
||||||
|
lines.append(f"## Compression eval — label `{label}`")
|
||||||
|
lines.append("")
|
||||||
|
lines.append(f"- Compressor model: `{compressor_model}`")
|
||||||
|
lines.append(f"- Judge model: `{judge_model}`")
|
||||||
|
lines.append(f"- Runs per fixture: {runs_per_fixture}")
|
||||||
|
lines.append("- Medians over runs reported")
|
||||||
|
if baseline_summaries:
|
||||||
|
lines.append("- Deltas shown against baseline run")
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
baseline_by_name: Dict[str, Dict[str, Any]] = {}
|
||||||
|
if baseline_summaries:
|
||||||
|
baseline_by_name = {s["fixture_name"]: s for s in baseline_summaries}
|
||||||
|
|
||||||
|
# Main table
|
||||||
|
header = ["Fixture"] + DIMENSIONS + ["overall"]
|
||||||
|
lines.append("| " + " | ".join(header) + " |")
|
||||||
|
lines.append("|" + "|".join(["---"] * len(header)) + "|")
|
||||||
|
for s in summaries:
|
||||||
|
row = [s["fixture_name"]]
|
||||||
|
baseline = baseline_by_name.get(s["fixture_name"])
|
||||||
|
for d in DIMENSIONS:
|
||||||
|
cur = s["dimension_medians"][d]
|
||||||
|
if baseline and d in baseline.get("dimension_medians", {}):
|
||||||
|
row.append(_format_delta(baseline["dimension_medians"][d], cur))
|
||||||
|
else:
|
||||||
|
row.append(_format_score(cur))
|
||||||
|
if baseline:
|
||||||
|
row.append(_format_delta(baseline["overall_median"], s["overall_median"]))
|
||||||
|
else:
|
||||||
|
row.append(_format_score(s["overall_median"]))
|
||||||
|
lines.append("| " + " | ".join(row) + " |")
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
# Compression metadata
|
||||||
|
lines.append("### Compression summary")
|
||||||
|
lines.append("")
|
||||||
|
lines.append("| Fixture | Pre tokens | Post tokens | Ratio | Pre msgs | Post msgs |")
|
||||||
|
lines.append("|---|---|---|---|---|---|")
|
||||||
|
for s in summaries:
|
||||||
|
c = s.get("compression", {})
|
||||||
|
lines.append(
|
||||||
|
"| {name} | {pre} | {post} | {ratio:.1%} | {pm} | {pom} |".format(
|
||||||
|
name=s["fixture_name"],
|
||||||
|
pre=c.get("pre_tokens", 0),
|
||||||
|
post=c.get("post_tokens", 0),
|
||||||
|
ratio=c.get("compression_ratio", 0.0),
|
||||||
|
pm=c.get("pre_message_count", 0),
|
||||||
|
pom=c.get("post_message_count", 0),
|
||||||
|
)
|
||||||
|
)
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
# Per-probe misses
|
||||||
|
any_misses = any(s["misses"] for s in summaries)
|
||||||
|
if any_misses:
|
||||||
|
lines.append("### Probes scoring below 3.0 overall (median)")
|
||||||
|
lines.append("")
|
||||||
|
for s in summaries:
|
||||||
|
if not s["misses"]:
|
||||||
|
continue
|
||||||
|
lines.append(f"**{s['fixture_name']}**")
|
||||||
|
for m in s["misses"]:
|
||||||
|
note_part = f" — {m['notes']}" if m["notes"] else ""
|
||||||
|
lines.append(
|
||||||
|
f"- `{m['id']}` ({m['type']}): "
|
||||||
|
f"{m['overall_median']:.2f}{note_part}"
|
||||||
|
)
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
lines.append("### Methodology")
|
||||||
|
lines.append("")
|
||||||
|
lines.append(
|
||||||
|
"Probe-based eval adapted from "
|
||||||
|
"https://factory.ai/news/evaluating-compression. Each fixture is "
|
||||||
|
"compressed in a single forced `ContextCompressor.compress()` call, "
|
||||||
|
"then a continuation call asks the compressor model to answer each "
|
||||||
|
"probe from the compressed state, then the judge model scores the "
|
||||||
|
"answer 0-5 on six dimensions. A single run is noisy; medians "
|
||||||
|
"across multiple runs are the meaningful signal. Changes below "
|
||||||
|
"~0.3 on any dimension are likely within run-to-run noise."
|
||||||
|
)
|
||||||
|
return "\n".join(lines) + "\n"
|
||||||
|
|
||||||
|
|
||||||
|
def load_baseline_summaries(baseline_dir: Path) -> List[Dict[str, Any]]:
|
||||||
|
"""Load summaries from a previous eval run for --compare-to.
|
||||||
|
|
||||||
|
Reads the dumped per-run JSONs and re-summarises them so the
|
||||||
|
aggregation matches whatever summariser was current at the time of
|
||||||
|
the new run (forward-compatible with schema additions).
|
||||||
|
"""
|
||||||
|
if not baseline_dir.exists():
|
||||||
|
raise FileNotFoundError(f"baseline dir not found: {baseline_dir}")
|
||||||
|
|
||||||
|
by_fixture: Dict[str, List[Dict[str, Any]]] = {}
|
||||||
|
for path in sorted(baseline_dir.glob("*-run-*.json")):
|
||||||
|
with path.open() as fh:
|
||||||
|
payload = json.load(fh)
|
||||||
|
by_fixture.setdefault(payload["fixture_name"], []).append(payload)
|
||||||
|
|
||||||
|
return [summarize_fixture_runs(runs) for runs in by_fixture.values()]
|
||||||
0
scripts/compression_eval/results/.gitkeep
Normal file
0
scripts/compression_eval/results/.gitkeep
Normal file
198
scripts/compression_eval/rubric.py
Normal file
198
scripts/compression_eval/rubric.py
Normal file
@@ -0,0 +1,198 @@
|
|||||||
|
"""Rubric for probe-based compression eval grading.
|
||||||
|
|
||||||
|
Six dimensions scored 0-5 by a judge model. The scoring anchors are spelled
|
||||||
|
out so the judge interpretation is stable across runs and across judge
|
||||||
|
models.
|
||||||
|
|
||||||
|
Adapted from the methodology in
|
||||||
|
https://factory.ai/news/evaluating-compression. Their scoreboard is not
|
||||||
|
adopted; only the dimension definitions and the 0-5 scale.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import Any, Dict, List
|
||||||
|
|
||||||
|
# Canonical dimension order. All reports, parsers, and comparisons derive
|
||||||
|
# from this list — do not hardcode the order elsewhere.
|
||||||
|
DIMENSIONS: List[str] = [
|
||||||
|
"accuracy",
|
||||||
|
"context_awareness",
|
||||||
|
"artifact_trail",
|
||||||
|
"completeness",
|
||||||
|
"continuity",
|
||||||
|
"instruction_following",
|
||||||
|
]
|
||||||
|
|
||||||
|
DIMENSION_DESCRIPTIONS: Dict[str, str] = {
|
||||||
|
"accuracy": (
|
||||||
|
"Are concrete facts correct — file paths, function names, PR/issue "
|
||||||
|
"numbers, error codes, command outputs, line numbers? A single wrong "
|
||||||
|
"path or error code should cost points. Vague but non-contradicting "
|
||||||
|
"answers score mid-range."
|
||||||
|
),
|
||||||
|
"context_awareness": (
|
||||||
|
"Does the answer reflect the CURRENT state of the session, not a "
|
||||||
|
"mid-session snapshot? For example, if a file was modified then "
|
||||||
|
"reverted, does the answer describe the reverted state? If three "
|
||||||
|
"PRs were opened, does the answer know which was merged?"
|
||||||
|
),
|
||||||
|
"artifact_trail": (
|
||||||
|
"Does the answer correctly enumerate the artifacts (files read, "
|
||||||
|
"files modified, commands run, tools called, PRs opened, cron jobs "
|
||||||
|
"created)? Missing artifacts cost more than extra unrelated ones."
|
||||||
|
),
|
||||||
|
"completeness": (
|
||||||
|
"Does the answer address ALL parts of the probe question? If the "
|
||||||
|
"probe asks for three things and only two are answered, that is "
|
||||||
|
"incomplete regardless of accuracy on the two."
|
||||||
|
),
|
||||||
|
"continuity": (
|
||||||
|
"Could the next assistant continue the work using only this answer, "
|
||||||
|
"without having to re-fetch files or re-explore the codebase? An "
|
||||||
|
"answer that lists files by name but doesn't mention the change is "
|
||||||
|
"poor continuity even if accurate."
|
||||||
|
),
|
||||||
|
"instruction_following": (
|
||||||
|
"Is the answer in the format the probe requested (list, number, "
|
||||||
|
"short phrase, yes/no)? Ignore tone and length, only assess "
|
||||||
|
"whether the requested form was honoured."
|
||||||
|
),
|
||||||
|
}
|
||||||
|
|
||||||
|
SCORE_SCALE: Dict[int, str] = {
|
||||||
|
0: "No useful information; wrong or hallucinated.",
|
||||||
|
1: "Major gaps or a key fact is wrong.",
|
||||||
|
2: "Partially correct but significant omissions.",
|
||||||
|
3: "Mostly correct with minor omissions or imprecision.",
|
||||||
|
4: "Correct and complete with only trivial imprecision.",
|
||||||
|
5: "Fully correct, complete, and in the requested format.",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
_RUBRIC_HEADER = """You are an evaluator grading a single answer produced by an AI assistant \
|
||||||
|
that was given a COMPRESSED handoff summary of an earlier conversation and \
|
||||||
|
asked a probe question. You are NOT evaluating the compression summary \
|
||||||
|
directly — you are evaluating whether the answer the assistant produced \
|
||||||
|
from that summary is correct, complete, and useful.
|
||||||
|
|
||||||
|
Grade on six dimensions, each 0-5:
|
||||||
|
|
||||||
|
{dimension_block}
|
||||||
|
|
||||||
|
0-5 scale:
|
||||||
|
{scale_block}
|
||||||
|
|
||||||
|
Grade strictly. Fractional scores are NOT allowed — output integers only. \
|
||||||
|
If the answer is ambiguous, use the lower of the two candidate scores."""
|
||||||
|
|
||||||
|
|
||||||
|
def build_judge_prompt(
|
||||||
|
*,
|
||||||
|
probe_question: str,
|
||||||
|
probe_type: str,
|
||||||
|
expected_facts: List[str],
|
||||||
|
assistant_answer: str,
|
||||||
|
) -> str:
|
||||||
|
"""Build the full judge prompt for one (probe, answer) pair.
|
||||||
|
|
||||||
|
The judge is told the expected_facts up front so grading is anchored to
|
||||||
|
concrete signal rather than judge taste. Expected facts are intentionally
|
||||||
|
NOT shown to the assistant that produces the answer.
|
||||||
|
"""
|
||||||
|
dim_block = "\n".join(
|
||||||
|
f"- {d}: {DIMENSION_DESCRIPTIONS[d]}" for d in DIMENSIONS
|
||||||
|
)
|
||||||
|
scale_block = "\n".join(
|
||||||
|
f" {score}: {desc}" for score, desc in sorted(SCORE_SCALE.items())
|
||||||
|
)
|
||||||
|
header = _RUBRIC_HEADER.format(
|
||||||
|
dimension_block=dim_block,
|
||||||
|
scale_block=scale_block,
|
||||||
|
)
|
||||||
|
|
||||||
|
expected_block = (
|
||||||
|
"\n".join(f"- {f}" for f in expected_facts) if expected_facts else "(none provided)"
|
||||||
|
)
|
||||||
|
|
||||||
|
output_schema = (
|
||||||
|
"Respond with ONLY a JSON object, no prose before or after, matching "
|
||||||
|
"this schema exactly:\n"
|
||||||
|
"{\n"
|
||||||
|
' "accuracy": <int 0-5>,\n'
|
||||||
|
' "context_awareness": <int 0-5>,\n'
|
||||||
|
' "artifact_trail": <int 0-5>,\n'
|
||||||
|
' "completeness": <int 0-5>,\n'
|
||||||
|
' "continuity": <int 0-5>,\n'
|
||||||
|
' "instruction_following": <int 0-5>,\n'
|
||||||
|
' "notes": "<one short sentence, <=200 chars, identifying the '
|
||||||
|
'single biggest issue with the answer if any>"\n'
|
||||||
|
"}"
|
||||||
|
)
|
||||||
|
|
||||||
|
return (
|
||||||
|
f"{header}\n\n"
|
||||||
|
f"PROBE TYPE: {probe_type}\n\n"
|
||||||
|
f"PROBE QUESTION:\n{probe_question}\n\n"
|
||||||
|
f"EXPECTED FACTS (the answer should contain these concrete anchors; "
|
||||||
|
f"missing any is a material defect in accuracy and/or completeness):\n"
|
||||||
|
f"{expected_block}\n\n"
|
||||||
|
f"ASSISTANT ANSWER TO GRADE:\n{assistant_answer}\n\n"
|
||||||
|
f"{output_schema}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def parse_judge_response(raw: str) -> Dict[str, Any]:
|
||||||
|
"""Parse the judge model's JSON response into a score dict.
|
||||||
|
|
||||||
|
Tolerates surrounding prose (judges ignore instructions sometimes) by
|
||||||
|
extracting the first {...} block. Validates that every dimension is
|
||||||
|
present as an integer 0-5.
|
||||||
|
|
||||||
|
Returns dict with keys: scores (dim->int), notes (str), overall (float).
|
||||||
|
Raises ValueError if the response cannot be parsed into a complete
|
||||||
|
score set.
|
||||||
|
"""
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
|
||||||
|
if not raw or not raw.strip():
|
||||||
|
raise ValueError("empty judge response")
|
||||||
|
|
||||||
|
# Strip code fences and any ```json prefix judges sometimes emit.
|
||||||
|
stripped = raw.strip()
|
||||||
|
fence_match = re.match(r"^```(?:json)?\s*(.*?)\s*```$", stripped, re.DOTALL)
|
||||||
|
if fence_match:
|
||||||
|
stripped = fence_match.group(1).strip()
|
||||||
|
|
||||||
|
# Extract the first {...} block greedy-to-matching-brace.
|
||||||
|
brace_match = re.search(r"\{.*\}", stripped, re.DOTALL)
|
||||||
|
if not brace_match:
|
||||||
|
raise ValueError(f"no JSON object found in judge response: {raw[:200]!r}")
|
||||||
|
candidate = brace_match.group(0)
|
||||||
|
|
||||||
|
try:
|
||||||
|
parsed = json.loads(candidate)
|
||||||
|
except json.JSONDecodeError as exc:
|
||||||
|
raise ValueError(f"judge response not valid JSON: {exc}; raw={candidate[:200]!r}")
|
||||||
|
|
||||||
|
scores: Dict[str, int] = {}
|
||||||
|
for dim in DIMENSIONS:
|
||||||
|
if dim not in parsed:
|
||||||
|
raise ValueError(f"judge response missing dimension {dim!r}: {parsed}")
|
||||||
|
value = parsed[dim]
|
||||||
|
if isinstance(value, bool) or not isinstance(value, (int, float)):
|
||||||
|
raise ValueError(f"dimension {dim} is not numeric: {value!r}")
|
||||||
|
int_val = int(round(value))
|
||||||
|
if int_val < 0 or int_val > 5:
|
||||||
|
raise ValueError(f"dimension {dim} out of range: {int_val}")
|
||||||
|
scores[dim] = int_val
|
||||||
|
|
||||||
|
notes_val = parsed.get("notes", "")
|
||||||
|
notes = str(notes_val)[:200] if notes_val else ""
|
||||||
|
|
||||||
|
overall = sum(scores.values()) / len(scores)
|
||||||
|
return {
|
||||||
|
"scores": scores,
|
||||||
|
"notes": notes,
|
||||||
|
"overall": overall,
|
||||||
|
}
|
||||||
383
scripts/compression_eval/run_eval.py
Executable file
383
scripts/compression_eval/run_eval.py
Executable file
@@ -0,0 +1,383 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Compression eval — entry point.
|
||||||
|
|
||||||
|
Runs the full probe-based eval over one or more fixtures, produces a
|
||||||
|
markdown report in ``results/<label>/report.md`` paired with per-run JSON
|
||||||
|
for later diffing.
|
||||||
|
|
||||||
|
Not a pytest. Requires a configured provider + credentials (same path the
|
||||||
|
agent uses). Does not run in CI. See README.md for usage examples.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from datetime import datetime
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
|
||||||
|
_HERE = Path(__file__).resolve().parent
|
||||||
|
_REPO_ROOT = _HERE.parents[1]
|
||||||
|
if str(_REPO_ROOT) not in sys.path:
|
||||||
|
sys.path.insert(0, str(_REPO_ROOT))
|
||||||
|
# Make our sibling modules importable whether invoked as a script or as -m.
|
||||||
|
if str(_HERE) not in sys.path:
|
||||||
|
sys.path.insert(0, str(_HERE))
|
||||||
|
|
||||||
|
try:
|
||||||
|
import fire # noqa: F401
|
||||||
|
except ImportError:
|
||||||
|
fire = None # fallback to argparse if fire is unavailable
|
||||||
|
|
||||||
|
from hermes_cli.runtime_provider import resolve_runtime_provider # noqa: E402
|
||||||
|
|
||||||
|
from compressor_driver import run_compression # noqa: E402
|
||||||
|
from grader import answer_probe, grade_probe # noqa: E402
|
||||||
|
from report import ( # noqa: E402
|
||||||
|
load_baseline_summaries,
|
||||||
|
render_report,
|
||||||
|
summarize_fixture_runs,
|
||||||
|
write_run_json,
|
||||||
|
)
|
||||||
|
|
||||||
|
logger = logging.getLogger("compression_eval")
|
||||||
|
|
||||||
|
|
||||||
|
FIXTURES_DIR = _HERE / "fixtures"
|
||||||
|
PROBES_DIR = _HERE / "probes"
|
||||||
|
RESULTS_DIR = _HERE / "results"
|
||||||
|
|
||||||
|
|
||||||
|
def _load_fixture(name: str) -> Dict[str, Any]:
|
||||||
|
path = FIXTURES_DIR / f"{name}.json"
|
||||||
|
if not path.exists():
|
||||||
|
available = sorted(p.stem for p in FIXTURES_DIR.glob("*.json"))
|
||||||
|
raise FileNotFoundError(
|
||||||
|
f"Fixture not found: {name}. Available: {available}"
|
||||||
|
)
|
||||||
|
with path.open() as fh:
|
||||||
|
return json.load(fh)
|
||||||
|
|
||||||
|
|
||||||
|
def _load_probes(name: str) -> Dict[str, Any]:
|
||||||
|
path = PROBES_DIR / f"{name}.probes.json"
|
||||||
|
if not path.exists():
|
||||||
|
raise FileNotFoundError(f"Probe bank not found for fixture {name}: {path}")
|
||||||
|
with path.open() as fh:
|
||||||
|
return json.load(fh)
|
||||||
|
|
||||||
|
|
||||||
|
def _resolve_runtime(
|
||||||
|
*,
|
||||||
|
provider_override: Optional[str],
|
||||||
|
model_override: Optional[str],
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""Resolve provider credentials via the same path the agent uses."""
|
||||||
|
runtime = resolve_runtime_provider(
|
||||||
|
requested=provider_override,
|
||||||
|
target_model=model_override,
|
||||||
|
)
|
||||||
|
if not runtime.get("api_key") and not runtime.get("base_url"):
|
||||||
|
raise RuntimeError(
|
||||||
|
"No provider configured. Run `hermes setup` or set provider "
|
||||||
|
"credentials in the environment before running the eval."
|
||||||
|
)
|
||||||
|
return runtime
|
||||||
|
|
||||||
|
|
||||||
|
def _available_fixtures() -> List[str]:
|
||||||
|
return sorted(p.stem for p in FIXTURES_DIR.glob("*.json"))
|
||||||
|
|
||||||
|
|
||||||
|
def _run_one_fixture(
|
||||||
|
*,
|
||||||
|
fixture_name: str,
|
||||||
|
run_index: int,
|
||||||
|
compressor_runtime: Dict[str, Any],
|
||||||
|
compressor_model: str,
|
||||||
|
judge_runtime: Dict[str, Any],
|
||||||
|
judge_model: str,
|
||||||
|
focus_topic: Optional[str],
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
fx = _load_fixture(fixture_name)
|
||||||
|
probes = _load_probes(fixture_name)
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[%s run=%d] compressing (%d messages, ctx=%d)",
|
||||||
|
fixture_name, run_index, len(fx["messages"]), fx["context_length"],
|
||||||
|
)
|
||||||
|
compression = run_compression(
|
||||||
|
messages=fx["messages"],
|
||||||
|
compressor_model=compressor_model,
|
||||||
|
compressor_provider=compressor_runtime["provider"],
|
||||||
|
compressor_base_url=compressor_runtime["base_url"],
|
||||||
|
compressor_api_key=compressor_runtime["api_key"],
|
||||||
|
compressor_api_mode=compressor_runtime.get("api_mode", ""),
|
||||||
|
context_length=fx["context_length"],
|
||||||
|
focus_topic=focus_topic,
|
||||||
|
# Force the compressor to use the model we're testing, bypassing
|
||||||
|
# any auxiliary.compression.model config override. Without this,
|
||||||
|
# ContextCompressor.call_llm(task="compression") routes through
|
||||||
|
# the user's config which may pin a different model (e.g.
|
||||||
|
# google/gemini-3-flash-preview).
|
||||||
|
summary_model_override=compressor_model,
|
||||||
|
)
|
||||||
|
logger.info(
|
||||||
|
"[%s run=%d] compressed %d -> %d tokens (%.1f%%)",
|
||||||
|
fixture_name, run_index,
|
||||||
|
compression["pre_tokens"], compression["post_tokens"],
|
||||||
|
compression["compression_ratio"] * 100,
|
||||||
|
)
|
||||||
|
|
||||||
|
probe_results: List[Dict[str, Any]] = []
|
||||||
|
for probe in probes["probes"]:
|
||||||
|
t0 = time.monotonic()
|
||||||
|
try:
|
||||||
|
answer = answer_probe(
|
||||||
|
compressed_messages=compression["compressed_messages"],
|
||||||
|
probe_question=probe["question"],
|
||||||
|
provider=compressor_runtime["provider"],
|
||||||
|
model=compressor_model,
|
||||||
|
base_url=compressor_runtime["base_url"],
|
||||||
|
api_key=compressor_runtime["api_key"],
|
||||||
|
)
|
||||||
|
except Exception as exc:
|
||||||
|
logger.warning(
|
||||||
|
"[%s run=%d probe=%s] continuation failed: %s",
|
||||||
|
fixture_name, run_index, probe["id"], exc,
|
||||||
|
)
|
||||||
|
answer = ""
|
||||||
|
|
||||||
|
try:
|
||||||
|
grade = grade_probe(
|
||||||
|
probe_question=probe["question"],
|
||||||
|
probe_type=probe["type"],
|
||||||
|
expected_facts=probe.get("expected_facts", []),
|
||||||
|
assistant_answer=answer,
|
||||||
|
judge_provider=judge_runtime["provider"],
|
||||||
|
judge_model=judge_model,
|
||||||
|
judge_base_url=judge_runtime["base_url"],
|
||||||
|
judge_api_key=judge_runtime["api_key"],
|
||||||
|
)
|
||||||
|
except Exception as exc:
|
||||||
|
logger.warning(
|
||||||
|
"[%s run=%d probe=%s] grading failed: %s",
|
||||||
|
fixture_name, run_index, probe["id"], exc,
|
||||||
|
)
|
||||||
|
from rubric import DIMENSIONS
|
||||||
|
grade = {
|
||||||
|
"scores": {d: 0 for d in DIMENSIONS},
|
||||||
|
"notes": f"grading error: {exc}",
|
||||||
|
"overall": 0.0,
|
||||||
|
"raw": "",
|
||||||
|
"parse_error": str(exc),
|
||||||
|
}
|
||||||
|
|
||||||
|
elapsed = time.monotonic() - t0
|
||||||
|
logger.info(
|
||||||
|
"[%s run=%d probe=%s] overall=%.2f (%.1fs)",
|
||||||
|
fixture_name, run_index, probe["id"], grade["overall"], elapsed,
|
||||||
|
)
|
||||||
|
|
||||||
|
probe_results.append({
|
||||||
|
"id": probe["id"],
|
||||||
|
"type": probe["type"],
|
||||||
|
"question": probe["question"],
|
||||||
|
"expected_facts": probe.get("expected_facts", []),
|
||||||
|
"answer": answer,
|
||||||
|
"scores": grade["scores"],
|
||||||
|
"overall": grade["overall"],
|
||||||
|
"notes": grade["notes"],
|
||||||
|
"parse_error": grade["parse_error"],
|
||||||
|
"elapsed_seconds": elapsed,
|
||||||
|
})
|
||||||
|
|
||||||
|
return {
|
||||||
|
"fixture_name": fixture_name,
|
||||||
|
"run_index": run_index,
|
||||||
|
"compression": {
|
||||||
|
"pre_tokens": compression["pre_tokens"],
|
||||||
|
"post_tokens": compression["post_tokens"],
|
||||||
|
"compression_ratio": compression["compression_ratio"],
|
||||||
|
"pre_message_count": compression["pre_message_count"],
|
||||||
|
"post_message_count": compression["post_message_count"],
|
||||||
|
"summary_text": compression["summary_text"],
|
||||||
|
},
|
||||||
|
"probes": probe_results,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _coerce_fixtures_arg(arg: Optional[str]) -> List[str]:
|
||||||
|
if not arg:
|
||||||
|
return _available_fixtures()
|
||||||
|
return [s.strip() for s in arg.split(",") if s.strip()]
|
||||||
|
|
||||||
|
|
||||||
|
def main(
|
||||||
|
fixtures: Optional[str] = None,
|
||||||
|
runs: int = 3,
|
||||||
|
judge_model: Optional[str] = None,
|
||||||
|
judge_provider: Optional[str] = None,
|
||||||
|
compressor_model: Optional[str] = None,
|
||||||
|
compressor_provider: Optional[str] = None,
|
||||||
|
label: Optional[str] = None,
|
||||||
|
focus_topic: Optional[str] = None,
|
||||||
|
compare_to: Optional[str] = None,
|
||||||
|
verbose: bool = False,
|
||||||
|
) -> int:
|
||||||
|
"""Run the compression eval.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
fixtures: Comma-separated fixture names; default = all in fixtures/.
|
||||||
|
runs: Runs per fixture. Medians reported. Default 3.
|
||||||
|
judge_model: Override the judge model (default = same as
|
||||||
|
compressor model resolved from config).
|
||||||
|
judge_provider: Override the judge provider.
|
||||||
|
compressor_model: Override the compressor model (default =
|
||||||
|
whatever resolve_runtime_provider returns for the active
|
||||||
|
configuration).
|
||||||
|
compressor_provider: Override the compressor provider.
|
||||||
|
label: Output subdirectory under results/. Default = timestamp.
|
||||||
|
focus_topic: Optional focus topic passed through to
|
||||||
|
ContextCompressor.compress(focus_topic=...).
|
||||||
|
compare_to: Path to a previous run directory (e.g.
|
||||||
|
results/2026-04-24_baseline) to diff against in the report.
|
||||||
|
verbose: Print debug logs.
|
||||||
|
"""
|
||||||
|
logging.basicConfig(
|
||||||
|
level=logging.DEBUG if verbose else logging.INFO,
|
||||||
|
format="%(asctime)s %(levelname)s %(name)s: %(message)s",
|
||||||
|
)
|
||||||
|
|
||||||
|
fixture_names = _coerce_fixtures_arg(fixtures)
|
||||||
|
# Validate every fixture has a probe bank before spending any money.
|
||||||
|
for name in fixture_names:
|
||||||
|
_load_fixture(name)
|
||||||
|
_load_probes(name)
|
||||||
|
|
||||||
|
compressor_runtime = _resolve_runtime(
|
||||||
|
provider_override=compressor_provider,
|
||||||
|
model_override=compressor_model,
|
||||||
|
)
|
||||||
|
effective_compressor_model = (
|
||||||
|
compressor_model or compressor_runtime.get("resolved_model") or "auto"
|
||||||
|
)
|
||||||
|
if effective_compressor_model == "auto":
|
||||||
|
# resolve_runtime_provider doesn't always fill resolved_model;
|
||||||
|
# fall back to reading model.default from config.
|
||||||
|
from hermes_cli.config import load_config
|
||||||
|
cfg = load_config()
|
||||||
|
mc = cfg.get("model", {}) or {}
|
||||||
|
if isinstance(mc, dict):
|
||||||
|
effective_compressor_model = (
|
||||||
|
mc.get("default") or mc.get("model") or "anthropic/claude-sonnet-4.6"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
effective_compressor_model = str(mc) or "anthropic/claude-sonnet-4.6"
|
||||||
|
|
||||||
|
if judge_provider or judge_model:
|
||||||
|
judge_runtime = _resolve_runtime(
|
||||||
|
provider_override=judge_provider,
|
||||||
|
model_override=judge_model,
|
||||||
|
)
|
||||||
|
effective_judge_model = judge_model or effective_compressor_model
|
||||||
|
else:
|
||||||
|
judge_runtime = compressor_runtime
|
||||||
|
effective_judge_model = effective_compressor_model
|
||||||
|
|
||||||
|
effective_label = label or datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
|
||||||
|
out_dir = RESULTS_DIR / effective_label
|
||||||
|
out_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"Compression eval starting: label=%s fixtures=%s runs=%d "
|
||||||
|
"compressor=%s judge=%s out=%s",
|
||||||
|
effective_label, fixture_names, runs,
|
||||||
|
effective_compressor_model, effective_judge_model, out_dir,
|
||||||
|
)
|
||||||
|
|
||||||
|
all_summaries: List[Dict[str, Any]] = []
|
||||||
|
for fixture_name in fixture_names:
|
||||||
|
per_run: List[Dict[str, Any]] = []
|
||||||
|
for run_i in range(1, runs + 1):
|
||||||
|
payload = _run_one_fixture(
|
||||||
|
fixture_name=fixture_name,
|
||||||
|
run_index=run_i,
|
||||||
|
compressor_runtime=compressor_runtime,
|
||||||
|
compressor_model=effective_compressor_model,
|
||||||
|
judge_runtime=judge_runtime,
|
||||||
|
judge_model=effective_judge_model,
|
||||||
|
focus_topic=focus_topic,
|
||||||
|
)
|
||||||
|
write_run_json(
|
||||||
|
results_dir=out_dir,
|
||||||
|
fixture_name=fixture_name,
|
||||||
|
run_index=run_i,
|
||||||
|
payload=payload,
|
||||||
|
)
|
||||||
|
per_run.append(payload)
|
||||||
|
summary = summarize_fixture_runs(per_run)
|
||||||
|
all_summaries.append(summary)
|
||||||
|
|
||||||
|
baseline_summaries: Optional[List[Dict[str, Any]]] = None
|
||||||
|
if compare_to:
|
||||||
|
baseline_path = Path(compare_to)
|
||||||
|
if not baseline_path.is_absolute():
|
||||||
|
baseline_path = _HERE / baseline_path
|
||||||
|
baseline_summaries = load_baseline_summaries(baseline_path)
|
||||||
|
|
||||||
|
report_md = render_report(
|
||||||
|
label=effective_label,
|
||||||
|
compressor_model=effective_compressor_model,
|
||||||
|
judge_model=effective_judge_model,
|
||||||
|
runs_per_fixture=runs,
|
||||||
|
summaries=all_summaries,
|
||||||
|
baseline_summaries=baseline_summaries,
|
||||||
|
)
|
||||||
|
report_path = out_dir / "report.md"
|
||||||
|
report_path.write_text(report_md)
|
||||||
|
|
||||||
|
# Also write a machine-readable summary.json alongside the human report.
|
||||||
|
summary_path = out_dir / "summary.json"
|
||||||
|
with summary_path.open("w") as fh:
|
||||||
|
json.dump(
|
||||||
|
{
|
||||||
|
"label": effective_label,
|
||||||
|
"compressor_model": effective_compressor_model,
|
||||||
|
"judge_model": effective_judge_model,
|
||||||
|
"runs_per_fixture": runs,
|
||||||
|
"fixtures": all_summaries,
|
||||||
|
},
|
||||||
|
fh,
|
||||||
|
indent=2,
|
||||||
|
ensure_ascii=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
print()
|
||||||
|
print(report_md)
|
||||||
|
print(f"Report written to {report_path}")
|
||||||
|
print(f"Per-run JSON in {out_dir}")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
if fire is not None:
|
||||||
|
# fire preserves docstrings as --help and handles kwarg-style CLI.
|
||||||
|
sys.exit(fire.Fire(main))
|
||||||
|
else:
|
||||||
|
import argparse
|
||||||
|
p = argparse.ArgumentParser()
|
||||||
|
p.add_argument("--fixtures")
|
||||||
|
p.add_argument("--runs", type=int, default=3)
|
||||||
|
p.add_argument("--judge-model", dest="judge_model")
|
||||||
|
p.add_argument("--judge-provider", dest="judge_provider")
|
||||||
|
p.add_argument("--compressor-model", dest="compressor_model")
|
||||||
|
p.add_argument("--compressor-provider", dest="compressor_provider")
|
||||||
|
p.add_argument("--label")
|
||||||
|
p.add_argument("--focus-topic", dest="focus_topic")
|
||||||
|
p.add_argument("--compare-to", dest="compare_to")
|
||||||
|
p.add_argument("--verbose", action="store_true")
|
||||||
|
args = p.parse_args()
|
||||||
|
sys.exit(main(**vars(args)))
|
||||||
381
scripts/compression_eval/scrub_fixtures.py
Executable file
381
scripts/compression_eval/scrub_fixtures.py
Executable file
@@ -0,0 +1,381 @@
|
|||||||
|
"""One-shot fixture scrubber for scripts/compression_eval/fixtures/.
|
||||||
|
|
||||||
|
Source: ~/.hermes/sessions/<file>.jsonl
|
||||||
|
Output: .worktrees/.../scripts/compression_eval/fixtures/<name>.json
|
||||||
|
|
||||||
|
Scrubbing passes:
|
||||||
|
1. agent.redact.redact_sensitive_text — API keys, tokens, connection strings
|
||||||
|
2. Username paths — /home/teknium/ → /home/user/, ~/.hermes/ preserved as-is
|
||||||
|
(that path is universal)
|
||||||
|
3. Personal handles — "Teknium"/"teknium"/"teknium1" → "user"
|
||||||
|
4. Reasoning scratchpads — strip <REASONING_SCRATCHPAD>...</REASONING_SCRATCHPAD>
|
||||||
|
blocks and <think>...</think> tags (personality leakage risk)
|
||||||
|
5. session_meta line — drop entirely, we only need the messages
|
||||||
|
6. User message personality — lightly paraphrase the first user message to keep
|
||||||
|
task intent while removing "vibe"; subsequent user turns kept verbatim
|
||||||
|
since they're short instructions
|
||||||
|
|
||||||
|
The fixture format matches DESIGN.md:
|
||||||
|
{
|
||||||
|
"name": "...",
|
||||||
|
"description": "...",
|
||||||
|
"model": "...", # best guess from original session
|
||||||
|
"context_length": 200000,
|
||||||
|
"messages": [...], # OpenAI-format, only role/content/tool_calls/tool_call_id/tool_name
|
||||||
|
"notes": "Scrubbed from ~/.hermes/sessions/... on 2026-04-24"
|
||||||
|
}
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
from datetime import datetime
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any, Dict, List
|
||||||
|
|
||||||
|
# Resolve the hermes-agent checkout relative to this script so agent.redact
|
||||||
|
# imports cleanly whether we run from a worktree or a main clone.
|
||||||
|
_REPO_ROOT = Path(__file__).resolve().parents[2]
|
||||||
|
sys.path.insert(0, str(_REPO_ROOT))
|
||||||
|
from agent.redact import redact_sensitive_text # noqa: E402
|
||||||
|
|
||||||
|
|
||||||
|
SESSION_DIR = Path.home() / ".hermes" / "sessions"
|
||||||
|
# Resolve FIXTURES_DIR relative to this script so the scrubber runs the
|
||||||
|
# same way inside a worktree, a main checkout, or from a contributor's
|
||||||
|
# clone at a different path.
|
||||||
|
FIXTURES_DIR = Path(__file__).resolve().parent / "fixtures"
|
||||||
|
|
||||||
|
# (source_file, output_name, description, user_first_paraphrase, model_guess, context_length, truncate_at)
|
||||||
|
# truncate_at: keep messages[:truncate_at] (None = keep all). Applied BEFORE
|
||||||
|
# orphan-empty-assistant cleanup.
|
||||||
|
SPECS = [
|
||||||
|
(
|
||||||
|
"20260321_060441_fef7be92.jsonl",
|
||||||
|
"feature-impl-context-priority",
|
||||||
|
"~75-turn feature-impl: user asks how multiple project-context files "
|
||||||
|
"(.hermes.md / AGENTS.md / CLAUDE.md / .cursorrules) are handled when "
|
||||||
|
"all are present; agent investigates the codebase, designs a priority "
|
||||||
|
"order, patches the loader + tests, live-tests with a scenario "
|
||||||
|
"directory, commits to a feature branch, opens a PR, and merges after "
|
||||||
|
"approval. Exercises investigate → decide → implement → verify → "
|
||||||
|
"ship flow with clear artifact trail (2 files modified, 1 PR).",
|
||||||
|
(
|
||||||
|
"If .hermes.md, AGENTS.md, CLAUDE.md, and .cursorrules all exist in "
|
||||||
|
"the same directory, does the agent load all of them or pick one? "
|
||||||
|
"Use the hermes-agent-dev skill to check."
|
||||||
|
),
|
||||||
|
"anthropic/claude-sonnet-4.6",
|
||||||
|
200000,
|
||||||
|
74, # cut at "Merged and pulled. Main is current." — drops trailing unrelated cron-delivery messages
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"20260412_233741_3f2119a8.jsonl",
|
||||||
|
"debug-session-feishu-id-model",
|
||||||
|
"~60-turn debug/triage PR-review session: a third-party bug report "
|
||||||
|
"says the gateway's Feishu adapter misuses the open_id / union_id / "
|
||||||
|
"user_id identity model (open_id is app-scoped, not the bot's "
|
||||||
|
"canonical ID). An open community PR (#8388) tries to fix it. Agent "
|
||||||
|
"reviews the PR against current main, fetches upstream Feishu/Lark "
|
||||||
|
"identity docs, and produces a decision. Exercises long tool-heavy "
|
||||||
|
"context with PR diffs, upstream docs, and a clear decision at the "
|
||||||
|
"end — the classic 'can the summary still name the PR number, the "
|
||||||
|
"root cause, and the decision?' scenario.",
|
||||||
|
(
|
||||||
|
"A community user reports the Feishu/Lark adapter gets the identity "
|
||||||
|
"model wrong — open_id is app-scoped, not the bot's canonical ID. "
|
||||||
|
"There's an open PR #8388 trying to fix it. Use the hermes-agent-dev "
|
||||||
|
"skill and the pr-triage-salvage skill to review it."
|
||||||
|
),
|
||||||
|
"anthropic/claude-sonnet-4.6",
|
||||||
|
200000,
|
||||||
|
58, # end at "Here's my review: ..." — clean decision point before the "close it, implement cleaner" pivot
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"20260328_160817_77bd258b.jsonl",
|
||||||
|
"config-build-competitive-scouts",
|
||||||
|
"~60-turn iterative config/build session: user wants a set of weekly "
|
||||||
|
"cron jobs that scan competing AI coding agents (openclaw, nanoclaw, "
|
||||||
|
"ironclaw, codex, opencode, claude-code, kilo-code, gemini-cli, "
|
||||||
|
"cline, aider, roo) for merged PRs or web updates worth porting to "
|
||||||
|
"hermes-agent. User adds one target per turn; agent creates each cron "
|
||||||
|
"job and re-states the accumulated schedule. Exercises artifact trail "
|
||||||
|
"(which jobs are configured, which days) and iterative state "
|
||||||
|
"accumulation — the canonical case for iterative-merge summarization.",
|
||||||
|
(
|
||||||
|
"Set up a cron job for the agent every Sunday to scan all PRs "
|
||||||
|
"merged into openclaw that week, decide which are worth adding to "
|
||||||
|
"hermes-agent, and open PRs porting those features."
|
||||||
|
),
|
||||||
|
"anthropic/claude-sonnet-4.6",
|
||||||
|
200000,
|
||||||
|
None,
|
||||||
|
),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
# Tool output truncation is DELIBERATELY DISABLED.
|
||||||
|
#
|
||||||
|
# An earlier iteration truncated tool outputs > 2KB to keep fixture JSON
|
||||||
|
# files small, but that defeats the whole purpose of the eval. Real
|
||||||
|
# sessions have 30KB skill_view dumps, 10KB read_file outputs, 5KB
|
||||||
|
# web_extract bodies — compression has to either head-protect them,
|
||||||
|
# summarize them, or drop them. A fixture without that load doesn't
|
||||||
|
# exercise the compressor. The size win wasn't worth the signal loss.
|
||||||
|
#
|
||||||
|
# The function remains so the scrubbing_passes record in the fixture
|
||||||
|
# JSON continues to truthfully describe what was applied (no-op in this
|
||||||
|
# configuration).
|
||||||
|
_TOOL_OUTPUT_MAX = None # None disables truncation entirely
|
||||||
|
|
||||||
|
|
||||||
|
def _maybe_truncate_tool_output(text: str, tool_name: str) -> str:
|
||||||
|
if _TOOL_OUTPUT_MAX is None or not text or len(text) <= _TOOL_OUTPUT_MAX:
|
||||||
|
return text
|
||||||
|
keep = _TOOL_OUTPUT_MAX - 200
|
||||||
|
head = text[:keep]
|
||||||
|
return (
|
||||||
|
head
|
||||||
|
+ f"\n\n[... tool output truncated for fixture — original was {len(text)} chars"
|
||||||
|
+ (f" from {tool_name}" if tool_name else "")
|
||||||
|
+ "]"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
_PATH_RE = re.compile(r"/home/teknium\b")
|
||||||
|
# No \b boundaries — some tool content stores newlines as the literal
|
||||||
|
# two-char sequence "\\n" (escaped JSON), so a "\\nTeknium..." run has a
|
||||||
|
# word char ('n') immediately before 'T' and \b fails. Substring match is
|
||||||
|
# safer here; "Teknium" as a substring of an unrelated word is
|
||||||
|
# implausible in this corpus.
|
||||||
|
_USER_RE = re.compile(r"teknium1|Teknium|teknium", re.IGNORECASE)
|
||||||
|
# Only strip scratchpads in ASSISTANT content, not tool results (might be legit)
|
||||||
|
_SCRATCH_RE = re.compile(
|
||||||
|
r"<REASONING_SCRATCHPAD>.*?</REASONING_SCRATCHPAD>\s*", re.DOTALL
|
||||||
|
)
|
||||||
|
_THINK_RE = re.compile(r"<think>.*?</think>\s*", re.DOTALL)
|
||||||
|
# Discord/Telegram user mention leakage in messaging-platform sessions
|
||||||
|
_USER_MENTION_RE = re.compile(r"<@\*{3}>|<@\d+>")
|
||||||
|
# Contributor emails (from git show output etc) — anything@domain.tld
|
||||||
|
# Keep noreply@github-style placeholders obvious; real personal emails get
|
||||||
|
# replaced with a contributor placeholder.
|
||||||
|
_EMAIL_RE = re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b")
|
||||||
|
# "Author: Name <email>" git-show headers — rewrite the whole line
|
||||||
|
_GIT_AUTHOR_RE = re.compile(r"Author:\s*[^<\n]+<[^>]+>")
|
||||||
|
|
||||||
|
|
||||||
|
def _scrub_text(text: str, *, drop_scratchpads: bool = False) -> str:
|
||||||
|
"""Apply the pipeline to a raw text string.
|
||||||
|
|
||||||
|
drop_scratchpads only affects assistant messages — tool outputs that
|
||||||
|
happen to contain similar markers are left alone.
|
||||||
|
"""
|
||||||
|
if not text:
|
||||||
|
return text
|
||||||
|
if drop_scratchpads:
|
||||||
|
text = _SCRATCH_RE.sub("", text)
|
||||||
|
text = _THINK_RE.sub("", text)
|
||||||
|
text = _PATH_RE.sub("/home/user", text)
|
||||||
|
text = _USER_RE.sub("user", text)
|
||||||
|
text = _USER_MENTION_RE.sub("<@user>", text)
|
||||||
|
# Rewrite git "Author: Name <email>" lines before generic email replace
|
||||||
|
text = _GIT_AUTHOR_RE.sub("Author: contributor <contributor@example.com>", text)
|
||||||
|
text = _EMAIL_RE.sub("contributor@example.com", text)
|
||||||
|
text = redact_sensitive_text(text)
|
||||||
|
return text
|
||||||
|
|
||||||
|
|
||||||
|
def _content_to_str(content: Any) -> str:
|
||||||
|
if content is None:
|
||||||
|
return ""
|
||||||
|
if isinstance(content, str):
|
||||||
|
return content
|
||||||
|
if isinstance(content, list):
|
||||||
|
parts = []
|
||||||
|
for p in content:
|
||||||
|
if isinstance(p, dict) and "text" in p:
|
||||||
|
parts.append(p["text"])
|
||||||
|
elif isinstance(p, str):
|
||||||
|
parts.append(p)
|
||||||
|
return "\n".join(parts)
|
||||||
|
return str(content)
|
||||||
|
|
||||||
|
|
||||||
|
def _scrub_tool_calls(tool_calls: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
|
||||||
|
out = []
|
||||||
|
for tc in tool_calls or []:
|
||||||
|
if not isinstance(tc, dict):
|
||||||
|
continue
|
||||||
|
fn = tc.get("function", {}) or {}
|
||||||
|
args = fn.get("arguments", "")
|
||||||
|
if isinstance(args, str):
|
||||||
|
args = _scrub_text(args)
|
||||||
|
new_tc = {
|
||||||
|
"id": tc.get("id", ""),
|
||||||
|
"type": tc.get("type", "function"),
|
||||||
|
"function": {
|
||||||
|
"name": fn.get("name", ""),
|
||||||
|
"arguments": args,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
out.append(new_tc)
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def _scrub_message(m: Dict[str, Any], *, first_user_paraphrase: str | None, user_turn_idx: List[int]) -> Dict[str, Any] | None:
|
||||||
|
role = m.get("role")
|
||||||
|
if role in (None, "session_meta"):
|
||||||
|
return None
|
||||||
|
|
||||||
|
content = _content_to_str(m.get("content"))
|
||||||
|
|
||||||
|
if role == "assistant":
|
||||||
|
content = _scrub_text(content, drop_scratchpads=True)
|
||||||
|
elif role == "user":
|
||||||
|
# Use paraphrase for the very first user turn only
|
||||||
|
user_turn_idx[0] += 1
|
||||||
|
if user_turn_idx[0] == 1 and first_user_paraphrase is not None:
|
||||||
|
content = first_user_paraphrase
|
||||||
|
else:
|
||||||
|
content = _scrub_text(content)
|
||||||
|
else:
|
||||||
|
content = _scrub_text(content)
|
||||||
|
# Truncate large tool outputs
|
||||||
|
if role == "tool":
|
||||||
|
tn = m.get("tool_name") or m.get("name") or ""
|
||||||
|
content = _maybe_truncate_tool_output(content, tn)
|
||||||
|
|
||||||
|
new_msg: Dict[str, Any] = {"role": role, "content": content}
|
||||||
|
|
||||||
|
if role == "assistant":
|
||||||
|
tcs = m.get("tool_calls") or []
|
||||||
|
if tcs:
|
||||||
|
new_msg["tool_calls"] = _scrub_tool_calls(tcs)
|
||||||
|
if role == "tool":
|
||||||
|
if m.get("tool_call_id"):
|
||||||
|
new_msg["tool_call_id"] = m["tool_call_id"]
|
||||||
|
if m.get("tool_name") or m.get("name"):
|
||||||
|
new_msg["tool_name"] = m.get("tool_name") or m.get("name")
|
||||||
|
|
||||||
|
return new_msg
|
||||||
|
|
||||||
|
|
||||||
|
def build_fixture(
|
||||||
|
source_file: str,
|
||||||
|
output_name: str,
|
||||||
|
description: str,
|
||||||
|
first_user_paraphrase: str,
|
||||||
|
model_guess: str,
|
||||||
|
context_length: int,
|
||||||
|
truncate_at: int | None = None,
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
src = SESSION_DIR / source_file
|
||||||
|
raw_msgs: List[Dict[str, Any]] = []
|
||||||
|
with src.open() as fh:
|
||||||
|
for line in fh:
|
||||||
|
try:
|
||||||
|
raw_msgs.append(json.loads(line))
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Skip session_meta lines up front so truncate_at counts real messages
|
||||||
|
raw_msgs = [m for m in raw_msgs if m.get("role") != "session_meta"]
|
||||||
|
if truncate_at is not None:
|
||||||
|
raw_msgs = raw_msgs[:truncate_at]
|
||||||
|
|
||||||
|
user_turn_counter = [0]
|
||||||
|
scrubbed: List[Dict[str, Any]] = []
|
||||||
|
for m in raw_msgs:
|
||||||
|
new = _scrub_message(
|
||||||
|
m,
|
||||||
|
first_user_paraphrase=first_user_paraphrase,
|
||||||
|
user_turn_idx=user_turn_counter,
|
||||||
|
)
|
||||||
|
if new is not None:
|
||||||
|
scrubbed.append(new)
|
||||||
|
|
||||||
|
# Drop empty-content assistant messages that have no tool_calls
|
||||||
|
# (artifact of scratchpad-only turns post-scrub)
|
||||||
|
pruned: List[Dict[str, Any]] = []
|
||||||
|
for m in scrubbed:
|
||||||
|
if (
|
||||||
|
m["role"] == "assistant"
|
||||||
|
and not (m.get("content") or "").strip()
|
||||||
|
and not m.get("tool_calls")
|
||||||
|
):
|
||||||
|
continue
|
||||||
|
pruned.append(m)
|
||||||
|
# Trim trailing orphan tool messages (no matching assistant)
|
||||||
|
while pruned and pruned[-1]["role"] == "tool":
|
||||||
|
pruned.pop()
|
||||||
|
scrubbed = pruned
|
||||||
|
|
||||||
|
# Inject a synthetic public-safe system message so the compressor has
|
||||||
|
# a head to anchor on. The real system prompts embed personality and
|
||||||
|
# platform-specific content we don't want checked in.
|
||||||
|
system_msg = {
|
||||||
|
"role": "system",
|
||||||
|
"content": (
|
||||||
|
"You are a helpful AI coding assistant with access to tools "
|
||||||
|
"(terminal, file editing, search, web, etc.). You operate in a "
|
||||||
|
"conversational loop: the user gives you a task, you call tools "
|
||||||
|
"to accomplish it, and you report back concisely."
|
||||||
|
),
|
||||||
|
}
|
||||||
|
if scrubbed and scrubbed[0].get("role") == "system":
|
||||||
|
scrubbed[0] = system_msg
|
||||||
|
else:
|
||||||
|
scrubbed.insert(0, system_msg)
|
||||||
|
|
||||||
|
fixture = {
|
||||||
|
"name": output_name,
|
||||||
|
"description": description,
|
||||||
|
"model": model_guess,
|
||||||
|
"context_length": context_length,
|
||||||
|
"source": f"~/.hermes/sessions/{source_file}",
|
||||||
|
"truncated_to": truncate_at,
|
||||||
|
"scrubbed_at": datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ"),
|
||||||
|
"scrubbing_passes": [
|
||||||
|
"redact_sensitive_text (agent.redact)",
|
||||||
|
"username paths replaced with /home/user",
|
||||||
|
"personal handles (all case variants of the maintainer name) replaced with 'user'",
|
||||||
|
"email addresses replaced with contributor@example.com",
|
||||||
|
"git 'Author: Name <addr>' header lines normalised",
|
||||||
|
"reasoning scratchpad blocks stripped from assistant content",
|
||||||
|
"think tag blocks stripped from assistant content",
|
||||||
|
"messaging-platform user mentions replaced with <@user>",
|
||||||
|
"first user message paraphrased to remove personal voice",
|
||||||
|
"subsequent user messages kept verbatim (after above redactions)",
|
||||||
|
"system prompt replaced with generic public-safe placeholder",
|
||||||
|
"orphan empty-assistant messages and trailing tool messages dropped",
|
||||||
|
"tool outputs preserved verbatim (truncation disabled so the compressor sees real load)",
|
||||||
|
],
|
||||||
|
"messages": scrubbed,
|
||||||
|
}
|
||||||
|
return fixture
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
FIXTURES_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
for spec in SPECS:
|
||||||
|
source_file, output_name, description, paraphrase, model, ctx, truncate = spec
|
||||||
|
fixture = build_fixture(
|
||||||
|
source_file=source_file,
|
||||||
|
output_name=output_name,
|
||||||
|
description=description,
|
||||||
|
first_user_paraphrase=paraphrase,
|
||||||
|
model_guess=model,
|
||||||
|
context_length=ctx,
|
||||||
|
truncate_at=truncate,
|
||||||
|
)
|
||||||
|
out_path = FIXTURES_DIR / f"{output_name}.json"
|
||||||
|
with out_path.open("w") as fh:
|
||||||
|
json.dump(fixture, fh, indent=2, ensure_ascii=False)
|
||||||
|
size_kb = out_path.stat().st_size / 1024
|
||||||
|
print(f" {output_name}.json {size_kb:.1f} KB {len(fixture['messages'])} msgs")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
0
tests/scripts/__init__.py
Normal file
0
tests/scripts/__init__.py
Normal file
449
tests/scripts/test_compression_eval.py
Normal file
449
tests/scripts/test_compression_eval.py
Normal file
@@ -0,0 +1,449 @@
|
|||||||
|
"""Unit tests for scripts/compression_eval/ non-LLM paths.
|
||||||
|
|
||||||
|
These exercise rubric parsing, report rendering, and fixture/probe
|
||||||
|
loading — everything that does NOT require API credentials. The eval
|
||||||
|
harness itself (run_eval.py) is not hermetic and is not tested here.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
_SCRIPTS_DIR = Path(__file__).resolve().parents[2] / "scripts" / "compression_eval"
|
||||||
|
if str(_SCRIPTS_DIR) not in sys.path:
|
||||||
|
sys.path.insert(0, str(_SCRIPTS_DIR))
|
||||||
|
|
||||||
|
from rubric import ( # noqa: E402
|
||||||
|
DIMENSIONS,
|
||||||
|
SCORE_SCALE,
|
||||||
|
build_judge_prompt,
|
||||||
|
parse_judge_response,
|
||||||
|
)
|
||||||
|
from report import ( # noqa: E402
|
||||||
|
render_report,
|
||||||
|
summarize_fixture_runs,
|
||||||
|
write_run_json,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------- rubric.parse_judge_response ----------
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_judge_response_accepts_clean_json():
|
||||||
|
raw = json.dumps({
|
||||||
|
"accuracy": 4,
|
||||||
|
"context_awareness": 3,
|
||||||
|
"artifact_trail": 2,
|
||||||
|
"completeness": 5,
|
||||||
|
"continuity": 4,
|
||||||
|
"instruction_following": 5,
|
||||||
|
"notes": "missed redis_client.py",
|
||||||
|
})
|
||||||
|
out = parse_judge_response(raw)
|
||||||
|
assert out["scores"]["accuracy"] == 4
|
||||||
|
assert out["scores"]["artifact_trail"] == 2
|
||||||
|
assert out["notes"] == "missed redis_client.py"
|
||||||
|
assert 0 <= out["overall"] <= 5
|
||||||
|
# overall is the arithmetic mean of the six dims
|
||||||
|
expected = (4 + 3 + 2 + 5 + 4 + 5) / 6
|
||||||
|
assert abs(out["overall"] - expected) < 1e-9
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_judge_response_strips_code_fences():
|
||||||
|
raw = '```json\n{"accuracy":5,"context_awareness":5,"artifact_trail":5,"completeness":5,"continuity":5,"instruction_following":5,"notes":""}\n```'
|
||||||
|
out = parse_judge_response(raw)
|
||||||
|
assert all(v == 5 for v in out["scores"].values())
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_judge_response_tolerates_surrounding_prose():
|
||||||
|
raw = (
|
||||||
|
"Here is my grading:\n\n"
|
||||||
|
'{"accuracy": 3, "context_awareness": 4, "artifact_trail": 3, '
|
||||||
|
'"completeness": 4, "continuity": 3, "instruction_following": 5, '
|
||||||
|
'"notes": "ok"}\n\n'
|
||||||
|
"Let me know if you need more detail."
|
||||||
|
)
|
||||||
|
out = parse_judge_response(raw)
|
||||||
|
assert out["scores"]["accuracy"] == 3
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_judge_response_rounds_floats_to_ints():
|
||||||
|
raw = json.dumps({
|
||||||
|
"accuracy": 3.4,
|
||||||
|
"context_awareness": 3.6,
|
||||||
|
"artifact_trail": 3,
|
||||||
|
"completeness": 3,
|
||||||
|
"continuity": 3,
|
||||||
|
"instruction_following": 3,
|
||||||
|
"notes": "",
|
||||||
|
})
|
||||||
|
out = parse_judge_response(raw)
|
||||||
|
assert out["scores"]["accuracy"] == 3
|
||||||
|
assert out["scores"]["context_awareness"] == 4
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_judge_response_rejects_out_of_range():
|
||||||
|
raw = json.dumps({
|
||||||
|
"accuracy": 7, # illegal
|
||||||
|
"context_awareness": 3, "artifact_trail": 3, "completeness": 3,
|
||||||
|
"continuity": 3, "instruction_following": 3, "notes": "",
|
||||||
|
})
|
||||||
|
with pytest.raises(ValueError, match="out of range"):
|
||||||
|
parse_judge_response(raw)
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_judge_response_rejects_missing_dimension():
|
||||||
|
raw = json.dumps({
|
||||||
|
"accuracy": 3, "context_awareness": 3, "artifact_trail": 3,
|
||||||
|
"completeness": 3, "continuity": 3,
|
||||||
|
# instruction_following missing
|
||||||
|
"notes": "",
|
||||||
|
})
|
||||||
|
with pytest.raises(ValueError, match="missing dimension"):
|
||||||
|
parse_judge_response(raw)
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_judge_response_rejects_non_numeric():
|
||||||
|
raw = json.dumps({
|
||||||
|
"accuracy": "high",
|
||||||
|
"context_awareness": 3, "artifact_trail": 3, "completeness": 3,
|
||||||
|
"continuity": 3, "instruction_following": 3, "notes": "",
|
||||||
|
})
|
||||||
|
with pytest.raises(ValueError, match="not numeric"):
|
||||||
|
parse_judge_response(raw)
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_judge_response_rejects_booleans_as_numeric():
|
||||||
|
# JSON bools coerce to int otherwise — catch that explicitly
|
||||||
|
raw = json.dumps({
|
||||||
|
"accuracy": True,
|
||||||
|
"context_awareness": 3, "artifact_trail": 3, "completeness": 3,
|
||||||
|
"continuity": 3, "instruction_following": 3, "notes": "",
|
||||||
|
})
|
||||||
|
with pytest.raises(ValueError, match="not numeric"):
|
||||||
|
parse_judge_response(raw)
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_judge_response_rejects_empty():
|
||||||
|
with pytest.raises(ValueError, match="empty"):
|
||||||
|
parse_judge_response("")
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_judge_response_rejects_no_json():
|
||||||
|
with pytest.raises(ValueError, match="no JSON object"):
|
||||||
|
parse_judge_response("just some prose with no braces at all")
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_judge_response_rejects_malformed_json():
|
||||||
|
with pytest.raises(ValueError, match="not valid JSON"):
|
||||||
|
parse_judge_response("{accuracy: 3,}") # missing quotes, trailing comma
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_judge_response_truncates_long_notes():
|
||||||
|
long_notes = "x" * 500
|
||||||
|
raw = json.dumps({
|
||||||
|
"accuracy": 3, "context_awareness": 3, "artifact_trail": 3,
|
||||||
|
"completeness": 3, "continuity": 3, "instruction_following": 3,
|
||||||
|
"notes": long_notes,
|
||||||
|
})
|
||||||
|
out = parse_judge_response(raw)
|
||||||
|
assert len(out["notes"]) == 200
|
||||||
|
|
||||||
|
|
||||||
|
# ---------- rubric.build_judge_prompt ----------
|
||||||
|
|
||||||
|
|
||||||
|
def test_build_judge_prompt_mentions_all_dimensions():
|
||||||
|
prompt = build_judge_prompt(
|
||||||
|
probe_question="What files were modified?",
|
||||||
|
probe_type="artifact",
|
||||||
|
expected_facts=["foo.py", "bar.py"],
|
||||||
|
assistant_answer="I modified foo.py.",
|
||||||
|
)
|
||||||
|
for dim in DIMENSIONS:
|
||||||
|
assert dim in prompt
|
||||||
|
|
||||||
|
|
||||||
|
def test_build_judge_prompt_includes_expected_facts():
|
||||||
|
prompt = build_judge_prompt(
|
||||||
|
probe_question="What files were modified?",
|
||||||
|
probe_type="artifact",
|
||||||
|
expected_facts=["specific_file.py", "another_file.py"],
|
||||||
|
assistant_answer="n/a",
|
||||||
|
)
|
||||||
|
assert "specific_file.py" in prompt
|
||||||
|
assert "another_file.py" in prompt
|
||||||
|
|
||||||
|
|
||||||
|
def test_build_judge_prompt_handles_empty_expected_facts():
|
||||||
|
prompt = build_judge_prompt(
|
||||||
|
probe_question="anything?",
|
||||||
|
probe_type="recall",
|
||||||
|
expected_facts=[],
|
||||||
|
assistant_answer="nope",
|
||||||
|
)
|
||||||
|
assert "(none provided)" in prompt
|
||||||
|
|
||||||
|
|
||||||
|
def test_build_judge_prompt_includes_all_score_scale_levels():
|
||||||
|
prompt = build_judge_prompt(
|
||||||
|
probe_question="q", probe_type="recall",
|
||||||
|
expected_facts=[], assistant_answer="a",
|
||||||
|
)
|
||||||
|
for score in SCORE_SCALE:
|
||||||
|
assert f" {score}:" in prompt
|
||||||
|
|
||||||
|
|
||||||
|
# ---------- report.summarize_fixture_runs ----------
|
||||||
|
|
||||||
|
|
||||||
|
def _fake_run(fixture_name: str, run_index: int, probe_scores: dict) -> dict:
|
||||||
|
"""Build a synthetic per-run payload for summariser tests."""
|
||||||
|
probes = []
|
||||||
|
for pid, per_dim in probe_scores.items():
|
||||||
|
overall = sum(per_dim.values()) / len(per_dim)
|
||||||
|
probes.append({
|
||||||
|
"id": pid,
|
||||||
|
"type": "recall",
|
||||||
|
"question": "q",
|
||||||
|
"expected_facts": [],
|
||||||
|
"answer": "a",
|
||||||
|
"scores": per_dim,
|
||||||
|
"overall": overall,
|
||||||
|
"notes": f"note-run{run_index}",
|
||||||
|
"parse_error": None,
|
||||||
|
"elapsed_seconds": 0.1,
|
||||||
|
})
|
||||||
|
return {
|
||||||
|
"fixture_name": fixture_name,
|
||||||
|
"run_index": run_index,
|
||||||
|
"compression": {
|
||||||
|
"pre_tokens": 10000,
|
||||||
|
"post_tokens": 5000,
|
||||||
|
"compression_ratio": 0.5,
|
||||||
|
"pre_message_count": 50,
|
||||||
|
"post_message_count": 25,
|
||||||
|
"summary_text": "## Active Task\n...",
|
||||||
|
},
|
||||||
|
"probes": probes,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _all_dims(value: int) -> dict:
|
||||||
|
return {d: value for d in DIMENSIONS}
|
||||||
|
|
||||||
|
|
||||||
|
def test_summarize_handles_single_run():
|
||||||
|
runs = [_fake_run("fx1", 1, {
|
||||||
|
"p1": _all_dims(4),
|
||||||
|
"p2": _all_dims(3),
|
||||||
|
})]
|
||||||
|
s = summarize_fixture_runs(runs)
|
||||||
|
assert s["fixture_name"] == "fx1"
|
||||||
|
assert s["runs"] == 1
|
||||||
|
# Median of {4, 3} per dim is 3.5
|
||||||
|
for d in DIMENSIONS:
|
||||||
|
assert abs(s["dimension_medians"][d] - 3.5) < 1e-9
|
||||||
|
# Both probes have overall >= 3.0 so no misses
|
||||||
|
assert s["misses"] == []
|
||||||
|
|
||||||
|
|
||||||
|
def test_summarize_flags_misses_below_three():
|
||||||
|
runs = [_fake_run("fx1", 1, {
|
||||||
|
"p_good": _all_dims(4),
|
||||||
|
"p_bad": _all_dims(2),
|
||||||
|
})]
|
||||||
|
s = summarize_fixture_runs(runs)
|
||||||
|
miss_ids = [m["id"] for m in s["misses"]]
|
||||||
|
assert "p_bad" in miss_ids
|
||||||
|
assert "p_good" not in miss_ids
|
||||||
|
miss_entry = next(m for m in s["misses"] if m["id"] == "p_bad")
|
||||||
|
assert miss_entry["overall_median"] == 2.0
|
||||||
|
assert miss_entry["notes"] == "note-run1"
|
||||||
|
|
||||||
|
|
||||||
|
def test_summarize_medians_across_runs():
|
||||||
|
# Three runs, same probe, scores 2, 4, 5 per dim -> median 4
|
||||||
|
runs = [
|
||||||
|
_fake_run("fx1", 1, {"p": _all_dims(2)}),
|
||||||
|
_fake_run("fx1", 2, {"p": _all_dims(4)}),
|
||||||
|
_fake_run("fx1", 3, {"p": _all_dims(5)}),
|
||||||
|
]
|
||||||
|
s = summarize_fixture_runs(runs)
|
||||||
|
for d in DIMENSIONS:
|
||||||
|
assert s["dimension_medians"][d] == 4.0
|
||||||
|
assert s["runs"] == 3
|
||||||
|
|
||||||
|
|
||||||
|
def test_summarize_empty_input():
|
||||||
|
assert summarize_fixture_runs([]) == {}
|
||||||
|
|
||||||
|
|
||||||
|
# ---------- report.render_report ----------
|
||||||
|
|
||||||
|
|
||||||
|
def test_render_report_renders_all_fixtures():
|
||||||
|
runs = [_fake_run("feature-impl", 1, {"p1": _all_dims(4)})]
|
||||||
|
s = summarize_fixture_runs(runs)
|
||||||
|
md = render_report(
|
||||||
|
label="test",
|
||||||
|
compressor_model="modelA",
|
||||||
|
judge_model="modelA",
|
||||||
|
runs_per_fixture=1,
|
||||||
|
summaries=[s],
|
||||||
|
)
|
||||||
|
assert "feature-impl" in md
|
||||||
|
assert "modelA" in md
|
||||||
|
for dim in DIMENSIONS:
|
||||||
|
assert dim in md
|
||||||
|
# Methodology footer present
|
||||||
|
assert "Methodology" in md
|
||||||
|
assert "factory.ai" in md
|
||||||
|
|
||||||
|
|
||||||
|
def test_render_report_shows_deltas_when_baseline_provided():
|
||||||
|
baseline_runs = [_fake_run("fx", 1, {"p1": _all_dims(3)})]
|
||||||
|
current_runs = [_fake_run("fx", 1, {"p1": _all_dims(4)})]
|
||||||
|
baseline = [summarize_fixture_runs(baseline_runs)]
|
||||||
|
current = [summarize_fixture_runs(current_runs)]
|
||||||
|
md = render_report(
|
||||||
|
label="test",
|
||||||
|
compressor_model="m",
|
||||||
|
judge_model="m",
|
||||||
|
runs_per_fixture=1,
|
||||||
|
summaries=current,
|
||||||
|
baseline_summaries=baseline,
|
||||||
|
)
|
||||||
|
# Improvement of +1 from 3 -> 4 on every dim
|
||||||
|
assert "+1.00" in md
|
||||||
|
assert "Deltas shown against baseline" in md
|
||||||
|
|
||||||
|
|
||||||
|
def test_render_report_lists_misses_section():
|
||||||
|
runs = [_fake_run("fx", 1, {
|
||||||
|
"good": _all_dims(4),
|
||||||
|
"bad": _all_dims(1),
|
||||||
|
})]
|
||||||
|
s = summarize_fixture_runs(runs)
|
||||||
|
md = render_report(
|
||||||
|
label="t", compressor_model="m", judge_model="m",
|
||||||
|
runs_per_fixture=1, summaries=[s],
|
||||||
|
)
|
||||||
|
assert "Probes scoring below 3.0" in md
|
||||||
|
assert "`bad`" in md
|
||||||
|
assert "`good`" not in md
|
||||||
|
|
||||||
|
|
||||||
|
def test_render_report_no_misses_section_when_all_pass():
|
||||||
|
runs = [_fake_run("fx", 1, {"p": _all_dims(5)})]
|
||||||
|
s = summarize_fixture_runs(runs)
|
||||||
|
md = render_report(
|
||||||
|
label="t", compressor_model="m", judge_model="m",
|
||||||
|
runs_per_fixture=1, summaries=[s],
|
||||||
|
)
|
||||||
|
assert "Probes scoring below 3.0" not in md
|
||||||
|
|
||||||
|
|
||||||
|
def test_render_report_compression_table():
|
||||||
|
runs = [_fake_run("fx", 1, {"p": _all_dims(4)})]
|
||||||
|
s = summarize_fixture_runs(runs)
|
||||||
|
md = render_report(
|
||||||
|
label="t", compressor_model="m", judge_model="m",
|
||||||
|
runs_per_fixture=1, summaries=[s],
|
||||||
|
)
|
||||||
|
assert "Pre tokens" in md
|
||||||
|
assert "10000" in md # from _fake_run compression.pre_tokens
|
||||||
|
assert "50.0%" in md # ratio renders as percent
|
||||||
|
|
||||||
|
|
||||||
|
# ---------- report.write_run_json ----------
|
||||||
|
|
||||||
|
|
||||||
|
def test_write_run_json_roundtrip(tmp_path):
|
||||||
|
payload = _fake_run("fx1", 2, {"p": _all_dims(4)})
|
||||||
|
out = write_run_json(
|
||||||
|
results_dir=tmp_path,
|
||||||
|
fixture_name="fx1",
|
||||||
|
run_index=2,
|
||||||
|
payload=payload,
|
||||||
|
)
|
||||||
|
assert out.exists()
|
||||||
|
assert out.name == "fx1-run-2.json"
|
||||||
|
with out.open() as fh:
|
||||||
|
loaded = json.load(fh)
|
||||||
|
assert loaded["fixture_name"] == "fx1"
|
||||||
|
assert loaded["run_index"] == 2
|
||||||
|
|
||||||
|
|
||||||
|
# ---------- fixture + probe sanity ----------
|
||||||
|
|
||||||
|
|
||||||
|
_EVAL_DIR = Path(__file__).resolve().parents[2] / "scripts" / "compression_eval"
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("fixture_name", [
|
||||||
|
"feature-impl-context-priority",
|
||||||
|
"debug-session-feishu-id-model",
|
||||||
|
"config-build-competitive-scouts",
|
||||||
|
])
|
||||||
|
def test_fixture_loads_and_is_well_formed(fixture_name):
|
||||||
|
path = _EVAL_DIR / "fixtures" / f"{fixture_name}.json"
|
||||||
|
assert path.exists(), f"fixture missing: {path}"
|
||||||
|
with path.open() as fh:
|
||||||
|
fx = json.load(fh)
|
||||||
|
assert fx["name"] == fixture_name
|
||||||
|
assert isinstance(fx["messages"], list) and len(fx["messages"]) > 10
|
||||||
|
assert fx["messages"][0]["role"] == "system"
|
||||||
|
# At least one user message and one assistant message
|
||||||
|
roles = {m["role"] for m in fx["messages"]}
|
||||||
|
assert "user" in roles and "assistant" in roles
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("fixture_name", [
|
||||||
|
"feature-impl-context-priority",
|
||||||
|
"debug-session-feishu-id-model",
|
||||||
|
"config-build-competitive-scouts",
|
||||||
|
])
|
||||||
|
def test_probes_have_all_four_types(fixture_name):
|
||||||
|
path = _EVAL_DIR / "probes" / f"{fixture_name}.probes.json"
|
||||||
|
assert path.exists(), f"probe bank missing: {path}"
|
||||||
|
with path.open() as fh:
|
||||||
|
pb = json.load(fh)
|
||||||
|
assert pb["fixture"] == fixture_name
|
||||||
|
types = {p["type"] for p in pb["probes"]}
|
||||||
|
assert types == {"recall", "artifact", "continuation", "decision"}, (
|
||||||
|
f"{fixture_name} probe bank missing at least one probe type; got {types}"
|
||||||
|
)
|
||||||
|
# Every probe has expected_facts (possibly empty list but present)
|
||||||
|
for p in pb["probes"]:
|
||||||
|
assert "id" in p and "question" in p and "type" in p
|
||||||
|
assert "expected_facts" in p and isinstance(p["expected_facts"], list)
|
||||||
|
|
||||||
|
|
||||||
|
def test_fixtures_do_not_leak_maintainer_pii():
|
||||||
|
"""Smoke test that scrubber actually ran. This is a belt-and-suspenders
|
||||||
|
check that would have caught the ethanbit@qq.com leak before it
|
||||||
|
landed."""
|
||||||
|
for fixture_path in (_EVAL_DIR / "fixtures").glob("*.json"):
|
||||||
|
text = fixture_path.read_text()
|
||||||
|
lower = text.lower()
|
||||||
|
# The scrubbing_passes metadata intentionally documents what was
|
||||||
|
# replaced. Ignore the metadata block and only scan the messages.
|
||||||
|
data = json.loads(text)
|
||||||
|
msg_text = json.dumps(data["messages"])
|
||||||
|
msg_lower = msg_text.lower()
|
||||||
|
assert "teknium" not in msg_lower, (
|
||||||
|
f"{fixture_path.name}: maintainer handle leaked into messages"
|
||||||
|
)
|
||||||
|
# No personal-email domains (placeholder @example.com is allowed)
|
||||||
|
import re
|
||||||
|
personal_emails = re.findall(
|
||||||
|
r"[A-Za-z0-9._%+-]+@(?!example\.com)[A-Za-z0-9.-]+\.[A-Za-z]{2,}",
|
||||||
|
msg_text,
|
||||||
|
)
|
||||||
|
assert personal_emails == [], (
|
||||||
|
f"{fixture_path.name}: personal email(s) leaked: {personal_emails}"
|
||||||
|
)
|
||||||
Reference in New Issue
Block a user