# compression_eval Offline eval harness for `agent/context_compressor.py`. Runs a real conversation transcript through the compressor, then probes the compressed state with targeted questions graded on six dimensions. ## When to run Before merging changes to: - `agent/context_compressor.py` — any change to `_template_sections`, `_generate_summary`, `compress()`, or its boundary logic - `agent/auxiliary_client.py` — when changing how compression tasks are routed - `agent/prompt_builder.py` — when the compression-note phrasing changes ## Not for CI This harness makes real model calls (compressor + continuation + judge = ~3 calls per probe × probes per fixture × runs). Costs ~$0.50 to ~$1.50 per full run depending on models, takes minutes, is LLM-graded (non-deterministic). It lives in `scripts/` and is invoked by hand. `tests/` and `scripts/run_tests.sh` do not touch it. `tests/scripts/test_compression_eval.py` covers the non-LLM code paths (rubric parsing, report rendering, fixture/probe loading, PII smoke check on the checked-in fixtures) and DOES run in CI. ## Usage ```bash # Run all three fixtures, 3 runs each, with your configured provider python3 scripts/compression_eval/run_eval.py # Faster iteration — one fixture, one run python3 scripts/compression_eval/run_eval.py \ --fixtures=debug-session-feishu-id-model --runs=1 # Pin a cheap model for both compression + judge (recommended) python3 scripts/compression_eval/run_eval.py \ --compressor-provider=nous --compressor-model=openai/gpt-5.4-mini \ --judge-provider=nous --judge-model=openai/gpt-5.4-mini \ --runs=3 --label=baseline # After editing context_compressor.py, rerun with a new label and diff python3 scripts/compression_eval/run_eval.py \ --compressor-provider=nous --compressor-model=openai/gpt-5.4-mini \ --judge-provider=nous --judge-model=openai/gpt-5.4-mini \ --runs=3 --label=my-prompt-tweak \ --compare-to=results/baseline ``` Results land in `results/