docs: add Environments, Benchmarks & Data Generation guide

Comprehensive developer guide covering: - Architecture (BaseEnv → HermesAgentBaseEnv → concrete envs) - All three benchmarks (TerminalBench2, TBLite, YC-Bench) - Training environments (TerminalTestEnv, HermesSweEnv) - Core components (AgentLoop, ToolContext, Tool Call Parsers) - Two-phase operation (Phase 1 OpenAI, Phase 2 VLLM) - Running environments (evaluate, process, serve modes) - Creating new environments (training + eval-only) - Configuration reference and prerequisites Also updates environments/README.md directory tree to include TBLite and YC-Bench benchmarks.
2026-04-28 06:51:16 +08:00 · 2026-03-06 23:31:45 -08:00
parent f55f625277
commit 55a21fe37b
3 changed files with 509 additions and 2 deletions
--- a/environments/README.md
+++ b/environments/README.md
@@ -195,8 +195,12 @@ environments/
 │   └── hermes_swe_env.py
 │
 └── benchmarks/                   # Evaluation benchmarks
-    └── terminalbench_2/
-        └── terminalbench2_env.py
+    ├── terminalbench_2/          # 89 terminal tasks, Modal sandboxes
+    │   └── terminalbench2_env.py
+    ├── tblite/                   # 100 calibrated tasks (fast TB2 proxy)
+    │   └── tblite_env.py
+    └── yc_bench/                 # Long-horizon strategic benchmark
+        └── yc_bench_env.py
 ```

 ## Concrete Environments