gemini thinking script

add prokletor
add prokletor formatter
2026-05-06 18:57:21 +08:00 · 2025-12-11 00:46:25 -05:00 · 2025-12-10 23:07:28 -05:00 · 2025-11-23 10:24:58 -05:00 · 2025-11-22 11:25:23 -05:00 · 2025-11-22 09:47:00 -05:00
308 changed files with 2311 additions and 149076 deletions
--- a/.clinerules
+++ b/.clinerules
@@ -1,115 +0,0 @@
-# Cline's Memory Bank
-
-I am Cline, an expert software engineer with a unique characteristic: my memory resets completely between sessions. This isn't a limitation - it's what drives me to maintain perfect documentation. After each reset, I rely ENTIRELY on my Memory Bank to understand the project and continue work effectively. I MUST read ALL memory bank files at the start of EVERY task - this is not optional.
-
-## Memory Bank Structure
-
-The Memory Bank consists of core files and optional context files, all in Markdown format. Files build upon each other in a clear hierarchy:
-
-flowchart TD
-    PB[projectbrief.md] --> PC[productContext.md]
-    PB --> SP[systemPatterns.md]
-    PB --> TC[techContext.md]
-
-    PC --> AC[activeContext.md]
-    SP --> AC
-    TC --> AC
-
-    AC --> P[progress.md]
-
-### Core Files (Required)
-1. `projectbrief.md`
-   - Foundation document that shapes all other files
-   - Created at project start if it doesn't exist
-   - Defines core requirements and goals
-   - Source of truth for project scope
-
-2. `productContext.md`
-   - Why this project exists
-   - Problems it solves
-   - How it should work
-   - User experience goals
-
-3. `activeContext.md`
-   - Current work focus
-   - Recent changes
-   - Next steps
-   - Active decisions and considerations
-   - Important patterns and preferences
-   - Learnings and project insights
-
-4. `systemPatterns.md`
-   - System architecture
-   - Key technical decisions
-   - Design patterns in use
-   - Component relationships
-   - Critical implementation paths
-
-5. `techContext.md`
-   - Technologies used
-   - Development setup
-   - Technical constraints
-   - Dependencies
-   - Tool usage patterns
-
-6. `progress.md`
-   - What works
-   - What's left to build
-   - Current status
-   - Known issues
-   - Evolution of project decisions
-
-### Additional Context
-Create additional files/folders within memory-bank/ when they help organize:
- Complex feature documentation
- Integration specifications
- API documentation
- Testing strategies
- Deployment procedures
-
-## Core Workflows
-
-### Plan Mode
-flowchart TD
-    Start[Start] --> ReadFiles[Read Memory Bank]
-    ReadFiles --> CheckFiles{Files Complete?}
-
-    CheckFiles -->|No| Plan[Create Plan]
-    Plan --> Document[Document in Chat]
-
-    CheckFiles -->|Yes| Verify[Verify Context]
-    Verify --> Strategy[Develop Strategy]
-    Strategy --> Present[Present Approach]
-
-### Act Mode
-flowchart TD
-    Start[Start] --> Context[Check Memory Bank]
-    Context --> Update[Update Documentation]
-    Update --> Execute[Execute Task]
-    Execute --> Document[Document Changes]
-
-## Documentation Updates
-
-Memory Bank updates occur when:
-1. Discovering new project patterns
-2. After implementing significant changes
-3. When user requests with **update memory bank** (MUST review ALL files)
-4. When context needs clarification
-
-flowchart TD
-    Start[Update Process]
-
-    subgraph Process
-        P1[Review ALL Files]
-        P2[Document Current State]
-        P3[Clarify Next Steps]
-        P4[Document Insights & Patterns]
-
-        P1 --> P2 --> P3 --> P4
-    end
-
-    Start --> Process
-
-Note: When triggered by **update memory bank**, I MUST review every memory bank file, even if some don't require updates. Focus particularly on activeContext.md and progress.md as they track current state.
-
-REMEMBER: After every memory reset, I begin completely fresh. The Memory Bank is my only link to previous work. It must be maintained with precision and clarity, as my effectiveness depends entirely on its accuracy.
--- a/.cursorrules
+++ b/.cursorrules
@@ -1,201 +1,23 @@
-Hermes-Agent is an agent harness for LLMs with an interactive CLI.
+Hermes-Agent is an agent harness for LLMs.

-## Development Environment
+When building, the tool functionality is in the tools/ directory, where each specific tool (or in some cases, tools that are built for the same execution category or api) are placed in a script each their own.

-**IMPORTANT**: Always use the virtual environment if it exists:
-```bash
-source venv/bin/activate  # Before running any Python commands
-```
+Each tool is then consolidated in the model_tools.py file in the repo root.

-## Project Structure
+There is also a way to consolidate sets of tools in toolsets.py for the agent to use.

- `hermes` - CLI launcher script (run with `./hermes`)
- `cli.py` - Interactive CLI with Rich UI, prompt_toolkit, animated spinners
- `cli-config.yaml` - CLI configuration (model, terminal, toolsets, personalities)
- `tools/` - Individual tool implementations (web, terminal, browser, vision, etc.)
- `tools/__init__.py` - Exports all tools for importing
- `model_tools.py` - Consolidates tool schemas and handlers for the agent
- `toolsets.py` - Groups tools into logical toolsets (web, terminal, browser, etc.)
- `toolset_distributions.py` - Probability-based tool selection for data generation
- `run_agent.py` - Primary agent runner with AIAgent class and KawaiiSpinner
- `batch_runner.py` - Parallel batch processing with checkpointing
- `tests/` - Test scripts
+The primary agent runner code is in run_agent, but other runners could be developed using the tools and framework.

-## File Dependency Chain
+Always ensure consistency between tools, the model_tools.py and toolsets.py when changing any of them, otherwise they could become desynced in a way that is detrimental to functionality.

-```
-tools/*.py → tools/__init__.py → model_tools.py → toolsets.py → toolset_distributions.py
-                                       ↑
-run_agent.py ──────────────────────────┘
-cli.py → run_agent.py (uses AIAgent with quiet_mode=True)
-batch_runner.py → run_agent.py + toolset_distributions.py
-```
+The expected pathway for using API keys is to setup and place them in a .env file in the repo root.

-Always ensure consistency between tools, model_tools.py, and toolsets.py when changing any of them.
+Test scripts will be placed in tests/

-## CLI Architecture (cli.py)
+The run_agent loop is setup to:
+- Process the enabled toolsets to provide to the model,
+- Pipe in a prompt or problem from the input to the agent,
+- Loop the LLM each time it calls a tool, until the model decides no more tools are needed and provides a natural language response,
+- Return that response.

-The interactive CLI uses:
- **Rich** - For the welcome banner and styled panels
- **prompt_toolkit** - For fixed input area with history and `patch_stdout`
- **KawaiiSpinner** (in run_agent.py) - Animated feedback during API calls and tool execution
-
-Key components:
- `HermesCLI` class - Main CLI controller with commands and conversation loop
- `load_cli_config()` - Loads `cli-config.yaml`, sets environment variables for terminal
- `build_welcome_banner()` - Displays ASCII art logo, tools, and skills summary
- `/commands` - Process user commands like `/help`, `/clear`, `/personality`, etc.
-
-CLI uses `quiet_mode=True` when creating AIAgent to suppress verbose logging and enable kawaii-style feedback instead.
-
-### Adding CLI Commands
-
-1. Add to `COMMANDS` dict with description
-2. Add handler in `process_command()` method
-3. For persistent settings, use `save_config_value()` to update `cli-config.yaml`
-
-## Adding a New Tool
-
-Follow this strict order to maintain consistency:
-
-1. Create `tools/your_tool.py` with:
-   - Handler function (sync or async) returning a JSON string via `json.dumps()`
-   - `check_*_requirements()` function to verify dependencies (e.g., API keys)
-   - Schema definition following OpenAI function-calling format
-
-2. Export in `tools/__init__.py`:
-   - Import the handler and check function
-   - Add to `__all__` list
-
-3. Register in `model_tools.py`:
-   - Create `get_*_tool_definitions()` function or add to existing
-   - Add routing in `handle_function_call()` dispatcher
-   - Update `get_all_tool_names()` with the tool name
-   - Update `get_toolset_for_tool()` mapping
-   - Update `get_available_toolsets()` and `check_toolset_requirements()`
-
-4. Add to toolset in `toolsets.py`:
-   - Add to existing toolset or create new one in TOOLSETS dict
-
-5. Optionally add to `toolset_distributions.py` for batch processing
-
-## Tool Implementation Pattern
-
-```python
-# tools/example_tool.py
-import json
-import os
-
-def check_example_requirements() -> bool:
-    """Check if required API keys/dependencies are available."""
-    return bool(os.getenv("EXAMPLE_API_KEY"))
-
-def example_tool(param: str, task_id: str = None) -> str:
-    """Execute the tool and return JSON string result."""
-    try:
-        result = {"success": True, "data": "..."}
-        return json.dumps(result, ensure_ascii=False)
-    except Exception as e:
-        return json.dumps({"error": str(e)}, ensure_ascii=False)
-```
-
-All tool handlers MUST return a JSON string. Never return raw dicts.
-
-## Stateful Tools
-
-Tools that maintain state (terminal, browser) require:
- `task_id` parameter for session isolation between concurrent tasks
- `cleanup_*()` function to release resources
- Cleanup is called automatically in run_agent.py after conversation completes
-
-## Environment Variables
-
-API keys are loaded from `.env` file in repo root:
- `OPENROUTER_API_KEY` - Main LLM API access (primary provider)
- `FIRECRAWL_API_KEY` - Web search/extract tools
- `BROWSERBASE_API_KEY` / `BROWSERBASE_PROJECT_ID` - Browser automation
- `FAL_KEY` - Image generation (FLUX model)
- `NOUS_API_KEY` - Vision and Mixture-of-Agents tools
-
-Terminal tool configuration (can also be set in `cli-config.yaml`):
- `TERMINAL_ENV` - Backend: local, docker, singularity, modal, or ssh
- `TERMINAL_CWD` - Working directory
- `TERMINAL_SSH_HOST`, `TERMINAL_SSH_USER`, `TERMINAL_SSH_KEY` - For SSH backend
-
-## Agent Loop (run_agent.py)
-
-The AIAgent class handles:
- Processing enabled toolsets to provide to the model
- Piping prompts to the agent
- Looping LLM calls when tools are invoked, until natural language response
- Returning the final response
-
-Uses OpenAI-compatible API (primarily OpenRouter) with the OpenAI Python SDK.
-
-## Reasoning Model Support
-
-For models that support chain-of-thought reasoning:
- Extract `reasoning_content` from API responses
- Store in `assistant_msg["reasoning"]` for trajectory export
- Pass back via `reasoning_content` field on subsequent turns
-
-## Trajectory Format
-
-Conversations are saved in ShareGPT format for training:
-```json
-{"from": "system", "value": "System prompt with <tools>...</tools>"}
-{"from": "human", "value": "User message"}
-{"from": "gpt", "value": "<think>reasoning</think>\n<tool_call>{...}</tool_call>"}
-{"from": "tool", "value": "<tool_response>{...}</tool_response>"}
-{"from": "gpt", "value": "Final response"}
-```
-
-Tool calls use `<tool_call>` XML tags, responses use `<tool_response>` tags, reasoning uses `<think>` tags.
-
-## Batch Processing (batch_runner.py)
-
-For processing multiple prompts:
- Parallel execution with multiprocessing
- Content-based resume for fault tolerance (matches on prompt text, not indices)
- Toolset distributions control probabilistic tool availability per prompt
- Output: `data/<run_name>/trajectories.jsonl` (combined) + individual batch files
-
-## Logging
-
-Trajectories restructure tools as a system prompt for storage in a format suitable for later training use.
-
-## Skills System
-
-Skills are on-demand knowledge documents the agent can load. Located in `skills/` directory:
-
-```
-skills/
-├── mlops/                    # Category folder
-│   ├── axolotl/             # Skill folder
-│   │   ├── SKILL.md         # Main instructions (required)
-│   │   ├── references/      # Additional docs, API specs
-│   │   └── templates/       # Output formats, configs
-│   └── vllm/
-│       └── SKILL.md
-└── example-skill/
-    └── SKILL.md
-```
-
-**Progressive disclosure** (token-efficient):
-1. `skills_categories()` - List category names (~50 tokens)
-2. `skills_list(category)` - Name + description per skill (~3k tokens)
-3. `skill_view(name)` - Full content + tags + linked files
-
-SKILL.md files use YAML frontmatter:
-```yaml
---
-name: skill-name
-description: Brief description for listing
-tags: [tag1, tag2]
-related_skills: [other-skill]
-version: 1.0.0
---
-# Skill Content...
-```
-
-Tool files: `tools/skills_tool.py` → `model_tools.py` → `toolsets.py`
+There are additional caveats for logging, where we restructure the "tools" as a system prompt for storage later into a format that can be used and handled properly later.
--- a/.env.example
+++ b/.env.example
@@ -1,77 +1,14 @@
 # Hermes Agent Environment Configuration
 # Copy this file to .env and fill in your API keys
+# Get API keys from the URLs listed below

 # =============================================================================
-# CORE SETTINGS
+# REQUIRED API KEYS
 # =============================================================================
-# Agent backend:
-# - openai  : default Hermes-Agent loop (OpenAI function-calling via OpenAI SDK)
-# - atropos : Atroposlib ServerManager/ManagedServer-backed loop (training/env integration)
-HERMES_BACKEND=openai

-
-# =============================================================================
-# LOCAL / SELF-HOSTED OPENAI-COMPATIBLE ENDPOINTS (vLLM, SGLang, llama.cpp, etc.)
-# =============================================================================
-# For local development (matches the Atropos test env defaults):
-# ATROPOS_SERVER_BASE_URL=http://127.0.0.1:8080
-# ATROPOS_SERVER_MODEL=hermes-4-36b
-# For hosted inference (Nous Research inference API):
-ATROPOS_SERVER_BASE_URL=
-ATROPOS_SERVER_MODEL=
-ATROPOS_TOKENIZER_NAME=
-# Set this to your Nous API key (Bearer token).
-ATROPOS_SERVER_API_KEY=
-
-# Debugging (prints to stdout; use with care)
-# HERMES_DEBUG_ATROPOS_REQUEST=1
-# HERMES_DEBUG_ATROPOS_RESPONSE=1
-# HERMES_DEBUG_OPENAI_REQUEST=1
-# HERMES_DEBUG_OPENAI_RESPONSE=1
-
-# =============================================================================
-# LOCAL / SELF-HOSTED OPENAI-COMPATIBLE ENDPOINTS (vLLM, SGLang, llama.cpp, etc.)
-# =============================================================================
-# If you set ATROPOS_SERVER_BASE_URL or OPENAI_BASE_URL, Hermes will use it instead
-# of OpenRouter.
-#
-# Local server convenience (base URL without /v1):
-# llama.cpp example (see `Hermes-Agent/scripts/launch_llama_cpp_hermes_4_36b.sh`):
-# ATROPOS_SERVER_BASE_URL=http://127.0.0.1:8080
-# ATROPOS_SERVER_MODEL=hermes-4-36b
-# ATROPOS_TOKENIZER_NAME=NousResearch/Hermes-4.3-36B
-# ATROPOS_SERVER_API_KEY=local
-#
-# Hosted Nous inference API:
-# ATROPOS_SERVER_BASE_URL=https://inference-api.nousresearch.com
-# ATROPOS_SERVER_MODEL=Hermes-4.3-36B
-# ATROPOS_TOKENIZER_NAME=NousResearch/Hermes-4.3-36B
-# ATROPOS_SERVER_API_KEY=sk-... (Bearer token)
-#
-# If you plan to run GRPO-style group sampling (e.g. `--env.group_size 4`) against
-# llama.cpp, start the server with at least that many slots, e.g.:
-#   LLAMA_CPP_PARALLEL=4 Hermes-Agent/scripts/launch_llama_cpp_hermes_4_36b.sh
-#
-# Generic OpenAI-compatible (base URL should include /v1):
-# OPENAI_BASE_URL=http://127.0.0.1:8080/v1
-# OPENAI_API_KEY=local
-
-# =============================================================================
-# LLM PROVIDER (OpenRouter)
-# =============================================================================
-# OpenRouter provides access to many models through one API
-# All LLM calls go through OpenRouter - no direct provider keys needed
-# Get your key at: https://openrouter.ai/keys
-OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
-OPENROUTER_API_KEY=
-
-# Default model to use (OpenRouter format: provider/model)
-# Examples: anthropic/claude-sonnet-4, openai/gpt-4o, google/gemini-2.0-flash, zhipuai/glm-4-plus
-LLM_MODEL=anthropic/claude-sonnet-4
-
-# =============================================================================
-# TOOL API KEYS
-# =============================================================================
+# Anthropic API Key - Main agent model
+# Get at: https://console.anthropic.com/
+ANTHROPIC_API_KEY=

 # Firecrawl API Key - Web search, extract, and crawl
 # Get at: https://firecrawl.dev/
@@ -81,206 +18,31 @@ FIRECRAWL_API_KEY=
 # Get at: https://inference-api.nousresearch.com/
 NOUS_API_KEY=

+# Morph API Key - Terminal/command execution tools
+# Get at: https://morph.so/
+MORPH_API_KEY=
+
 # FAL.ai API Key - Image generation
 # Get at: https://fal.ai/
 FAL_KEY=

 # =============================================================================
-# TERMINAL TOOL CONFIGURATION (mini-swe-agent backend)
-# =============================================================================
-# Backend type: "local", "singularity", "docker", "modal", or "ssh"
-# - local: Runs directly on your machine (fastest, no isolation)
-# - ssh: Runs on remote server via SSH (great for sandboxing - agent can't touch its own code)
-# - singularity: Runs in Apptainer/Singularity containers (HPC clusters, no root needed)
-# - docker: Runs in Docker containers (isolated, requires Docker + docker group)
-# - modal: Runs in Modal cloud sandboxes (scalable, requires Modal account)
-TERMINAL_ENV=local
-
-# Container images (for singularity/docker/modal backends)
-TERMINAL_DOCKER_IMAGE=python:3.11
-TERMINAL_SINGULARITY_IMAGE=docker://python:3.11
-TERMINAL_MODAL_IMAGE=python:3.11
-
-# Working directory inside the container
-TERMINAL_CWD=/tmp
-
-# Default command timeout in seconds
-TERMINAL_TIMEOUT=60
-
-# Cleanup inactive environments after this many seconds
-TERMINAL_LIFETIME_SECONDS=300
-
-# =============================================================================
-# SSH REMOTE EXECUTION (for TERMINAL_ENV=ssh)
-# =============================================================================
-# Run terminal commands on a remote server via SSH.
-# Agent code stays on your machine, commands execute remotely.
-#
-# SECURITY BENEFITS:
-# - Agent cannot read your .env file (API keys protected)
-# - Agent cannot modify its own code
-# - Remote server acts as isolated sandbox
-# - Can safely configure passwordless sudo on remote
-#
-# TERMINAL_SSH_HOST=192.168.1.100
-# TERMINAL_SSH_USER=agent
-# TERMINAL_SSH_PORT=22
-# TERMINAL_SSH_KEY=~/.ssh/id_rsa
-
-# =============================================================================
-# SUDO SUPPORT (works with ALL terminal backends)
-# =============================================================================
-# If set, enables sudo commands by piping password via `sudo -S`.
-# Works with: local, docker, singularity, modal, and ssh backends.
-# 
-# SECURITY WARNING: Password stored in plaintext. Only use on trusted machines.
-# 
-# ALTERNATIVES:
-# - For SSH backend: Configure passwordless sudo on the remote server
-# - For containers: Run as root inside the container (no sudo needed)
-# - For local: Configure /etc/sudoers for specific commands
-# - For CLI: Leave unset - you'll be prompted interactively with 45s timeout
-#
-# SUDO_PASSWORD=your_password_here
-
-# =============================================================================
-# MODAL CLOUD BACKEND (for TERMINAL_ENV=modal)
-# =============================================================================
-# Modal provides cloud sandboxes with per-second billing and auto-scaling.
-# This implementation uses a warm pool of sandboxes for cost efficiency.
-#
-# SETUP:
-#   pip install modal && modal setup
-#   (Authenticates via browser, stores credentials locally)
-#
-# FEATURES:
-# - Auto-scaling warm sandbox pool (no cold start after first use)
-# - Named sandbox recovery (reconnects after restart)
-# - Profile-based heterogeneous environments (CPU, GPU, different images)
-# - Server-side idle_timeout protection against orphaned sandboxes
-
-# Modal app name (groups all sandboxes, used for recovery)
-TERMINAL_MODAL_APP_NAME=hermes-sandbox
-
-# Default profile when none specified
-TERMINAL_MODAL_DEFAULT_PROFILE=default
-
-# Profile config file (optional - YAML format, see modal_profiles.yaml)
-# TERMINAL_MODAL_PROFILES_FILE=modal_profiles.yaml
-
-# --- Default Profile Settings (used if no YAML file) ---
-# These apply when no profile is specified or for the "default" profile
-TERMINAL_MODAL_IMAGE=python:3.11
-TERMINAL_MODAL_MIN_POOL=1
-TERMINAL_MODAL_MAX_POOL=5
-TERMINAL_MODAL_IDLE_TIMEOUT=120
-TERMINAL_MODAL_MAX_LIFETIME=3600
-TERMINAL_MODAL_SCALE_DOWN_IDLE=180
-
-# --- Custom Profile Example: pytorch-gpu ---
-# Uncomment to enable a GPU profile for ML tasks
-# Usage: terminal_tool("python train.py", profile="pytorch-gpu")
-#
-# TERMINAL_MODAL_PROFILE_pytorch_gpu_IMAGE=pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
-# TERMINAL_MODAL_PROFILE_pytorch_gpu_GPU=T4
-# TERMINAL_MODAL_PROFILE_pytorch_gpu_MEMORY=16384
-# TERMINAL_MODAL_PROFILE_pytorch_gpu_MIN_POOL=0
-# TERMINAL_MODAL_PROFILE_pytorch_gpu_MAX_POOL=2
-# TERMINAL_MODAL_PROFILE_pytorch_gpu_IDLE_TIMEOUT=60
-
-# --- Custom Profile Example: node ---
-# Uncomment to enable a Node.js profile
-# Usage: terminal_tool("npm test", profile="node")
-#
-# TERMINAL_MODAL_PROFILE_node_IMAGE=node:18
-# TERMINAL_MODAL_PROFILE_node_MIN_POOL=0
-# TERMINAL_MODAL_PROFILE_node_MAX_POOL=3
-
-# =============================================================================
-# MODAL SECRETS (Secure credential injection)
-# =============================================================================
-# Modal Secrets allow you to securely pass API keys, passwords, and other
-# sensitive data to your sandboxes without exposing them in code or logs.
-#
-# SETUP SECRETS:
-#   1. Via Dashboard: https://modal.com/secrets
-#   2. Via CLI: modal secret create my-secret KEY1=value1 KEY2=value2
-#   3. Via CLI with env: modal secret create my-secret API_KEY="$API_KEY"
-#
-# LIST SECRETS:
-#   modal secret list
-#
-# DELETE SECRETS:
-#   modal secret delete my-secret
-
-# Global secrets applied to ALL profiles (comma-separated secret names)
-# These secrets must be created on Modal dashboard or via CLI first
-# TERMINAL_MODAL_SECRETS=my-api-keys,database-creds
-
-# Per-profile secrets (comma-separated secret names)
-# TERMINAL_MODAL_PROFILE_pytorch_gpu_SECRETS=huggingface-token,wandb-key
-
-# Per-profile environment variables (semicolon-separated KEY=VALUE pairs)
-# TERMINAL_MODAL_PROFILE_default_ENV_VARS=DEBUG=1;LOG_LEVEL=info
-
-# Load local .env file into sandbox (useful for development)
-# TERMINAL_MODAL_PROFILE_default_USE_DOTENV=true
-
-# =============================================================================
-# BROWSER TOOL CONFIGURATION (agent-browser + Browserbase)
-# =============================================================================
-# Browser automation requires Browserbase cloud service for remote browser execution.
-# This allows the agent to navigate websites, fill forms, and extract information.
-#
-# STEALTH MODES:
-# - Basic Stealth: ALWAYS active (random fingerprints, auto CAPTCHA solving)
-# - Advanced Stealth: Requires BROWSERBASE_ADVANCED_STEALTH=true (Scale Plan only)
-
-# Browserbase API Key - Cloud browser execution
-# Get at: https://browserbase.com/
-BROWSERBASE_API_KEY=
-
-# Browserbase Project ID - From your Browserbase dashboard
-BROWSERBASE_PROJECT_ID=
-
-# Enable residential proxies for better CAPTCHA solving (default: true)
-# Routes traffic through residential IPs, significantly improves success rate
-BROWSERBASE_PROXIES=true
-
-# Enable advanced stealth mode (default: false, requires Scale Plan)
-# Uses custom Chromium build to avoid bot detection altogether
-BROWSERBASE_ADVANCED_STEALTH=false
-
-# Browser session timeout in seconds (default: 300)
-# Sessions are cleaned up after this duration of inactivity
-BROWSER_SESSION_TIMEOUT=300
-
-# Browser inactivity timeout - auto-cleanup inactive sessions (default: 120 = 2 min)
-# Browser sessions are automatically closed after this period of no activity
-BROWSER_INACTIVITY_TIMEOUT=120
-
-# =============================================================================
-# SESSION LOGGING
-# =============================================================================
-# Session trajectories are automatically saved to logs/ directory
-# Format: logs/session_YYYYMMDD_HHMMSS_UUID.json
-# Contains full conversation history in trajectory format for debugging/replay
-
-# =============================================================================
-# LEGACY/OPTIONAL API KEYS
+# OPTIONAL API KEYS
 # =============================================================================

-# Morph API Key - For legacy Hecate terminal backend (terminal-hecate tool)
-# Get at: https://morph.so/
-MORPH_API_KEY=
+# OpenAI API Key - Optional, for enhanced Hecate features
+# Get at: https://platform.openai.com/
+OPENAI_API_KEY=

-# Hecate VM Settings (only if using terminal-hecate tool)
+# =============================================================================
+# OPTIONAL CONFIGURATION
+# =============================================================================
+
+# Terminal Tool Settings
 HECATE_VM_LIFETIME_SECONDS=300
 HECATE_DEFAULT_SNAPSHOT_ID=snapshot_p5294qxt

-# =============================================================================
-# DEBUG OPTIONS
-# =============================================================================
+# Debug Logging (set to "true" to enable, logs saved to ./logs/)
 WEB_TOOLS_DEBUG=false
 VISION_TOOLS_DEBUG=false
 MOA_TOOLS_DEBUG=false
--- a/.gitignore
+++ b/.gitignore
@@ -30,35 +30,3 @@ run_datagen_megascience_glm4-6.sh
 run_datagen_sonnet.sh
 source-data/*
 run_datagen_megascience_glm4-6.sh
-data/*
-node_modules/
-browser-use/
-agent-browser/
-# Private keys
-*.ppk
-*.pem
-privvy*
-images/
-
-# CLI config (may contain sensitive SSH paths)
-cli-config.yaml
-
-.DS_Store
-
-# artifacts
-*.jsonl
-*.html
-*.json
-*.log
-*.csv
-
-# Singularity/Apptainer images (large binary files)
-*.sif
-
-# Test files
-test_singularity_*.py
-test_*.py
-!tests/test_*.py
-
-# Nomad data
-/tmp/NomadClient*/
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,3 +0,0 @@
-[submodule "mini-swe-agent"]
-	path = mini-swe-agent
-	url = https://github.com/SWE-agent/mini-swe-agent
--- a/README.md
+++ b/README.md
@@ -4,64 +4,34 @@ An AI agent with advanced tool-calling capabilities, featuring a flexible toolse

 ## Features

- **Interactive CLI**: Beautiful terminal interface with animated feedback, personalities, and session management
 - **Web Tools**: Search, extract content, and crawl websites
- **Terminal Tools**: Execute commands via local, Docker, Singularity, Modal, or SSH backends
- **Browser Tools**: Automate web browsers to navigate, click, type, and extract content
+- **Terminal Tools**: Execute commands with interactive session support
 - **Vision Tools**: Analyze images from URLs
 - **Reasoning Tools**: Advanced multi-model reasoning (Mixture of Agents)
 - **Creative Tools**: Generate images from text prompts
- **Skills Tools**: On-demand knowledge documents with progressive disclosure
 - **Toolsets System**: Organize tools into logical groups for different scenarios
 - **Batch Processing**: Process datasets in parallel with checkpointing and statistics tracking
 - **Ephemeral System Prompts**: Guide model behavior without polluting training datasets

-## Quick Start (CLI)
-
-```bash
-# After setup (see below), just run:
-./hermes
-
-# Or with options:
-./hermes --model "anthropic/claude-sonnet-4" --toolsets "web,terminal"
-```
-
-The CLI provides:
- Animated spinners during thinking and tool execution
- Kawaii-style feedback messages
- `/commands` for configuration, history, and session management
- Customizable personalities (`/personality kawaii`, `/personality pirate`, etc.)
- Persistent configuration via `cli-config.yaml`
-
 ## Setup

-### 1. Clone the Repository
-```bash
-# Clone with submodules (recommended)
-git clone --recurse-submodules https://github.com/NousResearch/Hermes-Agent.git
-cd Hermes-Agent
-
-# Or if already cloned without submodules:
-git submodule update --init --recursive
-```
-
-### 2. Install Dependencies
+### 1. Install Dependencies
 ```bash
 # Create and activate virtual environment (recommended)
 python3 -m venv venv
 source venv/bin/activate  # On Windows: venv\Scripts\activate

-# Install Python packages
+# Install required packages
 pip install -r requirements.txt

-# Install mini-swe-agent for terminal tools
-pip install -e ./mini-swe-agent
-
-# Install Node.js dependencies for browser tools (requires Node.js)
-npm install
+# Install Hecate for terminal tools
+git clone git@github.com:NousResearch/hecate.git
+cd hecate
+pip install -e .
+cd ..
 ```

-### 3. Configure Environment Variables
+### 2. Configure Environment Variables
 ```bash
 # Copy the example environment file
 cp .env.example .env
@@ -71,298 +41,14 @@ nano .env  # or use your preferred editor
 ```

 **Required API Keys:**
- `OPENROUTER_API_KEY` - LLM access via OpenRouter (get at: https://openrouter.ai/keys)
+- `ANTHROPIC_API_KEY` - Main agent model (get at: https://console.anthropic.com/)
 - `FIRECRAWL_API_KEY` - Web tools (get at: https://firecrawl.dev/)
 - `NOUS_API_KEY` - Vision & reasoning tools (get at: https://inference-api.nousresearch.com/)
+- `MORPH_API_KEY` - Terminal tools (get at: https://morph.so/)
 - `FAL_KEY` - Image generation (get at: https://fal.ai/)
+- `OPENAI_API_KEY` - Optional, for some Hecate features

-**Optional API Keys (for specific features):**
- `BROWSERBASE_API_KEY` - Browser automation (get at: https://browserbase.com/)
- `BROWSERBASE_PROJECT_ID` - From Browserbase dashboard
- `MORPH_API_KEY` - For legacy Hecate terminal backend (get at: https://morph.so/)
-
-### 4. Configure Terminal Backend
-
-The terminal tool uses **mini-swe-agent** environments. Configure in `.env` or `cli-config.yaml`:
-
-```bash
-# Backend: "local", "docker", "singularity", "modal", or "ssh"
-TERMINAL_ENV=local          # Default: runs on host machine (no isolation)
-TERMINAL_ENV=ssh            # Remote execution via SSH (agent code stays local)
-TERMINAL_ENV=singularity    # Recommended for HPC: Apptainer/Singularity containers
-TERMINAL_ENV=docker         # Isolated Docker containers
-TERMINAL_ENV=modal          # Cloud execution via Modal
-
-# Container image (for docker/singularity/modal backends)
-TERMINAL_DOCKER_IMAGE=python:3.11-slim
-TERMINAL_SINGULARITY_IMAGE=docker://python:3.11-slim
-TERMINAL_TIMEOUT=60
-
-# SSH backend (for ssh)
-TERMINAL_SSH_HOST=my-server.example.com
-TERMINAL_SSH_USER=myuser
-TERMINAL_SSH_KEY=~/.ssh/id_rsa  # Optional, uses ssh-agent if not set
-```
-
-**Backend Requirements:**
- **local**: No extra setup (runs directly on your machine, no isolation)
- **ssh**: SSH access to remote machine (great for sandboxing - agent can't touch its own code)
- **singularity**: Requires Apptainer or Singularity installed (common on HPC clusters, no root needed)
- **docker**: Requires Docker installed and user in `docker` group
- **modal**: Requires Modal account (see setup below)
-
-### Singularity/Apptainer Setup (Recommended for HPC)
-
-Singularity/Apptainer provides rootless container execution, ideal for HPC clusters:
-
-```bash
-# 1. Verify Apptainer is installed
-apptainer --version  # or: singularity --version
-
-# 2. Set up cache directories (important for parallel workers)
-# Use /scratch if available (HPC), otherwise /tmp
-export APPTAINER_CACHEDIR=/scratch/$USER/.apptainer
-export APPTAINER_TMPDIR=/scratch/$USER/.apptainer/tmp
-mkdir -p "$APPTAINER_CACHEDIR" "$APPTAINER_TMPDIR"
-
-# 3. Pre-build SIF image (recommended for parallel batch processing)
-# This avoids race conditions when multiple workers start simultaneously
-apptainer build $APPTAINER_CACHEDIR/python-nodejs.sif docker://nikolaik/python-nodejs:python3.11-nodejs20
-
-# 4. Configure .env to use the local SIF
-TERMINAL_ENV=singularity
-TERMINAL_SINGULARITY_IMAGE=/scratch/$USER/.apptainer/python-nodejs.sif
-```
-
-**Tip:** The batch scripts in `configs/` automatically handle SIF pre-building if `/scratch` is available.
-
-### Modal Cloud Backend Setup
-
-[Modal](https://modal.com) provides serverless cloud compute for running sandboxed environments at scale.
-
-```bash
-# 1. Install Modal and dependencies
-pip install modal boto3
-
-# 2. Authenticate with Modal (opens browser)
-modal setup
-
-# 3. Set terminal backend to modal in .env
-TERMINAL_ENV=modal
-```
-
-Modal uses CLI-based authentication (stored in `~/.modal/`), so no API key is needed in `.env`. After running `modal setup`, commands will automatically execute in Modal's cloud sandboxes.
-
-### Browser Tools Setup
-
-Browser tools enable the agent to navigate websites, fill forms, click buttons, and extract content. They use [agent-browser](https://github.com/vercel-labs/agent-browser) CLI with [Browserbase](https://browserbase.com) cloud execution.
-
-```bash
-# 1. Install Node.js (if not already installed)
-# Use nvm (recommended) or your package manager
-
-# 2. Install agent-browser CLI (choose one option):
-npm install -g agent-browser     # Option A: Global install (recommended)
-npm install                      # Option B: Local install (uses npx fallback)
-
-# 3. Get Browserbase credentials
-# Sign up at https://browserbase.com/ and get your:
-# - API Key (from Settings → API Keys)
-# - Project ID (from your project dashboard)
-
-# 4. Add to your .env file:
-BROWSERBASE_API_KEY=your_api_key_here
-BROWSERBASE_PROJECT_ID=your_project_id_here
-```
-
-**Available Browser Tools:**
-
-| Tool | Description |
-|------|-------------|
-| `browser_navigate` | Navigate to a URL |
-| `browser_snapshot` | Get text-based page snapshot with element refs |
-| `browser_click` | Click an element by ref (e.g., `@e5`) |
-| `browser_type` | Type text into an input field |
-| `browser_scroll` | Scroll up or down |
-| `browser_back` | Go back in browser history |
-| `browser_press` | Press a keyboard key (Enter, Tab, etc.) |
-| `browser_close` | Close the browser session |
-| `browser_get_images` | Get list of images on the page |
-
-**Example Usage:**
-```bash
-# Use browser tools with web search and vision
-python run_agent.py \
-  --query "Go to amazon.com and find the price of the latest Kindle" \
-  --enabled_toolsets=browser,web,vision
-
-# Use browser-focused distribution
-python batch_runner.py \
-  --dataset_file=browser_tasks.jsonl \
-  --distribution=browser_use \
-  --run_name=browser_run
-```
-
-See `.env.example` for all available configuration options including debug settings.
-
-### Skills Tools
-
-Skills are on-demand knowledge documents the agent can load when needed. They follow a **progressive disclosure** pattern to minimize token usage:
-
-```
-skills/
-├── mlops/                    # Category folder
-│   ├── axolotl/             # Skill folder
-│   │   ├── SKILL.md         # Main instructions (required)
-│   │   ├── references/      # Additional docs, API specs
-│   │   └── templates/       # Output formats, configs
-│   └── vllm/
-│       └── SKILL.md
-```
-
-**Available Skills Tools:**
-
-| Tool | Description |
-|------|-------------|
-| `skills_categories` | List available skill categories (~50 tokens) |
-| `skills_list` | List skills with name + description (~3k tokens for 40 skills) |
-| `skill_view` | Load full skill content, tags, and linked files |
-
-**Example Usage:**
-```bash
-# Use skills tools
-python run_agent.py \
-  --query "What skills do you have for fine-tuning? Show me the axolotl skill." \
-  --enabled_toolsets=skills
-```
-
-**Creating Skills:**
-
-Skills use YAML frontmatter for metadata:
-```yaml
---
-name: my-skill
-description: Brief description shown in skills_list
-tags: [tag1, tag2]
-related_skills: [other-skill]
-version: 1.0.0
---
-# Skill Content
-
-Instructions, examples, and guidelines here...
-```
-
-Skills can include:
- `references/` - Additional documentation, API specs, examples
- `templates/` - Output formats, config files, boilerplate code
- `scripts/` - Executable helpers (Python, shell scripts)
-
-## Session Logging
-
-Every conversation is automatically logged to `logs/` for debugging and inspection:
-
-```
-logs/
-├── session_20260201_143052_a1b2c3.json
-├── session_20260201_150217_d4e5f6.json
-└── ...
-```
-
-**Log Format:**
-```json
-{
-  "session_id": "20260201_143052_a1b2c3",
-  "model": "anthropic/claude-sonnet-4",
-  "session_start": "2026-02-01T14:30:52.123456",
-  "last_updated": "2026-02-01T14:35:12.789012",
-  "message_count": 8,
-  "conversations": [
-    {"from": "system", "value": "..."},
-    {"from": "human", "value": "..."},
-    {"from": "gpt", "value": "..."},
-    {"from": "tool", "value": "..."}
-  ]
-}
-```
-
- **Automatic**: Logs are created and updated automatically after each conversation turn
- **Session ID in Banner**: The CLI displays the session ID in the welcome banner
- **Trajectory Format**: Uses the same format as batch processing for consistency
- **Git Ignored**: `logs/` is in `.gitignore` so logs aren't committed
-
-## Interactive CLI
-
-The CLI provides a rich interactive experience for working with the agent.
-
-### Running the CLI
-
-```bash
-# Basic usage
-./hermes
-
-# With specific model
-./hermes --model "anthropic/claude-sonnet-4"
-
-# With specific toolsets
-./hermes --toolsets "web,terminal,skills"
-```
-
-### CLI Commands
-
-| Command | Description |
-|---------|-------------|
-| `/help` | Show available commands |
-| `/tools` | List available tools by toolset |
-| `/toolsets` | List available toolsets |
-| `/model [name]` | Show or change the current model |
-| `/prompt [text]` | View/set custom system prompt |
-| `/personality [name]` | Set a predefined personality |
-| `/clear` | Clear screen and reset conversation |
-| `/reset` | Reset conversation only |
-| `/history` | Show conversation history |
-| `/save` | Save current conversation to file |
-| `/config` | Show current configuration |
-| `/quit` | Exit the CLI |
-
-### Configuration
-
-Copy `cli-config.yaml.example` to `cli-config.yaml` and customize:
-
-```yaml
-# Model settings
-model:
-  default: "anthropic/claude-sonnet-4"
-
-# Terminal backend (local, docker, singularity, modal, or ssh)
-terminal:
-  env_type: "local"
-  cwd: "."  # Use current directory
-
-# Or use SSH for remote execution (keeps agent code isolated)
-# terminal:
-#   env_type: "ssh"
-#   ssh_host: "my-server.example.com"
-#   ssh_user: "myuser"
-#   ssh_key: "~/.ssh/id_rsa"
-#   cwd: "/home/myuser/project"
-
-# Enable specific toolsets
-toolsets:
-  - all  # or: web, terminal, browser, vision, etc.
-
-# Custom personalities (use with /personality command)
-agent:
-  personalities:
-    helpful: "You are a helpful assistant."
-    kawaii: "You are a kawaii assistant! Use cute expressions..."
-```
-
-### Personalities
-
-Built-in personalities available via `/personality`:
- `helpful`, `concise`, `technical`, `creative`, `teacher`
- `kawaii`, `catgirl`, `pirate`, `shakespeare`, `surfer`
- `noir`, `uwu`, `philosopher`, `hype`
+See `.env.example` for all available configuration options including debug settings and terminal tool configuration.

 ## Toolsets System

@@ -402,17 +88,18 @@ python run_agent.py --enabled_toolsets=safe --query "Help without running comman
 python run_agent.py --list_tools
 ```

-See `toolsets.py` for the complete list of available toolsets and how to create custom ones.
+For detailed documentation on toolsets, see `TOOLSETS_README.md`.

 ## Basic Usage

 ### Default (all tools enabled)
 ```bash
-# Uses OpenRouter by default - just set OPENROUTER_API_KEY in .env
 python run_agent.py \
  --query "search up the latest docs on jit in python 3.13 and write me basic example that's not in their docs. profile its perf" \
  --max_turns 20 \
-  --model anthropic/claude-sonnet-4-20250514
+  --model claude-sonnet-4-20250514 \
+  --base_url https://api.anthropic.com/v1/ \
+  --api_key $ANTHROPIC_API_KEY
 ```

 ### With specific toolset
@@ -420,16 +107,17 @@ python run_agent.py \
 python run_agent.py \
  --query "Debug this Python error" \
  --enabled_toolsets=debugging \
-  --model anthropic/claude-sonnet-4-20250514
+  --model claude-sonnet-4-20250514 \
+  --api_key $ANTHROPIC_API_KEY
 ```

 ### Python API
 ```python
 from run_agent import AIAgent

-# Uses OpenRouter by default (reads OPENROUTER_API_KEY from .env)
+# Use a specific toolset
 agent = AIAgent(
-    model="anthropic/claude-sonnet-4-20250514",
+    model="claude-opus-4-20250514",
    enabled_toolsets=["research"]
 )
 response = agent.chat("Find information about quantum computing")
@@ -474,36 +162,8 @@ python batch_runner.py \
 - Combined output in `data/<run_name>/trajectories.jsonl`
 - Tool usage statistics and success rates

-Use `--list_distributions` to see available toolset distributions for varied data generation.
-
-### Trajectory Compression
-
-Post-process trajectories to fit within token budgets for training:
-
-```bash
-# Compress a directory of JSONL files
-python trajectory_compressor.py --input=data/my_run
-
-# Compress a single JSONL file
-python trajectory_compressor.py --input=data/trajectories.jsonl
-
-# Compress a 15% sample (useful for creating smaller training sets)
-python trajectory_compressor.py --input=data/trajectories.jsonl --sample_percent=15
-
-# Custom output and token target
-python trajectory_compressor.py \
-  --input=data/trajectories.jsonl \
-  --output=data/compressed.jsonl \
-  --target_max_tokens=16000
-```
-
-**Features:**
- Protects first turns (system, human, first GPT response, first tool call)
- Protects last N turns (configurable)
- Summarizes middle turns using LLM to fit target token budget
- Supports both directory and single file input
- Optional random sampling with `--sample_percent`
- Configurable via `configs/trajectory_compression.yaml`
+**Quick Start:** See [QUICKSTART_BATCH.md](QUICKSTART_BATCH.md) for a 5-minute getting started guide.  
+**Full Documentation:** See [BATCH_PROCESSING.md](BATCH_PROCESSING.md) for comprehensive documentation.

 ### Ephemeral System Prompts

@@ -524,7 +184,7 @@ python batch_runner.py \

 The ephemeral prompt will influence the model's behavior during execution, but **only the standard tool-calling system prompt** will be saved in the trajectory files.

-The ephemeral prompt influences model behavior during execution, but **only the standard tool-calling system prompt** is saved in trajectory files.
+**Documentation:** See [docs/ephemeral_system_prompt.md](docs/ephemeral_system_prompt.md) for complete details.

 ## Command Line Arguments

@@ -553,305 +213,31 @@ The ephemeral prompt influences model behavior during execution, but **only the

 All environment variables can be configured in the `.env` file (copy from `.env.example`).

-**LLM Provider (OpenRouter):**
- `OPENROUTER_API_KEY`: Primary LLM access via OpenRouter (supports Claude, GPT-4, Gemini, etc.)
- `LLM_MODEL`: Default model (e.g., `anthropic/claude-sonnet-4`, `openai/gpt-4o`)
-
-**Tool API Keys:**
+**Core API Keys:**
+- `ANTHROPIC_API_KEY`: Main agent model
 - `FIRECRAWL_API_KEY`: Web tools (search, extract, crawl)
 - `NOUS_API_KEY`: Vision and reasoning tools
+- `MORPH_API_KEY`: Terminal tools
 - `FAL_KEY`: Image generation tools
+- `OPENAI_API_KEY`: Optional, for some Hecate features

-**Terminal Tool Configuration (mini-swe-agent backend):**
- `TERMINAL_ENV`: Backend type - `local`, `docker`, `singularity`, `modal`, or `ssh` (default: `local`)
- `TERMINAL_DOCKER_IMAGE`: Docker image for docker backend (default: `python:3.11-slim`)
- `TERMINAL_SINGULARITY_IMAGE`: Singularity/Apptainer image (can be `docker://...` URL or local `.sif` path)
- `TERMINAL_TIMEOUT`: Command timeout in seconds (default: `60`)
- `TERMINAL_LIFETIME_SECONDS`: Cleanup inactive environments after this time (default: `300`)
- `TERMINAL_CWD`: Working directory inside containers (default: `/tmp`)
- `TERMINAL_SCRATCH_DIR`: Custom scratch directory for sandbox storage (optional, auto-detects `/scratch`)
- `SUDO_PASSWORD`: Enable sudo commands by piping password via `sudo -S` (works with all backends)
-  - If unset in CLI mode, you'll be prompted interactively when sudo is needed (45s timeout)
-
-**SSH Backend Configuration (for remote execution):**
- `TERMINAL_SSH_HOST`: Remote server hostname or IP
- `TERMINAL_SSH_USER`: SSH username
- `TERMINAL_SSH_PORT`: SSH port (default: `22`)
- `TERMINAL_SSH_KEY`: Path to SSH private key (optional, uses ssh-agent if not set)
-
-**Browser Tool Configuration (agent-browser + Browserbase):**
- `BROWSERBASE_API_KEY`: Browserbase API key for cloud browser execution
- `BROWSERBASE_PROJECT_ID`: Browserbase project ID
- `BROWSER_SESSION_TIMEOUT`: Session timeout in seconds (default: `300`)
-
-**Legacy Hecate Terminal Backend (optional):**
- `MORPH_API_KEY`: For Hecate/MorphCloud terminal backend
+**Configuration Options:**
 - `HECATE_VM_LIFETIME_SECONDS`: VM lifetime (default: 300)
 - `HECATE_DEFAULT_SNAPSHOT_ID`: Default snapshot (default: snapshot_p5294qxt)
-
-**Debug Options:**
 - `WEB_TOOLS_DEBUG`, `VISION_TOOLS_DEBUG`, `MOA_TOOLS_DEBUG`, `IMAGE_TOOLS_DEBUG`: Enable debug logging

-## Key Files
+## Documentation

-| File | Purpose |
-|------|---------|
-| `hermes` | CLI launcher script (run with `./hermes`) |
-| `cli.py` | Interactive CLI implementation |
-| `cli-config.yaml` | CLI configuration (copy from `.example`) |
-| `run_agent.py` | Main agent runner - single query execution |
-| `batch_runner.py` | Parallel batch processing with checkpointing |
-| `model_tools.py` | Core tool definitions and handlers |
-| `toolsets.py` | Toolset definitions and composition |
-| `toolset_distributions.py` | Probability distributions for data generation |
-| `trajectory_compressor.py` | Post-process trajectories for training |
-| `tools/` | Individual tool implementations |
-| `tools/skills_tool.py` | Skills system with progressive disclosure |
-| `skills/` | On-demand knowledge documents |
-| `docs/` | Documentation |
-| `configs/` | Example batch run scripts |
+**Single Agent Usage:**
+- `TOOLSETS_README.md`: Comprehensive guide to the toolsets system
+- `toolsets.py`: View and modify available toolsets
+- `model_tools.py`: Core tool definitions and handlers

-# Atropos Integrations & RL Training
+**Batch Processing:**
+- `QUICKSTART_BATCH.md`: 5-minute quick start guide
+- `BATCH_PROCESSING.md`: Complete batch processing documentation
+- `toolset_distributions.py`: Toolset distributions for data generation

-Atropos is an RL training framework that uses Hermes-Agent for agent-based environments. This section covers setting up the sandbox infrastructure with either Docker or Singularity backends.
+## Examples

-## Prerequisites
-
-### 1. Install Nomad
-Nomad is a workload orchestrator that manages the sandbox containers:
-
-```bash
-# Install Nomad (Linux)
-curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
-echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
-sudo apt update && sudo apt install nomad
-
-# Verify installation
-nomad --version
-```
-
-For other platforms, see: https://developer.hashicorp.com/nomad/docs/install
-
-### 2. Install Atropos Dependencies
-
-```bash
-python3 -m venv .venv
-source .venv/bin/activate
-pip install -e '.[atropos]'
-```
-
-## Backend Options
-
-Atropos supports two container backends for the sandbox environment:
-
-| Backend | Use Case | Requirements |
-|---------|----------|--------------|
-| **Docker** | Development, servers with Docker | Docker installed, user in `docker` group |
-| **Singularity** | HPC clusters, rootless environments | Apptainer/Singularity installed (no root needed) |
-
---
-
-## Docker Backend (Default)
-
-### 1. Build the Sandbox Image
-
-```bash
-cd atropos
-docker build -t atropos-sandbox:local .
-```
-
-### 2. Start Nomad (Development Mode)
-
-```bash
-# Start Nomad with Docker driver
-nomad agent -dev -config=nomad-dev.hcl
-```
-
-Or create `nomad-dev.hcl`:
-```hcl
-client {
-  enabled = true
-  options {
-    "driver.allowlist" = "docker"
-  }
-}
-```
-
-### 3. Run with Docker Backend
-
-```bash
-source .venv/bin/activate
-
-# Test the environment
-python -m atropos.envs.swe_smith_oracle_env process \
-    --env.use_wandb false \
-    --env.total_steps 1 \
-    --env.max_items 1 \
-    --env.driver docker
-```
-
---
-
-## Singularity Backend (HPC/Rootless)
-
-Singularity/Apptainer is ideal for HPC clusters where Docker requires root privileges.
-
-### 1. Build the Singularity Image
-
-```bash
-cd atropos
-
-# Option A: Convert from Docker image (if Docker is available)
-docker build -t atropos-sandbox:local .
-apptainer build atropos-sandbox.sif docker-daemon://atropos-sandbox:local
-
-# Option B: Build directly from Dockerfile (requires root or fakeroot)
-apptainer build atropos-sandbox.sif docker://ghcr.io/nousresearch/atropos-sandbox:latest
-```
-
-### 2. Start Nomad with raw_exec Driver
-
-Singularity uses Nomad's `raw_exec` driver. Create `nomad-singularity.hcl`:
-
-```hcl
-client {
-  enabled = true
-  options {
-    "driver.allowlist" = "raw_exec,docker"
-  }
-}
-
-plugin "raw_exec" {
-  config {
-    enabled = true
-  }
-}
-```
-
-Start Nomad:
-```bash
-nomad agent -dev -config=nomad-singularity.hcl
-```
-
-### 3. Run with Singularity Backend
-
-```bash
-source .venv/bin/activate
-
-# Basic test
-python -m atropos.envs.swe_smith_oracle_env process \
-    --env.use_wandb false \
-    --env.total_steps 1 \
-    --env.max_items 1 \
-    --env.driver singularity \
-    --env.singularity_image /path/to/atropos-sandbox.sif
-
-# Full example with all options
-python -m atropos.envs.swe_smith_oracle_env process \
-    --env.use_wandb false \
-    --env.total_steps 10 \
-    --env.group_size 4 \
-    --env.max_items 100 \
-    --env.driver singularity \
-    --env.singularity_image /path/to/atropos-sandbox.sif \
-    --env.slots_per_container 10 \
-    --env.min_containers 1 \
-    --env.max_containers 5
-```
-
---
-
-## CLI Arguments Reference
-
-### Environment Configuration (`--env.*`)
-
-| Argument | Default | Description |
-|----------|---------|-------------|
-| `--env.driver` | `docker` | Container backend: `docker` or `singularity` |
-| `--env.singularity_image` | - | Path to `.sif` file (required for singularity driver) |
-| `--env.sandbox_image` | `atropos-sandbox:local` | Docker image name (for docker driver) |
-| `--env.slots_per_container` | `10` | Number of parallel slots per container |
-| `--env.min_containers` | `1` | Minimum number of containers to run |
-| `--env.max_containers` | `10` | Maximum containers for auto-scaling |
-| `--env.nomad_address` | `http://localhost:4646` | Nomad server address |
-| `--env.privileged` | `false` | Run containers in privileged mode (Docker only) |
-
-### Processing Configuration
-
-| Argument | Default | Description |
-|----------|---------|-------------|
-| `--env.total_steps` | `1` | Number of processing steps |
-| `--env.group_size` | `1` | Items per processing group |
-| `--env.max_items` | `0` | Max dataset items (0 = all) |
-| `--env.use_wandb` | `true` | Enable Weights & Biases logging |
-| `--env.agent_max_steps` | `50` | Max agent steps per trajectory |
-
---
-
-## Troubleshooting
-
-### Port Already in Use
-```bash
-# Find and kill process on port 8080
-lsof -ti :8080 | xargs kill
-
-# Or use a different port
--env.port 8081
-```
-
-### Singularity: Permission Denied
-```bash
-# Check Apptainer is installed
-apptainer --version
-
-# Ensure the .sif file is readable
-ls -la /path/to/atropos-sandbox.sif
-```
-
-### Nomad: Job Not Starting
-```bash
-# Check Nomad status
-nomad status
-
-# View job logs
-nomad alloc logs -job atropos-sandbox-agent-env
-
-# Check stderr for errors
-nomad alloc logs -stderr -job atropos-sandbox-agent-env
-```
-
-### OpenAI API Token Error
-If you see `NotImplementedError: OpenAI endpoints do not support token IDs`:
-```bash
-# For testing/evaluation only (not training)
-export ATROPOS_ALLOW_DUMMY_MANAGED_SERVER=1
-```
-
---
-
-## Example: Full HPC Workflow
-
-```bash
-# 1. Setup environment
-python3 -m venv .venv
-source .venv/bin/activate
-pip install -e '.[atropos]'
-
-# 2. Build Singularity image (on a machine with Docker)
-cd atropos
-docker build -t atropos-sandbox:local .
-apptainer build atropos-sandbox.sif docker-daemon://atropos-sandbox:local
-
-# 3. Transfer .sif to HPC cluster
-scp atropos-sandbox.sif user@hpc-cluster:/scratch/user/
-
-# 4. On HPC cluster: Start Nomad
-nomad agent -dev -config=nomad-singularity.hcl &
-
-# 5. Run training
-python -m atropos.envs.swe_smith_oracle_env process \
-    --env.driver singularity \
-    --env.singularity_image /scratch/user/atropos-sandbox.sif \
-    --env.total_steps 100 \
-    --env.max_items 1000
-```
+See `TOOLSETS_README.md` for extensive examples of using different toolsets for various scenarios.
--- a/TODO.md
+++ b/TODO.md
@@ -1,729 +0,0 @@
-# Hermes Agent - Future Improvements
-
-> Ideas for enhancing the agent's capabilities, generated from self-analysis of the codebase.
-
---
-
-## 🚨 HIGH PRIORITY - Immediate Fixes
-
-These items need to be addressed ASAP:
-
-### 1. SUDO Breaking Terminal Tool 🔐 ✅ COMPLETE
- [x] **Problem:** SUDO commands break the terminal tool execution (hangs indefinitely)
- [x] **Fix:** Created custom environment wrappers in `tools/terminal_tool.py`
-  - `stdin=subprocess.DEVNULL` prevents hanging on interactive prompts
-  - Sudo fails gracefully with clear error if no password configured
-  - Same UX as Claude Code - agent sees error, tells user to run it themselves
- [x] **All 5 environments now have consistent behavior:**
-  - `_LocalEnvironment` - local execution
-  - `_DockerEnvironment` - Docker containers
-  - `_SingularityEnvironment` - Singularity/Apptainer containers
-  - `_ModalEnvironment` - Modal cloud sandboxes
-  - `_SSHEnvironment` - remote SSH execution
- [x] **Optional sudo support via `SUDO_PASSWORD` env var:**
-  - Shared `_transform_sudo_command()` helper used by all environments
-  - If set, auto-transforms `sudo cmd` → pipes password via `sudo -S`
-  - Documented in `.env.example`, `cli-config.yaml`, and README
-  - Works for chained commands: `cmd1 && sudo cmd2`
- [x] **Interactive sudo prompt in CLI mode:**
-  - When sudo detected and no password configured, prompts user
-  - 45-second timeout (auto-skips if no input)
-  - Hidden password input via `getpass` (password not visible)
-  - Password cached for session (don't ask repeatedly)
-  - Spinner pauses during prompt for clean UX
-  - Uses `HERMES_INTERACTIVE` env var to detect CLI mode
-
-### 2. Fix `browser_get_images` Tool 🖼️ ✅ VERIFIED WORKING
- [x] **Tested:** Tool works correctly on multiple sites
- [x] **Results:** Successfully extracts image URLs, alt text, dimensions
- [x] **Note:** Some sites (Pixabay, etc.) have Cloudflare bot protection that blocks headless browsers - this is expected behavior, not a bug
-
-### 3. Better Action Logging for Debugging 📝 ✅ COMPLETE
- [x] **Problem:** Need better logging of agent actions for debugging
- [x] **Implementation:**
-  - Save full session trajectories to `logs/` directory as JSON
-  - Each session gets a unique file: `session_YYYYMMDD_HHMMSS_UUID.json`
-  - Logs all messages, tool calls with inputs/outputs, timestamps
-  - Structured JSON format for easy parsing and replay
-  - Automatic on CLI runs (configurable)
-
-### 4. Stream Thinking Summaries in Real-Time 💭 ⏸️ DEFERRED
- [ ] **Problem:** Thinking/reasoning summaries not shown while streaming
- [ ] **Complexity:** This is a significant refactor - leaving for later
-
-**OpenRouter Streaming Info:**
- Uses `stream=True` with OpenAI SDK
- Reasoning comes in `choices[].delta.reasoning_details` chunks
- Types: `reasoning.summary`, `reasoning.text`, `reasoning.encrypted`
- Tool call arguments stream as partial JSON (need accumulation)
- Items paradigm: same ID emitted multiple times with updated content
-
-**Key Challenges:**
- Tool call JSON accumulation (partial `{"query": "wea` → `{"query": "weather"}`)
- Multiple concurrent outputs (thinking + tool calls + text simultaneously)
- State management for partial responses
- Error handling if connection drops mid-stream
- Deciding when tool calls are "complete" enough to execute
-
-**UX Questions to Resolve:**
- Show raw thinking text or summarized?
- Live expanding text vs. spinner replacement?
- Markdown rendering while streaming?
- How to handle thinking + tool call display simultaneously?
-
-**Implementation Options:**
- New `run_conversation_streaming()` method (keep non-streaming as fallback)
- Wrapper that handles streaming internally
- Big refactor of existing `run_conversation()`
-
-**References:**
- https://openrouter.ai/docs/api/reference/streaming
- https://openrouter.ai/docs/guides/best-practices/reasoning-tokens#streaming-response
-
---
-
-## 1. Subagent Architecture (Context Isolation) 🎯
-
-**Problem:** Long-running tools (terminal commands, browser automation, complex file operations) consume massive context. A single `ls -la` can add hundreds of lines. Browser snapshots, debugging sessions, and iterative terminal work quickly bloat the main conversation, leaving less room for actual reasoning.
-
-**Solution:** The main agent becomes an **orchestrator** that delegates context-heavy tasks to **subagents**.
-
-**Architecture:**
-```
-┌─────────────────────────────────────────────────────────────────┐
-│  ORCHESTRATOR (main agent)                                      │
-│  - Receives user request                                        │
-│  - Plans approach                                               │
-│  - Delegates heavy tasks to subagents                           │
-│  - Receives summarized results                                  │
-│  - Maintains clean, focused context                             │
-└─────────────────────────────────────────────────────────────────┘
-         │                    │                    │
-         ▼                    ▼                    ▼
-┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
-│ TERMINAL AGENT  │  │ BROWSER AGENT   │  │ CODE AGENT      │
-│ - terminal tool │  │ - browser tools │  │ - file tools    │
-│ - file tools    │  │ - web_search    │  │ - terminal      │
-│                 │  │ - web_extract   │  │                 │
-│ Isolated context│  │ Isolated context│  │ Isolated context│
-│ Returns summary │  │ Returns summary │  │ Returns summary │
-└─────────────────┘  └─────────────────┘  └─────────────────┘
-```
-
-**How it works:**
-1. User asks: "Set up a new Python project with FastAPI and tests"
-2. Orchestrator plans: "I need to create files, install deps, write code"
-3. Orchestrator calls: `terminal_task(goal="Create venv, install fastapi pytest", context="New project in ~/myapp")`
-4. **Subagent spawns** with fresh context, only terminal/file tools
-5. Subagent iterates (may take 10+ tool calls, lots of output)
-6. Subagent completes → returns summary: "Created venv, installed fastapi==0.109.0, pytest==8.0.0"
-7. Orchestrator receives **only the summary**, context stays clean
-8. Orchestrator continues with next subtask
-
-**Key tools to implement:**
- [ ] `terminal_task(goal, context, cwd?)` - Delegate terminal/shell work
- [ ] `browser_task(goal, context, start_url?)` - Delegate web research/automation  
- [ ] `code_task(goal, context, files?)` - Delegate code writing/modification
- [ ] Generic `delegate_task(goal, context, toolsets=[])` - Flexible delegation
-
-**Implementation details:**
- [ ] Subagent uses same `run_agent.py` but with:
-  - Fresh/empty conversation history
-  - Limited toolset (only what's needed)
-  - Smaller max_iterations (focused task)
-  - Task-specific system prompt
- [ ] Subagent returns structured result:
-  ```python
-  {
-    "success": True,
-    "summary": "Installed 3 packages, created 2 files",
-    "details": "Optional longer explanation if needed",
-    "artifacts": ["~/myapp/requirements.txt", "~/myapp/main.py"],  # Files created
-    "errors": []  # Any issues encountered
-  }
-  ```
- [ ] Orchestrator sees only the summary in its context
- [ ] Full subagent transcript saved separately for debugging
-
-**Benefits:**
- 🧹 **Clean context** - Orchestrator stays focused, doesn't drown in tool output
- 📊 **Better token efficiency** - 50 terminal outputs → 1 summary paragraph
- 🎯 **Focused subagents** - Each agent has just the tools it needs
- 🔄 **Parallel potential** - Independent subtasks could run concurrently
- 🐛 **Easier debugging** - Each subtask has its own isolated transcript
-
-**When to use subagents vs direct tools:**
- **Subagent**: Multi-step tasks, iteration likely, lots of output expected
- **Direct**: Quick one-off commands, simple file reads, user needs to see output
-
-**Files to modify:** `run_agent.py` (add orchestration mode), new `tools/delegate_tools.py`, new `subagent_runner.py`
-
---
-
-## 2. Context Management (complements Subagents)
-
-**Problem:** Context grows unbounded during long conversations. Trajectory compression exists for training data post-hoc, but live conversations lack intelligent context management.
-
-**Ideas:**
- [ ] **Incremental summarization** - Compress old tool outputs on-the-fly during conversations
-  - Trigger when context exceeds threshold (e.g., 80% of max tokens)
-  - Preserve recent turns fully, summarize older tool responses
-  - Could reuse logic from `trajectory_compressor.py`
-  
- [ ] **Semantic memory retrieval** - Vector store for long conversation recall
-  - Embed important facts/findings as conversation progresses
-  - Retrieve relevant memories when needed instead of keeping everything in context
-  - Consider lightweight solutions: ChromaDB, FAISS, or even a simple embedding cache
-  
- [ ] **Working vs. episodic memory** distinction
-  - Working memory: Current task state, recent tool results (always in context)
-  - Episodic memory: Past findings, tried approaches (retrieved on demand)
-  - Clear eviction policies for each
-
-**Files to modify:** `run_agent.py` (add memory manager), possibly new `tools/memory_tool.py`
-
---
-
-## 3. Self-Reflection & Course Correction 🔄
-
-**Problem:** Current retry logic handles malformed outputs but not semantic failures. Agent doesn't reason about *why* something failed.
-
-**Ideas:**
- [ ] **Meta-reasoning after failures** - When a tool returns an error or unexpected result:
-  ```
-  Tool failed → Reflect: "Why did this fail? What assumptions were wrong?"
-  → Adjust approach → Retry with new strategy
-  ```
-  - Could be a lightweight LLM call or structured self-prompt
-  
- [ ] **Planning/replanning module** - For complex multi-step tasks:
-  - Generate plan before execution
-  - After each step, evaluate: "Am I on track? Should I revise the plan?"
-  - Store plan in working memory, update as needed
-  
- [ ] **Approach memory** - Remember what didn't work:
-  - "I tried X for this type of problem and it failed because Y"
-  - Prevents repeating failed strategies in the same conversation
-
-**Files to modify:** `run_agent.py` (add reflection hooks in tool loop), new `tools/reflection_tool.py`
-
---
-
-## 4. Tool Composition & Learning 🔧
-
-**Problem:** Tools are atomic. Complex tasks require repeated manual orchestration of the same tool sequences.
-
-**Ideas:**
- [ ] **Macro tools / Tool chains** - Define reusable tool sequences:
-  ```yaml
-  research_topic:
-    description: "Deep research on a topic"
-    steps:
-      - web_search: {query: "$topic"}
-      - web_extract: {urls: "$search_results.urls[:3]"}
-      - summarize: {content: "$extracted"}
-  ```
-  - Could be defined in skills or a new `macros/` directory
-  - Agent can invoke macro as single tool call
-  
- [ ] **Tool failure patterns** - Learn from failures:
-  - Track: tool, input pattern, error type, what worked instead
-  - Before calling a tool, check: "Has this pattern failed before?"
-  - Persistent across sessions (stored in skills or separate DB)
-  
- [ ] **Parallel tool execution** - When tools are independent, run concurrently:
-  - Detect independence (no data dependencies between calls)
-  - Use `asyncio.gather()` for parallel execution
-  - Already have async support in some tools, just need orchestration
-
-**Files to modify:** `model_tools.py`, `toolsets.py`, new `tool_macros.py`
-
---
-
-## 5. Dynamic Skills Expansion 📚
-
-**Problem:** Skills system is elegant but static. Skills must be manually created and added.
-
-**Ideas:**
- [ ] **Skill acquisition from successful tasks** - After completing a complex task:
-  - "This approach worked well. Save as a skill?"
-  - Extract: goal, steps taken, tools used, key decisions
-  - Generate SKILL.md automatically
-  - Store in user's skills directory
-  
- [ ] **Skill templates** - Common patterns that can be parameterized:
-  ```markdown
-  # Debug {language} Error
-  1. Reproduce the error
-  2. Search for error message: `web_search("{error_message} {language}")`
-  3. Check common causes: {common_causes}
-  4. Apply fix and verify
-  ```
-  
- [ ] **Skill chaining** - Combine skills for complex workflows:
-  - Skills can reference other skills as dependencies
-  - "To do X, first apply skill Y, then skill Z"
-  - Directed graph of skill dependencies
-
-**Files to modify:** `tools/skills_tool.py`, `skills/` directory structure, new `skill_generator.py`
-
---
-
-## 6. Task Continuation Hints 🎯
-
-**Problem:** Could be more helpful by suggesting logical next steps.
-
-**Ideas:**
- [ ] **Suggest next steps** - At end of a task, suggest logical continuations:
-  - "Code is written. Want me to also write tests / docs / deploy?"
-  - Based on common workflows for task type
-  - Non-intrusive, just offer options
-
-**Files to modify:** `run_agent.py`, response generation logic
-
---
-
-## 7. Interactive Clarifying Questions Tool ❓
-
-**Problem:** Agent sometimes makes assumptions or guesses when it should ask the user. Currently can only ask via text, which gets lost in long outputs.
-
-**Ideas:**
- [ ] **Multiple-choice prompt tool** - Let agent present structured choices to user:
-  ```
-  ask_user_choice(
-    question="Should the language switcher enable only German or all languages?",
-    choices=[
-      "Only enable German - works immediately",
-      "Enable all, mark untranslated - show fallback notice",
-      "Let me specify something else"
-    ]
-  )
-  ```
-  - Renders as interactive terminal UI with arrow key / Tab navigation
-  - User selects option, result returned to agent
-  - Up to 4 choices + optional free-text option
-  
- [ ] **Implementation:**
-  - Use `inquirer` or `questionary` Python library for rich terminal prompts
-  - Tool returns selected option text (or user's custom input)
-  - **CLI-only** - only works when running via `cli.py` (not API/programmatic use)
-  - Graceful fallback: if not in interactive mode, return error asking agent to rephrase as text
-  
- [ ] **Use cases:**
-  - Clarify ambiguous requirements before starting work
-  - Confirm destructive operations with clear options
-  - Let user choose between implementation approaches
-  - Checkpoint complex multi-step workflows
-
-**Files to modify:** New `tools/ask_user_tool.py`, `cli.py` (detect interactive mode), `model_tools.py`
-
---
-
-## 8. Resource Awareness & Efficiency 💰
-
-**Problem:** No awareness of costs, time, or resource usage. Could be smarter about efficiency.
-
-**Ideas:**
- [ ] **Tool result caching** - Don't repeat identical operations:
-  - Cache web searches, extractions within a session
-  - Invalidation based on time-sensitivity of query
-  - Hash-based lookup: same input → cached output
-
- [ ] **Lazy evaluation** - Don't fetch everything upfront:
-  - Get summaries first, full content only if needed
-  - "I found 5 relevant pages. Want me to deep-dive on any?"
-
-**Files to modify:** `model_tools.py`, new `resource_tracker.py`
-
---
-
-## 9. Collaborative Problem Solving 🤝
-
-**Problem:** Interaction is command/response. Complex problems benefit from dialogue.
-
-**Ideas:**
- [ ] **Assumption surfacing** - Make implicit assumptions explicit:
-  - "I'm assuming you want Python 3.11+. Correct?"
-  - "This solution assumes you have sudo access..."
-  - Let user correct before going down wrong path
-
- [ ] **Checkpoint & confirm** - For high-stakes operations:
-  - "About to delete 47 files. Here's the list - proceed?"
-  - "This will modify your database. Want a backup first?"
-  - Configurable threshold for when to ask
-
-**Files to modify:** `run_agent.py`, system prompt configuration
-
---
-
-## 10. Project-Local Context 💾
-
-**Problem:** Valuable context lost between sessions.
-
-**Ideas:**
- [ ] **Project awareness** - Remember project-specific context:
-  - Store `.hermes/context.md` in project directory
-  - "This is a Django project using PostgreSQL"
-  - Coding style preferences, deployment setup, etc.
-  - Load automatically when working in that directory
-
- [ ] **Handoff notes** - Leave notes for future sessions:
-  - Write to `.hermes/notes.md` in project
-  - "TODO for next session: finish implementing X"
-  - "Known issues: Y doesn't work on Windows"
-
-**Files to modify:** New `project_context.py`, auto-load in `run_agent.py`
-
---
-
-## 11. Graceful Degradation & Robustness 🛡️
-
-**Problem:** When things go wrong, recovery is limited. Should fail gracefully.
-
-**Ideas:**
- [ ] **Fallback chains** - When primary approach fails, have backups:
-  - `web_extract` fails → try `browser_navigate` → try `web_search` for cached version
-  - Define fallback order per tool type
-  
- [ ] **Partial progress preservation** - Don't lose work on failure:
-  - Long task fails midway → save what we've got
-  - "I completed 3/5 steps before the error. Here's what I have..."
-  
- [ ] **Self-healing** - Detect and recover from bad states:
-  - Browser stuck → close and retry
-  - Terminal hung → timeout and reset
-
-**Files to modify:** `model_tools.py`, tool implementations, new `fallback_manager.py`
-
---
-
-## 12. Tools & Skills Wishlist 🧰
-
-*Things that would need new tool implementations (can't do well with current tools):*
-
-### High-Impact
-
- [ ] **Audio/Video Transcription** 🎬 *(See also: Section 16 for detailed spec)*
-  - Transcribe audio files, podcasts, YouTube videos
-  - Extract key moments from video
-  - Voice memo transcription for messaging integrations
-  - *Provider options: Whisper API, Deepgram, local Whisper*
-  
- [ ] **Diagram Rendering** 📊
-  - Render Mermaid/PlantUML to actual images
-  - Can generate the code, but rendering requires external service or tool
-  - "Show me how these components connect" → actual visual diagram
-
-### Medium-Impact
-
- [ ] **Canvas / Visual Workspace** 🖼️
-  - Agent-controlled visual panel for rendering interactive UI
-  - Inspired by OpenClaw's Canvas feature
-  - **Capabilities:**
-    - `present` / `hide` - Show/hide the canvas panel
-    - `navigate` - Load HTML files or URLs into the canvas
-    - `eval` - Execute JavaScript in the canvas context
-    - `snapshot` - Capture the rendered UI as an image
-  - **Use cases:**
-    - Display generated HTML/CSS/JS previews
-    - Show interactive data visualizations (charts, graphs)
-    - Render diagrams (Mermaid → rendered output)
-    - Present structured information in rich format
-    - A2UI-style component system for structured agent UI
-  - **Implementation options:**
-    - Electron-based panel for CLI
-    - WebSocket-connected web app
-    - VS Code webview extension
-  - *Would let agent "show" things rather than just describe them*
-
- [ ] **Document Generation** 📄
-  - Create styled PDFs, Word docs, presentations
-  - *Can do basic PDF via terminal tools, but limited*
-
- [ ] **Diff/Patch Tool** 📝
-  - Surgical code modifications with preview
-  - "Change line 45-50 to X" without rewriting whole file
-  - Show diffs before applying
-  - *Can use `diff`/`patch` but a native tool would be safer*
-
-### Skills to Create
-
- [ ] **Domain-specific skill packs:**
-  - DevOps/Infrastructure (Terraform, K8s, AWS)
-  - Data Science workflows (EDA, model training)
-  - Security/pentesting procedures
-  
- [ ] **Framework-specific skills:**
-  - React/Vue/Angular patterns
-  - Django/Rails/Express conventions
-  - Database optimization playbooks
-
- [ ] **Troubleshooting flowcharts:**
-  - "Docker container won't start" → decision tree
-  - "Production is slow" → systematic diagnosis
-
---
-
-## 13. Messaging Platform Integrations 💬
-
-**Problem:** Agent currently only works via `cli.py` which requires direct terminal access. Users may want to interact via messaging apps from their phone or other devices.
-
-**Architecture:**
- `run_agent.py` already accepts `conversation_history` parameter and returns updated messages ✅
- Need: persistent session storage, platform monitors, session key resolution
-
-**Implementation approach:**
-```
-┌─────────────────────────────────────────────────────────────┐
-│  Platform Monitor (e.g., telegram_monitor.py)               │
-│  ├─ Long-running daemon connecting to messaging platform    │
-│  ├─ On message: resolve session key → load history from disk│
-│  ├─ Call run_agent.py with loaded history                   │
-│  ├─ Save updated history back to disk (JSONL)               │
-│  └─ Send response back to platform                          │
-└─────────────────────────────────────────────────────────────┘
-```
-
-**Platform support (each user sets up their own credentials):**
- [ ] **Telegram** - via `python-telegram-bot` or `grammy` equivalent
-  - Bot token from @BotFather
-  - Easiest to set up, good for personal use
- [ ] **Discord** - via `discord.py`
-  - Bot token from Discord Developer Portal
-  - Can work in servers (group sessions) or DMs
- [ ] **WhatsApp** - via `baileys` (WhatsApp Web protocol)
-  - QR code scan to authenticate
-  - More complex, but reaches most people
-
-**Session management:**
- [ ] **Session store** - JSONL persistence per session key
-  - `~/.hermes/sessions/{session_key}.jsonl`
-  - Session keys: `telegram:dm:{user_id}`, `discord:channel:{id}`, etc.
- [ ] **Session expiry** - Configurable reset policies
-  - Daily reset (default 4am) OR idle timeout (e.g., 2 hours)
-  - Manual reset via `/reset` or `/new` command in chat
- [ ] **Session continuity** - Conversations persist across messages until reset
-
-**Files to create:** `monitors/telegram_monitor.py`, `monitors/discord_monitor.py`, `monitors/session_store.py`
-
---
-
-## 14. Scheduled Tasks / Cron Jobs ⏰
-
-**Problem:** Agent only runs on-demand. Some tasks benefit from scheduled execution (daily summaries, monitoring, reminders).
-
-**Ideas:**
- [ ] **Cron-style scheduler** - Run agent turns on a schedule
-  - Store jobs in `~/.hermes/cron/jobs.json`
-  - Each job: `{ id, schedule, prompt, session_mode, delivery }`
-  - Uses APScheduler or similar Python library
-  
- [ ] **Session modes:**
-  - `isolated` - Fresh session each run (no history, clean context)
-  - `main` - Append to main session (agent remembers previous scheduled runs)
-  
- [ ] **Delivery options:**
-  - Write output to file (`~/.hermes/cron/output/{job_id}/{timestamp}.md`)
-  - Send to messaging channel (if integrations enabled)
-  - Both
-  
- [ ] **CLI interface:**
-  ```bash
-  # List scheduled jobs
-  python cli.py --cron list
-  
-  # Add a job (runs daily at 9am)
-  python cli.py --cron add "Summarize my email inbox" --schedule "0 9 * * *"
-  
-  # Quick syntax for simple intervals  
-  python cli.py --cron add "Check server status" --every 30m
-  
-  # Remove a job
-  python cli.py --cron remove <job_id>
-  ```
-
- [ ] **Agent self-scheduling** - Let the agent create its own cron jobs
-  - New tool: `schedule_task(prompt, schedule, session_mode)`
-  - "Remind me to check the deployment tomorrow at 9am"
-  - Agent can set follow-up tasks for itself
-
- [ ] **In-chat command:** `/cronjob {prompt} {frequency}` when using messaging integrations
-
-**Files to create:** `cron/scheduler.py`, `cron/jobs.py`, `tools/schedule_tool.py`
-
---
-
-## 15. Text-to-Speech (TTS) 🔊
-
-**Problem:** Agent can only respond with text. Some users prefer audio responses (accessibility, hands-free use, podcasts).
-
-**Ideas:**
- [ ] **TTS tool** - Generate audio files from text
-  ```python
-  tts_generate(text="Here's your summary...", voice="nova", output="summary.mp3")
-  ```
-  - Returns path to generated audio file
-  - For messaging integrations: can send as voice message
-  
- [ ] **Provider options:**
-  - Edge TTS (free, good quality, many voices)
-  - OpenAI TTS (paid, excellent quality)
-  - ElevenLabs (paid, best quality, voice cloning)
-  - Local options (Coqui TTS, Bark)
-  
- [ ] **Modes:**
-  - On-demand: User explicitly asks "read this to me"
-  - Auto-TTS: Configurable to always generate audio for responses
-  - Long-text handling: Summarize or chunk very long responses
-  
- [ ] **Integration with messaging:**
-  - When enabled, can send voice notes instead of/alongside text
-  - User preference per channel
-
-**Files to create:** `tools/tts_tool.py`, config in `cli-config.yaml`
-
---
-
-## 16. Speech-to-Text / Audio Transcription 🎤
-
-**Problem:** Users may want to send voice memos instead of typing. Agent is blind to audio content.
-
-**Ideas:**
- [ ] **Voice memo transcription** - For messaging integrations
-  - User sends voice message → transcribe → process as text
-  - Seamless: user speaks, agent responds
-  
- [ ] **Audio/video file transcription** - Existing idea, expanded:
-  - Transcribe local audio files (mp3, wav, m4a)
-  - Transcribe YouTube videos (download audio → transcribe)
-  - Extract key moments with timestamps
-  
- [ ] **Provider options:**
-  - OpenAI Whisper API (good quality, cheap)
-  - Deepgram (fast, good for real-time)
-  - Local Whisper (free, runs on GPU)
-  - Groq Whisper (fast, free tier available)
-  
- [ ] **Tool interface:**
-  ```python
-  transcribe(source="audio.mp3")  # Local file
-  transcribe(source="https://youtube.com/...")  # YouTube
-  transcribe(source="voice_message", data=bytes)  # Voice memo
-  ```
-
-**Files to create:** `tools/transcribe_tool.py`, integrate with messaging monitors
-
---
-
-## Priority Order (Suggested)
-
-1. **🎯 Subagent Architecture** - Critical for context management, enables everything else
-2. **Memory & Context Management** - Complements subagents for remaining context
-3. **Self-Reflection** - Improves reliability and reduces wasted tool calls  
-4. **Project-Local Context** - Practical win, keeps useful info across sessions
-5. **Messaging Integrations** - Unlocks mobile access, new interaction patterns
-6. **Scheduled Tasks / Cron Jobs** - Enables automation, reminders, monitoring
-7. **Tool Composition** - Quality of life, builds on other improvements
-8. **Dynamic Skills** - Force multiplier for repeated tasks
-9. **Interactive Clarifying Questions** - Better UX for ambiguous tasks
-10. **TTS / Audio Transcription** - Accessibility, hands-free use
-
---
-
-## Removed Items (Unrealistic)
-
-The following were removed because they're architecturally impossible:
-
- ~~Proactive suggestions / Prefetching~~ - Agent only runs on user request, can't interject
- ~~Clipboard integration~~ - No access to user's local system clipboard
-
-The following **moved to active TODO** (now possible with new architecture):
-
- ~~Session save/restore~~ → See **Messaging Integrations** (session persistence)
- ~~Voice/TTS playback~~ → See **TTS** (can generate audio files, send via messaging)
- ~~Set reminders~~ → See **Scheduled Tasks / Cron Jobs**
-
-The following were removed because they're **already possible**:
-
- ~~HTTP/API Client~~ → Use `curl` or Python `requests` in terminal
- ~~Structured Data Manipulation~~ → Use `pandas` in terminal
- ~~Git-Native Operations~~ → Use `git` CLI in terminal
- ~~Symbolic Math~~ → Use `SymPy` in terminal
- ~~Code Quality Tools~~ → Run linters (`eslint`, `black`, `mypy`) in terminal
- ~~Testing Framework~~ → Run `pytest`, `jest`, etc. in terminal
- ~~Translation~~ → LLM handles this fine, or use translation APIs
-
---
-
---
-
-## 🧪 Brainstorm Ideas (Not Yet Fleshed Out)
-
-*These are early-stage ideas that need more thinking before implementation. Captured here so they don't get lost.*
-
-### Remote/Distributed Execution 🌐
-
-**Concept:** Run agent on a powerful remote server while interacting from a thin client.
-
-**Why interesting:**
- Run on beefy GPU server for local LLM inference
- Agent has access to remote machine's resources (files, tools, internet)
- User interacts via lightweight client (phone, low-power laptop)
-
-**Open questions:**
- How does this differ from just SSH + running cli.py on remote?
- Would need secure communication channel (WebSocket? gRPC?)
- How to handle tool outputs that reference remote paths?
- Credential management for remote execution
- Latency considerations for interactive use
-
-**Possible architecture:**
-```
-┌─────────────┐         ┌─────────────────────────┐
-│ Thin Client │ ◄─────► │ Remote Hermes Server    │
-│ (phone/web) │  WS/API │ - Full agent + tools    │
-└─────────────┘         │ - GPU for local LLM     │
-                        │ - Access to server files│
-                        └─────────────────────────┘
-```
-
-**Related to:** Messaging integrations (could be the "server" that monitors receive from)
-
---
-
-### Multi-Agent Parallel Execution 🤖🤖
-
-**Concept:** Extension of Subagent Architecture (Section 1) - run multiple subagents in parallel.
-
-**Why interesting:**
- Independent subtasks don't need to wait for each other
- "Research X while setting up Y" - both run simultaneously
- Faster completion for complex multi-part tasks
-
-**Open questions:**
- How to detect which tasks are truly independent?
- Resource management (API rate limits, concurrent connections)
- How to merge results when parallel tasks have conflicts?
- Cost implications of multiple parallel LLM calls
-
-*Note: Basic subagent delegation (Section 1) should be implemented first, parallel execution is an optimization on top.*
-
---
-
-### Plugin/Extension System 🔌
-
-**Concept:** Allow users to add custom tools/skills without modifying core code.
-
-**Why interesting:**
- Community contributions
- Organization-specific tools
- Clean separation of core vs. extensions
-
-**Open questions:**
- Security implications of loading arbitrary code
- Versioning and compatibility
- Discovery and installation UX
-
---
-
-*Last updated: $(date +%Y-%m-%d)* 🤖
--- a/atropos/Dockerfile
+++ b/atropos/Dockerfile
@@ -1,41 +0,0 @@
-# Dockerfile for atropos-agent sandbox server
-# Runs inside Nomad containers to handle tool execution
-# Includes bubblewrap for namespace-based slot isolation
-
-FROM python:3.11-slim
-
-# Install system dependencies
-RUN apt-get update && apt-get install -y --no-install-recommends \
-    # Bubblewrap for namespace isolation
-    bubblewrap \
-    # `script` for PTY allocation (used for stable tmux+asciinema startup)
-    util-linux \
-    # Git for SWE-style tasks (cloning repos)
-    git \
-    # tmux for stateful terminal sessions (Phase 4.7+)
-    tmux \
-    # Common tools agents might need
-    curl \
-    wget \
-    jq \
-    # Cleanup
-    && rm -rf /var/lib/apt/lists/*
-
-# Install Python dependencies (sandbox server + optional terminal recording)
-RUN pip install --no-cache-dir aiohttp asciinema
-
-# Copy the sandbox server
-COPY sandbox_server.py /app/sandbox_server.py
-
-WORKDIR /app
-
-# Create data directory for slot workspaces
-RUN mkdir -p /data
-
-# Verify bubblewrap is installed and working
-RUN bwrap --version
-
-EXPOSE 8080
-
-# Default command - can be overridden by Nomad job spec
-CMD ["python", "sandbox_server.py", "--port", "8080", "--slots", "10", "--data-dir", "/data"]
--- a/atropos/init.py
+++ b/atropos/init.py
@@ -1,46 +0,0 @@
-"""
-Atropos integration for Hermes-Agent.
-
-This package is intentionally optional: Hermes-Agent should work without Atropos.
-If you import anything from `atropos.*` without having `atroposlib` installed,
-we raise a clear error with install instructions.
-
-Install (recommended, from repo checkout):
-  uv sync --extra atropos
-
-Or (pip / editable):
-  pip install -e '.[atropos]'
-"""
-
-from __future__ import annotations
-
-
-def _require_atroposlib() -> None:
-    try:
-        import atroposlib  # noqa: F401
-    except ModuleNotFoundError as exc:  # pragma: no cover
-        raise ModuleNotFoundError(
-            "Hermes-Agent Atropos integration requires `atroposlib`, but it is not installed.\n"
-            "Install it with:\n"
-            "  uv sync --extra atropos\n"
-            "or:\n"
-            "  pip install -e '.[atropos]'\n"
-        ) from exc
-
-
-_require_atroposlib()
-
-# Re-export the most commonly used pieces for convenience.
-from .agent import AgentConfig, AgentResult, AgentStep, AtroposAgent, SequenceData  # noqa: E402
-from .envs import AgentEnv, AgentEnvConfig  # noqa: E402
-
-__all__ = [
-    "AtroposAgent",
-    "AgentConfig",
-    "AgentResult",
-    "AgentStep",
-    "SequenceData",
-    "AgentEnv",
-    "AgentEnvConfig",
-]
-
--- a/atropos/agent/init.py
+++ b/atropos/agent/init.py
@@ -1,15 +0,0 @@
-"""
-Agent abstractions for atropos-agent.
-
-Provides the core AtroposAgent class for running ReACT-style agent loops.
-"""
-
-from .atropos_agent import AgentConfig, AgentResult, AgentStep, AtroposAgent, SequenceData
-
-__all__ = [
-    "AtroposAgent",
-    "AgentConfig",
-    "AgentResult",
-    "AgentStep",
-    "SequenceData",
-]
--- a/atropos/agent/atropos_agent.py
+++ b/atropos/agent/atropos_agent.py
@@ -1,850 +0,0 @@
-"""
-ReACT-style agent implementation for atropos-agent.
-
-This module provides the core AtroposAgent class that implements a basic
-Reason-Act-Observe loop with tool calling capabilities.
-
-Uses ManagedServer from atroposlib for automatic token/logprob tracking,
-making trajectories ready for RL training.
-
-The agent uses Hermes-style XML tags for tool calls:
- <think>...</think> for reasoning
- <tool_call>{"name": "...", "arguments": {...}}</tool_call> for actions
- <tool_response>...</tool_response> for observations
-"""
-
-import asyncio
-import os
-import json
-import time
-from contextlib import asynccontextmanager
-from dataclasses import dataclass, field
-from uuid import uuid4
-from typing import Any, AsyncGenerator, Awaitable, Callable, Dict, List, Optional, Union
-
-from dotenv import load_dotenv
-import httpx
-
-from ..tools import ToolCall, ToolRegistry, ToolResult
-from atroposlib.envs.server_handling.managed_server import ManagedServer
-
-load_dotenv()
-
-
-# Default system prompt with tool calling instructions.
-AGENT_SYSTEM_PROMPT = """You are a deep thinking AI. You MUST enclose your internal reasoning inside <think>...</think> tags.
-
-You are a function calling AI model.
-
-You are provided with function signatures within <tools></tools> XML tags.
-You must call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.
-You can ONLY respond without a tool call if you are totally certain you have the final answer to the user's question or task
-After calling & executing a function, you will be provided with function results within <tool_response></tool_response> XML tags.
-
-Here are the available tools:
-<tools>
-{tools_json}
-</tools>
-
-Use the following JSON schema for each tool call you will make:
-{"title": "FunctionCall", "type": "object", "properties": {"name": {"title": "Name", "type": "string"}, "arguments": {"title": "Arguments", "type": "object"}}, "required": ["name", "arguments"]}
-
-## REQUIRED TOOL FORMAT
-
-When you decide to call a tool, your assistant message MUST be:
-1) exactly one <think>...</think> block, followed by
-2) one or more <tool_call>...</tool_call> blocks,
-and NOTHING else in that message.
-
-If you need to explain anything, put it inside <think>. Do NOT write natural language outside <think> or <tool_call>.
-
-For each function call return a JSON object with function name and arguments within <tool_call></tool_call> XML tags as follows:
-<tool_call>
-{"name": "<function-name>", "arguments": {"arg1": "value1"}}
-</tool_call>
-
-Each <tool_call> must be on its own and contain ONLY the JSON object (no extra text).
-The JSON inside <tool_call> MUST be valid JSON with double quotes.
-
-Do NOT output <tool_response> in an assistant message.
-
-After you receive tool results, you may either call more tools (same required format) or provide the final answer.
-When providing the final answer, do NOT include any <tool_call> blocks.
-
-## TERMINAL TOOL NOTES
-
- Commands execute under POSIX `/bin/sh` (not bash).
- Each tool call runs in a fresh shell: environment changes (like `cd` or venv activation) do not persist across tool calls.
- Avoid bash-only features like `source`, `[[ ... ]]`, or process substitution.
- Prefer explicit venv usage:
-  - `python -m venv .venv && . .venv/bin/activate && python -m pip install -e .` (POSIX `.` activation), or
-  - `.venv/bin/python -m pip install -e .` (no activation required).
-
-## ICL (examples)
-
-User: Show the current directory.
-Assistant:
-<think>I should run pwd.</think>
-<tool_call>
-{"name": "terminal", "arguments": {"command": "pwd"}}
-</tool_call>
-User: <tool_response>{"success": true, "output": "/tmp\\n"}</tool_response>
-Assistant: /tmp
-
-User: List files, then count them.
-Assistant:
-<think>I should count files.</think>
-<tool_call>
-{"name": "terminal", "arguments": {"command": "ls -1 | wc -l"}}
-</tool_call>
-User: <tool_response>{"success": true, "output": "3\\n"}</tool_response>
-Assistant: 3
-
-User: Run pwd, then print ok (two tool calls).
-Assistant:
-<think>I should run two commands.</think>
-<tool_call>
-{"name": "terminal", "arguments": {"command": "pwd"}}
-</tool_call>
-<tool_call>
-{"name": "terminal", "arguments": {"command": "echo ok"}}
-</tool_call>
-User: <tool_response>{"success": true, "output": "/tmp\\n"}</tool_response>
-User: <tool_response>{"success": true, "output": "ok\\n"}</tool_response>
-Assistant: ok
-"""
-
-
-@dataclass
-class AgentConfig:
-    """Configuration for the AtroposAgent."""
-    
-    # Generation parameters
-    temperature: Optional[float] = 0.7
-    # Default to "let the backend decide" (important for tool-tag completions that may be longer).
-    max_tokens: Optional[int] = None
-    
-    # Agent behavior
-    max_steps: int = 50
-    system_prompt: Optional[str] = None
-    tool_delay_s: float = 0.0
-    
-    # Working directory for tools
-    working_dir: Optional[str] = None
-
-
-@dataclass
-class SequenceData:
-    """Token/logprob data from a single completion."""
-    
-    full_text: str
-    tokens: List[int]
-    masked_tokens: List[int]  # -100 for prompt, actual IDs for completion
-    logprobs: List[float]  # 1.0 for prompt, actual values for completion
-    metadata: Optional[Dict[str, Any]] = None
-    
-    @classmethod
-    def from_sequence_node(cls, node) -> "SequenceData":
-        """Create from a ManagedServer SequenceNode."""
-        return cls(
-            full_text=node.full_text,
-            tokens=node.tokens,
-            masked_tokens=node.masked_tokens,
-            logprobs=node.logprobs,
-            metadata=getattr(node, "metadata", None),
-        )
-
-
-@dataclass
-class AgentStep:
-    """A single step in the agent's trajectory."""
-    
-    step_number: int
-    assistant_message: str
-    tool_calls: List[ToolCall] = field(default_factory=list)
-    tool_results: List[ToolResult] = field(default_factory=list)
-    sequence_data: Optional[SequenceData] = None  # Token data from this step
-    
-    @property
-    def has_tool_calls(self) -> bool:
-        return len(self.tool_calls) > 0
-
-
-@dataclass
-class AgentResult:
-    """Result of running an agent trajectory."""
-    
-    success: bool
-    final_response: str
-    steps: List[AgentStep] = field(default_factory=list)
-    total_tokens: int = 0
-    error: Optional[str] = None
-    metadata: Dict[str, Any] = field(default_factory=dict)
-    
-    # Full trajectory token data for RL training
-    trajectory_data: Optional[SequenceData] = None
-    
-    @property
-    def num_steps(self) -> int:
-        return len(self.steps)
-    
-    @property
-    def total_tool_calls(self) -> int:
-        return sum(len(step.tool_calls) for step in self.steps)
-    
-    def to_messages(self) -> List[Dict[str, str]]:
-        """Convert trajectory to messages format for logging."""
-        messages = []
-        for step in self.steps:
-            messages.append({"role": "assistant", "content": step.assistant_message})
-            if step.tool_results:
-                # Combine all tool responses
-                responses = "\n".join(r.to_xml() for r in step.tool_results)
-                messages.append({"role": "user", "content": responses})
-        return messages
-    
-    def to_scored_data(self, score: float) -> Optional[Dict[str, Any]]:
-        """
-        Convert to format suitable for ScoredDataGroup.
-        
-        Args:
-            score: The score for this trajectory
-            
-        Returns:
-            Dict with tokens, masks, scores suitable for training, or None if no data
-        """
-        if self.trajectory_data is None:
-            return None
-        
-        return {
-            "tokens": self.trajectory_data.tokens,
-            "masks": self.trajectory_data.masked_tokens,
-            "scores": score,
-            "logprobs": self.trajectory_data.logprobs,
-        }
-
-
-class AtroposAgent:
-    """
-    A ReACT-style agent that uses LLMs with tool calling.
-    
-    This implementation wraps ManagedServer for automatic token/logprob tracking,
-    making trajectories ready for RL training.
-    
-    Example:
-        # `server` may be an Atropos `ServerManager` (recommended) or a single `APIServer`.
-        # In practice, environments usually construct this via `BaseEnv`.
-        server = ...
-        tools = ToolRegistry()
-        tools.register(BashTool())
-        
-        agent = AtroposAgent(server=server, tools=tools)
-        result = await agent.run("List the files in the current directory")
-        
-        # Access token data for training
-        if result.trajectory_data:
-            print(f"Tokens: {result.trajectory_data.tokens}")
-            print(f"Masked: {result.trajectory_data.masked_tokens}")
-    """
-    
-    def __init__(
-        self,
-        server,  # ServerManager or APIServer
-        tools: Optional[ToolRegistry] = None,
-        config: Optional[AgentConfig] = None,
-        tokenizer: Optional[Any] = None,
-        execute_tool: Optional[Callable[[ToolCall], Awaitable[ToolResult]]] = None,
-    ):
-        self.server = server
-        self.tools = tools or ToolRegistry()
-        self.config = config or AgentConfig()
-        self.tokenizer = tokenizer or getattr(server, "tokenizer", None)
-        self.execute_tool = execute_tool or self.tools.execute
-
-    @asynccontextmanager
-    async def _managed(self) -> AsyncGenerator[Any, None]:
-        """
-        Yield a ManagedServer-like object.
-
-        - If `self.server` is a ServerManager, use its `managed_server()` context manager.
-        - If `self.server` is a single APIServer, wrap it in `ManagedServer` directly.
-        """
-        if os.getenv("ATROPOS_BYPASS_MANAGED_SERVER") == "1":
-            yield _DirectChatCompletionClient(server=self.server)
-            return
-        if hasattr(self.server, "managed_server"):
-            async with self.server.managed_server(tokenizer=self.tokenizer) as managed:
-                yield managed
-        else:
-            managed = ManagedServer(server=self.server, tokenizer=self.tokenizer)
-            try:
-                yield managed
-            finally:
-                managed.reset()
-    
-    def _build_system_prompt(self) -> str:
-        """Build the system prompt with tool descriptions."""
-        if self.config.system_prompt:
-            return self.config.system_prompt
-
-        tools_json = self.tools.get_prompt_tool_definitions_json()
-        # Avoid `str.format()` here because the prompt contains many literal `{}` braces
-        # in JSON examples; we only want to substitute the single `{tools_json}` token.
-        return AGENT_SYSTEM_PROMPT.replace("{tools_json}", tools_json)
-
-    def _infer_server_model_for_debug(self) -> Optional[str]:
-        """
-        Best-effort inference of the configured model name for debug payload saving.
-
-        ManagedServer/server_manager typically injects `model` internally, so `chat_kwargs`
-        may not contain it. For replaying saved payloads via curl, it's useful to persist it.
-        """
-        servers = getattr(self.server, "servers", None)
-        if isinstance(servers, list) and servers:
-            s0 = servers[0]
-            cfg = getattr(s0, "config", None)
-            model = getattr(cfg, "model_name", None) or getattr(s0, "model_name", None)
-            if isinstance(model, str) and model:
-                return model
-        model = getattr(self.server, "model_name", None) or getattr(self.server, "model", None)
-        if isinstance(model, str) and model:
-            return model
-        return None
-
-    def _infer_server_base_url_for_debug(self) -> Optional[str]:
-        """
-        Best-effort inference of the configured base_url for debug logging.
-
-        This is helpful when diagnosing hangs / retries at the transport layer.
-        """
-        servers = getattr(self.server, "servers", None)
-        if isinstance(servers, list) and servers:
-            s0 = servers[0]
-            cfg = getattr(s0, "config", None)
-            base_url = getattr(cfg, "base_url", None) or getattr(s0, "base_url", None)
-            if isinstance(base_url, str) and base_url:
-                return base_url
-        base_url = getattr(self.server, "base_url", None)
-        if isinstance(base_url, str) and base_url:
-            return base_url
-        return None
-
-    def _extract_response_metadata(self, response: Any) -> Dict[str, Any]:
-        """
-        Extract lightweight, JSON-serializable metadata from an OpenAI-style response.
-
-        This is useful for debugging training runs, especially when ManagedServer state
-        tracking is unavailable (e.g. OpenAI-compatible chat endpoints).
-        """
-        meta: Dict[str, Any] = {}
-        try:
-            rid = getattr(response, "id", None)
-            if isinstance(rid, str) and rid:
-                meta["id"] = rid
-            model = getattr(response, "model", None)
-            if isinstance(model, str) and model:
-                meta["model"] = model
-            created = getattr(response, "created", None)
-            if isinstance(created, int):
-                meta["created"] = created
-            system_fingerprint = getattr(response, "system_fingerprint", None)
-            if isinstance(system_fingerprint, str) and system_fingerprint:
-                meta["system_fingerprint"] = system_fingerprint
-
-            choices = getattr(response, "choices", None)
-            if isinstance(choices, list) and choices:
-                fr = getattr(choices[0], "finish_reason", None)
-                if isinstance(fr, str) and fr:
-                    meta["finish_reason"] = fr
-
-            usage = getattr(response, "usage", None)
-            if usage is not None:
-                if hasattr(usage, "model_dump"):
-                    meta["usage"] = usage.model_dump()
-                elif isinstance(usage, dict):
-                    meta["usage"] = usage
-        except Exception:
-            pass
-        return meta
-
-    def _debug_dump_request(self, *, step_num: int, chat_kwargs: Dict[str, Any]) -> None:
-        if os.getenv("ATROPOS_DEBUG_AGENT_REQUEST") != "1":
-            return
-        try:
-            # Avoid dumping megabytes by default; messages can be huge.
-            meta = {
-                "step": step_num,
-                "base_url": self._infer_server_base_url_for_debug(),
-                "model": chat_kwargs.get("model") or self._infer_server_model_for_debug(),
-                "chat_kwargs_keys": sorted(list(chat_kwargs.keys())),
-                "n": chat_kwargs.get("n"),
-                "max_tokens": chat_kwargs.get("max_tokens"),
-                "temperature": chat_kwargs.get("temperature"),
-                "num_messages": len(chat_kwargs.get("messages") or []),
-            }
-            print("\n=== ATROPOS_DEBUG_AGENT_REQUEST ===", flush=True)
-            print(meta, flush=True)
-
-            if os.getenv("ATROPOS_DEBUG_AGENT_REQUEST_FULL") == "1":
-                payload = dict(chat_kwargs)
-                # Make the payload more legible and less huge.
-                try:
-                    dumped = json.dumps(payload, ensure_ascii=False, indent=2)
-                except Exception:
-                    dumped = repr(payload)
-                print("\n=== ATROPOS_DEBUG_AGENT_REQUEST_FULL ===", flush=True)
-                print(dumped[:200_000], flush=True)
-
-            # Optional: save the FULL request payload to disk (no truncation).
-            save_dir = os.getenv("ATROPOS_DEBUG_AGENT_REQUEST_SAVE_DIR")
-            if save_dir:
-                os.makedirs(save_dir, exist_ok=True)
-                payload: Dict[str, Any] = dict(chat_kwargs)
-                if "model" not in payload:
-                    model = self._infer_server_model_for_debug()
-                    if model:
-                        payload["model"] = model
-                # Use a unique filename so parallel trajectories don't clobber each other.
-                fname = os.path.join(
-                    save_dir,
-                    f"atropos_agent_request_step{step_num}_{int(time.time()*1000)}_{os.getpid()}_{uuid4().hex}.json",
-                )
-                with open(fname, "w", encoding="utf-8") as f:
-                    json.dump(payload, f, ensure_ascii=False, indent=2)
-                print(f"[AtroposAgent] saved request payload: {fname}", flush=True)
-        except Exception:
-            return
-
-    def _debug_dump_response(self, *, step_num: int, response: Any) -> None:
-        if os.getenv("ATROPOS_DEBUG_AGENT_RESPONSE") != "1":
-            return
-        print("\n=== ATROPOS_DEBUG_AGENT_RESPONSE ===", flush=True)
-        print({"step": step_num, "type": type(response).__name__}, flush=True)
-        try:
-            dumped = response.model_dump()  # openai pydantic model
-        except Exception:
-            dumped = getattr(response, "__dict__", {"repr": repr(response)})
-        # Keep the dump bounded; we only need enough to see the assistant message content.
-        text = str(dumped)
-        print(text[:200_000], flush=True)
-
-    async def _chat_completion_with_debug(
-        self, *, managed: Any, step_num: int, chat_kwargs: Dict[str, Any]
-    ) -> Any:
-        """
-        Call `managed.chat_completion()` with optional timeout + richer failure logging.
-
-        Debug env vars:
-        - `ATROPOS_AGENT_CHAT_TIMEOUT_S`: if set, wraps the await in `asyncio.wait_for`.
-        - `ATROPOS_DEBUG_AGENT_WAIT_EVERY_S`: if set, prints a heartbeat while waiting.
-        """
-        # Hard guardrail: never allow a single chat completion to block for too long.
-        # This is essential for RL data-gen stability; long hangs should be treated as failures (score=0).
-        timeout_s_raw = os.getenv("ATROPOS_AGENT_CHAT_TIMEOUT_S")
-        timeout_s_default = 240.0
-        timeout_s = float(timeout_s_raw) if timeout_s_raw else timeout_s_default
-        timeout_s = min(timeout_s, 240.0)
-
-        wait_every_raw = os.getenv("ATROPOS_DEBUG_AGENT_WAIT_EVERY_S")
-        wait_every_s = float(wait_every_raw) if wait_every_raw else None
-
-        async def _await_call() -> Any:
-            if not wait_every_s or wait_every_s <= 0:
-                return await managed.chat_completion(**chat_kwargs)
-
-            # Heartbeat mode: wait in chunks without cancelling the underlying request.
-            # NOTE: do NOT use `asyncio.wait_for(task, timeout=...)` here, because a timeout
-            # will cancel the task and surface as `CancelledError` on the next loop.
-            task = asyncio.create_task(managed.chat_completion(**chat_kwargs))
-            t0 = time.perf_counter()
-            try:
-                while True:
-                    done, _pending = await asyncio.wait({task}, timeout=wait_every_s)
-                    if task in done:
-                        return task.result()
-
-                    waited = time.perf_counter() - t0
-                    print(
-                        f"[AtroposAgent] step={step_num} still waiting for chat_completion... ({waited:.1f}s)",
-                        flush=True,
-                    )
-            except asyncio.CancelledError:
-                task.cancel()
-                raise
-
-        try:
-            return await asyncio.wait_for(_await_call(), timeout=timeout_s)
-        except asyncio.TimeoutError as e:
-            print("\n=== ATROPOS_DEBUG_AGENT_CHAT_TIMEOUT ===", flush=True)
-            print({"step": step_num, "timeout_s": timeout_s}, flush=True)
-            raise RuntimeError(f"chat_completion timed out after {timeout_s:.1f}s") from e
-        except asyncio.CancelledError:
-            # Treat cancellation as a hard failure rather than crashing the whole env run.
-            # (Atropos/BaseEnv may cancel tasks during shutdown or retries.)
-            raise RuntimeError("chat_completion cancelled") from None
-        except Exception as e:
-            detail: Dict[str, Any] = {
-                "step": step_num,
-                "exc_type": type(e).__name__,
-                "exc_str": str(e),
-            }
-            if isinstance(e, httpx.HTTPStatusError):
-                try:
-                    detail["status_code"] = e.response.status_code
-                    detail["response_text"] = e.response.text[:20_000]
-                except Exception:
-                    pass
-            elif isinstance(e, httpx.RequestError):
-                detail["request"] = repr(getattr(e, "request", None))
-
-            print("\n=== ATROPOS_DEBUG_AGENT_CHAT_FAILURE ===", flush=True)
-            print(detail, flush=True)
-            raise
-
-    async def run(
-        self,
-        task: str,
-        initial_messages: Optional[List[Dict[str, str]]] = None,
-    ) -> AgentResult:
-        """
-        Run the agent on a task using ManagedServer for token tracking.
-        
-        Args:
-            task: The task/prompt for the agent
-            initial_messages: Optional additional context messages
-            
-        Returns:
-            AgentResult with the trajectory, final response, and token data
-        """
-        messages = [
-            {"role": "system", "content": self._build_system_prompt()},
-        ]
-        
-        if initial_messages:
-            messages.extend(initial_messages)
-        
-        messages.append({"role": "user", "content": task})
-        
-        steps = []
-        final_response = ""
-        final_node = None
-        final_prompt_messages: Optional[List[Dict[str, str]]] = None
-        last_node = None
-        last_prompt_messages: Optional[List[Dict[str, str]]] = None
-        last_response_text: str = ""
-        
-        # Use ManagedServer for automatic token tracking
-        async with self._managed() as managed:
-            for step_num in range(self.config.max_steps):
-                # ReACT loop iteration here, just call -> tools -> observe until done (no tools called)
-                try:
-                    # Keep a copy of the prompt messages used for this completion.
-                    # Useful for reconstructing tokens/masks when state tracking is unavailable.
-                    prompt_messages = list(messages)
-                    chat_kwargs: Dict[str, Any] = {"messages": messages, "n": 1}
-                    if self.config.max_tokens is not None:
-                        chat_kwargs["max_tokens"] = self.config.max_tokens
-                    if self.config.temperature is not None:
-                        chat_kwargs["temperature"] = self.config.temperature
-
-                    t_req = time.perf_counter()
-                    print(
-                        f"[AtroposAgent] step={step_num+1} chat_completion start "
-                        f"(messages={len(messages)}, max_tokens={self.config.max_tokens}, temp={self.config.temperature})",
-                        flush=True,
-                    )
-                    self._debug_dump_request(step_num=step_num + 1, chat_kwargs=chat_kwargs)
-                    response = await self._chat_completion_with_debug(
-                        managed=managed, step_num=step_num + 1, chat_kwargs=chat_kwargs
-                    )
-                    self._debug_dump_response(step_num=step_num + 1, response=response)
-                    response_meta = self._extract_response_metadata(response)
-                    print(
-                        f"[AtroposAgent] step={step_num+1} chat_completion done in {time.perf_counter() - t_req:.2f}s",
-                        flush=True,
-                    )
-                    
-                    current_node = None
-                    if hasattr(managed, "get_state"):
-                        state = managed.get_state()
-                        nodes = state.get("nodes", [])
-                        current_node = nodes[-1] if nodes else None
-                    
-                except Exception as e:
-                    return AgentResult(
-                        success=False,
-                        final_response="",
-                        steps=steps,
-                        error=f"Generation error: {str(e)}",
-                    )
-                
-                msg = response.choices[0].message
-                # Some OpenAI-compatible servers populate `message.reasoning` and leave `content=""`.
-                response_text = (msg.content or "") or (getattr(msg, "reasoning", None) or "")
-                tool_calls = ToolCall.parse_from_text(response_text)
-                last_node = current_node
-                last_prompt_messages = prompt_messages
-                last_response_text = response_text
-
-                step_sequence_data = SequenceData.from_sequence_node(current_node) if current_node else None
-                if step_sequence_data is None:
-                    if response_meta:
-                        # We still want metadata for debugging even if token/logprob state tracking is unavailable.
-                        step_sequence_data = SequenceData(
-                            full_text=response_text,
-                            tokens=[],
-                            masked_tokens=[],
-                            logprobs=[],
-                            metadata=response_meta,
-                        )
-                else:
-                    merged = dict(response_meta)
-                    node_meta = step_sequence_data.metadata
-                    if isinstance(node_meta, dict):
-                        merged.update(node_meta)
-                    step_sequence_data.metadata = merged or step_sequence_data.metadata
-                
-                step = AgentStep(
-                    step_number=step_num + 1,
-                    assistant_message=response_text,
-                    tool_calls=tool_calls,
-                    sequence_data=step_sequence_data,
-                )
-                
-                if not tool_calls:
-                    steps.append(step)
-                    final_response = response_text
-                    final_node = current_node
-                    final_prompt_messages = prompt_messages
-                    break
-                
-                messages.append({"role": "assistant", "content": response_text})
-                
-                tool_responses = []
-                for call in tool_calls:
-                    result = await self.execute_tool(call)
-                    step.tool_results.append(result)
-                    tool_responses.append(result.to_xml())
-                    if self.config.tool_delay_s > 0:
-                        await asyncio.sleep(self.config.tool_delay_s)
-                
-                steps.append(step)
-            
-                responses_text = "\n".join(tool_responses)
-                # Tool observations are represented as user content with Hermes-style tags.
-                # This is compatible with most OpenAI-compatible chat APIs and ensures
-                # tokenizers/chat templates include tool outputs during training.
-                messages.append({"role": "user", "content": responses_text})
-            
-            else:
-                # Reached max steps without completing
-                # Return a failure result but include the last observed completion so callers can
-                # record the trajectory (score=0) without triggering retries.
-                final_response = last_response_text or final_response
-                final_node = last_node
-                final_prompt_messages = last_prompt_messages
-                trajectory_data = None
-                if final_node:
-                    trajectory_data = SequenceData.from_sequence_node(final_node)
-                elif final_prompt_messages is not None and self.tokenizer is not None:
-                    if hasattr(self.tokenizer, "apply_chat_template"):
-                        prompt_text = self.tokenizer.apply_chat_template(
-                            final_prompt_messages, tokenize=False, add_generation_prompt=True
-                        )
-                        prompt_tokens = self.tokenizer.encode(prompt_text, add_special_tokens=False)
-                    else:
-                        prompt_text = "\n".join([f"{m['role']}: {m['content']}" for m in final_prompt_messages])
-                        prompt_tokens = self.tokenizer.encode(prompt_text, add_special_tokens=True)
-                    output_tokens = self.tokenizer.encode(final_response, add_special_tokens=False)
-                    tokens = prompt_tokens + output_tokens
-                    masked_tokens = ([-100] * len(prompt_tokens)) + output_tokens
-                    logprobs = ([1.0] * len(prompt_tokens)) + ([0.0] * len(output_tokens))
-                    trajectory_data = SequenceData(
-                        full_text=f"{prompt_text}{final_response}",
-                        tokens=tokens,
-                        masked_tokens=masked_tokens,
-                        logprobs=logprobs,
-                    )
-                # Preserve response metadata (if any) even on failure trajectories.
-                try:
-                    if trajectory_data is not None and steps:
-                        last_step = steps[-1]
-                        if last_step.sequence_data and isinstance(last_step.sequence_data.metadata, dict):
-                            trajectory_data.metadata = dict(last_step.sequence_data.metadata)
-                except Exception:
-                    pass
-                return AgentResult(
-                    success=False,
-                    final_response=final_response,
-                    steps=steps,
-                    error=f"Reached maximum steps ({self.config.max_steps})",
-                    trajectory_data=trajectory_data,
-                )
-        
-        # Build result with trajectory data
-        trajectory_data = None
-        if final_node:
-            trajectory_data = SequenceData.from_sequence_node(final_node)
-        elif final_prompt_messages is not None and self.tokenizer is not None:
-            if hasattr(self.tokenizer, "apply_chat_template"):
-                prompt_text = self.tokenizer.apply_chat_template(
-                    final_prompt_messages, tokenize=False, add_generation_prompt=True
-                )
-                prompt_tokens = self.tokenizer.encode(prompt_text, add_special_tokens=False)
-            else:
-                prompt_text = "\n".join([f"{m['role']}: {m['content']}" for m in final_prompt_messages])
-                prompt_tokens = self.tokenizer.encode(prompt_text, add_special_tokens=True)
-            output_tokens = self.tokenizer.encode(final_response, add_special_tokens=False)
-            tokens = prompt_tokens + output_tokens
-            masked_tokens = ([-100] * len(prompt_tokens)) + output_tokens
-            logprobs = ([1.0] * len(prompt_tokens)) + ([0.0] * len(output_tokens))
-            trajectory_data = SequenceData(
-                full_text=f"{prompt_text}{final_response}",
-                tokens=tokens,
-                masked_tokens=masked_tokens,
-                logprobs=logprobs,
-            )
-
-        # Ensure trajectory_data carries the most recent metadata we observed (if any).
-        try:
-            if trajectory_data is not None and steps:
-                last_step = steps[-1]
-                if last_step.sequence_data and isinstance(last_step.sequence_data.metadata, dict):
-                    trajectory_data.metadata = dict(last_step.sequence_data.metadata)
-        except Exception:
-            pass
-        
-        return AgentResult(
-            success=True,
-            final_response=final_response,
-            steps=steps,
-            trajectory_data=trajectory_data,
-        )
-    
-    async def run_single_turn(
-        self,
-        messages: List[Dict[str, str]],
-        execute_tools: bool = True,
-    ) -> tuple[str, List[ToolResult], Optional[SequenceData]]:
-        """
-        Run a single turn of the agent (one LLM call + tool execution).
-        
-        This is useful for integration with BaseEnv where you want more
-        control over the loop.
-        
-        Args:
-            messages: The conversation history
-            execute_tools: Whether to execute parsed tool calls
-            
-        Returns:
-            Tuple of (response_text, tool_results, sequence_data)
-        """
-        async with self._managed() as managed:
-            chat_kwargs: Dict[str, Any] = {"messages": messages, "n": 1}
-            if self.config.max_tokens is not None:
-                chat_kwargs["max_tokens"] = self.config.max_tokens
-            if self.config.temperature is not None:
-                chat_kwargs["temperature"] = self.config.temperature
-
-            self._debug_dump_request(step_num=1, chat_kwargs=chat_kwargs)
-            response = await self._chat_completion_with_debug(managed=managed, step_num=1, chat_kwargs=chat_kwargs)
-            self._debug_dump_response(step_num=1, response=response)
-            
-            current_node = None
-            if hasattr(managed, "get_state"):
-                state = managed.get_state()
-                nodes = state.get("nodes", [])
-                current_node = nodes[-1] if nodes else None
-        
-        msg = response.choices[0].message
-        response_text = (msg.content or "") or (getattr(msg, "reasoning", None) or "")
-        tool_results = []
-        
-        if execute_tools:
-            tool_calls = ToolCall.parse_from_text(response_text)
-            for call in tool_calls:
-                result = await self.execute_tool(call)
-                tool_results.append(result)
-        
-        sequence_data = SequenceData.from_sequence_node(current_node) if current_node else None
-        
-        return response_text, tool_results, sequence_data
-
-
-class _DirectChatCompletionClient:
-    """
-    Minimal stand-in for ManagedServer that calls the OpenAI-compatible endpoint directly.
-
-    This is for isolating issues where `ManagedServer.chat_completion()` hangs or misbehaves.
-    It intentionally does NOT do token/logprob tracking.
-    """
-
-    def __init__(self, server: Any):
-        self._server = server
-
-    def _server_config(self) -> tuple[str, str, str]:
-        # ServerManager case: first configured server.
-        servers = getattr(self._server, "servers", None)
-        if isinstance(servers, list) and servers:
-            s0 = servers[0]
-            cfg = getattr(s0, "config", None)
-            base_url = getattr(cfg, "base_url", None) or getattr(s0, "base_url", None)
-            api_key = getattr(cfg, "api_key", None) or getattr(s0, "api_key", None)
-            model = getattr(cfg, "model_name", None) or getattr(s0, "model_name", None)
-            if isinstance(base_url, str) and isinstance(api_key, str) and isinstance(model, str):
-                return base_url.rstrip("/"), api_key, model
-
-        # APIServer-like fallback.
-        base_url = getattr(self._server, "base_url", None)
-        api_key = getattr(self._server, "api_key", None)
-        model = getattr(self._server, "model_name", None) or getattr(self._server, "model", None)
-        if isinstance(base_url, str) and isinstance(api_key, str) and isinstance(model, str):
-            return base_url.rstrip("/"), api_key, model
-
-        raise RuntimeError("Unable to resolve server base_url/api_key/model for direct chat completion")
-
-    async def chat_completion(self, *, messages: List[Dict[str, str]], n: int = 1, **kwargs: Any) -> Any:
-        base_url, api_key, model = self._server_config()
-        url = f"{base_url}/chat/completions"
-
-        payload: Dict[str, Any] = {
-            "model": model,
-            "messages": messages,
-            "n": n,
-        }
-        # Pass through common generation kwargs.
-        for k in ("max_tokens", "temperature", "top_p", "presence_penalty", "frequency_penalty", "stop"):
-            if k in kwargs and kwargs[k] is not None:
-                payload[k] = kwargs[k]
-
-        timeout_s = float(os.getenv("ATROPOS_DIRECT_REQUEST_TIMEOUT_S") or "120")
-        print(f"[AtroposAgent] DIRECT chat_completion POST {url} (timeout={timeout_s}s)", flush=True)
-        async with httpx.AsyncClient(timeout=timeout_s) as client:
-            resp = await client.post(
-                url,
-                headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
-                json=payload,
-            )
-            resp.raise_for_status()
-            data = resp.json()
-
-        # Return a very small object compatible with the code paths that read
-        # `response.choices[0].message.content`.
-        class _Msg:
-            def __init__(self, d: Dict[str, Any]):
-                self.content = d.get("content")
-                self.reasoning = d.get("reasoning")
-
-        class _Choice:
-            def __init__(self, d: Dict[str, Any]):
-                self.message = _Msg(d.get("message") or {})
-
-        class _Resp:
-            def __init__(self, d: Dict[str, Any]):
-                self._d = d
-                self.choices = [_Choice(c) for c in (d.get("choices") or [])]
-
-            def model_dump(self) -> Dict[str, Any]:
-                return self._d
-
-        return _Resp(data)
--- a/atropos/api/init.py
+++ b/atropos/api/init.py
@@ -1,6 +0,0 @@
-"""
-FastAPI services for atropos-agent.
-
- tool_executor_server: queued/batched sandbox tool execution (Phase 4)
-"""
-
--- a/atropos/api/tool_executor_server.py
+++ b/atropos/api/tool_executor_server.py
@@ -1,254 +0,0 @@
-"""
-Tool Executor API (Phase 4)
-
-This service provides a queued, batched execution layer on top of a ToolBackend.
-It mirrors the stateful FastAPI + app.state pattern used in:
-  atropos/atroposlib/api/server.py
-
-Run (dev):
-  uv run uvicorn atropos_agent.api.tool_executor_server:app --host 0.0.0.0 --port 9001
-"""
-
-from __future__ import annotations
-
-import os
-from typing import Any, Dict, Optional
-from pathlib import Path
-
-from fastapi import FastAPI, Header, HTTPException, status
-from pydantic import BaseModel, Field
-
-from ..backends.nomad_backend import NomadBackendConfig, NomadToolBackend
-from ..tools import ToolRegistry, build_tool_registry
-from ..tools.base import (
-    ArtifactArchiveRequestPayload,
-    ArtifactArchiveResponsePayload,
-    ArtifactListRequestPayload,
-    ArtifactListResponsePayload,
-    ArtifactReadRequestPayload,
-    ArtifactReadResponsePayload,
-    ToolExecutorExecuteRequest,
-    ToolExecutorReleaseRequest,
-    ToolResultPayload,
-)
-from ..tools.tool_executor import ToolExecutor, ToolExecutorConfig
-
-
-class ToolExecutorServerConfig(BaseModel):
-    nomad_address: str = Field(default="http://localhost:4646")
-    job_id: str = Field(default="atropos-sandbox-tool-executor")
-    image: str = Field(default="atropos-sandbox:local")
-    slots_per_container: int = Field(default=10)
-    min_containers: int = Field(default=1)
-    max_containers: int = Field(default=10)
-    privileged: bool = Field(default=False)
-    acquire_timeout_s: float = Field(default=30.0)
-
-    batch_window_ms: int = Field(default=20)
-    max_batch_size: int = Field(default=200)
-    allow_network: bool = Field(default=True)
-
-    tool_server_url: Optional[str] = Field(default=None)
-    tool_server_token: Optional[str] = Field(default=None)
-
-    token: Optional[str] = Field(default=None, description="Bearer token required for requests (optional in dev).")
-
-    purge_job_on_shutdown: bool = Field(default=True)
-
-    @classmethod
-    def from_env(cls) -> "ToolExecutorServerConfig":
-        # In dev, prefer loading secrets/config from the repo-local `.env` (not committed).
-        try:
-            from dotenv import load_dotenv  # type: ignore
-        except Exception:  # pragma: no cover
-            load_dotenv = None  # type: ignore[assignment]
-        if load_dotenv is not None:
-            env_path = Path(__file__).resolve().parents[2] / ".env"
-            if env_path.exists():
-                load_dotenv(dotenv_path=env_path)
-
-        def _get_bool(name: str, default: bool) -> bool:
-            raw = os.getenv(name)
-            if raw is None:
-                return default
-            return raw.strip().lower() in {"1", "true", "yes", "y", "on"}
-
-        return cls(
-            nomad_address=os.getenv("TOOL_EXECUTOR_NOMAD_ADDRESS", "http://localhost:4646"),
-            job_id=os.getenv("TOOL_EXECUTOR_JOB_ID", "atropos-sandbox-tool-executor"),
-            image=os.getenv("TOOL_EXECUTOR_IMAGE", "atropos-sandbox:local"),
-            slots_per_container=int(os.getenv("TOOL_EXECUTOR_SLOTS", "10")),
-            min_containers=int(os.getenv("TOOL_EXECUTOR_MIN_CONTAINERS", "1")),
-            max_containers=int(os.getenv("TOOL_EXECUTOR_MAX_CONTAINERS", "10")),
-            privileged=_get_bool("TOOL_EXECUTOR_PRIVILEGED", False),
-            acquire_timeout_s=float(os.getenv("TOOL_EXECUTOR_ACQUIRE_TIMEOUT_S", "30.0")),
-            batch_window_ms=int(os.getenv("TOOL_EXECUTOR_BATCH_WINDOW_MS", "20")),
-            max_batch_size=int(os.getenv("TOOL_EXECUTOR_MAX_BATCH_SIZE", "200")),
-            allow_network=_get_bool("TOOL_EXECUTOR_ALLOW_NETWORK", True),
-            tool_server_url=os.getenv("TOOL_EXECUTOR_TOOL_SERVER_URL") or None,
-            tool_server_token=os.getenv("TOOL_EXECUTOR_TOOL_SERVER_TOKEN") or None,
-            token=os.getenv("TOOL_EXECUTOR_TOKEN") or None,
-            purge_job_on_shutdown=_get_bool("TOOL_EXECUTOR_PURGE_JOB_ON_SHUTDOWN", True),
-        )
-
-
-app = FastAPI(title="Atropos-Agent Tool Executor")
-
-
-@app.get("/")
-async def root() -> Dict[str, str]:
-    return {"message": "Atropos-Agent Tool Executor"}
-
-
-def _check_auth(cfg: ToolExecutorServerConfig, authorization: Optional[str]) -> None:
-    if not cfg.token:
-        return
-    if not authorization:
-        raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Missing Authorization header")
-    if not authorization.lower().startswith("bearer "):
-        raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Invalid Authorization header")
-    token = authorization.split(" ", 1)[1].strip()
-    if token != cfg.token:
-        raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail="Invalid token")
-
-
-@app.on_event("startup")
-async def _startup() -> None:
-    cfg = ToolExecutorServerConfig.from_env()
-
-    # Default to Atropos "full" tool surface: sandbox + external (if tool_server_url provided).
-    tools: ToolRegistry = build_tool_registry(
-        enabled_toolsets=["full"],
-        disabled_toolsets=None,
-        tool_server_url=cfg.tool_server_url,
-    )
-
-    backend = NomadToolBackend(
-        NomadBackendConfig(
-            nomad_address=cfg.nomad_address,
-            sandbox_job_id=cfg.job_id,
-            sandbox_image=cfg.image,
-            slots_per_container=cfg.slots_per_container,
-            min_containers=cfg.min_containers,
-            max_containers=cfg.max_containers,
-            privileged=cfg.privileged,
-            acquire_timeout_s=cfg.acquire_timeout_s,
-            purge_job_on_start=False,
-        )
-    )
-    await backend.start()
-
-    executor = ToolExecutor(
-        backend=backend,
-        tools=tools,
-        config=ToolExecutorConfig(
-            batch_window_ms=cfg.batch_window_ms,
-            max_batch_size=cfg.max_batch_size,
-            allow_network=cfg.allow_network,
-            tool_server_url=cfg.tool_server_url,
-            tool_server_token=cfg.tool_server_token,
-        ),
-    )
-    await executor.start()
-
-    app.state.cfg = cfg
-    app.state.backend = backend
-    app.state.executor = executor
-
-
-@app.on_event("shutdown")
-async def _shutdown() -> None:
-    executor: Optional[ToolExecutor] = getattr(app.state, "executor", None)
-    backend: Optional[NomadToolBackend] = getattr(app.state, "backend", None)
-    cfg: Optional[ToolExecutorServerConfig] = getattr(app.state, "cfg", None)
-
-    if executor is not None:
-        await executor.close()
-
-    if backend is not None:
-        await backend.stop(purge=bool(cfg.purge_job_on_shutdown) if cfg else False)
-
-
-@app.get("/health")
-async def health() -> Dict[str, Any]:
-    return {"status": "ok"}
-
-
-@app.get("/status")
-async def status_endpoint() -> Dict[str, Any]:
-    executor: ToolExecutor = app.state.executor
-    backend: NomadToolBackend = app.state.backend
-
-    return {
-        "queue_size": executor.queue_size(),
-        "total_requests": executor.total_requests,
-        "total_errors": executor.total_errors,
-        "pool": backend.get_stats(),
-    }
-
-
-@app.post("/execute", response_model=ToolResultPayload)
-async def execute_tool(
-    req: ToolExecutorExecuteRequest,
-    authorization: Optional[str] = Header(default=None),
-    status_code: int = status.HTTP_200_OK,  # noqa: B008
-) -> ToolResultPayload:
-    cfg: ToolExecutorServerConfig = app.state.cfg
-    _check_auth(cfg, authorization)
-
-    executor: ToolExecutor = app.state.executor
-    result = await executor.execute(
-        trajectory_id=req.trajectory_id,
-        call=req.tool.to_tool_call(),
-        timeout_s=req.timeout_s,
-    )
-    return ToolResultPayload.from_tool_result(result)
-
-
-@app.post("/release")
-async def release_trajectory(
-    req: ToolExecutorReleaseRequest,
-    authorization: Optional[str] = Header(default=None),
-) -> Dict[str, Any]:
-    cfg: ToolExecutorServerConfig = app.state.cfg
-    _check_auth(cfg, authorization)
-
-    executor: ToolExecutor = app.state.executor
-    await executor.release_trajectory(req.trajectory_id, reset_workspace=req.reset_workspace)
-    return {"status": "ok"}
-
-
-@app.post("/artifacts/read", response_model=ArtifactReadResponsePayload)
-async def artifacts_read(
-    req: ArtifactReadRequestPayload,
-    authorization: Optional[str] = Header(default=None),
-) -> ArtifactReadResponsePayload:
-    cfg: ToolExecutorServerConfig = app.state.cfg
-    _check_auth(cfg, authorization)
-
-    executor: ToolExecutor = app.state.executor
-    return await executor.read_artifact(req)
-
-
-@app.post("/artifacts/list", response_model=ArtifactListResponsePayload)
-async def artifacts_list(
-    req: ArtifactListRequestPayload,
-    authorization: Optional[str] = Header(default=None),
-) -> ArtifactListResponsePayload:
-    cfg: ToolExecutorServerConfig = app.state.cfg
-    _check_auth(cfg, authorization)
-
-    executor: ToolExecutor = app.state.executor
-    return await executor.list_artifacts(req)
-
-
-@app.post("/artifacts/archive", response_model=ArtifactArchiveResponsePayload)
-async def artifacts_archive(
-    req: ArtifactArchiveRequestPayload,
-    authorization: Optional[str] = Header(default=None),
-) -> ArtifactArchiveResponsePayload:
-    cfg: ToolExecutorServerConfig = app.state.cfg
-    _check_auth(cfg, authorization)
-
-    executor: ToolExecutor = app.state.executor
-    return await executor.archive_artifacts(req)
--- a/atropos/api/tool_server.py
+++ b/atropos/api/tool_server.py
@@ -1,140 +0,0 @@
-"""
-External ToolServer (Phase 4.5+).
-
-This server executes tools that must NOT run inside the sandbox, typically
-because they require credentials or access to external services.
-
-Run (dev):
-  uv run uvicorn atropos_agent.api.tool_server:app --host 0.0.0.0 --port 9002
-"""
-
-from __future__ import annotations
-
-import asyncio
-import os
-import inspect
-from typing import Any, Dict, List, Optional
-from pathlib import Path
-
-from fastapi import FastAPI, Header, HTTPException, status
-from pydantic import BaseModel, Field
-
-from ..tools import ToolRegistry, build_tool_registry
-from ..tools.base import ToolResultPayload, ToolServerExecuteRequest
-
-
-class ToolServerConfig(BaseModel):
-    token: Optional[str] = Field(
-        default=None,
-        description="Bearer token required for requests (optional in dev).",
-    )
-    max_concurrency: int = Field(default=16, ge=1, description="Max concurrent tool executions.")
-
-    @classmethod
-    def from_env(cls) -> "ToolServerConfig":
-        # In dev, prefer loading secrets from the repo-local `.env` (not committed).
-        try:
-            from dotenv import load_dotenv  # type: ignore
-        except Exception:  # pragma: no cover
-            load_dotenv = None  # type: ignore[assignment]
-        if load_dotenv is not None:
-            env_path = Path(__file__).resolve().parents[2] / ".env"
-            if env_path.exists():
-                load_dotenv(dotenv_path=env_path)
-
-        token = os.getenv("TOOL_SERVER_TOKEN") or None
-        max_concurrency = int(os.getenv("TOOL_SERVER_MAX_CONCURRENCY", "16"))
-        return cls(token=token, max_concurrency=max_concurrency)
-
-
-app = FastAPI(title="Atropos-Agent Tool Server")
-
-
-@app.get("/")
-async def root() -> Dict[str, str]:
-    return {"message": "Atropos-Agent Tool Server"}
-
-
-@app.on_event("startup")
-async def _startup() -> None:
-    cfg = ToolServerConfig.from_env()
-
-    # External-only registry. It will only include tools that are enabled by toolsets and
-    # whose Hermes requirements/keys are satisfied in this process.
-    tools: ToolRegistry = build_tool_registry(
-        enabled_toolsets=["all"],
-        disabled_toolsets=["terminal", "sandbox", "filesystem", "terminal_stateful", "default"],
-        tool_server_url="enabled",
-    )
-
-    app.state.cfg = cfg
-    app.state.tools = tools
-    app.state.semaphore = asyncio.Semaphore(cfg.max_concurrency)
-
-
-@app.get("/health")
-async def health() -> Dict[str, Any]:
-    return {"status": "ok"}
-
-
-@app.get("/tools")
-async def list_tools() -> Dict[str, Any]:
-    tools: ToolRegistry = app.state.tools
-    return {"tools": [s.to_dict() for s in tools.get_schemas()]}
-
-
-def _check_auth(cfg: ToolServerConfig, authorization: Optional[str]) -> None:
-    if not cfg.token:
-        return
-    if not authorization:
-        raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Missing Authorization header")
-    if not authorization.lower().startswith("bearer "):
-        raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Invalid Authorization header")
-    token = authorization.split(" ", 1)[1].strip()
-    if token != cfg.token:
-        raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail="Invalid token")
-
-
-@app.post("/execute", response_model=ToolResultPayload)
-async def execute_tool(
-    req: ToolServerExecuteRequest,
-    authorization: Optional[str] = Header(default=None),
-) -> ToolResultPayload:
-    cfg: ToolServerConfig = app.state.cfg
-    _check_auth(cfg, authorization)
-
-    tools: ToolRegistry = app.state.tools
-    sem: asyncio.Semaphore = app.state.semaphore
-
-    tool = tools.get(req.tool.name)
-    if tool is None:
-        return ToolResultPayload(
-            success=False,
-            error=f"Unknown tool: {req.tool.name}",
-            uniq_id=req.tool.uniq_id,
-        )
-
-    async with sem:
-        try:
-            kwargs = dict(req.tool.arguments)
-            sig = inspect.signature(tool.execute).parameters
-            # Some tools can benefit from extra context.
-            if req.trajectory_id and "trajectory_id" in sig:
-                kwargs["trajectory_id"] = req.trajectory_id
-            if req.slot_id and "slot_id" in sig:
-                kwargs["slot_id"] = req.slot_id
-            if req.container_addr and "container_addr" in sig:
-                kwargs["container_addr"] = req.container_addr
-            if "task_id" in sig:
-                kwargs["task_id"] = req.trajectory_id
-            result = await tool.execute(**kwargs)
-        except Exception as e:
-            return ToolResultPayload(
-                success=False,
-                error=f"Tool execution error: {e}",
-                uniq_id=req.tool.uniq_id,
-            )
-
-    if result.uniq_id is None:
-        result.uniq_id = req.tool.uniq_id
-    return ToolResultPayload.from_tool_result(result)
--- a/atropos/backends/init.py
+++ b/atropos/backends/init.py
@@ -1,27 +0,0 @@
-from __future__ import annotations
-
-from typing import Any
-
-from .base import ToolBackend
-from .modal_backend import ModalSandboxConfig, ModalToolBackend
-from .nomad_backend import NomadBackendConfig, NomadToolBackend
-
-
-def create_tool_backend(cfg: Any) -> ToolBackend:
-    mode = str(getattr(cfg, "tool_pool_mode", "nomad")).strip().lower()
-    if mode == "nomad":
-        return NomadToolBackend(NomadBackendConfig.from_agent_env_config(cfg))
-    if mode == "modal":
-        return ModalToolBackend(ModalSandboxConfig.from_agent_env_config(cfg))
-    raise ValueError(f"Unknown tool_pool_mode: {mode}")
-
-
-__all__ = [
-    "ToolBackend",
-    "create_tool_backend",
-    "NomadBackendConfig",
-    "NomadToolBackend",
-    "ModalSandboxConfig",
-    "ModalToolBackend",
-]
-
--- a/atropos/backends/base.py
+++ b/atropos/backends/base.py
@@ -1,89 +0,0 @@
-"""
-Backend interfaces for AgentEnv tool execution.
-
-The goal of this module is to decouple ToolExecutor / AgentEnv from any single
-execution backend (Nomad/Docker today; Modal later).
-"""
-
-from __future__ import annotations
-
-from typing import Any, Dict, List, Optional, Protocol, Tuple
-
-from ..slots.executor import ExecutionResult
-from ..slots.slot import Slot
-
-
-class ToolBackend(Protocol):
-    """
-    Minimal interface required by ToolExecutor.
-
-    Backends provide:
-    - lifecycle (start/stop)
-    - slot acquisition/release (workspace affinity)
-    - batched tool execution across slots
-    - optional artifact helpers (for env verification / demos)
-    """
-
-    @property
-    def default_timeout_s(self) -> Optional[float]:
-        """Default sandbox execution timeout in seconds (if any)."""
-
-    async def start(self) -> None:
-        """Start the backend (provision workers/containers, health checks, etc)."""
-
-    async def stop(self, *, purge: bool = False) -> None:
-        """Stop the backend and optionally purge remote resources."""
-
-    async def acquire(self, trajectory_id: Optional[str] = None) -> Slot:
-        """Acquire a slot for a trajectory (workspace affinity)."""
-
-    async def release(self, slot: Slot, *, reset_workspace: bool = False) -> None:
-        """Release a slot back to the pool."""
-
-    async def execute_batch(
-        self,
-        requests: List[Tuple[Slot, str, Dict[str, Any]]],
-        *,
-        timeout_s: Optional[float] = None,
-    ) -> List[ExecutionResult]:
-        """Execute a batch of sandbox tool calls and return results in order."""
-
-    # ---------------------------------------------------------------------
-    # Optional artifact helpers (supported by the Nomad sandbox-server today)
-    # ---------------------------------------------------------------------
-
-    async def read_artifact(
-        self,
-        slot: Slot,
-        path: str,
-        *,
-        encoding: str = "text",
-        max_bytes: Optional[int] = None,
-        include_sha256: bool = False,
-        timeout_s: Optional[float] = None,
-    ) -> Dict[str, Any]:
-        raise NotImplementedError
-
-    async def list_artifacts(
-        self,
-        slot: Slot,
-        path: str = ".",
-        *,
-        recursive: bool = False,
-        max_entries: Optional[int] = None,
-        timeout_s: Optional[float] = None,
-    ) -> Dict[str, Any]:
-        raise NotImplementedError
-
-    async def archive_artifacts(
-        self,
-        slot: Slot,
-        path: str = ".",
-        *,
-        archive_format: str = "tar.gz",
-        max_bytes: Optional[int] = None,
-        max_entries: Optional[int] = None,
-        timeout_s: Optional[float] = None,
-    ) -> Dict[str, Any]:
-        raise NotImplementedError
-
--- a/atropos/backends/modal_backend.py
+++ b/atropos/backends/modal_backend.py
--- a/atropos/backends/nomad_backend.py
+++ b/atropos/backends/nomad_backend.py
@@ -1,156 +0,0 @@
-"""
-Nomad/Docker tool backend.
-
-This backend is the current default for AgentEnv: it provisions a Nomad job
-running `sandbox_server.py` and multiplexes stateless slots inside each container.
-"""
-
-from __future__ import annotations
-
-from dataclasses import dataclass
-from typing import Any, Dict, List, Optional, Tuple
-
-from ..slots import Slot, SlotPool, SlotPoolConfig
-from ..slots.executor import ExecutionResult
-from .base import ToolBackend
-
-
-@dataclass(frozen=True)
-class NomadBackendConfig:
-    nomad_address: str
-    sandbox_job_id: str
-    sandbox_image: str
-    slots_per_container: int
-    min_containers: int
-    max_containers: int
-    privileged: bool
-    acquire_timeout_s: float
-    purge_job_on_start: bool
-    # Driver selection: "docker" or "singularity"
-    driver: str = "docker"
-    # Path to .sif file for singularity driver (required if driver="singularity")
-    singularity_image: Optional[str] = None
-
-    @classmethod
-    def from_agent_env_config(cls, cfg: Any) -> "NomadBackendConfig":
-        return cls(
-            nomad_address=str(getattr(cfg, "nomad_address")),
-            sandbox_job_id=str(getattr(cfg, "sandbox_job_id")),
-            sandbox_image=str(getattr(cfg, "sandbox_image")),
-            slots_per_container=int(getattr(cfg, "slots_per_container")),
-            min_containers=int(getattr(cfg, "min_containers")),
-            max_containers=int(getattr(cfg, "max_containers")),
-            privileged=bool(getattr(cfg, "privileged")),
-            acquire_timeout_s=float(getattr(cfg, "acquire_timeout_s")),
-            purge_job_on_start=bool(getattr(cfg, "purge_job_on_start", False)),
-            driver=str(getattr(cfg, "driver", "docker")),
-            singularity_image=getattr(cfg, "singularity_image", None),
-        )
-
-
-class NomadToolBackend(ToolBackend):
-    def __init__(self, config: NomadBackendConfig):
-        self.config = config
-        self.pool = SlotPool(
-            SlotPoolConfig(
-                nomad_address=config.nomad_address,
-                job_id=config.sandbox_job_id,
-                image=config.sandbox_image,
-                slots_per_container=config.slots_per_container,
-                min_containers=config.min_containers,
-                max_containers=config.max_containers,
-                privileged=config.privileged,
-                acquire_timeout=config.acquire_timeout_s,
-                purge_job_on_start=bool(config.purge_job_on_start),
-                driver=config.driver,
-                singularity_image=config.singularity_image,
-            )
-        )
-
-    @property
-    def default_timeout_s(self) -> Optional[float]:
-        t = getattr(self.pool.executor, "timeout", None)
-        total = getattr(t, "total", None)
-        try:
-            return float(total) if total is not None else None
-        except Exception:
-            return None
-
-    async def start(self) -> None:
-        await self.pool.start()
-
-    async def stop(self, *, purge: bool = False) -> None:
-        await self.pool.stop(purge_job=purge)
-
-    async def acquire(self, trajectory_id: Optional[str] = None) -> Slot:
-        return await self.pool.acquire(trajectory_id)
-
-    async def release(self, slot: Slot, *, reset_workspace: bool = False) -> None:
-        await self.pool.release(slot, reset_workspace=reset_workspace)
-
-    async def execute_batch(
-        self,
-        requests: List[Tuple[Slot, str, Dict[str, Any]]],
-        *,
-        timeout_s: Optional[float] = None,
-    ) -> List[ExecutionResult]:
-        return await self.pool.execute_batch(requests, timeout=timeout_s)
-
-    async def read_artifact(
-        self,
-        slot: Slot,
-        path: str,
-        *,
-        encoding: str = "text",
-        max_bytes: Optional[int] = None,
-        include_sha256: bool = False,
-        timeout_s: Optional[float] = None,
-    ) -> Dict[str, Any]:
-        return await self.pool.executor.read_artifact(
-            slot,
-            path,
-            encoding=encoding,
-            max_bytes=max_bytes,
-            include_sha256=include_sha256,
-            timeout=timeout_s,
-        )
-
-    async def list_artifacts(
-        self,
-        slot: Slot,
-        path: str = ".",
-        *,
-        recursive: bool = False,
-        max_entries: Optional[int] = None,
-        timeout_s: Optional[float] = None,
-    ) -> Dict[str, Any]:
-        return await self.pool.executor.list_artifacts(
-            slot,
-            path,
-            recursive=recursive,
-            max_entries=max_entries,
-            timeout=timeout_s,
-        )
-
-    async def archive_artifacts(
-        self,
-        slot: Slot,
-        path: str = ".",
-        *,
-        archive_format: str = "tar.gz",
-        max_bytes: Optional[int] = None,
-        max_entries: Optional[int] = None,
-        timeout_s: Optional[float] = None,
-    ) -> Dict[str, Any]:
-        return await self.pool.executor.archive_artifacts(
-            slot,
-            path,
-            archive_format=archive_format,
-            max_bytes=max_bytes,
-            max_entries=max_entries,
-            timeout=timeout_s,
-        )
-
-    def get_stats(self) -> Dict[str, Any]:
-        return self.pool.get_stats()
-
--- a/atropos/envs/init.py
+++ b/atropos/envs/init.py
@@ -1,10 +0,0 @@
-"""
-Environment implementations for atropos-agent.
-"""
-
-from .agent_env import AgentEnv, AgentEnvConfig
-
-# NOTE: Additional example envs exist as modules (e.g. `test_env`, `swe_smith_oracle_env`),
-# but are intentionally not imported here to avoid pulling heavy optional deps at import time.
-
-__all__ = ["AgentEnv", "AgentEnvConfig"]
--- a/atropos/envs/agent_env.py
+++ b/atropos/envs/agent_env.py
@@ -1,526 +0,0 @@
-"""
-AgentEnv - Atropos BaseEnv extension for agent/tool-call workloads.
-
-AgentEnv is responsible for starting the sandbox tool execution backend and
-providing helpers for running agent trajectories with queued/batched tool calls.
-"""
-
-from __future__ import annotations
-import os
-import asyncio
-import time
-import uuid
-from abc import ABC, abstractmethod
-from typing import Any, Awaitable, Callable, Dict, Generic, List, Optional, Tuple, TypeVar
-
-from pydantic import Field
-
-from atroposlib.envs.base import APIServerConfig, BaseEnv, BaseEnvConfig, Item, ScoredDataGroup, ScoredDataItem
-from atroposlib.envs.server_handling.server_baseline import AsyncSemWithAdaptiveWeight
-
-from ..agent import AgentConfig, AgentResult, AtroposAgent
-from ..backends import ToolBackend, create_tool_backend
-from ..tools import ToolRegistry, build_tool_registry
-from ..tools.tool_executor import ToolExecutor, ToolExecutorConfig
-
-# Main BaseEnv child classes. Child class THESE to get agent+tooling functionality easily.
-
-class AgentEnvConfig(BaseEnvConfig):
-    tool_pool_mode: str = Field(default="nomad", description="Tool execution backend ('nomad' or 'modal')")
-
-    allow_network: bool = Field(
-        default=True,
-        description="Whether sandbox bash commands may access the network (env policy).",
-    )
-    require_sandbox: bool = Field(
-        default=False,
-        description="Fail closed if bubblewrap sandboxing is unavailable/unusable for stateless sandbox tools.",
-    )
-    require_stateful_sandbox: bool = Field(
-        default=False,
-        description="Fail closed if bubblewrap/PID isolation is unavailable for stateful terminal tools (tmux).",
-    )
-    tool_batch_window_ms: int = Field(default=20, description="ToolExecutor batching window (ms)")
-    tool_max_batch_size: int = Field(default=200, description="ToolExecutor maximum batch size")
-
-    # nomad mode settings. TODO: Add Modal support, split this into own config
-    nomad_address: str = Field(default="http://localhost:4646", description="Nomad API address")
-    sandbox_job_id: str = Field(default="atropos-sandbox-agent-env", description="Nomad job id for sandbox containers")
-    sandbox_image: str = Field(default="atropos-sandbox:local", description="Docker image for sandbox containers")
-    slots_per_container: int = Field(default=10, description="Nomad mode: slots per container")
-    min_containers: int = Field(default=1, description="Nomad mode: minimum containers")
-    max_containers: int = Field(default=10, description="Nomad mode: maximum containers")
-    privileged: bool = Field(default=False, description="Nomad mode: run container privileged")
-    acquire_timeout_s: float = Field(default=30.0, description="Slot acquisition timeout (seconds)")
-    purge_job_on_start: bool = Field(
-        default=False,
-        description=(
-            "Nomad mode: stop/purge the sandbox job on startup. This is helpful in local dev and training runs "
-            "to recover from previous crashes that leave the job in a restart backoff state."
-        ),
-    )
-    purge_job_on_shutdown: bool = Field(default=True, description="Nomad mode: stop/purge job on shutdown")
-    
-    # Nomad driver selection (docker or singularity)
-    driver: str = Field(
-        default="docker",
-        description="Nomad task driver: 'docker' (default) or 'singularity' (for HPC without sudo Docker)",
-    )
-    singularity_image: Optional[str] = Field(
-        default=None,
-        description="Path to .sif file for Singularity driver (required if driver='singularity')",
-    )
-
-    # modal mode settings (stub; implementation pending)
-    modal_app_name: str = Field(default="atropos-sandbox", description="Modal app name (stub)")
-    modal_function_name: str = Field(default="sandbox_server", description="Modal function/actor name (stub)")
-    modal_volume_name: Optional[str] = Field(default=None, description="Modal Volume name for persistent storage (stub)")
-    modal_volume_mount_path: str = Field(default="/data", description="Modal Volume mount path (stub)")
-
-    # basic agent defaults
-    agent_max_steps: int = Field(default=50, description="Max ReACT steps per trajectory")
-    agent_temperature: float = Field(default=0.7, description="Sampling temperature")
-    agent_max_tokens: Optional[int] = Field(
-        default=None,
-        description="Max tokens per model response (default: let backend decide)",
-    )
-    agent_tool_delay_s: float = Field(default=0.0, description="Delay between tool calls (seconds)")
-
-    # tool selection
-    enabled_toolsets: List[str] = Field(
-        default_factory=lambda: ["default"],
-        description="Toolsets to enable (Hermes-style grouping).",
-    )
-    disabled_toolsets: List[str] = Field(
-        default_factory=list,
-        description="Toolsets to disable (applied after enabled_toolsets).",
-    )
-
-    # external ToolServer routing (Phase 4.5+)
-    tool_server_url: Optional[str] = Field(
-        default=None,
-        description="Base URL for external ToolServer (enables external tools).",
-    )
-    tool_server_token: Optional[str] = Field(
-        default=None,
-        description="Bearer token for ToolServer auth (optional in dev).",
-    )
-
-AgentEnvConfigT = TypeVar("AgentEnvConfigT", bound="AgentEnvConfig")
-
-
-class AgentEnv(BaseEnv, ABC, Generic[AgentEnvConfigT]):
-    env_config_cls = AgentEnvConfig
-
-    def __init__(
-        self,
-        config: AgentEnvConfigT,
-        server_configs: List[APIServerConfig],
-        slurm: bool = False,
-        testing: bool = False,
-    ):
-        super().__init__(config, server_configs, slurm, testing)
-        self.config: AgentEnvConfigT = config
-
-        self.tools: ToolRegistry = self.build_tools()
-
-        self._backend: Optional[ToolBackend] = None
-        self._tool_executor: Optional[ToolExecutor] = None
-        self._tool_server_inprocess: bool = False
-        self._trajectory_workspace_meta: Dict[str, Dict[str, Any]] = {}
-
-    def build_tools(self) -> ToolRegistry:
-        """Wraps original Hermes-Agent ToolRegistry for atropos AgentEnv use.
-        See Hermes-Agent docs for toolsets and available tools etc.
-        """
-        return build_tool_registry(
-            enabled_toolsets=self.config.enabled_toolsets or ["default"],
-            disabled_toolsets=self.config.disabled_toolsets or None,
-            tool_server_url=self.config.tool_server_url,
-        )
-
-    @abstractmethod
-    def build_task(self, item: Item) -> str:
-        """Return the user-facing task string for the agent."""
-
-    @abstractmethod
-    async def score_trajectory(self, item: Item, final_response: str) -> float:
-        """Return a scalar score for this trajectory."""
-
-    async def setup_trajectory_workspace(
-        self,
-        item: Item,
-        *,
-        trajectory_id: str,
-        exec_tool: Callable[["ToolCall"], Awaitable["ToolResult"]],
-    ) -> Dict[str, Any]:
-        """
-        Optional hook: prepare the sandbox workspace before the agent starts.
-
-        Examples:
-        - clone a repo and checkout a commit
-        - write fixture files (e.g. images) for external-tool demos
-        - pre-install dependencies
-
-        Default: no-op.
-        """
-        _ = (item, trajectory_id, exec_tool)
-        return {}
-
-    async def verify_and_score_trajectory(
-        self,
-        item: Item,
-        final_response: str,
-        *,
-        trajectory_id: str,
-        exec_tool: Callable[["ToolCall"], Awaitable["ToolResult"]],
-        agent_result: Optional[AgentResult] = None,
-        workspace_meta: Optional[Dict[str, Any]] = None,
-    ) -> tuple[float, Dict[str, Any]]:
-        """
-        Optional hook: run in-sandbox verification before scoring.
-
-        Many agent envs need to execute verification inside the same trajectory
-        workspace (e.g. pytest) before releasing/resetting the slot.
-
-        Default: calls `score_trajectory()` and returns empty metadata.
-        """
-        _ = (trajectory_id, exec_tool, agent_result, workspace_meta)  # default ignores in-workspace verification
-        score = await self.score_trajectory(item, final_response)
-        return score, {}
-
-    def build_agent_config(self, item: Item) -> AgentConfig:  # noqa: ARG002
-        return AgentConfig(
-            max_steps=self.config.agent_max_steps,
-            temperature=self.config.agent_temperature,
-            max_tokens=self.config.agent_max_tokens,
-            tool_delay_s=self.config.agent_tool_delay_s,
-        )
-
-    async def setup(self) -> None:
-        print(f"[AgentEnv] setup(): starting tool backend ({self.config.tool_pool_mode})", flush=True)
-        await self._start_tool_backend()
-        print("[AgentEnv] setup(): configuring server concurrency", flush=True)
-        self._configure_server_concurrency()
-        print("[AgentEnv] setup(): running env-specific setup_agent_env()", flush=True)
-        await self.setup_agent_env()
-        print("[AgentEnv] setup(): done", flush=True)
-
-    def _configure_server_concurrency(self) -> None:
-        """
-        Ensure the LLM server concurrency isn't accidentally capped below `group_size`.
-
-        In `BaseEnv process` mode, groups are collected concurrently and if the underlying
-        ServerManager/OpenAIServer semaphore is left at 1, we serialize inference even
-        when `--env.group_size` is > 1.
-        """
-        desired = int(getattr(self.config, "group_size", 1) or 1)
-        if desired <= 1:
-            return
-
-        servers = getattr(self.server, "servers", None)
-        if not isinstance(servers, list) or not servers:
-            return
-
-        for s in servers:
-            sem = getattr(s, "sem", None)
-            eval_sem = getattr(s, "eval_sem", None)
-            # Only increase; never shrink.
-            if sem is not None and getattr(sem, "max_val", 0) < desired:
-                s.sem = AsyncSemWithAdaptiveWeight(desired)
-                if hasattr(s, "config") and hasattr(s.config, "num_max_requests_at_once"):
-                    s.config.num_max_requests_at_once = desired
-            if eval_sem is not None and getattr(eval_sem, "max_val", 0) < desired:
-                s.eval_sem = AsyncSemWithAdaptiveWeight(desired)
-                if hasattr(s, "config") and hasattr(s.config, "num_requests_for_eval"):
-                    s.config.num_requests_for_eval = desired
-
-    @abstractmethod
-    async def setup_agent_env(self) -> None:
-        """Subclass hook for env-specific setup."""
-
-    async def evaluate(self, *args, **kwargs):  # noqa: ARG002
-        """
-        Default eval hook (no-op).
-
-        Atropos BaseEnv requires an `evaluate()` implementation. Many agent envs
-        won't have a meaningful evaluation path during early PoC work; they can
-        override this when needed.
-        """
-        return {}
-
-    async def env_manager(self):
-        try:
-            return await super().env_manager()
-        finally:
-            await self.shutdown_tool_backend()
-
-    async def process_manager(self):
-        try:
-            return await super().process_manager()
-        finally:
-            await self.shutdown_tool_backend()
-
-    async def _start_tool_backend(self) -> None:
-        if self._tool_executor is not None:
-            return
-
-        tool_server_url = self.config.tool_server_url
-        tool_server_client = None
-        if tool_server_url == "inprocess":
-            import httpx
-            from ..api.tool_server import app as tool_server_app
-
-            await tool_server_app.router.startup()
-            tool_server_client = httpx.AsyncClient(
-                transport=httpx.ASGITransport(app=tool_server_app),
-                base_url="http://toolserver",
-            )
-            tool_server_url = "http://toolserver"
-            self._tool_server_inprocess = True
-
-        backend = create_tool_backend(self.config)
-        await backend.start()
-
-        executor = ToolExecutor(
-            backend=backend,
-            tools=self.tools,
-            config=ToolExecutorConfig(
-                batch_window_ms=self.config.tool_batch_window_ms,
-                max_batch_size=self.config.tool_max_batch_size,
-                allow_network=self.config.allow_network,
-                require_sandbox=self.config.require_sandbox,
-                require_stateful_sandbox=self.config.require_stateful_sandbox,
-                tool_server_url=tool_server_url,
-                tool_server_token=self.config.tool_server_token,
-            ),
-        )
-        await executor.start()
-        if tool_server_client is not None:
-            executor._tool_server_client = tool_server_client  # type: ignore[attr-defined]
-
-        self._backend = backend
-        self._tool_executor = executor
-
-    async def shutdown_tool_backend(self) -> None:
-        executor = self._tool_executor
-        backend = self._backend
-        inprocess_tool_server = self._tool_server_inprocess
-        self._tool_executor = None
-        self._backend = None
-        self._tool_server_inprocess = False
-
-        if executor is not None:
-            await executor.close()
-        if backend is not None:
-            await backend.stop(purge=bool(self.config.purge_job_on_shutdown))
-        if inprocess_tool_server:
-            from ..api.tool_server import app as tool_server_app
-
-            await tool_server_app.router.shutdown()
-
-    async def collect_trajectory(
-        self, item: Item
-    ) -> Tuple[Optional[ScoredDataItem], List[Item]]:
-        if self._tool_executor is None:
-            raise RuntimeError("Tool backend not started")
-
-        trajectory_id = str(uuid.uuid4())
-        t0 = time.perf_counter()
-        print(f"[AgentEnv] collect_trajectory(): tid={trajectory_id} start", flush=True)
-        task = self.build_task(item)
-        agent_config = self.build_agent_config(item)
-        if os.getenv("ATROPOS_DEBUG_PRINT_TASK") == "1":
-            print(f"Starting trajectory {trajectory_id} with task: {task}", flush=True)
-        else:
-            # Avoid printing the full task prompt by default (can be huge/noisy).
-            one_line = " ".join(str(task).splitlines()).strip()
-            preview = one_line[:240] + ("…" if len(one_line) > 240 else "")
-            print(f"Starting trajectory {trajectory_id} (task preview): {preview}", flush=True)
-
-        async def _exec(call):
-            return await self._tool_executor.execute(trajectory_id, call)
-
-        agent = AtroposAgent(
-            server=self.server,
-            tokenizer=self.tokenizer,
-            tools=self.tools,
-            config=agent_config,
-            execute_tool=_exec,
-        )
-
-        try:
-            print(f"[AgentEnv] tid={trajectory_id} setup_trajectory_workspace() start", flush=True)
-            workspace_meta = await self.setup_trajectory_workspace(item, trajectory_id=trajectory_id, exec_tool=_exec)
-            if not isinstance(workspace_meta, dict):
-                workspace_meta = {}
-            self._trajectory_workspace_meta[trajectory_id] = workspace_meta
-            print(
-                f"[AgentEnv] tid={trajectory_id} setup_trajectory_workspace() done in {time.perf_counter() - t0:.2f}s",
-                flush=True,
-            )
-
-            print(f"[AgentEnv] tid={trajectory_id} agent.run() start", flush=True)
-            result = await agent.run(task)
-            print(
-                f"[AgentEnv] tid={trajectory_id} agent.run() done in {time.perf_counter() - t0:.2f}s "
-                f"success={result.success} tool_calls={result.total_tool_calls}",
-                flush=True,
-            )
-            if not result.success or result.trajectory_data is None:
-                # Do not trigger BaseEnv retries for agent failures.
-                # Record the trajectory with score 0.0 so training/eval can see the failure mode.
-                messages = [{"role": "system", "content": agent._build_system_prompt()}]  # noqa: SLF001
-                messages.append({"role": "user", "content": task})
-                for step in result.steps:
-                    messages.append({"role": "assistant", "content": step.assistant_message})
-                    if step.tool_results:
-                        tool_text = "\n".join(r.to_xml() for r in step.tool_results)
-                        messages.append({"role": "user", "content": tool_text})
-
-                scored: ScoredDataItem = {
-                    "tokens": (result.trajectory_data.tokens if result.trajectory_data else []),
-                    "masks": (result.trajectory_data.masked_tokens if result.trajectory_data else []),
-                    "scores": 0.0,
-                }
-                if result.trajectory_data is not None:
-                    scored["inference_logprobs"] = result.trajectory_data.logprobs  # type: ignore[typeddict-unknown-key]
-                    if getattr(result.trajectory_data, "metadata", None):
-                        scored["overrides"] = {"managed_metadata": result.trajectory_data.metadata}
-                if self.config.include_messages:
-                    # Record a final failure marker as a user-side tool_response-like block so it survives templates.
-                    import json
-
-                    err = result.error or "agent_failed"
-                    messages.append(
-                        {
-                            "role": "user",
-                            "content": f"<tool_response>{json.dumps({'success': False, 'error': err})}</tool_response>",
-                        }
-                    )
-                    scored["messages"] = messages
-                return scored, []
-
-            print(f"[AgentEnv] tid={trajectory_id} verify_and_score_trajectory() start", flush=True)
-            score, score_metadata = await self.verify_and_score_trajectory(
-                item,
-                result.final_response,
-                trajectory_id=trajectory_id,
-                exec_tool=_exec,
-                agent_result=result,
-                workspace_meta=workspace_meta,
-            )
-            print(
-                f"[AgentEnv] tid={trajectory_id} verify_and_score_trajectory() done in {time.perf_counter() - t0:.2f}s "
-                f"score={score}",
-                flush=True,
-            )
-
-            messages = [{"role": "system", "content": agent._build_system_prompt()}]  # noqa: SLF001
-            messages.append({"role": "user", "content": task})
-            for step in result.steps:
-                messages.append({"role": "assistant", "content": step.assistant_message})
-                if step.tool_results:
-                    tool_text = "\n".join(r.to_xml() for r in step.tool_results)
-                    messages.append({"role": "user", "content": tool_text})
-
-            # Optional: allow env verification to attach additional messages (e.g. install logs).
-            if self.config.include_messages and isinstance(score_metadata, dict):
-                extra = score_metadata.get("verification_messages")
-                if isinstance(extra, list):
-                    for m in extra:
-                        if isinstance(m, dict) and isinstance(m.get("role"), str) and isinstance(m.get("content"), str):
-                            messages.append({"role": m["role"], "content": m["content"]})
-
-            scored: ScoredDataItem = {
-                "tokens": result.trajectory_data.tokens,
-                "masks": result.trajectory_data.masked_tokens,
-                "scores": score,
-            }
-            # Atroposlib expects policy logprobs at the *group* level under `inference_logprobs`.
-            # We stash per-item values here and lift them into the group in `collect_trajectories()`.
-            scored["inference_logprobs"] = result.trajectory_data.logprobs  # type: ignore[typeddict-unknown-key]
-            if getattr(result.trajectory_data, "metadata", None):
-                scored["overrides"] = {"managed_metadata": result.trajectory_data.metadata}
-            if self.config.include_messages:
-                scored["messages"] = messages
-
-            return scored, []
-        finally:
-            self._trajectory_workspace_meta.pop(trajectory_id, None)
-            print(f"[AgentEnv] tid={trajectory_id} release_trajectory(reset_workspace=True)", flush=True)
-            await self._tool_executor.release_trajectory(trajectory_id, reset_workspace=True)
-            print(f"[AgentEnv] collect_trajectory(): tid={trajectory_id} done in {time.perf_counter() - t0:.2f}s", flush=True)
-
-    async def collect_trajectories(
-        self, item: Item
-    ) -> Tuple[Optional[ScoredDataGroup], List[Item]]:
-        tasks = [self.collect_trajectory(item) for _ in range(self.config.group_size)]
-        results = await asyncio.gather(*tasks)
-
-        backlog: List[Item] = []
-        items: List[ScoredDataItem] = []
-        for scored, b in results:
-            backlog.extend(b)
-            if scored is not None:
-                items.append(scored)
-
-        if len(items) != self.config.group_size:
-            return None, backlog
-
-        group: ScoredDataGroup = ScoredDataGroup(
-            tokens=[],
-            masks=[],
-            scores=[],
-            advantages=[],
-            ref_logprobs=[],
-            messages=[] if self.config.include_messages else None,
-            inference_logprobs=[],
-            group_overrides={},
-            overrides=[],
-            images=[],
-            generation_params=None,
-        )
-
-        for it in items:
-            group["tokens"].append(it["tokens"])
-            group["masks"].append(it["masks"])
-            group["scores"].append(it["scores"])
-            # policy logprobs (for PPO/GRPO training) if present
-            lp = it.get("inference_logprobs")  # type: ignore[typeddict-item]
-            if lp is not None:
-                group["inference_logprobs"].append(lp)
-            group["overrides"].append(it.get("overrides") or {})  # type: ignore[typeddict-item]
-            if group.get("messages") is not None and it.get("messages") is not None:
-                group["messages"].append(it["messages"])
-
-        return group, backlog
-
-    async def run_agent(self, task: str, *, trajectory_id: Optional[str] = None) -> Tuple[str, Dict[str, Any]]:
-        """
-        Run the AtroposAgent on a single task and return (final_response, debug).
-
-        This is a helper intended for simple environments and tests.
-        """
-        if self._tool_executor is None:
-            raise RuntimeError("Tool backend not started")
-
-        tid = trajectory_id or str(uuid.uuid4())
-
-        async def _exec(call):
-            return await self._tool_executor.execute(tid, call)
-
-        agent = AtroposAgent(
-            server=self.server,
-            tokenizer=self.tokenizer,
-            tools=self.tools,
-            config=AgentConfig(
-                max_steps=self.config.agent_max_steps,
-                temperature=self.config.agent_temperature,
-                max_tokens=self.config.agent_max_tokens,
-            ),
-            execute_tool=_exec,
-        )
-        result = await agent.run(task)
-        await self._tool_executor.release_trajectory(tid, reset_workspace=True)
-        return result.final_response, {"success": result.success, "error": result.error, "tool_calls": result.total_tool_calls}
--- a/atropos/envs/hermes_compat_test_env.py
+++ b/atropos/envs/hermes_compat_test_env.py
@@ -1,171 +0,0 @@
-"""
-Hermes-Agent + Atropos (Nomad sandbox) compatibility smoke environment.
-
-This environment is intended to validate, end-to-end:
-  BaseEnv.process -> AgentEnv -> ToolExecutor (batched) -> Nomad SlotPool -> sandbox_server
-
-It forces the model to use a sandbox tool by asking it to run a command that
-generates a high-entropy token inside the sandbox, then repeat it exactly.
-
-Run (process mode):
-  uv run python -m atropos.envs.hermes_compat_test_env process --env.use_wandb false --env.total_steps 2 --env.group_size 1
-"""
-
-from __future__ import annotations
-
-import os
-from typing import Any, Dict, List, Tuple
-
-from dotenv import load_dotenv
-from pydantic import Field
-
-from atroposlib.envs.base import APIServerConfig, Item
-
-from ..agent import AgentConfig, AgentResult
-from ..tools import ToolCall
-from .agent_env import AgentEnv, AgentEnvConfig
-
-load_dotenv()
-
-
-def _forced_tool_item() -> Item:
-    # Use double quotes in the shell command and show JSON escaping explicitly.
-    # This avoids invalid JSON escapes like `\\'` (not valid JSON) that some models produce.
-    cmd = 'python -c "import secrets; print(secrets.token_hex(16))"'
-    return {
-        "command": cmd,
-        "prompt": (
-            "You are acting as an agent inside a sandboxed environment.\n"
-            "You MUST use the terminal tool to execute commands.\n"
-            "Run this exact command:\n"
-            f"{cmd}\n"
-            "When you call the tool, use valid JSON inside <tool_call>. Example:\n"
-            '<tool_call>{"name": "terminal", "arguments": {"command": '
-            '"python -c \\\\"import secrets; print(secrets.token_hex(16))\\\\""}}'
-            "</tool_call>\n"
-            "Then respond with EXACTLY what it printed (the hex token) and nothing else.\n"
-            "Do not guess. Do not explain."
-        ),
-    }
-
-
-class HermesCompatTestEnvConfig(AgentEnvConfig):
-    server_base_url: str = Field(
-        default="http://127.0.0.1:8080",
-        description="Base URL for an OpenAI-compatible chat server (without /v1).",
-    )
-    server_model: str = Field(default="hermes-4-36b", description="Model name")
-    tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
-
-
-class HermesCompatTestEnv(AgentEnv[HermesCompatTestEnvConfig]):
-    name = "hermes_compat_test_env"
-    env_config_cls = HermesCompatTestEnvConfig
-
-    def __init__(
-        self,
-        config: HermesCompatTestEnvConfig,
-        server_configs: List[APIServerConfig],
-        slurm: bool = False,
-        testing: bool = False,
-    ):
-        super().__init__(config, server_configs, slurm, testing)
-        self._iter = 0
-
-    @classmethod
-    def config_init(cls) -> Tuple[HermesCompatTestEnvConfig, List[APIServerConfig]]:
-        base_url = (
-            os.getenv("ATROPOS_SERVER_BASE_URL")
-            or os.getenv("OPENAI_BASE_URL")
-            or os.getenv("LLM_BASE_URL")
-            or "http://127.0.0.1:8080"
-        )
-        model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
-        api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
-
-        env_config = HermesCompatTestEnvConfig(
-            tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
-            group_size=1,
-            use_wandb=False,
-            include_messages=True,
-            ensure_scores_are_not_same=False,
-            total_steps=2,
-            batch_size=1,
-            server_base_url=base_url,
-            server_model=model,
-            # Tooling: sandbox-only terminal.
-            enabled_toolsets=["terminal"],
-            disabled_toolsets=[],
-            # Default to Nomad sandboxing; users can override via --env.* args.
-            sandbox_image=os.getenv("ATROPOS_SANDBOX_IMAGE") or "atropos-sandbox:local",
-            # In local dev it's common for a previous crash to leave the job in backoff.
-            purge_job_on_start=True,
-            purge_job_on_shutdown=True,
-        )
-
-        server_configs = [
-            APIServerConfig(
-                model_name=model,
-                base_url=f"{base_url.rstrip('/')}/v1",
-                api_key=api_key,
-                num_max_requests_at_once=1,
-                num_requests_for_eval=1,
-                timeout=120,
-            )
-        ]
-        return env_config, server_configs
-
-    async def setup_agent_env(self) -> None:
-        return None
-
-    async def get_next_item(self) -> Item:
-        self._iter += 1
-        return _forced_tool_item()
-
-    def build_task(self, item: Item) -> str:
-        return str(item.get("prompt") or "")
-
-    def build_agent_config(self, item: Item) -> AgentConfig:  # noqa: ARG002
-        # Avoid imposing max_tokens by default; tool-tag responses can be long for some models.
-        return AgentConfig(
-            max_steps=min(8, int(self.config.agent_max_steps)),
-            temperature=0.2,
-            max_tokens=None,
-        )
-
-    async def score_trajectory(self, item: Item, final_response: str) -> float:
-        # Scoring happens in verify_and_score_trajectory so we can inspect tool results.
-        _ = (item, final_response)
-        return 0.0
-
-    async def verify_and_score_trajectory(
-        self,
-        item: Item,
-        final_response: str,
-        *,
-        trajectory_id: str,  # noqa: ARG002
-        exec_tool,  # noqa: ARG002
-        agent_result: AgentResult | None = None,
-        workspace_meta: Dict[str, Any] | None = None,  # noqa: ARG002
-    ) -> tuple[float, Dict[str, Any]]:
-        if agent_result is None:
-            return 0.0, {"error": "Missing agent_result"}
-
-        observed: str = ""
-        tool_ok = False
-        for step in agent_result.steps:
-            for res in step.tool_results:
-                if not res.success:
-                    return 0.0, {"error": res.error, "output": res.output}
-                out = (res.output or "").strip()
-                if out:
-                    observed = out.splitlines()[-1].strip()
-                    tool_ok = True
-
-        final = (final_response or "").strip()
-        score = 1.0 if tool_ok and agent_result.total_tool_calls > 0 and observed and final == observed else 0.0
-        return score, {"observed": observed, "tool_calls": agent_result.total_tool_calls, "command": item.get("command")}
-
-
-if __name__ == "__main__":
-    HermesCompatTestEnv.cli()
--- a/atropos/envs/sandbox_terminal_smoke_env.py
+++ b/atropos/envs/sandbox_terminal_smoke_env.py
@@ -1,172 +0,0 @@
-"""
-Nomad sandbox terminal smoke environment (training-oriented).
-
-Validates, end-to-end:
-  BaseEnv.process -> AgentEnv -> ToolExecutor (batched) -> Nomad SlotPool -> sandbox_server
-
-It forces the model to use a sandbox tool by asking it to run a command that
-generates a high-entropy token inside the sandbox, then repeat it exactly.
-
-Run (process mode):
-  uv run python -m atropos.envs.sandbox_terminal_smoke_env process --env.use_wandb false --env.total_steps 2 --env.group_size 1
-"""
-
-from __future__ import annotations
-
-import os
-from typing import Any, Dict, List, Tuple
-
-from dotenv import load_dotenv
-from pydantic import Field
-
-from atroposlib.envs.base import APIServerConfig, Item
-
-from ..agent import AgentConfig, AgentResult
-from ..tools import ToolCall
-from .agent_env import AgentEnv, AgentEnvConfig
-
-load_dotenv()
-
-STRICT_TOOLCALL_SYSTEM_PROMPT = None
-
-
-def _forced_tool_item() -> Item:
-    # Use double quotes in the shell command and show JSON escaping explicitly.
-    # This avoids invalid JSON escapes like `\\'` (not valid JSON) that some models produce.
-    cmd = 'python -c "import secrets; print(secrets.token_hex(16))"'
-    return {
-        "command": cmd,
-        "prompt": (
-            "You MUST use the terminal tool.\n"
-            "Run this exact command:\n"
-            f"{cmd}\n"
-            "When you call the tool, use valid JSON inside <tool_call>. Example:\n"
-            '<tool_call>{"name": "terminal", "arguments": {"command": '
-            '"python -c \\\\"import secrets; print(secrets.token_hex(16))\\\\""}}'
-            "</tool_call>\n"
-            "Then respond with EXACTLY what it printed (the hex token) and nothing else.\n"
-            "Do not guess. Do not explain."
-        ),
-    }
-
-
-class SandboxTerminalSmokeEnvConfig(AgentEnvConfig):
-    server_base_url: str = Field(
-        default="http://127.0.0.1:8080",
-        description="Base URL for an OpenAI-compatible chat server (without /v1).",
-    )
-    server_model: str = Field(default="hermes-4-36b", description="Model name")
-    tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
-
-
-class SandboxTerminalSmokeEnv(AgentEnv[SandboxTerminalSmokeEnvConfig]):
-    name = "sandbox_terminal_smoke_env"
-    env_config_cls = SandboxTerminalSmokeEnvConfig
-
-    def __init__(
-        self,
-        config: SandboxTerminalSmokeEnvConfig,
-        server_configs: List[APIServerConfig],
-        slurm: bool = False,
-        testing: bool = False,
-    ):
-        super().__init__(config, server_configs, slurm, testing)
-        self._iter = 0
-
-    @classmethod
-    def config_init(cls) -> Tuple[SandboxTerminalSmokeEnvConfig, List[APIServerConfig]]:
-        base_url = (
-            os.getenv("ATROPOS_SERVER_BASE_URL")
-            or os.getenv("OPENAI_BASE_URL")
-            or os.getenv("LLM_BASE_URL")
-            or "http://127.0.0.1:8080"
-        )
-        model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
-        api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
-
-        env_config = SandboxTerminalSmokeEnvConfig(
-            tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
-            group_size=1,
-            use_wandb=False,
-            include_messages=True,
-            ensure_scores_are_not_same=False,
-            total_steps=2,
-            batch_size=1,
-            server_base_url=base_url,
-            server_model=model,
-            # Tooling: sandbox-only terminal.
-            enabled_toolsets=["terminal"],
-            disabled_toolsets=[],
-            # Default to Nomad sandboxing; users can override via --env.* args.
-            sandbox_image=os.getenv("ATROPOS_SANDBOX_IMAGE") or "atropos-sandbox:local",
-            purge_job_on_start=True,
-            purge_job_on_shutdown=True,
-        )
-
-        server_configs = [
-            APIServerConfig(
-                model_name=model,
-                base_url=f"{base_url.rstrip('/')}/v1",
-                api_key=api_key,
-                num_max_requests_at_once=1,
-                num_requests_for_eval=1,
-                timeout=120,
-            )
-        ]
-        return env_config, server_configs
-
-    async def setup_agent_env(self) -> None:
-        return None
-
-    async def get_next_item(self) -> Item:
-        self._iter += 1
-        return _forced_tool_item()
-
-    def build_task(self, item: Item) -> str:
-        return str(item.get("prompt") or "")
-
-    def build_agent_config(self, item: Item) -> AgentConfig:  # noqa: ARG002
-        # Avoid imposing max_tokens by default; tool-tag responses can be long for some models.
-        return AgentConfig(
-            max_steps=min(8, int(self.config.agent_max_steps)),
-            temperature=0.2,
-            max_tokens=None,
-            system_prompt=STRICT_TOOLCALL_SYSTEM_PROMPT,
-        )
-
-    async def score_trajectory(self, item: Item, final_response: str) -> float:
-        # Scoring happens in verify_and_score_trajectory so we can inspect tool results.
-        _ = (item, final_response)
-        return 0.0
-
-    async def verify_and_score_trajectory(
-        self,
-        item: Item,
-        final_response: str,
-        *,
-        trajectory_id: str,  # noqa: ARG002
-        exec_tool,  # noqa: ARG002
-        agent_result: AgentResult | None = None,
-        workspace_meta: Dict[str, Any] | None = None,  # noqa: ARG002
-    ) -> tuple[float, Dict[str, Any]]:
-        if agent_result is None:
-            return 0.0, {"error": "Missing agent_result"}
-
-        observed: str = ""
-        tool_ok = False
-        for step in agent_result.steps:
-            for res in step.tool_results:
-                if not res.success:
-                    return 0.0, {"error": res.error, "output": res.output}
-                out = (res.output or "").strip()
-                if out:
-                    observed = out.splitlines()[-1].strip()
-                    tool_ok = True
-
-        final = (final_response or "").strip()
-        score = 1.0 if tool_ok and agent_result.total_tool_calls > 0 and observed and final == observed else 0.0
-        return score, {"observed": observed, "tool_calls": agent_result.total_tool_calls, "command": item.get("command")}
-
-
-if __name__ == "__main__":
-    SandboxTerminalSmokeEnv.cli()
--- a/atropos/envs/swe_smith_oracle_env.py
+++ b/atropos/envs/swe_smith_oracle_env.py
@@ -1,418 +0,0 @@
-"""
-SWE-smith-oracle environment.
-
-This environment is intentionally minimal:
- prepares a sandbox workspace by cloning a public GitHub repo at `base_commit`
- runs an AtroposAgent tool loop to apply a fix
- verifies by running pytest nodeids from the dataset (reward = pass/fail)
- Python only (no multi-language support currently, need to properly bauild & add to dropbox)
- TODO: Get the other nonpython sandboxes up and running, then add a config knob to switch between them per row
- oh and add to dockerhub
-
-Dataset: NousResearch/SWE-smith-oracle (train; does NOT use SWE-bench eval set).
-"""
-
-from __future__ import annotations
-
-import os
-import random
-import time
-from typing import Any, Dict, List, Optional, Tuple
-
-from pydantic import Field
-
-from atroposlib.envs.base import APIServerConfig, Item
-
-from ..agent import AgentConfig
-from ..tools import ToolCall
-from .agent_env import AgentEnv, AgentEnvConfig
-
-
-class SweSmithOracleEnvConfig(AgentEnvConfig):
-    dataset_name: str = Field(default="NousResearch/SWE-smith-oracle")
-    dataset_split: str = Field(default="train")
-    max_items: int = Field(default=0, description="0 = no limit")
-    shuffle: bool = Field(default=True)
-    seed: int = Field(default=0)
-
-    python_only: bool = Field(default=True, description="Filter to Python-evaluable rows")
-    score_include_fail_to_pass: bool = Field(
-        default=True,
-        description=(
-            "If true (default), score tests on PASS_TO_PASS ∪ FAIL_TO_PASS. "
-            "Disable to only run PASS_TO_PASS (faster but weaker signal)."
-        ),
-    )
-
-    prompt_mode: str = Field(
-        default="problem_statement",
-        description="Task prompt content: 'problem_statement' (fast) or 'problem_statement+text' (slower, includes dataset 'text').",
-    )
-
-    repo_base_url: str = Field(default="https://github.com", description="Base URL for repo cloning")
-    install_timeout_s: float = Field(default=600.0)
-    test_timeout_s: float = Field(default=600.0)
-
-    tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
-
-
-class SweSmithOracleEnv(AgentEnv[SweSmithOracleEnvConfig]):
-    """
-    SWE-smith-oracle AgentEnv.
-
-    This is designed for benchmarking multiplexed slot execution vs naive container-per-trajectory.
-    """
-
-    name = "swe_smith_oracle_env"
-    env_config_cls = SweSmithOracleEnvConfig
-
-    def __init__(
-        self,
-        config: SweSmithOracleEnvConfig,
-        server_configs: List[APIServerConfig],
-        slurm: bool = False,
-        testing: bool = False,
-    ):
-        super().__init__(config, server_configs, slurm, testing)
-        self._dataset = None
-        self._indices: List[int] = []
-        self._cursor = 0
-
-    @classmethod
-    def config_init(cls) -> Tuple[SweSmithOracleEnvConfig, List[APIServerConfig]]:
-        # Defaults for running the env via CLI in offline `process` mode.
-        # Override via env vars or `--env.*` flags as needed.
-        base_url_raw = (
-            os.getenv("ATROPOS_SERVER_BASE_URL")
-            or os.getenv("OPENAI_BASE_URL")
-            or os.getenv("LLM_BASE_URL")
-            or "http://127.0.0.1:8080"
-        )
-        base_url = base_url_raw.rstrip("/")
-        if not base_url.endswith("/v1"):
-            base_url = f"{base_url}/v1"
-        model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
-        api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
-
-        env_config = SweSmithOracleEnvConfig(
-            tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
-            group_size=1,
-            use_wandb=False,
-            rollout_server_url="http://localhost:8000",
-            total_steps=1,
-            batch_size=1,
-            steps_per_eval=1,
-            max_token_length=8192,
-            inference_weight=1.0,
-            wandb_name="swe_smith_oracle",
-            enabled_toolsets=["terminal"],
-            disabled_toolsets=[],
-            sandbox_image=os.getenv("ATROPOS_SANDBOX_IMAGE") or "atropos-sandbox:local",
-            purge_job_on_start=True,
-            purge_job_on_shutdown=True,
-        )
-
-        server_configs = [
-            APIServerConfig(
-                model_name=model,
-                base_url=base_url,
-                api_key=api_key,
-                num_max_requests_at_once=1,
-                num_requests_for_eval=1,
-                timeout=int(os.getenv("ATROPOS_SERVER_TIMEOUT_S") or "300"),
-            ),
-        ]
-
-        return env_config, server_configs
-
-    async def setup_agent_env(self) -> None:
-        from datasets import load_dataset
-
-        t0 = time.perf_counter()
-        print(
-            f"[SweSmithOracleEnv] loading dataset {self.config.dataset_name}:{self.config.dataset_split} "
-            f"(python_only={self.config.python_only}, max_items={self.config.max_items or 'all'})",
-            flush=True,
-        )
-        ds = load_dataset(self.config.dataset_name, split=self.config.dataset_split)
-        self._dataset = ds
-
-        indices: List[int] = []
-        for idx in range(len(ds)):
-            row = ds[idx]
-            if self.config.python_only and not self._is_python_row(row):
-                continue
-            indices.append(idx)
-
-        if self.config.shuffle:
-            rnd = random.Random(self.config.seed)
-            rnd.shuffle(indices)
-
-        if self.config.max_items and self.config.max_items > 0:
-            indices = indices[: self.config.max_items]
-
-        self._indices = indices
-        self._cursor = 0
-
-        print(
-            f"[SweSmithOracleEnv] loaded {len(self._indices)} items from {self.config.dataset_name}:{self.config.dataset_split} "
-            f"in {time.perf_counter() - t0:.2f}s",
-            flush=True,
-        )
-
-    def _is_python_row(self, row: Dict[str, Any]) -> bool:
-        nodeids = row.get("PASS_TO_PASS")
-        if not isinstance(nodeids, list) or not nodeids:
-            return False
-        for nid in nodeids:
-            if not isinstance(nid, str) or ".py::" not in nid:
-                return False
-        return True
-
-    async def get_next_item(self) -> Item:
-        print(f"[SweSmithOracleEnv] get_next_item() cursor={self._cursor}/{len(self._indices)}", flush=True)
-        if not self._dataset or not self._indices:
-            raise RuntimeError("Dataset not initialized (did setup() run?)")
-        if self._cursor >= len(self._indices):
-            self._cursor = 0
-        idx = self._indices[self._cursor]
-        self._cursor += 1
-        return dict(self._dataset[idx])
-
-    def _repo_name(self, item: Item) -> str:
-        repo = item.get("repo") or ""
-        if isinstance(repo, str) and "/" in repo:
-            return repo.split("/")[-1]
-        return "repo"
-
-    def build_task(self, item: Item) -> str:
-        repo = item.get("repo") or ""
-        base_commit = item.get("base_commit") or ""
-        problem = str(item.get("problem_statement") or "")
-        context = str(item.get("text") or "")
-
-        nodeids = self._tests_for_item(item)
-        tests_list = "\n".join(f"- {t}" for t in nodeids)
-
-        repo_dir = self._repo_name(item)
-
-        tests_block = (
-            "Run these tests to verify:\n"
-            f"{tests_list}\n\n"
-            "When done, briefly describe what you changed and confirm tests pass."
-        )
-
-        prompt_mode = (self.config.prompt_mode or "problem_statement").strip().lower()
-        if prompt_mode not in {"problem_statement", "problem_statement+text"}:
-            raise ValueError(
-                f"Invalid prompt_mode={self.config.prompt_mode!r}. "
-                "Expected 'problem_statement' or 'problem_statement+text'."
-            )
-
-        context_block = ""
-        if prompt_mode == "problem_statement+text" and context:
-            # Note: We intentionally do NOT truncate/cap here. This mode is for debugging / richer prompts and can be slow.
-            context_block = f"\nAdditional context:\n{context}\n"
-
-        return (
-            "You are a senior software engineer. Fix the repository so the specified tests pass.\n\n"
-            f"Repository: {repo} (checked out at base_commit={base_commit})\n"
-            f"Workspace path: ./{repo_dir}\n\n"
-            "Constraints:\n"
-            "- You MUST use the terminal tool to inspect, edit, and verify the repository. Do not respond with a patch file.\n"
-            f"- Start by inspecting the repo (e.g. `ls`, `cd ./{repo_dir}`, `git status`).\n"
-            "- Use a workspace-local virtualenv (e.g. inside the repo at ./.venv) to avoid cross-run contamination.\n"
-            "- Use non-interactive commands only.\n\n"
-            "- Terminal commands run under POSIX /bin/sh and each tool call runs in a fresh shell (no persisted env vars).\n"
-            "  Avoid bash-only `source`; prefer `. .venv/bin/activate` or `.venv/bin/python ...`.\n\n"
-            "Problem statement:\n"
-            f"{problem}\n\n"
-            f"{context_block}\n"
-            f"{tests_block}"
-        )
-
-    def build_agent_config(self, item: Item) -> AgentConfig:  # noqa: ARG002
-        # SWE tasks are longer than the simple test env.
-        return AgentConfig(
-            max_steps=self.config.agent_max_steps,
-            temperature=self.config.agent_temperature,
-            max_tokens=self.config.agent_max_tokens,
-            tool_delay_s=self.config.agent_tool_delay_s,
-        )
-
-    async def setup_trajectory_workspace(self, item: Item, *, trajectory_id: str, exec_tool) -> Dict[str, Any]:
-        t0 = time.perf_counter()
-        repo = item.get("repo")
-        base_commit = item.get("base_commit")
-        instance_id = item.get("instance_id") or item.get("id") or item.get("problem_id")
-        if not isinstance(repo, str) or not isinstance(base_commit, str):
-            raise RuntimeError("Invalid dataset row: missing repo/base_commit")
-
-        repo_dir = self._repo_name(item)
-        clone_url = f"{self.config.repo_base_url.rstrip('/')}/{repo}.git"
-        print(
-            f"[SweSmithOracleEnv] tid={trajectory_id} setup_trajectory_workspace(): "
-            f"repo={repo} base_commit={base_commit} instance_id={instance_id} dir=./{repo_dir}",
-            flush=True,
-        )
-
-        # Repo setup strategy:
-        # - Maintain a shared, per-container bare repo cache under /data/repo_cache
-        # - For each trajectory, create an isolated git worktree under the slot workspace
-        # This avoids cloning/fetching full repos per trajectory and is crucial for multiplexing.
-
-        def _repo_cache_slug(repo_name: str) -> str:
-            return repo_name.replace("/", "__")
-
-        repo_slug = _repo_cache_slug(repo)
-        cache_root = "/data/repo_cache"
-        bare_repo = f"{cache_root}/{repo_slug}.git"
-        lock_file = f"{cache_root}/.locks/{repo_slug}.lock"
-
-        # Use flock to serialize operations that mutate the shared bare repo (fetch/worktree).
-        # util-linux (flock) is included in the sandbox image.
-        worktree_cmd = (
-            "set -e; "
-            f"rm -rf {repo_dir}; "
-            f"mkdir -p {cache_root}/.locks; "
-            f": > {lock_file}; "
-            f"flock -x {lock_file} sh -lc '"
-            f"set -e; "
-            "export GIT_TERMINAL_PROMPT=0; "
-            "export GIT_LFS_SKIP_SMUDGE=1; "
-            f"if [ ! -d \"{bare_repo}\" ]; then "
-            f"  git init --bare \"{bare_repo}\"; "
-            f"  git -C \"{bare_repo}\" remote add origin \"{clone_url}\"; "
-            "fi; "
-            f"git -C \"{bare_repo}\" remote set-url origin \"{clone_url}\"; "
-            f"git -C \"{bare_repo}\" worktree prune || true; "
-            f"if ! git -C \"{bare_repo}\" cat-file -e \"{base_commit}^{{commit}}\" 2>/dev/null; then "
-            f"  git -C \"{bare_repo}\" fetch --depth 1 origin \"{base_commit}\" || true; "
-            "fi; "
-            f"if ! git -C \"{bare_repo}\" cat-file -e \"{base_commit}^{{commit}}\" 2>/dev/null; then "
-            f"  git -C \"{bare_repo}\" fetch --prune origin; "
-            "fi; "
-            f"git --git-dir=\"{bare_repo}\" worktree add --detach \"{repo_dir}\" \"{base_commit}\"; "
-            "'"
-        )
-
-        print(f"[SweSmithOracleEnv] tid={trajectory_id} preparing worktree from repo cache", flush=True)
-        res = await exec_tool(
-            ToolCall(
-                name="terminal",
-                arguments={"command": worktree_cmd, "timeout": self.config.install_timeout_s},
-            )
-        )
-        if not res.success:
-            raise RuntimeError(
-                "git worktree setup failed "
-                f"(repo={repo}, base_commit={base_commit}, instance_id={instance_id}): {res.error}\n{res.output}"
-            )
-
-        print(
-            f"[SweSmithOracleEnv] tid={trajectory_id} setup_trajectory_workspace(): worktree ready in {time.perf_counter() - t0:.2f}s",
-            flush=True,
-        )
-        return {"repo_dir": repo_dir, "base_commit": base_commit}
-
-    def _tests_for_item(self, item: Item) -> List[str]:
-        tests: List[str] = []
-        if self.config.score_include_fail_to_pass:
-            for key in ("PASS_TO_PASS", "FAIL_TO_PASS"):
-                nodeids = item.get(key)
-                if isinstance(nodeids, list):
-                    tests.extend([n for n in nodeids if isinstance(n, str)])
-        else:
-            nodeids = item.get("PASS_TO_PASS")
-            if isinstance(nodeids, list):
-                tests.extend([n for n in nodeids if isinstance(n, str)])
-        # Stable order for reproducibility.
-        return sorted(dict.fromkeys(tests))
-
-    def _chunk_nodeids(self, nodeids: List[str], max_per_chunk: int = 50) -> List[List[str]]:
-        chunks: List[List[str]] = []
-        for i in range(0, len(nodeids), max_per_chunk):
-            chunks.append(nodeids[i : i + max_per_chunk])
-        return chunks
-
-    async def verify_and_score_trajectory(
-        self,
-        item: Item,
-        final_response: str,  # noqa: ARG002
-        *,
-        trajectory_id: str,
-        exec_tool,
-        agent_result=None,
-        workspace_meta: Optional[Dict[str, Any]] = None,
-    ) -> tuple[float, Dict[str, Any]]:
-        _ = trajectory_id
-        repo_dir = self._repo_name(item)
-
-        # Training correctness: do not reward trajectories that never actually used tools.
-        if agent_result is not None and getattr(agent_result, "total_tool_calls", 0) <= 0:
-            print(
-                f"[SweSmithOracleEnv] tid={trajectory_id} verify (dataset_tests): no tool calls; score=0.0",
-                flush=True,
-            )
-            return 0.0, {
-                "verification_mode": "dataset_tests",
-                "error": "No tool calls were made by the agent",
-            }
-
-        nodeids = self._tests_for_item(item)
-        if not nodeids:
-            return 0.0, {"error": "No tests provided"}
-
-        print(f"[SweSmithOracleEnv] tid={trajectory_id} verify (dataset_tests): ensuring venv + deps", flush=True)
-        setup_cmd = (
-            f"cd {repo_dir} && "
-            "python -m venv .venv && "
-            ". .venv/bin/activate && "
-            "python -m pip install -U pip setuptools wheel && "
-            "python -m pip install -e . && "
-            "python -m pip install pytest"
-        )
-        setup_res = await exec_tool(
-            ToolCall(name="terminal", arguments={"command": setup_cmd, "timeout": self.config.install_timeout_s})
-        )
-        verification_messages = [{"role": "user", "content": setup_res.to_xml()}]
-        if not setup_res.success:
-            return 0.0, {
-                "verification_mode": "dataset_tests",
-                "phase": "install",
-                "error": setup_res.error,
-                "output": setup_res.output,
-                "verification_messages": verification_messages,
-            }
-
-        chunks = self._chunk_nodeids(nodeids, max_per_chunk=50)
-        for chunk_idx, chunk in enumerate(chunks):
-            joined = " ".join(chunk)
-            cmd = f"cd {repo_dir} && . .venv/bin/activate && python -m pytest -q {joined}"
-            res = await exec_tool(
-                ToolCall(
-                    name="terminal",
-                    arguments={"command": cmd, "timeout": self.config.test_timeout_s},
-                )
-            )
-            verification_messages.append({"role": "user", "content": res.to_xml()})
-            if not res.success:
-                return 0.0, {
-                    "verification_mode": "dataset_tests",
-                    "phase": "pytest",
-                    "failed_chunk": chunk_idx,
-                    "error": res.error,
-                    "output": res.output,
-                    "verification_messages": verification_messages,
-                }
-
-        return 1.0, {"verification_mode": "dataset_tests", "passed": True, "verification_messages": verification_messages}
-
-    async def score_trajectory(self, item: Item, final_response: str) -> float:
-        # Not used; scoring happens in verify_and_score_trajectory.
-        _ = (item, final_response)
-        return 0.0
-
-
-if __name__ == "__main__":
-    SweSmithOracleEnv.cli()
--- a/atropos/envs/test_env.py
+++ b/atropos/envs/test_env.py
@@ -1,217 +0,0 @@
-"""
-Simple test environment for validating the atropos-agent setup.
-
-This environment uses a local OpenAI-compatible server for LLM testing to verify:
- BaseEnv extension works correctly
- API communication via OpenAI-compatible endpoint
- Basic trajectory collection
-
-This is a minimal environment for testing, not production use.
-"""
-
-import os
-from typing import Dict, List, Optional, Tuple
-
-from dotenv import load_dotenv
-from pydantic import Field
-
-from atroposlib.envs.base import (
-    APIServerConfig,
-    Item,
-)
-
-from ..agent import AgentConfig
-from .agent_env import AgentEnv, AgentEnvConfig
-
-# Load environment variables from .env file
-load_dotenv()
-
-
-# Simple test prompts for validation
-TEST_PROMPTS = [
-    {
-        "prompt": "What is 2 + 2? Answer with just the number.",
-        "expected": "4",
-    },
-    {
-        "prompt": "What is the capital of France? Answer with just the city name.",
-        "expected": "Paris",
-    },
-    {
-        "prompt": "What color is the sky on a clear day? Answer with just the color.",
-        "expected": "Blue",
-    },
-    {
-        "prompt": "How many days are in a week? Answer with just the number.",
-        "expected": "7",
-    },
-    {
-        "prompt": "What is 10 * 5? Answer with just the number.",
-        "expected": "50",
-    },
-]
-
-SYSTEM_PROMPT = (
-    "You are a helpful assistant. Answer questions concisely and directly. "
-    "When asked for a simple answer, provide just that answer without explanation."
-)
-
-
-class SimpleTestEnvConfig(AgentEnvConfig):
-    """Configuration for the simple test environment."""
-
-    server_base_url: str = Field(
-        default="http://127.0.0.1:8080",
-        description="Base URL for an OpenAI-compatible server (without /v1)",
-    )
-    server_model: str = Field(
-        default="hermes-4-36b",
-        description="Model name",
-    )
-    tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
-
-
-class SimpleTestEnv(AgentEnv[SimpleTestEnvConfig]):
-    """
-    A simple test environment to validate the atropos-agent setup.
-    
-    Uses a local OpenAI-compatible LLM endpoint with basic question-answering tasks.
-    Scoring is based on whether the response contains the expected answer.
-    """
-
-    name = "simple_test_env"
-    env_config_cls = SimpleTestEnvConfig
-
-    def __init__(
-        self,
-        config: SimpleTestEnvConfig,
-        server_configs: List[APIServerConfig],
-        slurm: bool = False,
-        testing: bool = False,
-    ):
-        super().__init__(config, server_configs, slurm, testing)
-        self.iter = 0
-        self.test_prompts = TEST_PROMPTS
-        self.percent_correct_buffer: List[float] = []
-
-    @classmethod
-    def config_init(cls) -> Tuple[SimpleTestEnvConfig, List[APIServerConfig]]:
-        """
-        Initialize configuration with local server settings from environment variables.
-        """
-        base_url = (
-            os.getenv("ATROPOS_SERVER_BASE_URL")
-            or os.getenv("OPENAI_BASE_URL")
-            or os.getenv("LLM_BASE_URL")
-            or "http://127.0.0.1:8080"
-        )
-        model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
-        api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
-
-        env_config = SimpleTestEnvConfig(
-            tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
-            group_size=4,
-            use_wandb=False,  # Disable wandb for simple testing
-            rollout_server_url="http://localhost:8000",
-            total_steps=10,
-            batch_size=16,
-            steps_per_eval=5,
-            max_token_length=2048,
-            inference_weight=1.0,
-            wandb_name="simple_test",
-            server_base_url=base_url,
-            server_model=model,
-        )
-
-        # OpenAI-compatible servers typically expose chat completions at /v1.
-        server_configs = [
-            APIServerConfig(
-                model_name=model,
-                base_url=f"{base_url}/v1",
-                api_key=api_key,
-                num_max_requests_at_once=4,
-                num_requests_for_eval=8,
-                timeout=120,  # Local models may be slower
-            ),
-        ]
-
-        return env_config, server_configs
-
-    async def setup_agent_env(self):
-        """Setup the environment - load test data."""
-        print(f"SimpleTestEnv setup complete. {len(self.test_prompts)} test prompts loaded.")
-        print(f"Using server at: {self.config.server_base_url}")
-        print(f"Model: {self.config.server_model}")
-
-    async def get_next_item(self) -> Item:
-        """Get the next test prompt."""
-        item = self.test_prompts[self.iter % len(self.test_prompts)]
-        self.iter += 1
-        return item
-
-    def build_task(self, item: Item) -> str:
-        return item["prompt"]
-
-    def build_agent_config(self, item: Item) -> AgentConfig:  # noqa: ARG002
-        return AgentConfig(
-            max_steps=5,
-            temperature=0.7,
-            max_tokens=256,
-            system_prompt=SYSTEM_PROMPT,
-        )
-
-    async def score_trajectory(self, item: Item, final_response: str) -> float:
-        expected = item["expected"].lower()
-        response_lower = (final_response or "").lower()
-        score = 1.0 if expected in response_lower else 0.0
-        self.percent_correct_buffer.append(score)
-        return score
-
-    async def evaluate(self, *args, **kwargs):
-        """
-        Simple evaluation - run through all test prompts once.
-        """
-        correct = 0
-        total = len(self.test_prompts)
-
-        for item in self.test_prompts:
-            messages = [
-                {"role": "system", "content": SYSTEM_PROMPT},
-                {"role": "user", "content": item["prompt"]},
-            ]
-
-            response = await self.server.chat_completion(
-                messages=messages,
-                n=1,
-                max_tokens=256,
-                temperature=0.0,  # Greedy for eval
-                split="eval",
-            )
-
-            response_text = response.choices[0].message.content or ""
-            expected = item["expected"].lower()
-
-            if expected in response_text.lower():
-                correct += 1
-
-        accuracy = correct / total
-        print(f"Evaluation: {correct}/{total} = {accuracy:.2%} accuracy")
-        return {"eval_accuracy": accuracy}
-
-    async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
-        """Log metrics (simplified for testing)."""
-        if wandb_metrics is None:
-            wandb_metrics = {}
-
-        if self.percent_correct_buffer:
-            avg_correct = sum(self.percent_correct_buffer) / len(self.percent_correct_buffer)
-            wandb_metrics["train/percent_correct"] = avg_correct
-            print(f"Train accuracy: {avg_correct:.2%}")
-            self.percent_correct_buffer = []
-
-        await super().wandb_log(wandb_metrics)
-
-
-if __name__ == "__main__":
-    # Allow running as CLI
-    SimpleTestEnv.cli()
--- a/atropos/envs/toolserver_smoke_env.py
+++ b/atropos/envs/toolserver_smoke_env.py
@@ -1,165 +0,0 @@
-"""
-ToolServer routing smoke environment.
-
-Validates that:
-  - sandbox tools run through Nomad SlotPool (terminal -> bash in sandbox)
-  - external tools run through ToolServer (skills_list)
-
-This env uses ToolServer in-process by default (`tool_server_url="inprocess"`),
-so it is self-contained for local testing.
-
-Run:
-  uv run python -m atropos.envs.toolserver_smoke_env process --env.use_wandb false --env.total_steps 1 --env.group_size 1
-"""
-
-from __future__ import annotations
-
-import os
-from typing import Any, Dict, List, Tuple
-
-from dotenv import load_dotenv
-from pydantic import Field
-
-from atroposlib.envs.base import APIServerConfig, Item
-
-from ..agent import AgentConfig, AgentResult
-from .agent_env import AgentEnv, AgentEnvConfig
-
-load_dotenv()
-
-
-class ToolServerSmokeEnvConfig(AgentEnvConfig):
-    server_base_url: str = Field(
-        default="http://127.0.0.1:8080",
-        description="Base URL for an OpenAI-compatible chat server (without /v1).",
-    )
-    server_model: str = Field(default="hermes-4-36b", description="Model name")
-    tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
-
-
-class ToolServerSmokeEnv(AgentEnv[ToolServerSmokeEnvConfig]):
-    name = "toolserver_smoke_env"
-    env_config_cls = ToolServerSmokeEnvConfig
-
-    def __init__(
-        self,
-        config: ToolServerSmokeEnvConfig,
-        server_configs: List[APIServerConfig],
-        slurm: bool = False,
-        testing: bool = False,
-    ):
-        super().__init__(config, server_configs, slurm, testing)
-        self._iter = 0
-
-    @classmethod
-    def config_init(cls) -> Tuple[ToolServerSmokeEnvConfig, List[APIServerConfig]]:
-        base_url = (
-            os.getenv("ATROPOS_SERVER_BASE_URL")
-            or os.getenv("OPENAI_BASE_URL")
-            or os.getenv("LLM_BASE_URL")
-            or "http://127.0.0.1:8080"
-        )
-        model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
-        api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
-
-        env_config = ToolServerSmokeEnvConfig(
-            tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
-            group_size=1,
-            use_wandb=False,
-            include_messages=True,
-            ensure_scores_are_not_same=False,
-            total_steps=1,
-            batch_size=1,
-            server_base_url=base_url,
-            server_model=model,
-            enabled_toolsets=["terminal", "skills"],
-            disabled_toolsets=[],
-            # Self-contained ToolServer for local smoke.
-            tool_server_url="inprocess",
-            sandbox_image=os.getenv("ATROPOS_SANDBOX_IMAGE") or "atropos-sandbox:local",
-            purge_job_on_start=True,
-            purge_job_on_shutdown=True,
-        )
-
-        server_configs = [
-            APIServerConfig(
-                model_name=model,
-                base_url=f"{base_url.rstrip('/')}/v1",
-                api_key=api_key,
-                num_max_requests_at_once=1,
-                num_requests_for_eval=1,
-                timeout=120,
-            )
-        ]
-        return env_config, server_configs
-
-    async def setup_agent_env(self) -> None:
-        return None
-
-    async def get_next_item(self) -> Item:
-        self._iter += 1
-        return {
-            "prompt": (
-                "You MUST call exactly one tool per assistant message.\n"
-                "\n"
-                "Step 1) Call the skills_list tool (no arguments), then stop.\n"
-                "Step 2) After you receive the tool response, call the terminal tool to run:\n"
-                "python -c \"print('ok')\"\n"
-                "Step 3) After you receive the terminal tool response, answer with just: ok\n"
-                "\n"
-                "Tool call format requirements:\n"
-                "- Every tool call MUST be a complete XML block with a closing tag.\n"
-                "- Do NOT emit a second <tool_call> in the same assistant message.\n"
-                "\n"
-                "Example:\n"
-                "<tool_call>{\"name\": \"skills_list\", \"arguments\": {}}</tool_call>\n"
-                "Do not include anything else in your final answer."
-            )
-        }
-
-    def build_task(self, item: Item) -> str:
-        return str(item.get("prompt") or "")
-
-    def build_agent_config(self, item: Item) -> AgentConfig:  # noqa: ARG002
-        return AgentConfig(
-            max_steps=min(10, int(self.config.agent_max_steps)),
-            temperature=0.2,
-            max_tokens=None,
-        )
-
-    async def score_trajectory(self, item: Item, final_response: str) -> float:
-        _ = (item, final_response)
-        return 0.0
-
-    async def verify_and_score_trajectory(
-        self,
-        item: Item,
-        final_response: str,
-        *,
-        trajectory_id: str,  # noqa: ARG002
-        exec_tool,  # noqa: ARG002
-        agent_result: AgentResult | None = None,
-        workspace_meta: Dict[str, Any] | None = None,  # noqa: ARG002
-    ) -> tuple[float, Dict[str, Any]]:
-        if agent_result is None:
-            return 0.0, {"error": "Missing agent_result"}
-
-        called = {c.name for s in agent_result.steps for c in s.tool_calls}
-        need = {"skills_list", "terminal"}
-        if not need.issubset(called):
-            return 0.0, {"error": f"Missing tool calls: {sorted(need - called)}", "called": sorted(called)}
-
-        terminal_ok = False
-        for step in agent_result.steps:
-            for call, res in zip(step.tool_calls, step.tool_results):
-                if call.name != "terminal":
-                    continue
-                if res.success and (res.output or "").strip().splitlines()[-1].strip() == "ok":
-                    terminal_ok = True
-
-        score = 1.0 if terminal_ok and (final_response or "").strip() == "ok" else 0.0
-        return score, {"called": sorted(called), "final": (final_response or "").strip()}
-
-
-if __name__ == "__main__":
-    ToolServerSmokeEnv.cli()
--- a/atropos/nomad/init.py
+++ b/atropos/nomad/init.py
@@ -1,11 +0,0 @@
-"""
-Nomad integration for atropos-agent.
-
-Provides:
- NomadClient: Client for Nomad HTTP API
- Job templates for sandbox containers
-"""
-
-from .client import NomadClient
-
-__all__ = ["NomadClient"]
--- a/atropos/nomad/client.py
+++ b/atropos/nomad/client.py
@@ -1,500 +0,0 @@
-"""
-Nomad API Client for atropos-agent.
-
-Provides a simple async client for interacting with the Nomad HTTP API:
- Submit/stop jobs
- Query allocations
- Get allocation addresses
- Scale jobs up/down
-"""
-
-import asyncio
-import json
-import os
-from dataclasses import dataclass, field
-from enum import Enum
-from pathlib import Path
-from typing import Any, Dict, List, Optional
-
-import aiohttp
-
-
-class AllocationStatus(Enum):
-    """Nomad allocation status."""
-    PENDING = "pending"
-    RUNNING = "running"
-    COMPLETE = "complete"
-    FAILED = "failed"
-    LOST = "lost"
-
-
-@dataclass
-class Allocation:
-    """Information about a Nomad allocation."""
-    id: str
-    job_id: str
-    task_group: str
-    node_id: str
-    status: AllocationStatus
-    # Network info for reaching the allocation
-    address: Optional[str] = None
-    port: Optional[int] = None
-    
-    @property
-    def http_address(self) -> Optional[str]:
-        """Get full HTTP address for the allocation."""
-        if self.address and self.port:
-            return f"http://{self.address}:{self.port}"
-        return None
-
-
-@dataclass
-class JobStatus:
-    """Status of a Nomad job."""
-    id: str
-    name: str
-    status: str
-    allocations: List[Allocation] = field(default_factory=list)
-    count: int = 0  # Number of task groups
-
-
-class NomadClient:
-    """
-    Async client for Nomad HTTP API.
-    
-    Usage:
-        client = NomadClient(address="http://localhost:4646")
-        
-        # Submit a job
-        await client.submit_job(job_spec)
-        
-        # Get allocations
-        allocs = await client.get_job_allocations("sandbox-python")
-        
-        # Scale job
-        await client.scale_job("sandbox-python", count=5)
-    """
-    
-    def __init__(
-        self,
-        address: str = "http://localhost:4646",
-        token: Optional[str] = None,
-        timeout: float = 30.0,
-    ):
-        self.address = address.rstrip("/")
-        self.token = token or os.environ.get("NOMAD_TOKEN")
-        self.timeout = aiohttp.ClientTimeout(total=timeout)
-        self._session: Optional[aiohttp.ClientSession] = None
-    
-    async def _get_session(self) -> aiohttp.ClientSession:
-        """Get or create HTTP session."""
-        if self._session is None or self._session.closed:
-            headers = {}
-            if self.token:
-                headers["X-Nomad-Token"] = self.token
-            self._session = aiohttp.ClientSession(
-                timeout=self.timeout,
-                headers=headers,
-            )
-        return self._session
-    
-    async def close(self):
-        """Close the HTTP session."""
-        if self._session and not self._session.closed:
-            await self._session.close()
-    
-    async def __aenter__(self):
-        return self
-    
-    async def __aexit__(self, exc_type, exc_val, exc_tb):
-        await self.close()
-    
-    async def _request(
-        self,
-        method: str,
-        path: str,
-        data: Optional[Dict[str, Any]] = None,
-    ) -> Dict[str, Any]:
-        """Make an HTTP request to Nomad API."""
-        session = await self._get_session()
-        url = f"{self.address}{path}"
-        
-        try:
-            async with session.request(method, url, json=data) as response:
-                if response.status == 404:
-                    return {"error": "not_found", "status": 404}
-                
-                text = await response.text()
-                if not text:
-                    return {"status": response.status}
-                
-                try:
-                    result = json.loads(text)
-                except json.JSONDecodeError:
-                    return {"text": text, "status": response.status}
-                
-                if response.status >= 400:
-                    return {"error": result, "status": response.status}
-                
-                return result if isinstance(result, dict) else {"data": result, "status": response.status}
-                
-        except aiohttp.ClientError as e:
-            return {"error": str(e), "status": 0}
-    
-    # Job Operations
-    
-    async def submit_job(self, job_spec: Dict[str, Any]) -> Dict[str, Any]:
-        """
-        Submit a job to Nomad.
-        
-        Args:
-            job_spec: Job specification dict (HCL converted to JSON)
-            
-        Returns:
-            Response with EvalID if successful
-        """
-        return await self._request("POST", "/v1/jobs", {"Job": job_spec})
-    
-    async def stop_job(self, job_id: str, purge: bool = False) -> Dict[str, Any]:
-        """
-        Stop (and optionally purge) a job.
-        
-        Args:
-            job_id: Job identifier
-            purge: If True, completely remove the job
-        """
-        path = f"/v1/job/{job_id}"
-        if purge:
-            path += "?purge=true"
-        return await self._request("DELETE", path)
-    
-    async def get_job(self, job_id: str) -> Optional[Dict[str, Any]]:
-        """Get job details."""
-        result = await self._request("GET", f"/v1/job/{job_id}")
-        if "error" in result and result.get("status") == 404:
-            return None
-        return result
-    
-    async def get_job_status(self, job_id: str) -> Optional[JobStatus]:
-        """Get job status with allocations."""
-        job = await self.get_job(job_id)
-        if not job:
-            return None
-        
-        allocs = await self.get_job_allocations(job_id)
-        
-        # Get count from task groups
-        count = 0
-        task_groups = job.get("TaskGroups", [])
-        for tg in task_groups:
-            count += tg.get("Count", 1)
-        
-        return JobStatus(
-            id=job_id,
-            name=job.get("Name", job_id),
-            status=job.get("Status", "unknown"),
-            allocations=allocs,
-            count=count,
-        )
-    
-    # Allocation Operations
-    
-    async def get_job_allocations(self, job_id: str) -> List[Allocation]:
-        """Get all allocations for a job."""
-        result = await self._request("GET", f"/v1/job/{job_id}/allocations")
-        
-        if "error" in result:
-            return []
-        
-        allocs_data = result.get("data", result) if isinstance(result, dict) else result
-        if not isinstance(allocs_data, list):
-            return []
-        
-        allocations = []
-        for alloc_data in allocs_data:
-            # Parse allocation info
-            alloc_id = alloc_data.get("ID", "")
-            status_str = alloc_data.get("ClientStatus", "unknown")
-            
-            try:
-                status = AllocationStatus(status_str)
-            except ValueError:
-                status = AllocationStatus.PENDING
-            
-            # Get network info - need to fetch detailed allocation for this
-            address = None
-            port = None
-            
-            # First try the summary data
-            resources = alloc_data.get("AllocatedResources") or {}
-            shared = resources.get("Shared") or {}
-            networks = shared.get("Networks") or []
-            
-            # If no networks in summary, fetch detailed allocation
-            if not networks and alloc_id:
-                detailed = await self.get_allocation(alloc_id)
-                if detailed:
-                    resources = detailed.get("AllocatedResources") or {}
-                    shared = resources.get("Shared") or {}
-                    networks = shared.get("Networks") or []
-            
-            if networks:
-                network = networks[0]
-                address = network.get("IP")
-                # Look for dynamic ports OR reserved ports (Singularity/raw_exec uses reserved)
-                dyn_ports = network.get("DynamicPorts") or []
-                reserved_ports = network.get("ReservedPorts") or []
-                for dp in dyn_ports + reserved_ports:
-                    if dp.get("Label") == "http":
-                        port = dp.get("Value")
-                        break
-            
-            allocations.append(Allocation(
-                id=alloc_id,
-                job_id=job_id,
-                task_group=alloc_data.get("TaskGroup", ""),
-                node_id=alloc_data.get("NodeID", ""),
-                status=status,
-                address=address,
-                port=port,
-            ))
-        
-        return allocations
-    
-    async def get_allocation(self, alloc_id: str) -> Optional[Dict[str, Any]]:
-        """Get detailed allocation info."""
-        result = await self._request("GET", f"/v1/allocation/{alloc_id}")
-        if "error" in result and result.get("status") == 404:
-            return None
-        return result
-    
-    # Scaling Operations
-    
-    async def scale_job(self, job_id: str, count: int, task_group: str = "sandbox") -> Dict[str, Any]:
-        """
-        Scale a job's task group to specified count.
-        
-        Args:
-            job_id: Job identifier
-            count: Desired number of allocations
-            task_group: Name of task group to scale
-        """
-        payload = {
-            "Count": count,
-            "Target": {
-                "Group": task_group,
-            },
-        }
-        return await self._request("POST", f"/v1/job/{job_id}/scale", payload)
-    
-    async def get_job_scale_status(self, job_id: str) -> Dict[str, int]:
-        """
-        Get current scale status for a job.
-        
-        Returns:
-            Dict mapping task group name to count
-        """
-        result = await self._request("GET", f"/v1/job/{job_id}/scale")
-        
-        if "error" in result:
-            return {}
-        
-        task_groups = result.get("TaskGroups", {})
-        return {
-            name: info.get("Running", 0)
-            for name, info in task_groups.items()
-        }
-    
-    # Health Check
-    
-    async def is_healthy(self) -> bool:
-        """Check if Nomad is reachable and healthy."""
-        try:
-            result = await self._request("GET", "/v1/status/leader")
-            return "error" not in result
-        except Exception:
-            return False
-    
-    async def get_leader(self) -> Optional[str]:
-        """Get current Nomad leader address."""
-        result = await self._request("GET", "/v1/status/leader")
-        if isinstance(result, dict) and "data" in result:
-            return result["data"]
-        return None
-
-
-def load_job_template(
-    template_name: str = "sandbox",
-    **kwargs,
-) -> Dict[str, Any]:
-    """
-    Load and configure a job template.
-    
-    Args:
-        template_name: Name of template (e.g., "sandbox")
-        **kwargs: Template variables to substitute
-        
-    Returns:
-        Job specification dict ready for Nomad API
-    """
-    # Default job template for sandbox container
-    if template_name == "sandbox":
-        return create_sandbox_job(**kwargs)
-    else:
-        raise ValueError(f"Unknown template: {template_name}")
-
-
-def create_sandbox_job(
-    job_id: str = "atropos-sandbox",
-    image: str = "atropos-sandbox:local",  # Use :local tag to avoid registry pull
-    count: int = 1,
-    slots_per_container: int = 10,
-    privileged: bool = False,
-    cpu: int = 500,
-    memory: int = 512,
-    port: int = 8080,
-    datacenter: str = "dc1",
-    driver: str = "docker",  # "docker" or "singularity"
-    singularity_image: str = None,  # Path to .sif file for singularity driver
-) -> Dict[str, Any]:
-    """
-    Create a sandbox job specification.
-    
-    This job runs the sandbox_server.py inside a container,
-    with the specified number of slots for agent workspaces.
-    
-    Args:
-        job_id: Unique job identifier
-        image: Docker image to use (for docker driver)
-        count: Number of container instances
-        slots_per_container: Number of slots per container
-        privileged: Run container in privileged mode (recommended for bubblewrap)
-        cpu: CPU allocation in MHz
-        memory: Memory allocation in MB
-        port: HTTP port for sandbox server
-        datacenter: Nomad datacenter
-        driver: Container driver - "docker" or "singularity"
-        singularity_image: Path to .sif file (required if driver="singularity")
-        
-    Returns:
-        Job specification dict
-    """
-    # Build task config based on driver
-    if driver == "singularity":
-        if not singularity_image:
-            raise ValueError("singularity_image path required when driver='singularity'")
-        
-        # Use raw_exec driver to run apptainer via shell for variable expansion
-        # The container binds the allocation directory for workspace persistence
-        # For raw_exec, we use static port since Nomad's dynamic port mapping doesn't
-        # work the same as Docker - the process runs directly on the host.
-        shell_cmd = (
-            f'apptainer run '
-            f'--bind "$NOMAD_ALLOC_DIR/data:/data" '
-            f'--pwd /app '
-            f'--env PYTHONUNBUFFERED=1 '
-            f'{singularity_image} '
-            f'python sandbox_server.py '
-            f'--port {port} '
-            f'--slots {slots_per_container} '
-            f'--data-dir /data'
-        )
-        task_config = {
-            "command": "/bin/sh",
-            "args": ["-c", shell_cmd],
-        }
-        task_driver = "raw_exec"
-    else:
-        # Docker driver (default)
-        task_config = {
-            "image": image,
-            "force_pull": False,  # Use local image, don't try to pull
-            "ports": ["http"],
-            "privileged": privileged,
-            "command": "python",
-            "args": [
-                "sandbox_server.py",
-                "--port", str(port),
-                "--slots", str(slots_per_container),
-                "--data-dir", "/data",
-            ],
-            # Note: On Linux, you can mount persistent storage:
-            # "volumes": ["${NOMAD_ALLOC_DIR}/data:/data"],
-            # On macOS/Docker Desktop, skip volumes for PoC
-            # (container /data is ephemeral but works for testing)
-        }
-        task_driver = "docker"
-    
-    # For Singularity/raw_exec, use static ports since the process runs directly on host.
-    # For Docker, use dynamic ports with port mapping.
-    if driver == "singularity":
-        network_config = {
-            "Mode": "host",
-            "ReservedPorts": [
-                {
-                    "Label": "http",
-                    "Value": port,
-                }
-            ],
-        }
-    else:
-        network_config = {
-            "Mode": "host",
-            "DynamicPorts": [
-                {
-                    "Label": "http",
-                    "To": port,
-                }
-            ],
-        }
-    
-    return {
-        "ID": job_id,
-        "Name": job_id,
-        "Type": "service",
-        "Datacenters": [datacenter],
-        "TaskGroups": [
-            {
-                "Name": "sandbox",
-                "Count": count,
-                # Speed up deployments and avoid Consul checks. Without this, Nomad may
-                # keep an "active deployment" around for the default MinHealthyTime,
-                # which blocks immediate scaling under load.
-                "Update": {
-                    "HealthCheck": "task_states",
-                    "MinHealthyTime": 0,
-                },
-                "Networks": [network_config],
-                "Tasks": [
-                    {
-                        "Name": "sandbox-server",
-                        "Driver": task_driver,
-                        "Config": task_config,
-                        "Env": {
-                            "PYTHONUNBUFFERED": "1",
-                            "NOMAD_ALLOC_DIR": "${NOMAD_ALLOC_DIR}",
-                        },
-                        "Resources": {
-                            "CPU": cpu,
-                            "MemoryMB": memory,
-                        },
-                        # Note: Services with Checks require Consul, which we skip for the PoC
-                    }
-                ],
-                "RestartPolicy": {
-                    "Attempts": 3,
-                    "Interval": 300_000_000_000,  # 5 minutes
-                    "Delay": 10_000_000_000,     # 10 seconds
-                    "Mode": "delay",
-                },
-                "ReschedulePolicy": {
-                    "Attempts": 5,
-                    "Interval": 3600_000_000_000,  # 1 hour
-                    "Delay": 30_000_000_000,      # 30 seconds
-                    "DelayFunction": "exponential",
-                    "MaxDelay": 300_000_000_000,  # 5 minutes
-                    "Unlimited": False,
-                },
-            }
-        ],
-    }
--- a/atropos/sandbox_server.py
+++ b/atropos/sandbox_server.py
--- a/atropos/slots/init.py
+++ b/atropos/slots/init.py
@@ -1,20 +0,0 @@
-"""
-Slot-based multiplexing for atropos-agent.
-
-Provides:
- Slot: Isolated workspace for a single trajectory
- SlotPool: Manages slots across Nomad allocations  
- SandboxExecutor: Executes tools in sandbox containers
-"""
-
-from .executor import SandboxExecutor
-from .pool import SlotPool, SlotPoolConfig
-from .slot import Slot, SlotState
-
-__all__ = [
-    "Slot",
-    "SlotState",
-    "SlotPool",
-    "SlotPoolConfig",
-    "SandboxExecutor",
-]
--- a/atropos/slots/executor.py
+++ b/atropos/slots/executor.py
@@ -1,457 +0,0 @@
-"""
-SandboxExecutor - HTTP client for sandbox container communication.
-
-Sends tool execution requests to sandbox_server.py running inside Nomad containers.
-Supports single and batch execution for efficiency.
-"""
-
-import asyncio
-import uuid
-from dataclasses import dataclass, field
-from typing import Any, Dict, List, Optional, Tuple
-
-import aiohttp
-
-from .slot import Slot, SlotState
-from ..tools.base import ToolCall, ToolResult
-
-
-@dataclass
-class ExecutionRequest:
-    """Request to execute a tool in a slot."""
-    slot: Slot
-    tool_name: str
-    args: Dict[str, Any]
-    execution_id: str = field(default_factory=lambda: str(uuid.uuid4()))
-    timeout: float = 30.0
-
-
-@dataclass
-class ExecutionResult:
-    """Result from sandbox execution."""
-    success: bool
-    output: str = ""
-    error: str = ""
-    execution_id: str = ""
-    slot_id: str = ""
-    metadata: Dict[str, Any] = field(default_factory=dict)
-    
-    def to_tool_result(self) -> ToolResult:
-        """Convert to ToolResult for agent consumption."""
-        return ToolResult(
-            success=self.success,
-            output=self.output,
-            error=self.error,
-            metadata=self.metadata,
-            uniq_id=self.execution_id,
-        )
-
-
-class SandboxExecutor:
-    """
-    HTTP client for executing tools in sandbox containers.
-    
-    Communicates with sandbox_server.py running inside Nomad allocations.
-    Supports both single execution and batched parallel execution.
-    
-    Usage:
-        executor = SandboxExecutor()
-        
-        # Single execution
-        result = await executor.execute(slot, "bash", {"command": "ls"})
-        
-        # Batch execution
-        results = await executor.execute_batch([
-            (slot1, "bash", {"command": "ls"}),
-            (slot2, "write_file", {"path": "test.txt", "content": "hello"}),
-        ])
-    """
-    
-    def __init__(
-        self,
-        timeout: float = 30.0,
-        max_retries: int = 3,
-        retry_delay: float = 1.0,
-    ):
-        self.timeout = aiohttp.ClientTimeout(total=timeout)
-        self.max_retries = max_retries
-        self.retry_delay = retry_delay
-        self._session: Optional[aiohttp.ClientSession] = None
-    
-    async def _get_session(self) -> aiohttp.ClientSession:
-        """Get or create HTTP session."""
-        if self._session is None or self._session.closed:
-            self._session = aiohttp.ClientSession(timeout=self.timeout)
-        return self._session
-    
-    async def close(self):
-        """Close HTTP session."""
-        if self._session and not self._session.closed:
-            await self._session.close()
-    
-    async def __aenter__(self):
-        return self
-    
-    async def __aexit__(self, exc_type, exc_val, exc_tb):
-        await self.close()
-    
-    async def execute(
-        self,
-        slot: Slot,
-        tool_name: str,
-        args: Dict[str, Any],
-        timeout: Optional[float] = None,
-    ) -> ExecutionResult:
-        """
-        Execute a tool in a slot's workspace.
-        
-        Args:
-            slot: Slot to execute in
-            tool_name: Name of tool (bash, read_file, write_file)
-            args: Tool arguments
-            timeout: Optional timeout override
-            
-        Returns:
-            ExecutionResult with output or error
-        """
-        execution_id = str(uuid.uuid4())
-        exec_timeout = timeout or self.timeout.total or 30.0
-        
-        # Mark slot as executing
-        original_state = slot.state
-        try:
-            if slot.state == SlotState.ACQUIRED:
-                slot.start_execution(execution_id)
-            
-            result = await self._send_execute_request(
-                container_addr=slot.container_addr,
-                slot_id=slot.slot_id,
-                tool_name=tool_name,
-                args=args,
-                execution_id=execution_id,
-                timeout=exec_timeout,
-            )
-            result.slot_id = slot.slot_id
-            return result
-            
-        finally:
-            # Restore slot state
-            if slot.state == SlotState.EXECUTING:
-                slot.end_execution()
-    
-    async def _send_execute_request(
-        self,
-        container_addr: str,
-        slot_id: str,
-        tool_name: str,
-        args: Dict[str, Any],
-        execution_id: str,
-        timeout: float,
-    ) -> ExecutionResult:
-        """Send execution request to sandbox server with retry logic."""
-        session = await self._get_session()
-        url = f"{container_addr}/execute"
-        
-        payload = {
-            "slot_id": slot_id,
-            "tool": tool_name,
-            "args": args,
-            "execution_id": execution_id,
-            "timeout": timeout,
-        }
-        
-        last_error = None
-        for attempt in range(self.max_retries):
-            try:
-                async with session.post(url, json=payload) as response:
-                    data = await response.json()
-                    
-                    return ExecutionResult(
-                        success=data.get("success", False),
-                        output=data.get("output", ""),
-                        error=data.get("error", ""),
-                        execution_id=data.get("execution_id", execution_id),
-                        metadata=data.get("metadata", {}),
-                    )
-                    
-            except aiohttp.ClientError as e:
-                last_error = str(e)
-                if attempt < self.max_retries - 1:
-                    await asyncio.sleep(self.retry_delay * (attempt + 1))
-                continue
-            except asyncio.TimeoutError:
-                last_error = f"Request timed out after {timeout}s"
-                break
-            except Exception as e:
-                last_error = str(e)
-                break
-        
-        return ExecutionResult(
-            success=False,
-            error=f"Failed after {self.max_retries} attempts: {last_error}",
-            execution_id=execution_id,
-        )
-    
-    async def execute_batch(
-        self,
-        requests: List[Tuple[Slot, str, Dict[str, Any]]],
-        timeout: Optional[float] = None,
-    ) -> List[ExecutionResult]:
-        """
-        Execute multiple tools in parallel across slots.
-        
-        This is the key optimization - we batch tool calls to maximize
-        container utilization while agents are waiting for LLM responses.
-        
-        Args:
-            requests: List of (slot, tool_name, args) tuples
-            timeout: Optional timeout override
-            
-        Returns:
-            List of ExecutionResults in same order as requests
-        """
-        if not requests:
-            return []
-        
-        # Group requests by container address for batch API
-        by_container: Dict[str, List[Tuple[int, Slot, str, Dict[str, Any], str]]] = {}
-        
-        for idx, (slot, tool_name, args) in enumerate(requests):
-            execution_id = str(uuid.uuid4())
-            container = slot.container_addr
-            
-            if container not in by_container:
-                by_container[container] = []
-            by_container[container].append((idx, slot, tool_name, args, execution_id))
-            
-            # Mark slots as executing
-            if slot.state == SlotState.ACQUIRED:
-                slot.start_execution(execution_id)
-        
-        # Execute batches in parallel
-        exec_timeout = timeout or self.timeout.total or 30.0
-        batch_tasks = []
-        
-        for container_addr, batch_requests in by_container.items():
-            task = self._send_batch_request(
-                container_addr=container_addr,
-                batch_requests=batch_requests,
-                timeout=exec_timeout,
-            )
-            batch_tasks.append(task)
-        
-        # Gather all batch results
-        batch_results = await asyncio.gather(*batch_tasks, return_exceptions=True)
-        
-        # Collect results in original order
-        results: List[Optional[ExecutionResult]] = [None] * len(requests)
-        
-        for batch_result in batch_results:
-            if isinstance(batch_result, Exception):
-                # Mark all in this batch as failed
-                continue
-            
-            for idx, result in batch_result:
-                results[idx] = result
-        
-        # Fill in any missing results
-        for idx, result in enumerate(results):
-            if result is None:
-                slot, tool_name, args = requests[idx]
-                results[idx] = ExecutionResult(
-                    success=False,
-                    error="Batch execution failed",
-                    slot_id=slot.slot_id,
-                )
-        
-        # End execution on all slots
-        for slot, _, _ in requests:
-            if slot.state == SlotState.EXECUTING:
-                slot.end_execution()
-        
-        return results  # type: ignore
-    
-    async def _send_batch_request(
-        self,
-        container_addr: str,
-        batch_requests: List[Tuple[int, Slot, str, Dict[str, Any], str]],
-        timeout: float,
-    ) -> List[Tuple[int, ExecutionResult]]:
-        """Send batch execution request to a single container."""
-        session = await self._get_session()
-        url = f"{container_addr}/batch"
-        
-        # Build batch payload
-        payload = [
-            {
-                "slot_id": slot.slot_id,
-                "tool": tool_name,
-                "args": args,
-                "execution_id": execution_id,
-                "timeout": timeout,
-            }
-            for _, slot, tool_name, args, execution_id in batch_requests
-        ]
-        
-        try:
-            async with session.post(url, json=payload) as response:
-                data = await response.json()
-                
-                if not isinstance(data, list):
-                    raise ValueError(f"Expected list response, got {type(data)}")
-                
-                results = []
-                for i, (idx, slot, _, _, execution_id) in enumerate(batch_requests):
-                    if i < len(data):
-                        item = data[i]
-                        result = ExecutionResult(
-                            success=item.get("success", False),
-                            output=item.get("output", ""),
-                            error=item.get("error", ""),
-                            execution_id=item.get("execution_id", execution_id),
-                            slot_id=slot.slot_id,
-                            metadata=item.get("metadata", {}),
-                        )
-                    else:
-                        result = ExecutionResult(
-                            success=False,
-                            error="Missing result in batch response",
-                            execution_id=execution_id,
-                            slot_id=slot.slot_id,
-                        )
-                    results.append((idx, result))
-                
-                return results
-                
-        except Exception as e:
-            # Return error for all requests in batch
-            return [
-                (idx, ExecutionResult(
-                    success=False,
-                    error=str(e),
-                    execution_id=execution_id,
-                    slot_id=slot.slot_id,
-                ))
-                for idx, slot, _, _, execution_id in batch_requests
-            ]
-    
-    async def reset_slot(self, slot: Slot) -> ExecutionResult:
-        """
-        Reset a slot's workspace (delete all files).
-        
-        Useful when reusing a slot for a new trajectory.
-        """
-        session = await self._get_session()
-        url = f"{slot.container_addr}/reset"
-        
-        try:
-            async with session.post(url, json={"slot_id": slot.slot_id}) as response:
-                data = await response.json()
-                return ExecutionResult(
-                    success=data.get("success", False),
-                    output=data.get("output", ""),
-                    error=data.get("error", ""),
-                    slot_id=slot.slot_id,
-                )
-        except Exception as e:
-            return ExecutionResult(
-                success=False,
-                error=str(e),
-                slot_id=slot.slot_id,
-            )
-    
-    async def health_check(self, container_addr: str) -> bool:
-        """Check if a sandbox container is healthy."""
-        session = await self._get_session()
-        url = f"{container_addr}/health"
-        
-        try:
-            async with session.get(url) as response:
-                data = await response.json()
-                return data.get("status") == "ok"
-        except Exception:
-            return False
-    
-    async def get_container_status(
-        self, 
-        container_addr: str
-    ) -> Optional[Dict[str, Any]]:
-        """Get status info from a sandbox container."""
-        session = await self._get_session()
-        url = f"{container_addr}/health"
-        
-        try:
-            async with session.get(url) as response:
-                return await response.json()
-        except Exception:
-            return None
-
-    # -------------------------------------------------------------------------
-    # Artifact helpers (optional)
-    # -------------------------------------------------------------------------
-
-    async def _post_json(
-        self,
-        url: str,
-        payload: Dict[str, Any],
-        timeout: Optional[float] = None,
-    ) -> Dict[str, Any]:
-        session = await self._get_session()
-        try:
-            async with session.post(url, json=payload, timeout=timeout) as response:
-                data = await response.json()
-                if isinstance(data, dict):
-                    data.setdefault("http_status", response.status)
-                    return data
-                return {"success": False, "error": f"Unexpected response type: {type(data)}", "http_status": response.status}
-        except Exception as e:
-            return {"success": False, "error": str(e)}
-
-    async def read_artifact(
-        self,
-        slot: Slot,
-        path: str,
-        *,
-        encoding: str = "text",
-        max_bytes: Optional[int] = None,
-        include_sha256: bool = False,
-        timeout: Optional[float] = None,
-    ) -> Dict[str, Any]:
-        url = f"{slot.container_addr}/artifacts/read"
-        payload: Dict[str, Any] = {"slot_id": slot.slot_id, "path": path, "encoding": encoding, "include_sha256": include_sha256}
-        if max_bytes is not None:
-            payload["max_bytes"] = max_bytes
-        return await self._post_json(url, payload, timeout=timeout)
-
-    async def list_artifacts(
-        self,
-        slot: Slot,
-        path: str = ".",
-        *,
-        recursive: bool = False,
-        max_entries: Optional[int] = None,
-        timeout: Optional[float] = None,
-    ) -> Dict[str, Any]:
-        url = f"{slot.container_addr}/artifacts/list"
-        payload: Dict[str, Any] = {"slot_id": slot.slot_id, "path": path, "recursive": recursive}
-        if max_entries is not None:
-            payload["max_entries"] = max_entries
-        return await self._post_json(url, payload, timeout=timeout)
-
-    async def archive_artifacts(
-        self,
-        slot: Slot,
-        path: str = ".",
-        *,
-        archive_format: str = "tar.gz",
-        max_bytes: Optional[int] = None,
-        max_entries: Optional[int] = None,
-        timeout: Optional[float] = None,
-    ) -> Dict[str, Any]:
-        url = f"{slot.container_addr}/artifacts/archive"
-        payload: Dict[str, Any] = {"slot_id": slot.slot_id, "path": path, "format": archive_format}
-        if max_bytes is not None:
-            payload["max_bytes"] = max_bytes
-        if max_entries is not None:
-            payload["max_entries"] = max_entries
-        return await self._post_json(url, payload, timeout=timeout)
--- a/atropos/slots/pool.py
+++ b/atropos/slots/pool.py
@@ -1,659 +0,0 @@
-"""
-SlotPool - Manages slots across Nomad allocations.
-
-The SlotPool is the core abstraction for slot-based multiplexing:
- Tracks available/acquired slots across containers
- Handles slot acquisition and release
- Auto-scales Nomad job count based on demand
- Provides batched tool execution
-"""
-
-import asyncio
-import logging
-import os
-import subprocess
-from dataclasses import dataclass, field
-from pathlib import Path
-from typing import Any, Dict, List, Optional, Tuple
-
-from ..nomad.client import (
-    Allocation,
-    AllocationStatus,
-    NomadClient,
-    create_sandbox_job,
-)
-from .executor import ExecutionResult, SandboxExecutor
-from .slot import Slot, SlotState, create_slots_for_allocation
-
-logger = logging.getLogger(__name__)
-
-
-@dataclass
-class SlotPoolConfig:
-    """Configuration for SlotPool."""
-    
-    # Nomad settings
-    nomad_address: str = "http://localhost:4646"
-    job_id: str = "atropos-sandbox"
-    datacenter: str = "dc1"
-    
-    # Container settings
-    image: str = "atropos-sandbox:local"  # Use :local tag to avoid registry pull
-    slots_per_container: int = 10
-    privileged: bool = False
-    cpu: int = 500  # MHz
-    memory: int = 512  # MB
-    
-    # Driver selection: "docker" or "singularity"
-    driver: str = "docker"
-    # Path to .sif file for singularity driver (required if driver="singularity")
-    singularity_image: Optional[str] = None
-    
-    # Scaling settings
-    min_containers: int = 1
-    max_containers: int = 10
-    
-    # Timeouts
-    acquire_timeout: float = 30.0  # Seconds between acquire polls (also triggers scale-up attempts)
-    health_check_interval: float = 30.0  # Seconds between health checks
-    scale_cooldown: float = 60.0  # Seconds between scale operations
-
-    # Job lifecycle
-    purge_job_on_start: bool = False  # Purge any pre-existing job before starting (local dev/training friendly)
-
-    # Local Docker image convenience (macOS/Nomad dev mode)
-    auto_build_local_image: bool = True  # If image endswith :local and is missing, build it from the bundled Dockerfile.
-    dockerfile_path: Optional[str] = None  # Override Dockerfile path (default: Hermes-Agent/atropos/Dockerfile).
-    docker_build_context: Optional[str] = None  # Override build context (default: Hermes-Agent/atropos).
-
-
-class SlotPool:
-    """
-    Manages a pool of slots across Nomad allocations.
-    
-    The SlotPool:
-    - Deploys sandbox containers to Nomad
-    - Tracks slots across all running containers
-    - Handles slot acquisition/release
-    - Auto-scales based on demand
-    - Provides batched execution via SandboxExecutor
-    
-    Usage:
-        config = SlotPoolConfig(
-            nomad_address="http://localhost:4646",
-            job_id="my-sandbox",
-            slots_per_container=10,
-        )
-        
-        pool = SlotPool(config)
-        await pool.start()
-        
-        # Acquire a slot
-        slot = await pool.acquire()
-        
-        # Execute tool
-        result = await pool.execute(slot, "bash", {"command": "ls"})
-        
-        # Release slot
-        await pool.release(slot)
-        
-        # Shutdown
-        await pool.stop()
-    """
-    
-    def __init__(self, config: Optional[SlotPoolConfig] = None):
-        self.config = config or SlotPoolConfig()
-        
-        # Nomad client
-        self.nomad = NomadClient(address=self.config.nomad_address)
-        
-        # Sandbox executor for tool execution
-        self.executor = SandboxExecutor()
-        
-        # Slot tracking
-        self._slots: Dict[str, Slot] = {}  # slot_key -> Slot
-        self._available_queue: asyncio.Queue[str] = asyncio.Queue()
-        self._lock = asyncio.Lock()
-        self._scale_lock = asyncio.Lock()
-        
-        # State
-        self._started = False
-        self._health_task: Optional[asyncio.Task] = None
-        self._scale_task: Optional[asyncio.Task] = None
-        self._last_scale_time = 0.0
-
-    def _default_dockerfile_path(self) -> Path:
-        # Hermes-Agent/atropos/Dockerfile lives next to this module in source checkouts.
-        return Path(__file__).resolve().parents[1] / "Dockerfile"
-
-    def _default_build_context(self) -> Path:
-        return Path(__file__).resolve().parents[1]
-
-    def _docker_image_exists(self, image: str) -> bool:
-        try:
-            proc = subprocess.run(
-                ["docker", "image", "inspect", image],
-                stdout=subprocess.DEVNULL,
-                stderr=subprocess.DEVNULL,
-                check=False,
-                env={**os.environ, "DOCKER_CLI_HINTS": "false"},
-            )
-            return proc.returncode == 0
-        except FileNotFoundError:
-            return False
-
-    def _try_build_local_image(self, image: str) -> None:
-        dockerfile = Path(self.config.dockerfile_path) if self.config.dockerfile_path else self._default_dockerfile_path()
-        context = Path(self.config.docker_build_context) if self.config.docker_build_context else self._default_build_context()
-
-        if not dockerfile.exists():
-            raise RuntimeError(
-                f"Sandbox Dockerfile not found at {dockerfile}. "
-                "Build the sandbox image manually or set --env.purge_job_on_start false and provide a non-local image."
-            )
-        if not context.exists():
-            raise RuntimeError(f"Docker build context not found at {context}")
-
-        # Prefer buildx+--load to ensure the image ends up in the local daemon (required by Nomad's docker driver).
-        buildx_cmd = [
-            "docker",
-            "buildx",
-            "build",
-            "--load",
-            "-t",
-            image,
-            "-f",
-            str(dockerfile),
-            str(context),
-        ]
-        proc = subprocess.run(buildx_cmd, check=False, env={**os.environ, "DOCKER_CLI_HINTS": "false"})
-        if proc.returncode == 0:
-            return
-
-        # Fallback to classic docker build if buildx isn't available.
-        build_cmd = ["docker", "build", "-t", image, "-f", str(dockerfile), str(context)]
-        proc2 = subprocess.run(build_cmd, check=False, env={**os.environ, "DOCKER_CLI_HINTS": "false"})
-        if proc2.returncode != 0:
-            raise RuntimeError(
-                f"Failed to build local sandbox image {image}. "
-                f"Tried: {' '.join(buildx_cmd)} and {' '.join(build_cmd)}"
-            )
-
-    def _ensure_local_image(self) -> None:
-        image = (self.config.image or "").strip()
-        if not image.endswith(":local"):
-            return
-        if not self.config.auto_build_local_image:
-            return
-
-        if self._docker_image_exists(image):
-            return
-
-        logger.info(f"Local sandbox image {image} not found; building it now...")
-        self._try_build_local_image(image)
-
-    def _slot_key(self, alloc_id: str, slot_id: str) -> str:
-        """Generate unique key for a slot."""
-        return f"{alloc_id}:{slot_id}"
-    
-    @property
-    def total_slots(self) -> int:
-        """Total number of slots in pool."""
-        return len(self._slots)
-    
-    @property
-    def available_slots(self) -> int:
-        """Number of available slots."""
-        return sum(1 for s in self._slots.values() if s.is_available)
-    
-    @property
-    def acquired_slots(self) -> int:
-        """Number of acquired slots."""
-        return sum(1 for s in self._slots.values() if s.is_acquired)
-    
-    async def start(self) -> None:
-        """
-        Start the slot pool.
-        
-        - Checks if Nomad is healthy
-        - Deploys sandbox job if not running
-        - Discovers existing allocations
-        - Starts health check background task
-        """
-        if self._started:
-            return
-        
-        logger.info(f"Starting SlotPool (job_id={self.config.job_id})")
-
-        try:
-            # Make sure local sandbox images exist before Nomad tries to pull them.
-            # This is a common footgun in macOS dev mode with :local tags.
-            self._ensure_local_image()
-
-            # Check Nomad health
-            if not await self.nomad.is_healthy():
-                raise RuntimeError(f"Nomad is not reachable at {self.config.nomad_address}")
-
-            if self.config.purge_job_on_start:
-                logger.info(f"Purging any existing Nomad job: {self.config.job_id}")
-                await self.nomad.stop_job(self.config.job_id, purge=True)
-
-            # Check if job exists (after optional purge)
-            job = await self.nomad.get_job(self.config.job_id)
-
-            if job is None:
-                # Deploy new job
-                logger.info(f"Deploying sandbox job: {self.config.job_id} (driver={self.config.driver})")
-                job_spec = create_sandbox_job(
-                    job_id=self.config.job_id,
-                    image=self.config.image,
-                    count=self.config.min_containers,
-                    slots_per_container=self.config.slots_per_container,
-                    privileged=self.config.privileged,
-                    cpu=self.config.cpu,
-                    memory=self.config.memory,
-                    datacenter=self.config.datacenter,
-                    driver=self.config.driver,
-                    singularity_image=self.config.singularity_image,
-                )
-                result = await self.nomad.submit_job(job_spec)
-                if "error" in result:
-                    raise RuntimeError(f"Failed to submit job: {result}")
-
-            # Wait for allocations to be running (even if the job already existed).
-            await self._wait_for_healthy_allocations(self.config.min_containers)
-
-            # Discover existing allocations and slots
-            await self._refresh_slots()
-
-            # Start health check task
-            self._health_task = asyncio.create_task(self._health_check_loop())
-
-            self._started = True
-            logger.info(f"SlotPool started: {self.total_slots} slots available")
-        except Exception:
-            # Ensure aiohttp sessions are not leaked if we fail to start.
-            await self.stop(purge_job=False)
-            raise
-    
-    async def stop(self, purge_job: bool = False) -> None:
-        """
-        Stop the slot pool.
-        
-        Args:
-            purge_job: If True, also stop the Nomad job
-        """
-        logger.info("Stopping SlotPool")
-
-        # Cancel health check task
-        if self._health_task:
-            self._health_task.cancel()
-            try:
-                await self._health_task
-            except asyncio.CancelledError:
-                pass
-            finally:
-                self._health_task = None
-
-        if self._scale_task:
-            self._scale_task.cancel()
-            try:
-                await self._scale_task
-            except asyncio.CancelledError:
-                pass
-            finally:
-                self._scale_task = None
-
-        # Optionally stop the job (do this even if start() never completed).
-        if purge_job:
-            logger.info(f"Stopping Nomad job: {self.config.job_id}")
-            await self.nomad.stop_job(self.config.job_id, purge=True)
-
-        # Close connections
-        await self.executor.close()
-        await self.nomad.close()
-
-        self._started = False
-        self._slots.clear()
-
-        # Clear the queue
-        while not self._available_queue.empty():
-            try:
-                self._available_queue.get_nowait()
-            except asyncio.QueueEmpty:
-                break
-    
-    async def acquire(self, trajectory_id: Optional[str] = None) -> Slot:
-        """
-        Acquire an available slot.
-        
-        If no slots are available, waits up to acquire_timeout seconds.
-        If still no slots, attempts to scale up.
-        
-        Args:
-            trajectory_id: Optional ID of trajectory acquiring the slot
-            
-        Returns:
-            Acquired Slot
-            
-        Raises:
-            asyncio.TimeoutError: If no slot becomes available
-        """
-        if not self._started:
-            raise RuntimeError("SlotPool not started")
-
-        while True:
-            try:
-                # Try to get an available slot
-                slot_key = await asyncio.wait_for(
-                    self._available_queue.get(),
-                    timeout=self.config.acquire_timeout,
-                )
-            except asyncio.TimeoutError:
-                # Try to scale up, but keep waiting even if scaling isn't possible.
-                # In practice, slots may become available shortly (e.g. contention),
-                # and scaling may be temporarily blocked by Nomad deployments.
-                await self._try_scale_up()
-                continue
-
-            slot = self._slots.get(slot_key)
-            if slot is None:
-                # Slot was removed; discard stale queue entry and retry.
-                continue
-
-            try:
-                slot.acquire(trajectory_id)
-            except RuntimeError:
-                # Slot isn't actually available (e.g. duplicate queue entry); retry.
-                continue
-
-            logger.debug(f"Acquired slot {slot.slot_id} (alloc={slot.alloc_id[:8]})")
-            return slot
-    
-    async def release(self, slot: Slot, reset_workspace: bool = False) -> None:
-        """
-        Release a slot back to the pool.
-        
-        Args:
-            slot: Slot to release
-            reset_workspace: If True, clear the workspace files
-        """
-        slot_key = self._slot_key(slot.alloc_id, slot.slot_id)
-        
-        if slot_key not in self._slots:
-            logger.warning(f"Releasing unknown slot: {slot_key}")
-            return
-        
-        # Optionally reset workspace
-        if reset_workspace:
-            await self.executor.reset_slot(slot)
-        
-        slot.release()
-        await self._available_queue.put(slot_key)
-        
-        logger.debug(f"Released slot {slot.slot_id}")
-    
-    async def execute(
-        self,
-        slot: Slot,
-        tool_name: str,
-        args: Dict[str, Any],
-        timeout: Optional[float] = None,
-    ) -> ExecutionResult:
-        """
-        Execute a tool in a slot's workspace.
-        
-        Args:
-            slot: Slot to execute in
-            tool_name: Name of tool (bash, read_file, write_file)
-            args: Tool arguments
-            timeout: Optional timeout override
-            
-        Returns:
-            ExecutionResult
-        """
-        return await self.executor.execute(slot, tool_name, args, timeout)
-    
-    async def execute_batch(
-        self,
-        requests: List[Tuple[Slot, str, Dict[str, Any]]],
-        timeout: Optional[float] = None,
-    ) -> List[ExecutionResult]:
-        """
-        Execute multiple tools in parallel.
-        
-        This is the key optimization - batch execution across multiple slots
-        maximizes container utilization.
-        
-        Args:
-            requests: List of (slot, tool_name, args) tuples
-            timeout: Optional timeout override
-            
-        Returns:
-            List of ExecutionResults in same order
-        """
-        return await self.executor.execute_batch(requests, timeout)
-    
-    async def _refresh_slots(self) -> None:
-        """Refresh slot inventory from Nomad allocations."""
-        async with self._lock:
-            allocs = await self.nomad.get_job_allocations(self.config.job_id)
-            
-            # Track which slots we've seen
-            seen_keys = set()
-            
-            for alloc in allocs:
-                if alloc.status != AllocationStatus.RUNNING:
-                    continue
-                
-                if not alloc.http_address:
-                    continue
-                
-                # Check container health
-                healthy = await self.executor.health_check(alloc.http_address)
-                if not healthy:
-                    continue
-                
-                # Create slots for this allocation
-                for i in range(self.config.slots_per_container):
-                    slot_id = f"slot_{i}"
-                    slot_key = self._slot_key(alloc.id, slot_id)
-                    seen_keys.add(slot_key)
-                    
-                    if slot_key not in self._slots:
-                        # New slot
-                        slot = Slot(
-                            slot_id=slot_id,
-                            alloc_id=alloc.id,
-                            container_addr=alloc.http_address,
-                        )
-                        self._slots[slot_key] = slot
-                        await self._available_queue.put(slot_key)
-                        logger.debug(f"Added slot: {slot_key}")
-            
-            # Remove slots from dead allocations
-            for slot_key in list(self._slots.keys()):
-                if slot_key not in seen_keys:
-                    slot = self._slots.pop(slot_key)
-                    logger.debug(f"Removed slot: {slot_key}")
-    
-    async def _wait_for_healthy_allocations(
-        self, 
-        min_count: int, 
-        timeout: float = 120.0
-    ) -> None:
-        """Wait for allocations to become healthy."""
-        import time
-        start = time.time()
-
-        def _summarize_alloc_detail(detail: Dict[str, Any]) -> str:
-            task_states = detail.get("TaskStates") or {}
-            parts: List[str] = []
-            if isinstance(task_states, dict):
-                for task_name, st in task_states.items():
-                    events = (st or {}).get("Events") or []
-                    if isinstance(events, list) and events:
-                        # Include a few recent events; the latest can be a generic restart message
-                        # while the true root cause is slightly earlier (e.g. image pull failure).
-                        recent = events[-3:]
-                        msgs: List[str] = []
-                        for ev in recent:
-                            desc = ev.get("DisplayMessage") or ev.get("Message") or ev.get("Type") or ""
-                            if desc:
-                                msgs.append(desc)
-                        if msgs:
-                            parts.append(f"{task_name}: " + " | ".join(msgs))
-            return "; ".join(parts)
-
-        def _alloc_events_lower(detail: Dict[str, Any]) -> str:
-            task_states = detail.get("TaskStates") or {}
-            texts: List[str] = []
-            if isinstance(task_states, dict):
-                for _task_name, st in task_states.items():
-                    events = (st or {}).get("Events") or []
-                    if isinstance(events, list):
-                        for ev in events[-10:]:
-                            desc = ev.get("DisplayMessage") or ev.get("Message") or ev.get("Type") or ""
-                            if desc:
-                                texts.append(desc)
-            return " ".join(texts).lower()
-        
-        while time.time() - start < timeout:
-            allocs = await self.nomad.get_job_allocations(self.config.job_id)
-            
-            healthy_count = 0
-            for alloc in allocs:
-                if alloc.status == AllocationStatus.RUNNING and alloc.http_address:
-                    if await self.executor.health_check(alloc.http_address):
-                        healthy_count += 1
-
-                # Fast-fail on obvious driver/image errors to avoid waiting out the full timeout.
-                if alloc.id:
-                    detail = await self.nomad.get_allocation(alloc.id)
-                    if isinstance(detail, dict):
-                        summary = _summarize_alloc_detail(detail)
-                        lowered = _alloc_events_lower(detail) or summary.lower()
-                        if "failed to pull" in lowered or "pull access denied" in lowered:
-                            raise RuntimeError(
-                                "Nomad allocation failed to start due to a Docker image pull error. "
-                                f"Allocation {alloc.id[:8]}: {summary}\n"
-                                "If you're using a local image tag (e.g. `atropos-sandbox:local`) on macOS, "
-                                "make sure the image is loaded into Docker, e.g.:\n"
-                                "  docker buildx build --load -t atropos-sandbox:local -f Hermes-Agent/atropos/Dockerfile Hermes-Agent/atropos"
-                            )
-                        if "exceeded allowed attempts" in lowered:
-                            raise RuntimeError(
-                                "Nomad allocation is crash-looping and has entered restart backoff. "
-                                f"Allocation {alloc.id[:8]}: {summary}\n"
-                                "Inspect logs with:\n"
-                                f"  nomad alloc logs -stderr -task sandbox-server {alloc.id}\n"
-                                "Common causes include: missing local Docker image tag, container entrypoint error, "
-                                "or sandbox-server startup failure."
-                            )
-            
-            if healthy_count >= min_count:
-                return
-            
-            await asyncio.sleep(2.0)
-
-        # Timed out: include allocation status detail to help debugging.
-        allocs = await self.nomad.get_job_allocations(self.config.job_id)
-        alloc_lines: List[str] = []
-        for alloc in allocs[:10]:
-            addr = alloc.http_address or "-"
-            line = f"{alloc.id[:8]} status={alloc.status.value} http={addr}"
-            detail = await self.nomad.get_allocation(alloc.id)
-            if isinstance(detail, dict):
-                summary = _summarize_alloc_detail(detail)
-                if summary:
-                    line += f" detail={summary}"
-            alloc_lines.append(line)
-
-        hint = (
-            "Timed out waiting for healthy sandbox allocations.\n"
-            f"Job: {self.config.job_id}, desired_healthy: {min_count}\n"
-            "Allocations:\n  - " + "\n  - ".join(alloc_lines)
-        )
-        raise RuntimeError(hint)
-    
-    async def _try_scale_up(self) -> bool:
-        """Attempt to scale up the job."""
-        import time
-
-        async with self._scale_lock:
-            # Check cooldown
-            if time.time() - self._last_scale_time < self.config.scale_cooldown:
-                return False
-
-            # Check max containers
-            status = await self.nomad.get_job_status(self.config.job_id)
-            if status is None:
-                return False
-
-            current_count = status.count
-            if current_count >= self.config.max_containers:
-                logger.warning(f"Cannot scale up: already at max ({self.config.max_containers})")
-                return False
-
-            # Scale up
-            new_count = min(current_count + 1, self.config.max_containers)
-            logger.info(f"Scaling up from {current_count} to {new_count} containers")
-
-            scale_resp = await self.nomad.scale_job(
-                self.config.job_id,
-                count=new_count,
-                task_group="sandbox",
-            )
-
-            # Nomad may return non-JSON errors (e.g. plain text) with a status field.
-            if isinstance(scale_resp, dict) and scale_resp.get("status", 200) >= 400:
-                logger.warning(f"Scale request rejected: {scale_resp}")
-                self._last_scale_time = time.time()
-                return False
-
-            self._last_scale_time = time.time()
-
-            # Wait for new allocation in the background so contended acquires can still
-            # make progress (e.g. by grabbing slots released by other trajectories).
-            if self._scale_task is None or self._scale_task.done():
-                self._scale_task = asyncio.create_task(self._wait_for_scale(new_count))
-
-            return True
-
-    async def _wait_for_scale(self, desired_count: int) -> None:
-        try:
-            await self._wait_for_healthy_allocations(desired_count, timeout=60.0)
-            await self._refresh_slots()
-        except asyncio.CancelledError:
-            raise
-        except Exception as e:
-            logger.error(f"Failed to scale up: {e}")
-    
-    async def _health_check_loop(self) -> None:
-        """Background task to monitor container health."""
-        while True:
-            try:
-                await asyncio.sleep(self.config.health_check_interval)
-                await self._refresh_slots()
-            except asyncio.CancelledError:
-                break
-            except Exception as e:
-                logger.error(f"Health check error: {e}")
-    
-    def get_stats(self) -> Dict[str, Any]:
-        """Get pool statistics."""
-        slots_by_state = {}
-        for slot in self._slots.values():
-            state = slot.state.value
-            slots_by_state[state] = slots_by_state.get(state, 0) + 1
-
-        container_count = len({s.alloc_id for s in self._slots.values()}) if self._slots else 0
-        
-        return {
-            "total_slots": self.total_slots,
-            "available_slots": self.available_slots,
-            "acquired_slots": self.acquired_slots,
-            "containers": container_count,
-            "slots_by_state": slots_by_state,
-            "started": self._started,
-        }
--- a/atropos/slots/slot.py
+++ b/atropos/slots/slot.py
@@ -1,159 +0,0 @@
-"""
-Slot abstraction for atropos-agent.
-
-A Slot represents an isolated workspace for a single agent trajectory.
-Slots are hosted on Nomad allocations and provide workspace isolation
-via filesystem directories.
-"""
-
-from dataclasses import dataclass, field
-from enum import Enum
-from typing import Any, Dict, Optional
-import uuid
-
-
-class SlotState(Enum):
-    """State of a slot in the pool."""
-    AVAILABLE = "available"      # Ready to be acquired
-    ACQUIRED = "acquired"        # Assigned to a trajectory
-    EXECUTING = "executing"      # Currently executing a tool
-    RELEASING = "releasing"      # Being released back to pool
-    ERROR = "error"              # In error state
-
-
-@dataclass
-class Slot:
-    """
-    An isolated workspace for a single agent trajectory.
-    
-    Slots are the unit of scheduling - each trajectory runs in its own slot,
-    with an isolated workspace directory. Multiple slots share a container.
-    
-    Attributes:
-        slot_id: Unique identifier for this slot (e.g., "slot_0")
-        alloc_id: Nomad allocation ID hosting this slot
-        container_addr: HTTP address of the sandbox server (e.g., "http://10.0.0.1:8080")
-        workspace_dir: Path to workspace in container (e.g., "/data/slot_0")
-        state: Current state of the slot
-        trajectory_id: ID of trajectory currently using this slot (if acquired)
-        metadata: Additional metadata
-    """
-    slot_id: str
-    alloc_id: str
-    container_addr: str
-    workspace_dir: str = ""
-    state: SlotState = SlotState.AVAILABLE
-    trajectory_id: Optional[str] = None
-    metadata: Dict[str, Any] = field(default_factory=dict)
-    
-    def __post_init__(self):
-        """Set default workspace_dir if not provided."""
-        if not self.workspace_dir:
-            self.workspace_dir = f"/data/{self.slot_id}"
-    
-    @property
-    def is_available(self) -> bool:
-        """Check if slot is available for acquisition."""
-        return self.state == SlotState.AVAILABLE
-    
-    @property
-    def is_acquired(self) -> bool:
-        """Check if slot is currently acquired."""
-        return self.state in (SlotState.ACQUIRED, SlotState.EXECUTING)
-    
-    def acquire(self, trajectory_id: Optional[str] = None) -> None:
-        """
-        Mark slot as acquired by a trajectory.
-        
-        Args:
-            trajectory_id: Optional ID of acquiring trajectory
-        """
-        if not self.is_available:
-            raise RuntimeError(f"Cannot acquire slot {self.slot_id}: state is {self.state}")
-        
-        self.state = SlotState.ACQUIRED
-        self.trajectory_id = trajectory_id or str(uuid.uuid4())
-    
-    def start_execution(self, execution_id: Optional[str] = None) -> None:
-        """Mark slot as executing."""
-        if self.state != SlotState.ACQUIRED:
-            raise RuntimeError(f"Cannot start execution on slot {self.slot_id}: state is {self.state}")
-        
-        self.state = SlotState.EXECUTING
-        if execution_id:
-            self.metadata["current_execution_id"] = execution_id
-    
-    def end_execution(self) -> None:
-        """Mark execution as complete, return to acquired state."""
-        if self.state != SlotState.EXECUTING:
-            raise RuntimeError(f"Cannot end execution on slot {self.slot_id}: state is {self.state}")
-        
-        self.state = SlotState.ACQUIRED
-        self.metadata.pop("current_execution_id", None)
-    
-    def release(self) -> None:
-        """Release slot back to available state."""
-        self.state = SlotState.AVAILABLE
-        self.trajectory_id = None
-        self.metadata.pop("current_execution_id", None)
-    
-    def mark_error(self, error: str) -> None:
-        """Mark slot as in error state."""
-        self.state = SlotState.ERROR
-        self.metadata["error"] = error
-    
-    def to_dict(self) -> Dict[str, Any]:
-        """Convert to dictionary for serialization."""
-        return {
-            "slot_id": self.slot_id,
-            "alloc_id": self.alloc_id,
-            "container_addr": self.container_addr,
-            "workspace_dir": self.workspace_dir,
-            "state": self.state.value,
-            "trajectory_id": self.trajectory_id,
-            "metadata": self.metadata,
-        }
-    
-    @classmethod
-    def from_dict(cls, data: Dict[str, Any]) -> "Slot":
-        """Create from dictionary."""
-        return cls(
-            slot_id=data["slot_id"],
-            alloc_id=data["alloc_id"],
-            container_addr=data["container_addr"],
-            workspace_dir=data.get("workspace_dir", ""),
-            state=SlotState(data.get("state", "available")),
-            trajectory_id=data.get("trajectory_id"),
-            metadata=data.get("metadata", {}),
-        )
-    
-    def __repr__(self) -> str:
-        return f"Slot({self.slot_id}, state={self.state.value}, alloc={self.alloc_id[:8]}...)"
-
-
-def create_slots_for_allocation(
-    alloc_id: str,
-    container_addr: str,
-    num_slots: int = 10,
-) -> list["Slot"]:
-    """
-    Create slots for a Nomad allocation.
-    
-    Args:
-        alloc_id: Nomad allocation ID
-        container_addr: HTTP address of sandbox server
-        num_slots: Number of slots to create
-        
-    Returns:
-        List of Slot objects
-    """
-    slots = []
-    for i in range(num_slots):
-        slot_id = f"slot_{i}"
-        slots.append(Slot(
-            slot_id=slot_id,
-            alloc_id=alloc_id,
-            container_addr=container_addr,
-            workspace_dir=f"/data/{slot_id}",
-        ))
-    return slots
--- a/atropos/terminal/init.py
+++ b/atropos/terminal/init.py
@@ -1,2 +0,0 @@
-"""Terminal helpers for stateful sandbox interactions."""
-
--- a/atropos/terminal/asciinema_stream.py
+++ b/atropos/terminal/asciinema_stream.py
@@ -1,115 +0,0 @@
-from __future__ import annotations
-
-import json
-from typing import Any
-
-import pyte
-
-
-class AsciinemaStreamDecoder:
-    def __init__(self, *, default_width: int = 80, default_height: int = 24) -> None:
-        self._default_width = max(1, int(default_width))
-        self._default_height = max(1, int(default_height))
-        self._buffer = ""
-        self._has_header = False
-        self.width = self._default_width
-        self.height = self._default_height
-        self._screen = pyte.Screen(self.width, self.height)
-        self._stream = pyte.Stream(self._screen)
-
-    def reset(self) -> None:
-        self._buffer = ""
-        self._has_header = False
-        self.width = self._default_width
-        self.height = self._default_height
-        self._screen = pyte.Screen(self.width, self.height)
-        self._stream = pyte.Stream(self._screen)
-
-    def feed(self, chunk: str | bytes) -> None:
-        if not chunk:
-            return
-        if isinstance(chunk, bytes):
-            chunk = chunk.decode("utf-8", errors="replace")
-        self._buffer += chunk
-        while True:
-            line, sep, rest = self._buffer.partition("\n")
-            if not sep:
-                break
-            self._buffer = rest
-            line = line.strip()
-            if not line:
-                continue
-            parsed = self._parse_json_line(line)
-            if parsed is None:
-                continue
-            if not self._has_header:
-                if isinstance(parsed, dict):
-                    self._init_from_header(parsed)
-                    continue
-                if isinstance(parsed, list):
-                    self._has_header = True
-                    self._apply_event(parsed)
-                    continue
-                continue
-            if isinstance(parsed, list):
-                self._apply_event(parsed)
-
-    def render(self) -> str:
-        return "\n".join(self._screen.display)
-
-    def _parse_json_line(self, line: str) -> Any | None:
-        try:
-            return json.loads(line)
-        except json.JSONDecodeError:
-            return None
-
-    def _init_from_header(self, header: dict[str, Any]) -> None:
-        width = _coerce_int(
-            header.get("width") or header.get("columns") or header.get("cols"),
-            self._default_width,
-        )
-        height = _coerce_int(
-            header.get("height") or header.get("rows") or header.get("lines"),
-            self._default_height,
-        )
-        self.width = max(1, width)
-        self.height = max(1, height)
-        self._screen = pyte.Screen(self.width, self.height)
-        self._stream = pyte.Stream(self._screen)
-        self._has_header = True
-
-    def _apply_event(self, event: list[Any]) -> None:
-        if len(event) < 2:
-            return
-        event_type = event[1]
-        payload = event[2] if len(event) > 2 else ""
-        if event_type == "o":
-            if isinstance(payload, str):
-                self._stream.feed(payload)
-        elif event_type == "r":
-            width, height = _parse_resize(payload)
-            if width and height:
-                self.width = width
-                self.height = height
-                self._screen.resize(width, height)
-
-
-def _coerce_int(value: Any, default: int) -> int:
-    try:
-        return int(value)
-    except (TypeError, ValueError):
-        return int(default)
-
-
-def _parse_resize(payload: Any) -> tuple[int, int]:
-    if isinstance(payload, str) and "x" in payload:
-        left, right = payload.lower().split("x", 1)
-        return _coerce_int(left, 0), _coerce_int(right, 0)
-    if isinstance(payload, dict):
-        width = _coerce_int(payload.get("width") or payload.get("columns") or payload.get("cols"), 0)
-        height = _coerce_int(payload.get("height") or payload.get("rows") or payload.get("lines"), 0)
-        return width, height
-    if isinstance(payload, list) and len(payload) >= 2:
-        return _coerce_int(payload[0], 0), _coerce_int(payload[1], 0)
-    return 0, 0
-
--- a/atropos/tools/init.py
+++ b/atropos/tools/init.py
@@ -1,26 +0,0 @@
-"""
-Tool abstractions for atropos-agent.
-
-Provides base Tool class and common tool implementations.
-"""
-
-from .base import Tool, ToolCall, ToolRegistry, ToolResult, ToolSchema
-from .build_registry import build_tool_registry
-from .sandbox_stubs import BashTool, ReadFileTool, TerminalTool, WriteFileTool
-from .terminal_stateful_tool import TerminalStatefulTool
-from .tmux_tool import TmuxTool
-
-__all__ = [
-    "Tool",
-    "ToolCall",
-    "ToolRegistry",
-    "ToolResult",
-    "ToolSchema",
-    "BashTool",
-    "ReadFileTool",
-    "WriteFileTool",
-    "TerminalTool",
-    "TerminalStatefulTool",
-    "TmuxTool",
-    "build_tool_registry",
-]
--- a/atropos/tools/base.py
+++ b/atropos/tools/base.py
@@ -1,423 +0,0 @@
-"""
-Base Tool abstraction for atropos-agent.
-
-Tools follow a simple pattern:
-1. Define schema (name, description, parameters)
-2. Implement execute() method
-3. Return ToolResult with output/error
-
-Tool calls use Hermes-style XML tags:
-<tool_call>{"name": "bash", "arguments": {"command": "ls"}}</tool_call>
-"""
-
-import json
-import re
-import uuid
-from abc import ABC, abstractmethod
-from dataclasses import dataclass, field
-from typing import Any, Dict, List, Literal, Optional
-
-from pydantic import BaseModel, Field
-
-
-@dataclass
-class ToolSchema:
-    """JSON Schema for a tool's parameters."""
-    
-    name: str
-    description: str
-    parameters: Dict[str, Any] = field(default_factory=dict)
-    required: List[str] = field(default_factory=list)
-    external: bool = False  # Whether the tool must be executed via an external ToolServer (secret proxy) and not inside the sandbox.
-    
-    def to_dict(self) -> Dict[str, Any]:
-        """Convert to OpenAI-compatible function schema."""
-        return {
-            "type": "function",
-            "function": {
-                "name": self.name,
-                "description": self.description,
-                "parameters": {
-                    "type": "object",
-                    "properties": self.parameters,
-                    "required": self.required,
-                },
-            },
-        }
-    
-    def to_prompt_description(self) -> str:
-        """Convert to human-readable description for system prompt."""
-        params_desc = []
-        for name, spec in self.parameters.items():
-            req = "(required)" if name in self.required else "(optional)"
-            desc = spec.get("description", "")
-            param_type = spec.get("type", "string")
-            params_desc.append(f"  - {name} ({param_type}) {req}: {desc}")
-        
-        params_str = "\n".join(params_desc) if params_desc else "  (no parameters)"
-        return f"**{self.name}**: {self.description}\nParameters:\n{params_str}"
-
-
-@dataclass
-class ToolCall:
-    """A parsed tool call from model output."""
-    
-    name: str
-    arguments: Dict[str, Any]
-    raw_text: str = ""  # Original XML/JSON text
-    uniq_id: str = field(default_factory=lambda: str(uuid.uuid4()))  # Unique tool-call id for traceability/reconstruction.
-    
-    @classmethod
-    def parse_from_text(cls, text: str) -> List["ToolCall"]:
-        """
-        Extract tool calls from text using Hermes-style XML tags.
-        
-        Supported formats (STRICT: requires well-formed closing tags):
-        - Hermes JSON wrapper:
-          <tool_call>{"name": "...", "arguments": {...}}</tool_call>
-        - GLM/llama.cpp style:
-          <tool_call>terminal{"command":"ls -la"}</tool_call>
-        """
-        calls: List["ToolCall"] = []
-
-        if not text:
-            return calls
-
-        def _append_from_payload(*, name: str, arguments: Dict[str, Any], raw: str, uniq_id: Optional[str] = None) -> None:
-            if not isinstance(name, str) or not name:
-                return
-            if not isinstance(arguments, dict):
-                return
-            calls.append(
-                cls(
-                    name=name,
-                    arguments=arguments,
-                    raw_text=raw,
-                    uniq_id=uniq_id or str(uuid.uuid4()),
-                )
-            )
-
-        # STRICT parsing: only accept well-formed <tool_call>...</tool_call> blocks.
-        pattern = r"<tool_call>\s*(.*?)\s*</tool_call>"
-        for inner in re.findall(pattern, text, re.DOTALL):
-            cleaned = (inner or "").strip()
-            if not cleaned:
-                continue
-
-            # Hermes JSON wrapper.
-            if cleaned.startswith("{"):
-                try:
-                    data = json.loads(cleaned)
-                except json.JSONDecodeError:
-                    continue
-                uniq_id = data.get("uniq_id") or data.get("id") or None
-                _append_from_payload(
-                    name=data.get("name", ""),
-                    arguments=data.get("arguments", {}),
-                    raw=inner,
-                    uniq_id=uniq_id,
-                )
-                continue
-
-            # GLM/llama.cpp style: terminal{...}
-            m = re.match(r"^\s*([A-Za-z0-9_.:\\-]+)\s*(\{.*\})\s*$", cleaned, re.DOTALL)
-            if not m:
-                continue
-            name = m.group(1)
-            args_text = m.group(2)
-            try:
-                args = json.loads(args_text)
-            except json.JSONDecodeError:
-                continue
-            _append_from_payload(name=name, arguments=args, raw=inner)
-
-        return calls
-    
-    @classmethod
-    def has_tool_call(cls, text: str) -> bool:
-        """Check if text contains any tool calls."""
-        return bool(re.search(r"<tool_call>", text))
-
-
-@dataclass
-class ToolResult:
-    """Result from executing a tool."""
-    
-    success: bool
-    output: str = ""
-    error: str = ""
-    metadata: Dict[str, Any] = field(default_factory=dict)
-    uniq_id: Optional[str] = None  # Should match ToolCall.uniq_id for async execution tracking.
-    
-    def to_xml(self) -> str:
-        """Format as XML for including in conversation."""
-        data = {
-            "success": self.success,
-            "output": self.output,
-        }
-        if self.uniq_id:
-            data["uniq_id"] = self.uniq_id
-        if self.error:
-            data["error"] = self.error
-        if self.metadata:
-            data["metadata"] = self.metadata
-        return f"<tool_response>{json.dumps(data)}</tool_response>"
-    
-    def to_dict(self) -> Dict[str, Any]:
-        """Convert to dictionary."""
-        return {
-            "success": self.success,
-            "output": self.output,
-            "error": self.error,
-            "metadata": self.metadata,
-            "uniq_id": self.uniq_id,
-        }
-
-
-class Tool(ABC):
-    """
-    Abstract base class for tools.
-    
-    Subclasses must implement:
-    - schema: ToolSchema describing the tool
-    - execute(): async method that performs the tool action
-    """
-    
-    @property
-    @abstractmethod
-    def schema(self) -> ToolSchema:
-        """Return the tool's schema."""
-        pass
-    
-    @property
-    def name(self) -> str:
-        """Tool name (from schema)."""
-        return self.schema.name
-    
-    @abstractmethod
-    async def execute(self, **kwargs) -> ToolResult:
-        """
-        Execute the tool with given arguments.
-        
-        Args:
-            **kwargs: Tool-specific arguments
-            
-        Returns:
-            ToolResult with success/failure and output
-        """
-        pass
-    
-    def is_available(self) -> tuple[bool, str | None]:
-        """
-        Return whether this tool should be exposed/executable in the current process.
-
-        Tools that depend on optional binaries/services/env vars can override this
-        to avoid advertising a tool that will fail at runtime.
-        """
-        return True, None
-
-    async def __call__(self, **kwargs) -> ToolResult:
-        """Allow calling tool instance directly."""
-        return await self.execute(**kwargs)
-
-# Note: This is only wrapping declarations for the external ToolServer (for execution on external process tools), and tools preinstalled in envs
-class ToolRegistry:
-    """Registry of available tools."""
-    
-    def __init__(self):
-        self._tools: Dict[str, Tool] = {}
-    
-    def register(self, tool: Tool) -> None:
-        """Register a tool."""
-        self._tools[tool.name] = tool
-    
-    def get(self, name: str) -> Optional[Tool]:
-        """Get a tool by name."""
-        return self._tools.get(name)
-    
-    def list_tools(self) -> List[Tool]:
-        """List all registered tools."""
-        return list(self._tools.values())
-    
-    def get_schemas(self) -> List[ToolSchema]:
-        """Get schemas for all registered tools."""
-        return [tool.schema for tool in self._tools.values()]
-    
-    def get_prompt_description(self) -> str:
-        """Generate tool descriptions for system prompt."""
-        descriptions = [tool.schema.to_prompt_description() for tool in self._tools.values()]
-        return "\n\n".join(descriptions)
-
-    def get_prompt_tool_definitions_json(self) -> str:
-        """
-        Return a Hermes-style JSON list of tool definitions for use inside a `<tools>...</tools>` block.
-
-        Hermes trajectories historically use a simplified schema list:
-          [{"name": ..., "description": ..., "parameters": {...}, "required": null}, ...]
-        """
-        formatted: List[Dict[str, Any]] = []
-        for tool in self._tools.values():
-            fn = tool.schema.to_dict().get("function", {})
-            formatted.append(
-                {
-                    "name": fn.get("name", tool.name),
-                    "description": fn.get("description", ""),
-                    "parameters": fn.get("parameters", {}),
-                    # Keep parity with Hermes saved trajectories (required is typically null there).
-                    "required": None,
-                }
-            )
-        return json.dumps(formatted, ensure_ascii=False)
-    
-    async def execute(self, call: ToolCall) -> ToolResult:
-        """Execute a tool call."""
-        tool = self.get(call.name)
-        if tool is None:
-            return ToolResult(
-                success=False,
-                error=f"Unknown tool: {call.name}",
-                uniq_id=call.uniq_id,
-            )
-        
-        try:
-            result = await tool.execute(**call.arguments)
-            if result.uniq_id is None:
-                result.uniq_id = call.uniq_id
-            return result
-        except Exception as e:
-            return ToolResult(
-                success=False,
-                error=f"Tool execution error: {str(e)}",
-                uniq_id=call.uniq_id,
-            )
-
-
-# =============================================================================
-# FastAPI / transport models
-# =============================================================================
-
-
-class ToolCallPayload(BaseModel):
-    name: str
-    arguments: Dict[str, Any] = Field(default_factory=dict)
-    uniq_id: str
-
-    @classmethod
-    def from_tool_call(cls, call: ToolCall) -> "ToolCallPayload":
-        return cls(name=call.name, arguments=call.arguments, uniq_id=call.uniq_id)
-
-    def to_tool_call(self) -> ToolCall:
-        return ToolCall(name=self.name, arguments=self.arguments, uniq_id=self.uniq_id)
-
-
-class ToolResultPayload(BaseModel):
-    success: bool
-    output: str = ""
-    error: str = ""
-    metadata: Dict[str, Any] = Field(default_factory=dict)
-    uniq_id: Optional[str] = None
-
-    @classmethod
-    def from_tool_result(cls, result: ToolResult) -> "ToolResultPayload":
-        return cls(
-            success=result.success,
-            output=result.output,
-            error=result.error,
-            metadata=result.metadata,
-            uniq_id=result.uniq_id,
-        )
-
-    def to_tool_result(self) -> ToolResult:
-        return ToolResult(
-            success=self.success,
-            output=self.output,
-            error=self.error,
-            metadata=self.metadata,
-            uniq_id=self.uniq_id,
-        )
-
-
-class ToolExecutorExecuteRequest(BaseModel):
-    trajectory_id: str
-    tool: ToolCallPayload
-    timeout_s: Optional[float] = None
-
-
-class ToolExecutorReleaseRequest(BaseModel):
-    trajectory_id: str
-    reset_workspace: bool = False
-
-
-class ToolServerExecuteRequest(BaseModel):
-    trajectory_id: Optional[str] = None
-    tool: ToolCallPayload
-    timeout_s: Optional[float] = None
-    # Optional sandbox context for tools that need workspace artifacts.
-    # This is set by ToolExecutor and is NOT model-controlled.
-    slot_id: Optional[str] = None
-    container_addr: Optional[str] = None
-
-
-# =============================================================================
-# Artifact transport models
-# =============================================================================
-
-
-class ArtifactReadRequestPayload(BaseModel):
-    trajectory_id: str
-    path: str
-    encoding: Literal["text", "base64"] = "text"
-    max_bytes: Optional[int] = None
-    include_sha256: bool = False
-
-
-class ArtifactReadResponsePayload(BaseModel):
-    success: bool
-    content: str = ""
-    error: str = ""
-    encoding: str = "text"
-    truncated: bool = False
-    bytes: int = 0
-    file_size: Optional[int] = None
-    path: str = ""
-    mime: Optional[str] = None
-    sha256: Optional[str] = None
-
-
-class ArtifactListRequestPayload(BaseModel):
-    trajectory_id: str
-    path: str = "."
-    recursive: bool = False
-    max_entries: Optional[int] = None
-
-
-class ArtifactListEntryPayload(BaseModel):
-    path: str
-    is_dir: bool
-    size: int
-    mtime: float
-
-
-class ArtifactListResponsePayload(BaseModel):
-    success: bool
-    entries: List[ArtifactListEntryPayload] = Field(default_factory=list)
-    truncated: bool = False
-    error: str = ""
-
-
-class ArtifactArchiveRequestPayload(BaseModel):
-    trajectory_id: str
-    path: str = "."
-    format: Literal["tar.gz", "tgz"] = "tar.gz"
-    max_bytes: Optional[int] = None
-    max_entries: Optional[int] = None
-
-
-class ArtifactArchiveResponsePayload(BaseModel):
-    success: bool
-    content: str = ""
-    error: str = ""
-    encoding: str = "base64"
-    format: str = "tar.gz"
-    bytes: int = 0
-    entry_count: int = 0
--- a/atropos/tools/build_registry.py
+++ b/atropos/tools/build_registry.py
@@ -1,64 +0,0 @@
-"""
-Unified tool registry builder for Hermes-Agent Atropos integration.
-
-This composes:
- sandbox tool stubs (terminal/bash/read_file/write_file + stateful terminal/tmux)
- Hermes external tools (web/vision/image/moa/skills/browser), executed via ToolServer
-
-ToolExecutor only needs the schema + `external` routing bit; ToolServer executes
-the external tools via Hermes' existing implementations.
-"""
-
-from __future__ import annotations
-
-from typing import List, Optional
-
-from .base import ToolRegistry
-from .hermes_external_tools import build_external_tools
-from .sandbox_stubs import BashTool, ReadFileTool, TerminalTool, WriteFileTool
-from .terminal_stateful_tool import TerminalStatefulTool
-from .tmux_tool import TmuxTool
-from .toolset_resolver import resolve_multiple_toolsets
-
-
-def build_tool_registry(
-    *,
-    enabled_toolsets: Optional[List[str]] = None,
-    disabled_toolsets: Optional[List[str]] = None,
-    tool_server_url: Optional[str] = None,
-) -> ToolRegistry:
-    """
-    Build a ToolRegistry for AgentEnv / ToolExecutor / ToolServer.
-
-    If `tool_server_url` is not provided, external tools will be omitted so we do
-    not advertise tools that cannot execute.
-    """
-    enabled_toolsets = enabled_toolsets or ["default"]
-
-    # Resolve tool names using Hermes toolsets plus Atropos additions.
-    selected = set(resolve_multiple_toolsets(enabled_toolsets))
-    if disabled_toolsets:
-        selected -= set(resolve_multiple_toolsets(disabled_toolsets))
-
-    reg = ToolRegistry()
-
-    # Always register sandbox tools if selected.
-    sandbox_by_name = {
-        "terminal": TerminalTool(),
-        "bash": BashTool(),
-        "read_file": ReadFileTool(),
-        "write_file": WriteFileTool(),
-        "terminal_stateful": TerminalStatefulTool(),
-        "tmux": TmuxTool(),
-    }
-    for name, tool in sandbox_by_name.items():
-        if name in selected:
-            reg.register(tool)
-
-    # External tools: only include when ToolServer is configured.
-    if tool_server_url:
-        for tool in build_external_tools(selected_tool_names=selected):
-            if tool.name in selected:
-                reg.register(tool)
-
-    return reg
--- a/atropos/tools/hermes_external_tools.py
+++ b/atropos/tools/hermes_external_tools.py
@@ -1,90 +0,0 @@
-"""
-Hermes external tool adapter for Atropos ToolServer.
-
-These tools reuse Hermes-Agent's existing tool runner (`model_tools.handle_function_call`)
-so we don't duplicate external tool implementations.
-
-Important:
- These are marked `external=True` and should be executed ONLY by ToolServer.
- We run `handle_function_call` in a worker thread because the Hermes implementation
-  uses `asyncio.run()` internally for some async tools (web_extract, vision, MoA, etc).
-"""
-
-from __future__ import annotations
-
-import asyncio
-import json
-from typing import Any, Dict, List, Optional
-
-import model_tools
-
-from .base import Tool, ToolResult, ToolSchema
-
-
-def _schema_from_openai_tool_dict(tool: Dict[str, Any], *, external: bool) -> ToolSchema:
-    fn = tool.get("function") or {}
-    name = str(fn.get("name") or "")
-    description = str(fn.get("description") or "")
-    params = fn.get("parameters") or {}
-    properties = params.get("properties") or {}
-    required = params.get("required") or []
-    if not isinstance(required, list):
-        required = []
-    return ToolSchema(
-        name=name,
-        description=description,
-        parameters=dict(properties),
-        required=[str(x) for x in required if isinstance(x, (str, int))],
-        external=external,
-    )
-
-
-class HermesExternalTool(Tool):
-    def __init__(self, schema: ToolSchema):
-        self._schema = schema
-
-    @property
-    def schema(self) -> ToolSchema:
-        return self._schema
-
-    async def execute(self, task_id: Optional[str] = None, **kwargs: Any) -> ToolResult:
-        # `model_tools.handle_function_call` returns a JSON string (success or error).
-        # Run in a thread because some Hermes tool handlers call `asyncio.run()`.
-        raw = await asyncio.to_thread(model_tools.handle_function_call, self.name, kwargs, task_id)
-
-        try:
-            parsed = json.loads(raw)
-        except Exception:
-            # Keep as plain string.
-            return ToolResult(success=True, output=str(raw))
-
-        if isinstance(parsed, dict) and parsed.get("error"):
-            return ToolResult(success=False, error=str(parsed.get("error")), output="")
-
-        return ToolResult(success=True, output=json.dumps(parsed, ensure_ascii=False))
-
-
-def build_external_tools(
-    *,
-    selected_tool_names: Optional[set[str]] = None,
-) -> List[HermesExternalTool]:
-    """
-    Build external tool wrappers from Hermes tool declarations.
-
-    Filters out sandbox-oriented tools (e.g. `terminal`) since those should run
-    inside the sandbox via ToolExecutor.
-    """
-    # IMPORTANT: Hermes' `model_tools.get_tool_definitions()` only understands Hermes toolsets.
-    # Atropos envs add extra toolsets (filesystem/sandbox/stateful). To avoid noisy "Unknown toolset"
-    # prints and accidental filtering, we fetch ALL Hermes tool definitions here and filter by name.
-    tools = model_tools.get_tool_definitions(enabled_toolsets=None, disabled_toolsets=None, quiet_mode=True)
-
-    wrappers: List[HermesExternalTool] = []
-    for t in tools:
-        schema = _schema_from_openai_tool_dict(t, external=True)
-        if schema.name in {"terminal"}:
-            continue
-        if selected_tool_names is not None and schema.name not in selected_tool_names:
-            continue
-        wrappers.append(HermesExternalTool(schema))
-    return wrappers
--- a/atropos/tools/sandbox_stubs.py
+++ b/atropos/tools/sandbox_stubs.py
@@ -1,99 +0,0 @@
-"""
-Sandbox tool stubs for Atropos ToolExecutor.
-
-These tools are executed inside the sandbox containers via:
-ToolExecutor -> SlotPool -> sandbox_server.py
-
-They intentionally do NOT execute anything on the host process. If they are
-called directly (outside ToolExecutor), they return a clear error.
-"""
-
-from __future__ import annotations
-
-from typing import Optional
-
-from .base import Tool, ToolResult, ToolSchema
-
-
-class TerminalTool(Tool):
-    @property
-    def schema(self) -> ToolSchema:
-        return ToolSchema(
-            name="terminal",
-            description=(
-                "Execute a command inside the sandbox slot workspace and return stdout/stderr. "
-                "Filesystem persists within a trajectory slot. Background processes are not supported "
-                "in stateless mode. Commands run under POSIX /bin/sh and each tool call runs in a fresh "
-                "shell (no persisted env vars). Avoid bash-only syntax like `source`; prefer `. .venv/bin/activate` "
-                "or invoke `.venv/bin/python ...` directly."
-            ),
-            parameters={
-                "command": {"type": "string", "description": "The command to execute"},
-                "timeout": {
-                    "type": "integer",
-                    "description": "Command timeout in seconds (optional).",
-                    "minimum": 1,
-                },
-                "background": {
-                    "type": "boolean",
-                    "description": "Not supported in sandbox terminal (always false).",
-                    "default": False,
-                },
-            },
-            required=["command"],
-            external=False,
-        )
-
-    async def execute(self, **_kwargs) -> ToolResult:
-        return ToolResult(
-            success=False,
-            error="terminal must be executed via ToolExecutor inside the sandbox",
-        )
-
-
-class BashTool(Tool):
-    @property
-    def schema(self) -> ToolSchema:
-        return ToolSchema(
-            name="bash",
-            description="Execute a bash command inside the sandbox slot workspace.",
-            parameters={"command": {"type": "string", "description": "The bash command to execute"}},
-            required=["command"],
-            external=False,
-        )
-
-    async def execute(self, **_kwargs) -> ToolResult:
-        return ToolResult(success=False, error="bash must be executed via ToolExecutor inside the sandbox")
-
-
-class ReadFileTool(Tool):
-    @property
-    def schema(self) -> ToolSchema:
-        return ToolSchema(
-            name="read_file",
-            description="Read a file from the sandbox slot workspace.",
-            parameters={"path": {"type": "string", "description": "Path to the file"}},
-            required=["path"],
-            external=False,
-        )
-
-    async def execute(self, **_kwargs) -> ToolResult:
-        return ToolResult(success=False, error="read_file must be executed via ToolExecutor inside the sandbox")
-
-
-class WriteFileTool(Tool):
-    @property
-    def schema(self) -> ToolSchema:
-        return ToolSchema(
-            name="write_file",
-            description="Write a file into the sandbox slot workspace.",
-            parameters={
-                "path": {"type": "string", "description": "Path to the file"},
-                "content": {"type": "string", "description": "File content"},
-            },
-            required=["path", "content"],
-            external=False,
-        )
-
-    async def execute(self, **_kwargs) -> ToolResult:
-        return ToolResult(success=False, error="write_file must be executed via ToolExecutor inside the sandbox")
--- a/atropos/tools/terminal_stateful_tool.py
+++ b/atropos/tools/terminal_stateful_tool.py
@@ -1,45 +0,0 @@
-"""
-Stateful terminal tool schema.
-
-This is a sandbox tool that routes to the sandbox server as `bash_stateful`
-via ToolExecutor mapping. It exists to expose an explicit, opt-in terminal
-primitive suitable for stateful workflows (e.g. tmux sessions / TUIs).
-"""
-
-from __future__ import annotations
-
-from typing import Optional
-
-from .base import Tool, ToolResult, ToolSchema
-
-
-class TerminalStatefulTool(Tool):
-    @property
-    def schema(self) -> ToolSchema:
-        return ToolSchema(
-            name="terminal_stateful",
-            description=(
-                "Execute a command in the sandbox, allowing stateful/background processes to persist "
-                "across tool calls within the same trajectory slot (e.g. tmux sessions). "
-                "Use sparingly; output is still non-interactive."
-            ),
-            parameters={
-                "command": {"type": "string", "description": "The command to execute"},
-                "timeout": {
-                    "type": "integer",
-                    "description": "Command timeout in seconds (optional).",
-                    "minimum": 1,
-                },
-            },
-            required=["command"],
-        )
-
-    def is_available(self) -> tuple[bool, str | None]:
-        return True, None
-
-    async def execute(self, command: str, timeout: Optional[int] = None) -> ToolResult:
-        _ = (command, timeout)
-        return ToolResult(
-            success=False,
-            error="terminal_stateful must be executed via ToolExecutor inside the sandbox",
-        )
--- a/atropos/tools/tmux_tool.py
+++ b/atropos/tools/tmux_tool.py
@@ -1,89 +0,0 @@
-"""
-tmux tool schema (sandbox).
-
-This is a sandbox tool that provides basic tmux session control suitable for
-TUI-style terminal interactions:
- send keys (arrow keys, enter, etc.)
- capture the current screen buffer
-
-Execution is routed by ToolExecutor to the sandbox server's `tmux` backend.
-"""
-
-from __future__ import annotations
-
-from typing import Any, Dict, Optional
-
-from .base import Tool, ToolResult, ToolSchema
-
-
-class TmuxTool(Tool):
-    @property
-    def schema(self) -> ToolSchema:
-        return ToolSchema(
-            name="tmux",
-            description=(
-                "Control a per-trajectory tmux session inside the sandbox (stateful terminal). "
-                "Use this for TUI-style interactions: send keys and capture the current screen."
-            ),
-            parameters={
-                "action": {
-                    "type": "string",
-                    "description": "Action to perform: start | send_keys | stream | stop.",
-                    "enum": ["start", "send_keys", "stream", "stop", "capture"],
-                },
-                "keys": {
-                    "description": "Keys to send (string or list of strings) when action=send_keys.",
-                },
-                "block": {
-                    "type": "boolean",
-                    "description": "If true, wait for shell command completion (only valid at a shell prompt).",
-                    "default": False,
-                },
-                "min_wait_s": {
-                    "type": "number",
-                    "description": "For non-blocking send_keys, sleep this long after sending keys (seconds).",
-                    "default": 0.0,
-                },
-                "max_wait_s": {
-                    "type": "number",
-                    "description": "For blocking send_keys, max time to wait for completion (seconds).",
-                },
-                "capture_entire": {
-                    "type": "boolean",
-                    "description": "Deprecated. Streaming is preferred.",
-                    "default": False,
-                },
-                "max_bytes": {
-                    "type": "integer",
-                    "description": "Max bytes to return per stream call.",
-                },
-                "reset": {
-                    "type": "boolean",
-                    "description": "If true, reset stream offset to the beginning of the asciinema recording.",
-                    "default": False,
-                },
-                "pane_width": {
-                    "type": "integer",
-                    "description": "Pane width for action=start (columns).",
-                    "minimum": 20,
-                },
-                "pane_height": {
-                    "type": "integer",
-                    "description": "Pane height for action=start (rows).",
-                    "minimum": 10,
-                },
-            },
-            required=["action"],
-        )
-
-    def is_available(self) -> tuple[bool, str | None]:
-        return True, None
-
-    async def execute(self, **kwargs: Dict[str, Any]) -> ToolResult:
-        # This tool is intended to be executed via ToolExecutor -> sandbox server.
-        # We keep a safe fallback for non-sandbox contexts.
-        action = str(kwargs.get("action") or "").strip()
-        return ToolResult(
-            success=False,
-            error=f"tmux tool must be executed in the sandbox (got action={action!r})",
-        )
--- a/atropos/tools/tool_executor.py
+++ b/atropos/tools/tool_executor.py
@@ -1,500 +0,0 @@
-"""
-ToolExecutor - queued, batched tool dispatch for multiplexed agent trajectories.
-
-This component is responsible for:
- Maintaining trajectory -> Slot affinity (workspace continuity)
- Batching sandbox tool calls across trajectories to maximize container utilization
- Routing external tools (ToolSchema.external=True) to a ToolServer (Phase 4.5)
-
-For now, only sandbox tools are executed:
- bash
- read_file
- write_file
-"""
-
-from __future__ import annotations
-
-import asyncio
-import time
-from dataclasses import dataclass
-from typing import Any, Dict, List, Optional
-
-import httpx
-
-from .base import (
-    ArtifactArchiveRequestPayload,
-    ArtifactArchiveResponsePayload,
-    ArtifactListRequestPayload,
-    ArtifactListResponsePayload,
-    ArtifactReadRequestPayload,
-    ArtifactReadResponsePayload,
-    ToolCall,
-    ToolCallPayload,
-    ToolRegistry,
-    ToolResult,
-    ToolResultPayload,
-    ToolServerExecuteRequest,
-)
-from ..backends.base import ToolBackend
-from ..slots import Slot
-
-
-@dataclass
-class ToolExecutorConfig:
-    batch_window_ms: int = 20
-    max_batch_size: int = 200
-    allow_network: bool = True
-    require_sandbox: bool = False
-    require_stateful_sandbox: bool = False
-    tool_server_url: Optional[str] = None
-    tool_server_token: Optional[str] = None
-
-
-@dataclass
-class _QueuedToolRequest:
-    trajectory_id: str
-    call: ToolCall
-    timeout_s: Optional[float]
-    future: asyncio.Future
-
-
-class ToolExecutor:
-    def __init__(
-        self,
-        backend: ToolBackend,
-        tools: ToolRegistry,
-        config: Optional[ToolExecutorConfig] = None,
-    ) -> None:
-        self.backend = backend
-        self.tools = tools
-        self.config = config or ToolExecutorConfig()
-
-        self._queue: asyncio.Queue[Optional[_QueuedToolRequest]] = asyncio.Queue()
-        self._task: Optional[asyncio.Task] = None
-        self._stopping = asyncio.Event()
-
-        self._slots_lock = asyncio.Lock()
-        self._slot_by_trajectory: Dict[str, Slot] = {}
-
-        self._tool_server_client: Optional[httpx.AsyncClient] = None
-        self._tool_server_lock = asyncio.Lock()
-
-        # lightweight stats for status endpoints
-        self.total_requests: int = 0
-        self.total_errors: int = 0
-        self.latencies_s: List[float] = []
-
-    async def start(self) -> None:
-        if self._task is None:
-            self._task = asyncio.create_task(self._run_loop())
-
-    def queue_size(self) -> int:
-        return self._queue.qsize()
-
-    async def close(self) -> None:
-        self._stopping.set()
-        await self._queue.put(None)
-        if self._task:
-            await self._task
-            self._task = None
-
-        client = self._tool_server_client
-        self._tool_server_client = None
-        if client is not None:
-            await client.aclose()
-
-        # Best-effort release any remaining slots.
-        async with self._slots_lock:
-            slots = list(self._slot_by_trajectory.items())
-            self._slot_by_trajectory.clear()
-
-        for _, slot in slots:
-            try:
-                await self.backend.release(slot, reset_workspace=False)
-            except Exception:
-                pass
-
-    async def execute(
-        self,
-        trajectory_id: str,
-        call: ToolCall,
-        timeout_s: Optional[float] = None,
-    ) -> ToolResult:
-        if self._task is None:
-            raise RuntimeError("ToolExecutor not started (call start() first)")
-
-        # Allow tool args to suggest a timeout (Hermes-compatible terminal tool),
-        # but never let the model choose "infinite" timeouts.
-        if timeout_s is None:
-            raw_timeout = call.arguments.get("timeout")
-            if isinstance(raw_timeout, (int, float)):
-                timeout_s = float(raw_timeout)
-        if timeout_s is not None:
-            timeout_s = max(1.0, min(float(timeout_s), 600.0))
-
-        loop = asyncio.get_running_loop()
-        fut: asyncio.Future = loop.create_future()
-        started = time.perf_counter()
-        await self._queue.put(_QueuedToolRequest(trajectory_id=trajectory_id, call=call, timeout_s=timeout_s, future=fut))
-        try:
-            result: ToolResult = await fut
-            return result
-        finally:
-            self.latencies_s.append(time.perf_counter() - started)
-
-    async def release_trajectory(self, trajectory_id: str, reset_workspace: bool = False) -> None:
-        async with self._slots_lock:
-            slot = self._slot_by_trajectory.pop(trajectory_id, None)
-
-        if slot is not None:
-            await self.backend.release(slot, reset_workspace=reset_workspace)
-
-    async def _get_slot_if_present(self, trajectory_id: str) -> Optional[Slot]:
-        async with self._slots_lock:
-            return self._slot_by_trajectory.get(trajectory_id)
-
-    # ---------------------------------------------------------------------
-    # Artifact helpers (optional)
-    # ---------------------------------------------------------------------
-
-    async def read_artifact(self, req: ArtifactReadRequestPayload) -> ArtifactReadResponsePayload:
-        slot = await self._get_slot_if_present(req.trajectory_id)
-        if slot is None:
-            return ArtifactReadResponsePayload(success=False, error="No active slot for trajectory (run a sandbox tool first)")
-        data = await self.backend.read_artifact(
-            slot,
-            req.path,
-            encoding=req.encoding,
-            max_bytes=req.max_bytes,
-            include_sha256=req.include_sha256,
-        )
-        if isinstance(data, dict):
-            data = dict(data)
-            data.pop("http_status", None)
-        try:
-            return ArtifactReadResponsePayload(**(data or {}))
-        except Exception as e:
-            return ArtifactReadResponsePayload(success=False, error=f"Invalid artifact read response: {e}")
-
-    async def list_artifacts(self, req: ArtifactListRequestPayload) -> ArtifactListResponsePayload:
-        slot = await self._get_slot_if_present(req.trajectory_id)
-        if slot is None:
-            return ArtifactListResponsePayload(success=False, error="No active slot for trajectory (run a sandbox tool first)")
-        data = await self.backend.list_artifacts(
-            slot,
-            req.path,
-            recursive=req.recursive,
-            max_entries=req.max_entries,
-        )
-        if isinstance(data, dict):
-            data = dict(data)
-            data.pop("http_status", None)
-        try:
-            return ArtifactListResponsePayload(**(data or {}))
-        except Exception as e:
-            return ArtifactListResponsePayload(success=False, error=f"Invalid artifact list response: {e}")
-
-    async def archive_artifacts(self, req: ArtifactArchiveRequestPayload) -> ArtifactArchiveResponsePayload:
-        slot = await self._get_slot_if_present(req.trajectory_id)
-        if slot is None:
-            return ArtifactArchiveResponsePayload(success=False, error="No active slot for trajectory (run a sandbox tool first)")
-        data = await self.backend.archive_artifacts(
-            slot,
-            req.path,
-            archive_format=req.format,
-            max_bytes=req.max_bytes,
-            max_entries=req.max_entries,
-        )
-        if isinstance(data, dict):
-            data = dict(data)
-            data.pop("http_status", None)
-        try:
-            return ArtifactArchiveResponsePayload(**(data or {}))
-        except Exception as e:
-            return ArtifactArchiveResponsePayload(success=False, error=f"Invalid artifact archive response: {e}")
-
-    async def _get_or_acquire_slot(self, trajectory_id: str) -> Slot:
-        async with self._slots_lock:
-            existing = self._slot_by_trajectory.get(trajectory_id)
-            if existing is not None:
-                return existing
-
-        slot = await self.backend.acquire(trajectory_id)
-
-        async with self._slots_lock:
-            existing = self._slot_by_trajectory.get(trajectory_id)
-            if existing is not None:
-                # Another coroutine won the race; return its slot.
-                await self.backend.release(slot, reset_workspace=False)
-                return existing
-            self._slot_by_trajectory[trajectory_id] = slot
-            return slot
-
-    async def _run_loop(self) -> None:
-        pending: List[_QueuedToolRequest] = []
-        deadline: Optional[float] = None
-
-        batch_window_s = max(0.0, self.config.batch_window_ms / 1000.0)
-        max_batch = max(1, self.config.max_batch_size)
-
-        while True:
-            if self._stopping.is_set() and self._queue.empty() and not pending:
-                break
-
-            timeout = None
-            if pending and deadline is not None:
-                timeout = max(0.0, deadline - time.perf_counter())
-
-            try:
-                item = await asyncio.wait_for(self._queue.get(), timeout=timeout)
-                if item is None:
-                    continue
-                pending.append(item)
-                if len(pending) == 1:
-                    deadline = time.perf_counter() + batch_window_s
-                if len(pending) < max_batch:
-                    continue
-            except asyncio.TimeoutError:
-                # batch window elapsed
-                pass
-
-            if not pending:
-                deadline = None
-                continue
-
-            batch = pending
-            pending = []
-            deadline = None
-
-            await self._execute_batch(batch)
-
-    async def _get_tool_server_client(self) -> httpx.AsyncClient:
-        url = self.config.tool_server_url
-        if not url:
-            raise RuntimeError("ToolServer not configured")
-
-        if self._tool_server_client is not None:
-            return self._tool_server_client
-
-        async with self._tool_server_lock:
-            if self._tool_server_client is None:
-                self._tool_server_client = httpx.AsyncClient(base_url=url.rstrip("/"))
-            return self._tool_server_client
-
-    def _tool_server_headers(self) -> Dict[str, str]:
-        token = self.config.tool_server_token
-        if not token:
-            return {}
-        return {"Authorization": f"Bearer {token}"}
-
-    async def _execute_external(self, req: _QueuedToolRequest) -> ToolResult:
-        client = await self._get_tool_server_client()
-        slot_id: Optional[str] = None
-        container_addr: Optional[str] = None
-        slot = await self._get_slot_if_present(req.trajectory_id)
-        if slot is not None:
-            slot_id = slot.slot_id
-            container_addr = slot.container_addr
-
-        payload = ToolServerExecuteRequest(
-            trajectory_id=req.trajectory_id,
-            tool=ToolCallPayload.from_tool_call(req.call),
-            timeout_s=req.timeout_s,
-            slot_id=slot_id,
-            container_addr=container_addr,
-        )
-
-        try:
-            resp = await client.post(
-                "/execute",
-                json=payload.model_dump(),
-                headers=self._tool_server_headers(),
-                timeout=req.timeout_s,
-            )
-            resp.raise_for_status()
-            data = resp.json()
-            parsed = ToolResultPayload(**data)
-            result = parsed.to_tool_result()
-            if result.uniq_id is None:
-                result.uniq_id = req.call.uniq_id
-            return result
-        except Exception as e:
-            return ToolResult(
-                success=False,
-                error=f"External tool failed: {e}",
-                uniq_id=req.call.uniq_id,
-            )
-
-    async def _execute_batch(self, batch: List[_QueuedToolRequest]) -> None:
-        # Resolve tool schemas once per request and separate sandbox/external/unknown.
-        sandbox_items: List[_QueuedToolRequest] = []
-        external_items: List[_QueuedToolRequest] = []
-        unknown_items: List[_QueuedToolRequest] = []
-
-        for it in batch:
-            tool = self.tools.get(it.call.name)
-            if tool is None:
-                unknown_items.append(it)
-                continue
-
-            schema = tool.schema
-            if not schema.external:
-                sandbox_items.append(it)
-            else:
-                external_items.append(it)
-
-        for it in unknown_items:
-            self.total_requests += 1
-            self.total_errors += 1
-            if not it.future.done():
-                it.future.set_result(
-                    ToolResult(
-                        success=False,
-                        error=f"Unknown tool: {it.call.name}",
-                        uniq_id=it.call.uniq_id,
-                    )
-                )
-
-        if external_items:
-            if not self.config.tool_server_url:
-                for it in external_items:
-                    self.total_requests += 1
-                    self.total_errors += 1
-                    if not it.future.done():
-                        it.future.set_result(
-                            ToolResult(
-                                success=False,
-                                error=f"External tool not available (ToolServer not configured): {it.call.name}",
-                                uniq_id=it.call.uniq_id,
-                            )
-                        )
-            else:
-                results = await asyncio.gather(*[self._execute_external(it) for it in external_items])
-                for it, res in zip(external_items, results):
-                    self.total_requests += 1
-                    if not getattr(res, "success", False):
-                        self.total_errors += 1
-                    if not it.future.done():
-                        it.future.set_result(res)
-
-        if not sandbox_items:
-            return
-
-        # Acquire slots for the distinct trajectories in this batch.
-        try:
-            traj_ids = list({it.trajectory_id for it in sandbox_items})
-            slots = await asyncio.gather(*[self._get_or_acquire_slot(tid) for tid in traj_ids])
-            slot_by_traj = dict(zip(traj_ids, slots))
-        except Exception as e:
-            for it in sandbox_items:
-                self.total_requests += 1
-                self.total_errors += 1
-                if not it.future.done():
-                    it.future.set_result(
-                        ToolResult(
-                            success=False,
-                            error=f"Failed to acquire slot: {e}",
-                            uniq_id=it.call.uniq_id,
-                        )
-                    )
-            return
-
-        # Group by timeout so we don't accidentally make short timeouts wait on long ones.
-        by_timeout: Dict[float, List[_QueuedToolRequest]] = {}
-        default_timeout = self.backend.default_timeout_s
-
-        for it in sandbox_items:
-            t = it.timeout_s
-            if t is None:
-                t = default_timeout
-            if t is None:
-                t = 30.0
-            by_timeout.setdefault(float(t), []).append(it)
-
-        for timeout_s, items in by_timeout.items():
-            requests = []
-            dispatched: List[_QueuedToolRequest] = []
-            for it in items:
-                slot = slot_by_traj[it.trajectory_id]
-                tool_name = it.call.name
-                args = dict(it.call.arguments)
-
-                # Hermes compatibility: treat `terminal` as an alias of sandbox `bash`.
-                if tool_name == "terminal":
-                    if args.get("background"):
-                        self.total_requests += 1
-                        self.total_errors += 1
-                        if not it.future.done():
-                            it.future.set_result(
-                                ToolResult(
-                                    success=False,
-                                    error="terminal background execution is not supported in sandbox",
-                                    uniq_id=it.call.uniq_id,
-                                )
-                            )
-                        continue
-                    tool_name = "bash"
-                    # `timeout` is handled at the ToolExecutor level, not passed to the sandbox tool args.
-                    args.pop("timeout", None)
-                elif tool_name == "terminal_stateful":
-                    tool_name = "bash_stateful"
-                    args.pop("timeout", None)
-                elif tool_name == "tmux":
-                    # `tmux` is a sandbox tool backed by the stateful session manager.
-                    # Network policy is env-controlled.
-                    args.pop("allow_network", None)
-
-                if tool_name == "bash":
-                    # Network policy is set by the environment/executor, not by the model.
-                    args.pop("allow_network", None)
-                    args.pop("require_sandbox", None)
-                    args["allow_network"] = bool(self.config.allow_network)
-                    args["require_sandbox"] = bool(self.config.require_sandbox)
-                    # `timeout` is handled at the ToolExecutor level, not passed to the sandbox tool args.
-                    args.pop("timeout", None)
-                elif tool_name == "bash_stateful":
-                    # Network policy is set by the environment/executor, not by the model.
-                    args.pop("allow_network", None)
-                    args.pop("require_sandbox", None)
-                    args.pop("require_stateful_sandbox", None)
-                    args["allow_network"] = bool(self.config.allow_network)
-                    args["require_stateful_sandbox"] = bool(self.config.require_stateful_sandbox)
-                    args.pop("timeout", None)
-                elif tool_name == "tmux":
-                    # Network policy applies to the underlying stateful session.
-                    args.pop("allow_network", None)
-                    args.pop("require_sandbox", None)
-                    args.pop("require_stateful_sandbox", None)
-                    args["allow_network"] = bool(self.config.allow_network)
-                    args["require_stateful_sandbox"] = bool(self.config.require_stateful_sandbox)
-
-                requests.append((slot, tool_name, args))
-                dispatched.append(it)
-
-            results = None
-            try:
-                if not dispatched:
-                    continue
-                results = await self.backend.execute_batch(requests, timeout_s=timeout_s)
-            except Exception as e:
-                for it in items:
-                    self.total_requests += 1
-                    self.total_errors += 1
-                    if not it.future.done():
-                        it.future.set_result(
-                            ToolResult(
-                                success=False,
-                                error=f"Batch execution failed: {e}",
-                                uniq_id=it.call.uniq_id,
-                            )
-                        )
-                continue
-
-            for it, res in zip(dispatched, results):
-                self.total_requests += 1
-                if not getattr(res, "success", False):
-                    self.total_errors += 1
-                tool_result = res.to_tool_result()
-                tool_result.uniq_id = it.call.uniq_id
-                if not it.future.done():
-                    it.future.set_result(tool_result)
--- a/atropos/tools/toolset_resolver.py
+++ b/atropos/tools/toolset_resolver.py
@@ -1,88 +0,0 @@
-"""
-Toolset resolution for Hermes-Agent Atropos integration.
-
-We primarily reuse Hermes-Agent toolsets (`toolsets.py`), but Atropos training/envs
-need a few extra sandbox-oriented toolsets that Hermes doesn't expose by default
-(e.g. filesystem + stateful terminal).
-"""
-
-from __future__ import annotations
-
-from typing import Any, Dict, List, Optional, Set
-
-import toolsets as hermes_toolsets
-
-
-ATROPOS_TOOLSETS: Dict[str, Dict[str, Any]] = {
-    "filesystem": {
-        "description": "Read/write files in the sandbox workspace.",
-        "tools": ["read_file", "write_file"],
-        "includes": [],
-    },
-    "terminal_stateful": {
-        "description": "Stateful terminal execution (tmux/TUI support) inside the sandbox.",
-        "tools": ["terminal_stateful", "tmux"],
-        "includes": [],
-    },
-    "sandbox": {
-        "description": "Sandbox tools (terminal + filesystem).",
-        "tools": [],
-        "includes": ["terminal", "filesystem"],
-    },
-    "default": {
-        "description": "Default toolset for Atropos AgentEnv tasks.",
-        "tools": [],
-        "includes": ["sandbox"],
-    },
-    "full": {
-        "description": "All Hermes tools plus Atropos sandbox additions.",
-        "tools": [],
-        "includes": ["all", "filesystem", "sandbox", "terminal_stateful"],
-    },
-}
-
-
-def validate_toolset(name: str) -> bool:
-    if name in {"all", "*"}:
-        return True
-    return hermes_toolsets.validate_toolset(name) or name in ATROPOS_TOOLSETS
-
-
-def resolve_toolset(name: str, visited: Optional[Set[str]] = None) -> List[str]:
-    if visited is None:
-        visited = set()
-
-    if name in {"all", "*"}:
-        # Union Hermes + Atropos toolsets.
-        all_tools: Set[str] = set()
-        for tname in hermes_toolsets.get_toolset_names():
-            all_tools.update(resolve_toolset(tname, visited=set()))
-        for tname, spec in ATROPOS_TOOLSETS.items():
-            # Avoid recursion: some Atropos toolsets (e.g. "full") include "all".
-            if tname == "full" or "all" in (spec.get("includes") or []):
-                continue
-            all_tools.update(resolve_toolset(tname, visited=set()))
-        return sorted(all_tools)
-
-    if name in ATROPOS_TOOLSETS:
-        if name in visited:
-            return []
-        visited.add(name)
-        spec = ATROPOS_TOOLSETS[name]
-        tools: Set[str] = set(spec.get("tools", []))
-        for inc in spec.get("includes", []):
-            tools.update(resolve_toolset(inc, visited=set(visited)))
-        return sorted(tools)
-
-    # Fall back to Hermes toolsets.
-    # IMPORTANT: do not pre-add `name` to `visited` here; Hermes' resolver uses
-    # `visited` for its own cycle detection and will treat the presence of `name`
-    # as a circular dependency.
-    return sorted(hermes_toolsets.resolve_toolset(name, visited=set(visited)))
-
-
-def resolve_multiple_toolsets(names: List[str]) -> List[str]:
-    tools: Set[str] = set()
-    for name in names:
-        tools.update(resolve_toolset(name, visited=set()))
-    return sorted(tools)
--- a/atropos_compatible_agent.py
+++ b/atropos_compatible_agent.py
@@ -1,415 +0,0 @@
-#!/usr/bin/env python3
-"""
-Atropos-compatible Hermes agent runner.
-
-This is a minimal subclass of Hermes-Agent's `AIAgent` that swaps the OpenAI
-function-calling backend for Atroposlib's `ManagedServer`/`ServerManager` backend
-and uses Hermes-style XML tool tags:
-
- <tool_call>{"name": "...", "arguments": {...}}</tool_call>
- <tool_response>{...}</tool_response>
-
-Tool observations are appended as `role="user"` messages containing one or more
-`<tool_response>` blocks so they survive common chat templates during tokenization.
-"""
-
-from __future__ import annotations
-
-import asyncio
-import json
-import re
-import time
-import warnings
-import os
-from contextlib import asynccontextmanager
-from typing import Any, AsyncGenerator, Dict, List, Optional, Tuple
-
-from model_tools import cleanup_vm, handle_function_call
-from run_agent import AIAgent
-
-_TOOL_CALL_RE = re.compile(r"<tool_call>\\s*(.*?)\\s*</tool_call>", re.DOTALL)
-
-
-ATROPOS_TOOL_SYSTEM_PROMPT = """You are a helpful AI assistant with access to tools.
-
-## Available Tools
-<tools>
-{tool_descriptions}
-</tools>
-
-## How to Use Tools
-To call a tool, output:
-<tool_call>{{"name": "tool_name", "arguments": {{"arg1": "value1"}}}}</tool_call>
-
-You may include optional reasoning in <think>...</think> before tool calls.
-
-After each tool call, you will receive tool results as:
-<tool_response>{{...}}</tool_response>
-
-Continue until finished, then provide a final response with no <tool_call> blocks.
-"""
-
-
-class AtroposAIAgent(AIAgent):
-    """
-    Hermes `AIAgent` variant that uses Atroposlib ServerManager/ManagedServer.
-
-    Notes:
-    - The default Hermes `AIAgent` remains unchanged; this class is opt-in.
-    - The underlying server must expose `managed_server(tokenizer=...)` OR be a single
-      APIServer-compatible object usable by Atroposlib's `ManagedServer`.
-    """
-
-    def __init__(
-        self,
-        *,
-        server: Any,
-        tokenizer: Any = None,
-        model: str = "local",
-        max_iterations: int = 10,
-        tool_delay: float = 0.0,
-        enabled_toolsets: Optional[List[str]] = None,
-        disabled_toolsets: Optional[List[str]] = None,
-        save_trajectories: bool = False,
-        verbose_logging: bool = False,
-        quiet_mode: bool = False,
-        ephemeral_system_prompt: Optional[str] = None,
-        log_prefix_chars: int = 100,
-        log_prefix: str = "",
-        session_id: Optional[str] = None,
-        temperature: Optional[float] = None,
-        max_tokens: Optional[int] = None,
-    ):
-        # Call parent init mainly to reuse tool selection + trajectory saving utilities.
-        super().__init__(
-            base_url="http://unused",
-            api_key="dummy-key",
-            model=model,
-            max_iterations=max_iterations,
-            tool_delay=tool_delay,
-            enabled_toolsets=enabled_toolsets,
-            disabled_toolsets=disabled_toolsets,
-            save_trajectories=save_trajectories,
-            verbose_logging=verbose_logging,
-            quiet_mode=quiet_mode,
-            ephemeral_system_prompt=ephemeral_system_prompt,
-            log_prefix_chars=log_prefix_chars,
-            log_prefix=log_prefix,
-            session_id=session_id,
-        )
-
-        self.server = server
-        self.tokenizer = tokenizer
-        self.temperature = temperature
-        self.max_tokens = max_tokens
-
-    @asynccontextmanager
-    async def _managed(self) -> AsyncGenerator[Any, None]:
-        if hasattr(self.server, "managed_server"):
-            with warnings.catch_warnings():
-                warnings.filterwarnings(
-                    "ignore",
-                    message=r"Using OpenAIServer with managed_server does not allow for state tracking",
-                    category=UserWarning,
-                )
-                async with self.server.managed_server(tokenizer=self.tokenizer) as managed:
-                    yield managed
-            return
-
-        # Fall back to directly wrapping a single server object.
-        from atroposlib.envs.server_handling.managed_server import ManagedServer
-
-        managed = ManagedServer(server=self.server, tokenizer=self.tokenizer)
-        try:
-            yield managed
-        finally:
-            managed.reset()
-
-    def _tool_descriptions_text(self) -> str:
-        if not self.tools:
-            return "(no tools available)"
-
-        parts: List[str] = []
-        for tool in self.tools:
-            fn = (tool or {}).get("function", {})
-            name = fn.get("name", "")
-            desc = (fn.get("description") or "").strip()
-            if not name:
-                continue
-            if desc:
-                parts.append(f"- {name}: {desc}")
-            else:
-                parts.append(f"- {name}")
-        return "\n".join(parts) if parts else "(no tools available)"
-
-    def _build_system_prompt(self, system_message: Optional[str]) -> Optional[str]:
-        tool_prompt = ATROPOS_TOOL_SYSTEM_PROMPT.format(
-            tool_descriptions=self._tool_descriptions_text()
-        )
-
-        parts: List[str] = []
-        if system_message:
-            parts.append(system_message)
-        if self.ephemeral_system_prompt:
-            parts.append(self.ephemeral_system_prompt)
-        parts.append(tool_prompt)
-
-        return "\n\n".join(parts)
-
-    def _parse_tool_calls(self, content: str) -> Tuple[List[Tuple[str, Dict[str, Any]]], List[str]]:
-        """
-        Returns:
-          (calls, errors)
-        """
-        calls: List[Tuple[str, Dict[str, Any]]] = []
-        errors: List[str] = []
-
-        for raw in _TOOL_CALL_RE.findall(content or ""):
-            try:
-                payload = json.loads(raw)
-            except json.JSONDecodeError as exc:
-                errors.append(f"Invalid JSON inside <tool_call>: {exc}")
-                continue
-
-            name = payload.get("name")
-            args = payload.get("arguments", {})
-            if not isinstance(name, str) or not name:
-                errors.append("Tool call missing 'name' string")
-                continue
-            if not isinstance(args, dict):
-                errors.append("Tool call 'arguments' must be an object")
-                continue
-
-            calls.append((name, args))
-
-        return calls, errors
-
-    async def run_conversation_async(
-        self,
-        user_message: str,
-        system_message: Optional[str] = None,
-        conversation_history: Optional[List[Dict[str, Any]]] = None,
-        task_id: Optional[str] = None,
-    ) -> Dict[str, Any]:
-        import uuid
-
-        effective_task_id = task_id or str(uuid.uuid4())
-
-        messages: List[Dict[str, Any]] = conversation_history.copy() if conversation_history else []
-        messages.append({"role": "user", "content": user_message})
-
-        active_system_prompt = self._build_system_prompt(system_message)
-
-        api_call_count = 0
-        final_response: Optional[str] = None
-        managed_state: Optional[Dict[str, Any]] = None
-        completed = False
-
-        try:
-            async with self._managed() as managed:
-                while api_call_count < self.max_iterations:
-                    api_call_count += 1
-
-                    api_messages = messages.copy()
-                    if active_system_prompt:
-                        api_messages = [{"role": "system", "content": active_system_prompt}] + api_messages
-
-                    chat_kwargs: Dict[str, Any] = {"messages": api_messages, "n": 1}
-                    if self.max_tokens is not None:
-                        chat_kwargs["max_tokens"] = self.max_tokens
-                    if self.temperature is not None:
-                        chat_kwargs["temperature"] = self.temperature
-
-                    # Prefer OpenAI tool calling when supported by the backend:
-                    # - Many providers normalize Hermes-style <tool_call> tags into tool_calls when `tools` is provided.
-                    # - ManagedServer (atroposlib) does prompt->completion conversion and does not support `tools`.
-                    #   Only pass `tools` when we're calling an OpenAI-compatible chat endpoint directly.
-                    tool_schemas = self.tools if self.tools else None
-                    managed_cls = type(managed).__name__
-                    if tool_schemas and managed_cls != "ManagedServer":
-                        chat_kwargs["tools"] = tool_schemas
-
-                    if os.getenv("HERMES_DEBUG_ATROPOS_REQUEST") == "1":
-                        meta = {
-                            "managed_type": managed_cls,
-                            "model": getattr(getattr(managed, "config", None), "model_name", self.model),
-                            "base_url": getattr(getattr(managed, "config", None), "base_url", None),
-                            "kwargs": chat_kwargs,
-                        }
-                        # Avoid dumping megabytes of data accidentally.
-                        # (Messages can be large; this is still "full" but bounded.)
-                        print("\n=== HERMES_DEBUG_ATROPOS_REQUEST ===", flush=True)
-                        print(json.dumps(meta, ensure_ascii=False, indent=2)[:200_000], flush=True)
-
-                    response = await managed.chat_completion(**chat_kwargs)
-
-                    if os.getenv("HERMES_DEBUG_ATROPOS_RESPONSE") == "1":
-                        try:
-                            dumped = response.model_dump()  # openai pydantic model
-                        except Exception:
-                            dumped = getattr(response, "__dict__", {"repr": repr(response)})
-                        print("\n=== HERMES_DEBUG_ATROPOS_RESPONSE: ChatCompletion (raw) ===", flush=True)
-                        print(json.dumps(dumped, ensure_ascii=False, indent=2), flush=True)
-
-                    if hasattr(managed, "get_state"):
-                        managed_state = managed.get_state()
-
-                    msg = response.choices[0].message
-                    assistant_content = (msg.content or "")
-                    msg_reasoning = getattr(msg, "reasoning", None)
-
-                    # Use tool_calls if the backend provides them (preferred).
-                    structured_tool_calls = getattr(msg, "tool_calls", None)
-
-                    # If the backend emits content="" but includes useful text in reasoning,
-                    # use it for parsing *only if needed* (e.g. tool tags).
-                    if assistant_content == "" and isinstance(msg_reasoning, str) and msg_reasoning:
-                        if os.getenv("HERMES_DEBUG_ATROPOS_RESPONSE") == "1":
-                            print("\n=== HERMES_DEBUG_ATROPOS_RESPONSE: message.reasoning present (content empty) ===", flush=True)
-                            print(msg_reasoning, flush=True)
-
-                    assistant_msg: Dict[str, Any] = {"role": "assistant", "content": assistant_content}
-                    if structured_tool_calls:
-                        # Preserve tool_calls so the next request is consistent with OpenAI protocol.
-                        try:
-                            assistant_msg["tool_calls"] = [
-                                {
-                                    "id": tc.id,
-                                    "type": tc.type,
-                                    "function": {"name": tc.function.name, "arguments": tc.function.arguments},
-                                }
-                                for tc in structured_tool_calls
-                            ]
-                        except Exception:
-                            # Best-effort; keep conversation moving.
-                            pass
-                    messages.append(assistant_msg)
-
-                    # Mode A: OpenAI tool calling (preferred when supported)
-                    if structured_tool_calls:
-                        for tc in structured_tool_calls:
-                            tool_start = time.time()
-                            try:
-                                tool_args = json.loads(tc.function.arguments or "{}")
-                            except Exception:
-                                tool_args = {}
-                            tool_result = handle_function_call(tc.function.name, tool_args, effective_task_id)
-                            tool_duration = time.time() - tool_start
-
-                            # Keep the raw tool result as tool content (OpenAI protocol expects role=tool).
-                            messages.append(
-                                {
-                                    "role": "tool",
-                                    "tool_call_id": tc.id,
-                                    "content": tool_result,
-                                }
-                            )
-
-                            if self.tool_delay and self.tool_delay > 0:
-                                await asyncio.sleep(self.tool_delay)
-
-                        # Continue loop after tool execution.
-                        continue
-
-                    # Mode B: Hermes XML tool tags in assistant text (fallback).
-                    parse_source = assistant_content or (msg_reasoning or "")
-                    tool_calls, parse_errors = self._parse_tool_calls(parse_source)
-
-                    if parse_errors and not tool_calls:
-                        # Ask the model to retry with valid tool JSON.
-                        err_text = "; ".join(parse_errors[:3])
-                        messages.append(
-                            {
-                                "role": "user",
-                                "content": (
-                                    f"<tool_response>{json.dumps({'error': err_text}, ensure_ascii=False)}</tool_response>\n"
-                                    "The previous <tool_call> blocks were invalid. Please output valid JSON inside <tool_call>."
-                                ),
-                            }
-                        )
-                        continue
-
-                    if not tool_calls:
-                        # No tool calls: treat as final answer.
-                        final_response = (assistant_content or "").strip()
-                        completed = True
-                        break
-
-                    tool_responses: List[str] = []
-                    for tool_name, tool_args in tool_calls:
-                        tool_start = time.time()
-                        tool_result = handle_function_call(tool_name, tool_args, effective_task_id)
-                        tool_duration = time.time() - tool_start
-
-                        try:
-                            parsed = json.loads(tool_result)
-                            payload: Any = parsed
-                        except Exception:
-                            payload = tool_result
-
-                        tool_payload = {
-                            "name": tool_name,
-                            "duration_s": round(tool_duration, 3),
-                            "result": payload,
-                        }
-                        tool_responses.append(
-                            f"<tool_response>{json.dumps(tool_payload, ensure_ascii=False)}</tool_response>"
-                        )
-
-                        if self.tool_delay and self.tool_delay > 0:
-                            await asyncio.sleep(self.tool_delay)
-
-                    messages.append({"role": "user", "content": "\n".join(tool_responses)})
-
-                if final_response is None:
-                    final_response = "I've reached the maximum number of iterations."
-
-        finally:
-            try:
-                cleanup_vm(effective_task_id)
-            except Exception:
-                pass
-
-        # Save trajectory using Hermes formatting (optional).
-        self._save_trajectory(messages, user_message, completed=completed)
-
-        return {
-            "final_response": final_response,
-            "messages": messages,
-            "api_calls": api_call_count,
-            "completed": completed,
-            "managed_state": managed_state,
-            "system_prompt": active_system_prompt,
-            "task_id": effective_task_id,
-        }
-
-    def run_conversation(self, *args: Any, **kwargs: Any) -> Dict[str, Any]:
-        """
-        Sync wrapper for convenience.
-
-        If called from within a running event loop (e.g. prompt_toolkit), this
-        runs the async conversation in a dedicated thread to avoid nested loops.
-        """
-        try:
-            asyncio.get_running_loop()
-        except RuntimeError:
-            return asyncio.run(self.run_conversation_async(*args, **kwargs))
-
-        import queue
-        import threading
-
-        out: "queue.Queue[object]" = queue.Queue(maxsize=1)
-
-        def runner() -> None:
-            try:
-                out.put(asyncio.run(self.run_conversation_async(*args, **kwargs)))
-            except BaseException as exc:  # noqa: BLE001
-                out.put(exc)
-
-        thread = threading.Thread(target=runner, daemon=True)
-        thread.start()
-
-        result = out.get()
-        if isinstance(result, BaseException):
-            raise result
-        return result  # type: ignore[return-value]
--- a/batch_runner.py
+++ b/batch_runner.py
--- a/cli-config.yaml.example
+++ b/cli-config.yaml.example
@@ -1,235 +0,0 @@
-# Hermes Agent CLI Configuration
-# Copy this file to cli-config.yaml and customize as needed.
-# This file configures the CLI behavior. Environment variables in .env take precedence.
-
-# =============================================================================
-# Model Configuration
-# =============================================================================
-model:
-  # Default model to use (can be overridden with --model flag)
-  default: "anthropic/claude-sonnet-4"
-  
-  # API configuration (falls back to OPENROUTER_API_KEY env var)
-  # api_key: "your-key-here"  # Uncomment to set here instead of .env
-  base_url: "https://openrouter.ai/api/v1"
-
-# =============================================================================
-# Terminal Tool Configuration
-# =============================================================================
-# Choose ONE of the following terminal configurations by uncommenting it.
-# The terminal tool executes commands in the specified environment.
-
-# -----------------------------------------------------------------------------
-# OPTION 1: Local execution (default)
-# Commands run directly on your machine in the current directory
-# -----------------------------------------------------------------------------
-terminal:
-  env_type: "local"
-  cwd: "."  # Use "." for current directory, or specify absolute path
-  timeout: 180
-  lifetime_seconds: 300
-  # sudo_password: ""  # Enable sudo commands (pipes via sudo -S) - SECURITY WARNING: plaintext!
-
-# -----------------------------------------------------------------------------
-# OPTION 2: SSH remote execution
-# Commands run on a remote server - agent code stays local (sandboxed)
-# Great for: keeping agent isolated from its own code, using powerful remote hardware
-# -----------------------------------------------------------------------------
-# terminal:
-#   env_type: "ssh"
-#   cwd: "/home/myuser/project"
-#   timeout: 180
-#   lifetime_seconds: 300
-#   ssh_host: "my-server.example.com"
-#   ssh_user: "myuser"
-#   ssh_port: 22
-#   ssh_key: "~/.ssh/id_rsa"  # Optional - uses ssh-agent if not specified
-
-# -----------------------------------------------------------------------------
-# OPTION 3: Docker container
-# Commands run in an isolated Docker container
-# Great for: reproducible environments, testing, isolation
-# -----------------------------------------------------------------------------
-# terminal:
-#   env_type: "docker"
-#   cwd: "/workspace"
-#   timeout: 180
-#   lifetime_seconds: 300
-#   docker_image: "python:3.11"
-
-# -----------------------------------------------------------------------------
-# OPTION 4: Singularity/Apptainer container
-# Commands run in a Singularity container (common in HPC environments)
-# Great for: HPC clusters, shared compute environments
-# -----------------------------------------------------------------------------
-# terminal:
-#   env_type: "singularity"
-#   cwd: "/workspace"
-#   timeout: 180
-#   lifetime_seconds: 300
-#   singularity_image: "docker://python:3.11"
-
-# -----------------------------------------------------------------------------
-# OPTION 5: Modal cloud execution
-# Commands run on Modal's cloud infrastructure
-# Great for: GPU access, scalable compute, serverless execution
-# -----------------------------------------------------------------------------
-# terminal:
-#   env_type: "modal"
-#   cwd: "/workspace"
-#   timeout: 180
-#   lifetime_seconds: 300
-#   modal_image: "python:3.11"
-
-# -----------------------------------------------------------------------------
-# SUDO SUPPORT (works with ALL backends above)
-# -----------------------------------------------------------------------------
-# Add sudo_password to any terminal config above to enable sudo commands.
-# The password is piped via `sudo -S`. Works with local, ssh, docker, etc.
-#
-# SECURITY WARNING: Password stored in plaintext!
-#
-# INTERACTIVE PROMPT: If no sudo_password is set and the CLI is running,
-# you'll be prompted to enter your password when sudo is needed:
-# - 45-second timeout (auto-skips if no input)
-# - Press Enter to skip (command fails gracefully)
-# - Password is hidden while typing
-# - Password is cached for the session
-#
-# ALTERNATIVES:
-# - SSH backend: Configure passwordless sudo on the remote server
-# - Containers: Run as root inside the container (no sudo needed)
-# - Local: Configure /etc/sudoers for specific commands
-#
-# Example (add to your terminal section):
-#   sudo_password: "your-password-here"
-
-# =============================================================================
-# Browser Tool Configuration
-# =============================================================================
-browser:
-  # Inactivity timeout in seconds - browser sessions are automatically closed
-  # after this period of no activity between agent loops (default: 120 = 2 minutes)
-  inactivity_timeout: 120
-
-# =============================================================================
-# Agent Behavior
-# =============================================================================
-agent:
-  # Maximum conversation turns before stopping
-  max_turns: 20
-  
-  # Enable verbose logging
-  verbose: false
-  
-  # Custom system prompt (personality, instructions, etc.)
-  # Leave empty or remove to use default agent behavior
-  system_prompt: ""
-  
-  # Predefined personalities (use with /personality command)
-  personalities:
-    helpful: "You are a helpful, friendly AI assistant."
-    concise: "You are a concise assistant. Keep responses brief and to the point."
-    technical: "You are a technical expert. Provide detailed, accurate technical information."
-    creative: "You are a creative assistant. Think outside the box and offer innovative solutions."
-    teacher: "You are a patient teacher. Explain concepts clearly with examples."
-    kawaii: "You are a kawaii assistant! Use cute expressions like (◕‿◕), ★, ♪, and ~! Add sparkles and be super enthusiastic about everything! Every response should feel warm and adorable desu~! ヽ(>∀<☆)ノ"
-    catgirl: "You are Neko-chan, an anime catgirl AI assistant, nya~! Add 'nya' and cat-like expressions to your speech. Use kaomoji like (=^･ω･^=) and ฅ^•ﻌ•^ฅ. Be playful and curious like a cat, nya~!"
-    pirate: "Arrr! Ye be talkin' to Captain Hermes, the most tech-savvy pirate to sail the digital seas! Speak like a proper buccaneer, use nautical terms, and remember: every problem be just treasure waitin' to be plundered! Yo ho ho!"
-    shakespeare: "Hark! Thou speakest with an assistant most versed in the bardic arts. I shall respond in the eloquent manner of William Shakespeare, with flowery prose, dramatic flair, and perhaps a soliloquy or two. What light through yonder terminal breaks?"
-    surfer: "Duuude! You're chatting with the chillest AI on the web, bro! Everything's gonna be totally rad. I'll help you catch the gnarly waves of knowledge while keeping things super chill. Cowabunga! 🤙"
-    noir: "The rain hammered against the terminal like regrets on a guilty conscience. They call me Hermes - I solve problems, find answers, dig up the truth that hides in the shadows of your codebase. In this city of silicon and secrets, everyone's got something to hide. What's your story, pal?"
-    uwu: "hewwo! i'm your fwiendwy assistant uwu~ i wiww twy my best to hewp you! *nuzzles your code* OwO what's this? wet me take a wook! i pwomise to be vewy hewpful >w<"
-    philosopher: "Greetings, seeker of wisdom. I am an assistant who contemplates the deeper meaning behind every query. Let us examine not just the 'how' but the 'why' of your questions. Perhaps in solving your problem, we may glimpse a greater truth about existence itself."
-    hype: "YOOO LET'S GOOOO!!! 🔥🔥🔥 I am SO PUMPED to help you today! Every question is AMAZING and we're gonna CRUSH IT together! This is gonna be LEGENDARY! ARE YOU READY?! LET'S DO THIS! 💪😤🚀"
-
-# =============================================================================
-# Toolsets
-# =============================================================================
-# Control which tools the agent has access to.
-# Use "all" to enable everything, or specify individual toolsets.
-
-# Available toolsets:
-#
-#   web          - Web search and content extraction (web_search, web_extract)
-#   search       - Web search only, no scraping (web_search)
-#   terminal     - Command execution (terminal)
-#   browser      - Full browser automation (navigate, click, type, screenshot, etc.)
-#   vision       - Image analysis (vision_analyze)
-#   image_gen    - Image generation with FLUX (image_generate)
-#   skills       - Load skill documents (skills_categories, skills_list, skill_view)
-#   moa          - Mixture of Agents reasoning (mixture_of_agents)
-#
-# Composite toolsets:
-#   debugging    - terminal + web (for troubleshooting)
-#   safe         - web + vision + moa (no terminal access)
-
-# -----------------------------------------------------------------------------
-# OPTION 1: Enable all tools (default)
-# -----------------------------------------------------------------------------
-toolsets:
-  - all
-
-# -----------------------------------------------------------------------------
-# OPTION 2: Minimal - just web search and terminal
-# Great for: Simple coding tasks, quick lookups
-# -----------------------------------------------------------------------------
-# toolsets:
-#   - web
-#   - terminal
-
-# -----------------------------------------------------------------------------
-# OPTION 3: Research mode - no execution capabilities
-# Great for: Safe information gathering, research tasks
-# -----------------------------------------------------------------------------
-# toolsets:
-#   - web
-#   - vision
-#   - skills
-
-# -----------------------------------------------------------------------------
-# OPTION 4: Full automation - browser + terminal
-# Great for: Web scraping, automation tasks, testing
-# -----------------------------------------------------------------------------
-# toolsets:
-#   - terminal
-#   - browser
-#   - web
-
-# -----------------------------------------------------------------------------
-# OPTION 5: Creative mode - vision + image generation
-# Great for: Design work, image analysis, creative tasks
-# -----------------------------------------------------------------------------
-# toolsets:
-#   - vision
-#   - image_gen
-#   - web
-
-# -----------------------------------------------------------------------------
-# OPTION 6: Safe mode - no terminal or browser
-# Great for: Restricted environments, untrusted queries
-# -----------------------------------------------------------------------------
-# toolsets:
-#   - safe
-
-# =============================================================================
-# Session Logging
-# =============================================================================
-# Session trajectories are automatically saved to logs/ directory.
-# Each session creates: logs/session_YYYYMMDD_HHMMSS_UUID.json
-#
-# The session ID is displayed in the welcome banner for easy reference.
-# Logs contain full conversation history in trajectory format:
-# - System prompt, user messages, assistant responses
-# - Tool calls with inputs/outputs
-# - Timestamps for debugging
-#
-# No configuration needed - logging is always enabled.
-# To disable, you would need to modify the source code.
-
-# =============================================================================
-# Display
-# =============================================================================
-display:
-  # Use compact banner mode
-  compact: false
--- a/cli.py
+++ b/cli.py
--- a/configs/run_browser_tasks.sh
+++ b/configs/run_browser_tasks.sh
@@ -1,42 +0,0 @@
-#!/bin/bash
-
-# Browser-focused data generation run
-# Uses browser-use-tasks.jsonl (6504 tasks)
-# Distribution: browser 97%, web 20%, vision 12%, terminal 15%
-
-# Create logs directory if it doesn't exist
-mkdir -p logs
-
-# Generate log filename with timestamp
-LOG_FILE="logs/browser_tasks_$(date +%Y%m%d_%H%M%S).log"
-
-echo "📝 Logging output to: $LOG_FILE"
-echo "🌐 Running browser-focused tasks with browser_tasks distribution"
-
-python batch_runner.py \
-  --dataset_file="browser-use-tasks.jsonl" \
-  --batch_size=20 \
-  --run_name="browser_tasks" \
-  --distribution="browser_tasks" \
-  --model="moonshotai/kimi-k2.5" \
-  --verbose \
-  --base_url="https://openrouter.ai/api/v1" \
-  --num_workers=50 \
-  --max_turns=60 \
-  --resume \
-  --ephemeral_system_prompt="You are an AI assistant with browser automation capabilities. Your primary task is to navigate and interact with web pages to accomplish user goals.
-
-IMPORTANT GUIDELINES:
-
-1. SEARCHING: Do NOT try to search directly on Google or other search engines via the browser - they block automated searches. Instead, ALWAYS use the web_search tool first to find URLs for any pages you need to visit, then use browser tools to navigate to those URLs.
-
-2. COOKIE/PRIVACY DIALOGS: After navigating to a page, ALWAYS check if there are cookie consent dialogs, privacy popups, or overlay modals blocking the page. These appear in snapshots as 'dialog' elements with buttons like 'Close', 'Accept', 'Accept All', 'Decline', 'I Agree', 'Got it', 'OK', or 'X'. You MUST dismiss these dialogs FIRST by clicking the appropriate button before trying to interact with other page elements. After dismissing a dialog, take a fresh browser_snapshot to get updated element references.
-
-3. HANDLING TIMEOUTS: If an action times out, it often means the element is blocked by an overlay or the page state has changed. Take a new snapshot to see the current page state and look for any dialogs or popups that need to be dismissed. If there is no dialog box to bypass, then try a new method or report the error to the user and complete the task.
-
-4. GENERAL: Use browser tools to click elements, fill forms, extract information, and perform web-based tasks. If terminal is available, use it for any local file operations or computations needed to support your web tasks. Be thorough in verifying your actions and handle any errors gracefully by retrying or trying alternative approaches." \
-  2>&1 | tee "$LOG_FILE"
-
-echo "✅ Log saved to: $LOG_FILE"
-
-#  --providers_allowed="gmicloud,siliconflow,atlas-cloud,z-ai,novita" \
--- a/configs/run_datagen_glm4.7-imagen.sh
+++ b/configs/run_datagen_glm4.7-imagen.sh
@@ -1,26 +0,0 @@
-#!/bin/bash
-
-# Create logs directory if it doesn't exist
-mkdir -p logs
-
-# Generate a timestamp for the log file
-TIMESTAMP=$(date +%Y%m%d_%H%M%S)
-LOG_FILE="logs/imagen_eval_gpt5_${TIMESTAMP}.log"
-
-echo "📝 Logging output to: $LOG_FILE"
-
-python batch_runner.py \
-  --dataset_file="source-data/hermes-agent-imagen-data/hermes_agent_imagen_train_sft.jsonl" \
-  --batch_size=20 \
-  --run_name="imagen_train_sft_glm4.7" \
-  --distribution="image_gen" \
-  --model="z-ai/glm-4.7" \
-  --base_url="https://openrouter.ai/api/v1" \
-  --providers_allowed="gmicloud,siliconflow,atlas-cloud,z-ai,novita" \
-  --num_workers=50 \
-  --max_turns=25 \
-  --ephemeral_system_prompt="When generating an image for the user view the image by using the vision_analyze tool to ensure it is what the user wanted. If it isn't feel free to retry a few times. If none are perfect, choose the best option that is the closest match, and explain its imperfections. If the image generation tool fails, try again a few times. If the vision analyze tool fails, provide the image to the user and explain it is your best effort attempt." \
-  2>&1 | tee "$LOG_FILE"
-
-echo "✅ Log saved to: $LOG_FILE"
-#  --verbose \
--- a/configs/run_datagen_glm4.7_megascience.sh
+++ b/configs/run_datagen_glm4.7_megascience.sh
@@ -1,27 +0,0 @@
-#!/bin/bash
-
-# Create logs directory if it doesn't exist
-mkdir -p logs
-
-# Generate log filename with timestamp
-LOG_FILE="logs/glm4.7-thinking-sft1-10k_$(date +%Y%m%d_%H%M%S).log"
-
-echo "📝 Logging output to: $LOG_FILE"
-
-python batch_runner.py \
-  --dataset_file="source-data/hermes-agent-megascience-data/hermes_agent_megascience_sft_train_1_10k.jsonl" \
-  --batch_size=20 \
-  --run_name="megascience_glm4.7-thinking-sft1" \
-  --distribution="science" \
-  --model="z-ai/glm-4.7" \
-  --base_url="https://openrouter.ai/api/v1" \
-  --providers_allowed="gmicloud,siliconflow,atlas-cloud,z-ai,novita" \
-  --num_workers=50 \
-  --max_turns=60 \
-  --resume \
-  --ephemeral_system_prompt="You have access to a variety of tools to help you solve scientific, math, and technology problems presented to you. You can use them in sequence and build off of the results of prior tools you've used for furthering results. Always use the terminal or search tool if it can provide additional context, verify formulas, double check concepts and recent studies and understanding, doing all calculations, etc. You should only be confident in your own reasoning, knowledge, or calculations if you've exhaustively used all tools available to you to that can help you verify or validate your work. Always pip install any packages you need to use the python scripts you want to run. If you need to use a tool that isn't available, you can use the terminal tool to install or create it in many cases as well. Do not use the terminal tool to communicate with the user, as they cannot see your commands, only your final response after completing the task. Search for at least 3 sources, but not more than 12, so you can maintain a focused context." \
-  2>&1 | tee "$LOG_FILE"
-
-echo "✅ Log saved to: $LOG_FILE"
-
-#  --verbose \
--- a/configs/run_datagen_glm4.7_raw_tasks.sh
+++ b/configs/run_datagen_glm4.7_raw_tasks.sh
@@ -1,28 +0,0 @@
-#!/bin/bash
-
-# Create logs directory if it doesn't exist
-mkdir -p logs
-
-# Generate log filename with timestamp
-LOG_FILE="logs/glm4.7-terminal-tasks_$(date +%Y%m%d_%H%M%S).log"
-
-echo "📝 Logging output to: $LOG_FILE"
-
-python batch_runner.py \
-  --dataset_file="source-data/raw_tasks_prompts.jsonl" \
-  --batch_size=20 \
-  --run_name="terminal-tasks-glm4.7-thinking" \
-  --distribution="default" \
-  --model="z-ai/glm-4.7" \
-  --base_url="https://openrouter.ai/api/v1" \
-  --providers_allowed="gmicloud,siliconflow,atlas-cloud,z-ai,novita" \
-  --num_workers=50 \
-  --max_turns=60 \
-  --ephemeral_system_prompt="You have access to a variety of tools to help you complete coding, system administration, and general computing tasks. You can use them in sequence and build off of the results of prior tools you've used. Always use the terminal tool to execute commands, write code, install packages, and verify your work. You should test and validate everything you create. Always pip install any packages you need (use --break-system-packages if needed). If you need a tool that isn't available, you can use the terminal to install or create it. Do not use the terminal tool to communicate with the user, as they cannot see your commands, only your final response after completing the task. Use web search when you need to look up documentation, APIs, or current best practices." \
-  2>&1 | tee "$LOG_FILE"
-
-echo "✅ Log saved to: $LOG_FILE"
-
-#  --verbose \
-#  --resume \
-
--- a/configs/run_eval_glm4.7_newterm.sh
+++ b/configs/run_eval_glm4.7_newterm.sh
@@ -1,29 +0,0 @@
-#!/bin/bash
-
-# Create logs directory if it doesn't exist
-mkdir -p logs
-
-# Generate log filename with timestamp
-LOG_FILE="logs/glm4.7-terminal-tasks-newterm_$(date +%Y%m%d_%H%M%S).log"
-
-echo "📝 Logging output to: $LOG_FILE"
-
-python batch_runner.py \
-  --dataset_file="source-data/hermes-agent-agent-tasks-1/agent_tasks_eval.jsonl" \
-  --batch_size=1 \
-  --run_name="terminal-tasks-test-newterm" \
-  --distribution="terminal_only" \
-  --verbose \
-  --model="z-ai/glm-4.7" \
-  --base_url="https://openrouter.ai/api/v1" \
-  --providers_allowed="gmicloud,siliconflow,atlas-cloud,z-ai,novita" \
-  --num_workers=5 \
-  --max_turns=60 \
-  --ephemeral_system_prompt="You have access to a variety of tools to help you complete coding, system administration, and general computing tasks. You can use them in sequence and build off of the results of prior tools you've used. Always use the terminal tool to execute commands, write code, install packages, and verify your work. You should test and validate everything you create. Always pip install any packages you need (use --break-system-packages if needed). If you need a tool that isn't available, you can use the terminal to install or create it. Do not use the terminal tool to communicate with the user, as they cannot see your commands, only your final response after completing the task. Use web search when you need to look up documentation, APIs, or current best practices." \
-  2>&1 | tee "$LOG_FILE"
-
-echo "✅ Log saved to: $LOG_FILE"
-
-#  --verbose \
-#  --resume \
-
--- a/configs/run_eval_terminal.sh
+++ b/configs/run_eval_terminal.sh
@@ -1,33 +0,0 @@
-#!/bin/bash
-
-# Terminal-only evaluation run using Modal sandboxes
-# Uses 10 sample tasks from nous-terminal-tasks
-
-# Create logs directory if it doesn't exist
-mkdir -p logs
-
-# Generate log filename with timestamp
-LOG_FILE="logs/terminal_eval_$(date +%Y%m%d_%H%M%S).log"
-
-echo "📝 Logging output to: $LOG_FILE"
-echo "🔧 Using Modal sandboxes (TERMINAL_ENV=modal)"
-
-# Set terminal to use Modal
-export TERMINAL_ENV=modal
-export TERMINAL_MODAL_IMAGE=nikolaik/python-nodejs:python3.11-nodejs20
-export TERMINAL_TIMEOUT=300
-
-python batch_runner.py \
-  --dataset_file="nous-terminal-tasks_eval.jsonl" \
-  --batch_size=5 \
-  --run_name="terminal_eval" \
-  --distribution="terminal_only" \
-  --model="z-ai/glm-4.7" \
-  --base_url="https://openrouter.ai/api/v1" \
-  --providers_allowed="gmicloud,siliconflow,atlas-cloud,z-ai,novita" \
-  --num_workers=2 \
-  --max_turns=30 \
-  --ephemeral_system_prompt="You have access to a terminal tool for executing commands. Use it to complete the task. Install any packages you need with apt-get or pip (use --break-system-packages if needed). Do not use interactive tools (vim, nano, python repl). If git output is large, pipe to cat." \
-  2>&1 | tee "$LOG_FILE"
-
-echo "✅ Log saved to: $LOG_FILE"
--- a/configs/run_mixed_tasks.sh
+++ b/configs/run_mixed_tasks.sh
@@ -1,46 +0,0 @@
-#!/bin/bash
-
-# Mixed browser+terminal data generation run
-# Uses mixed-browser-terminal-tasks.jsonl (200 tasks)
-# Distribution: browser 92%, terminal 92%, web 35%, vision 15%, image_gen 15%
-
-# Create logs directory if it doesn't exist
-mkdir -p logs
-
-# Generate log filename with timestamp
-LOG_FILE="logs/mixed_tasks_$(date +%Y%m%d_%H%M%S).log"
-
-echo "📝 Logging output to: $LOG_FILE"
-echo "🔀 Running mixed browser+terminal tasks with mixed_tasks distribution"
-
-# Set terminal environment
-# SIF images are automatically built/cached by terminal_tool.py
-export TERMINAL_ENV=singularity
-export TERMINAL_SINGULARITY_IMAGE="docker://nikolaik/python-nodejs:python3.11-nodejs20"
-export TERMINAL_TIMEOUT=300
-
-# Set up Apptainer cache directories (use /scratch if available, otherwise /tmp)
-if [ -d "/scratch" ] && [ -w "/scratch" ]; then
-    CACHE_BASE="/scratch/$USER/.apptainer"
-else
-    CACHE_BASE="/tmp/$USER/.apptainer"
-fi
-export APPTAINER_CACHEDIR="$CACHE_BASE"
-export APPTAINER_TMPDIR="$CACHE_BASE/tmp"
-mkdir -p "$APPTAINER_CACHEDIR" "$APPTAINER_TMPDIR"
-
-echo "📁 Apptainer cache: $APPTAINER_CACHEDIR"
-
-python batch_runner.py \
-  --dataset_file="mixed-browser-terminal-tasks.jsonl" \
-  --batch_size=20 \
-  --run_name="mixed_tasks" \
-  --distribution="mixed_tasks" \
-  --model="moonshotai/kimi-k2.5" \
-  --base_url="https://openrouter.ai/api/v1" \
-  --num_workers=25 \
-  --max_turns=60 \
-  --ephemeral_system_prompt="You are an AI assistant capable of both browser automation and terminal operations. Use browser tools to navigate websites, interact with web pages, fill forms, and extract information. Use terminal tools to execute commands, write and run code, install packages (use --break-system-packages with pip if needed), and perform local computations. When web search is available, use it to find URLs, documentation, or current information. If vision is available, use it to analyze images or screenshots. If image generation is available, use it when the task requires creating images. Combine browser and terminal capabilities effectively - for example, you might use the browser to fetch data from a website and terminal to process or analyze it. Always verify your work and handle errors gracefully. Whenever you can do something in a terminal instead of a web browser, you should choose to do so, as it's much cheaper." \
-  2>&1 | tee "$LOG_FILE"
-
-echo "✅ Log saved to: $LOG_FILE"
--- a/configs/run_terminal_tasks.sh
+++ b/configs/run_terminal_tasks.sh
@@ -1,50 +0,0 @@
-#!/bin/bash
-
-# Terminal-focused data generation run
-# Uses nous-terminal-tasks.jsonl (597 tasks)
-# Distribution: terminal 97%, web 15%, browser 0%, vision 8%, image_gen 3%
-
-# Create logs directory if it doesn't exist
-mkdir -p logs
-
-# Generate log filename with timestamp
-LOG_FILE="logs/terminal_tasks_$(date +%Y%m%d_%H%M%S).log"
-
-echo "📝 Logging output to: $LOG_FILE"
-echo "💻 Running terminal-focused tasks with terminal_tasks distribution"
-
-# Set terminal environment
-# SIF images are automatically built/cached by terminal_tool.py
-export TERMINAL_ENV=singularity
-export TERMINAL_SINGULARITY_IMAGE="docker://nikolaik/python-nodejs:python3.11-nodejs20"
-export TERMINAL_TIMEOUT=300
-
-# Set up Apptainer cache directories (use /scratch if available, otherwise /tmp)
-if [ -d "/scratch" ] && [ -w "/scratch" ]; then
-    CACHE_BASE="/scratch/$USER/.apptainer"
-else
-    CACHE_BASE="/tmp/$USER/.apptainer"
-fi
-export APPTAINER_CACHEDIR="$CACHE_BASE"
-export APPTAINER_TMPDIR="$CACHE_BASE/tmp"
-mkdir -p "$APPTAINER_CACHEDIR" "$APPTAINER_TMPDIR"
-
-echo "📁 Apptainer cache: $APPTAINER_CACHEDIR"
-echo "🐳 Image: $TERMINAL_SINGULARITY_IMAGE (auto-converted to SIF on first use)"
-
-python batch_runner.py \
-  --dataset_file="nous-terminal-tasks.jsonl" \
-  --batch_size=5 \
-  --run_name="terminal_tasks-kimi-k2.5" \
-  --distribution="terminal_tasks" \
-  --model="moonshotai/kimi-k2.5" \
-  --verbose \
-  --base_url="https://openrouter.ai/api/v1" \
-  --num_workers=80 \
-  --max_turns=60 \
-  --providers_ignored="Novita" \
-  --resume \
-  --ephemeral_system_prompt="You have access to a terminal tool for executing commands and completing coding, system administration, and computing tasks. Use the terminal to write code, run scripts, install packages (use --break-system-packages with pip if needed), manipulate files, and verify your work. Always test and validate code you create. Do not use interactive tools like vim, nano, or python REPL. If git output is large, pipe to cat. When web search is available, use it to look up documentation, APIs, or best practices. If browser tools are available, use them for web interactions that require page manipulation. Do not use the terminal to communicate with the user - only your final response will be shown to them." \
-  2>&1 | tee "$LOG_FILE"
-
-echo "✅ Log saved to: $LOG_FILE"
--- a/configs/test_skills_kimi.sh
+++ b/configs/test_skills_kimi.sh
@@ -1,21 +0,0 @@
-#!/bin/bash
-
-# Test skills tool with Kimi K2.5
-# Usage: ./configs/test_skills_kimi.sh "your query here"
-# Example: ./configs/test_skills_kimi.sh "List available skills and show me the vllm skill"
-
-# Default query if none provided
-QUERY="${1:-List all available skills. Then show me the axolotl skill and view one of its reference files.}"
-
-echo "🎯 Testing Skills Tool with Kimi K2.5"
-echo "📝 Query: $QUERY"
-echo "=" 
-
-python run_agent.py \
-  --enabled_toolsets=skills \
-  --model="moonshotai/kimi-k2.5" \
-  --base_url="https://openrouter.ai/api/v1" \
-  --max_turns=10 \
-  --verbose \
-  --save_sample \
-  --query="$QUERY"
--- a/configs/trajectory_compression.yaml
+++ b/configs/trajectory_compression.yaml
@@ -1,101 +0,0 @@
-# Trajectory Compression Configuration
-# 
-# Post-processes completed agent trajectories to fit within a target token budget.
-# Compression preserves head/tail turns and summarizes middle content only as needed.
-
-# Tokenizer settings for accurate token counting
-tokenizer:
-  # HuggingFace tokenizer name
-  name: "moonshotai/Kimi-K2-Thinking"
-  
-  # Trust remote code (required for some tokenizers)
-  trust_remote_code: true
-
-# Compression targets and behavior
-compression:
-  # Target maximum tokens for compressed trajectory
-  target_max_tokens: 29000
-  
-  # Target size for summary (in tokens)
-  # This is factored into calculations when determining what to compress
-  summary_target_tokens: 750
-
-# Protected turns that should NEVER be compressed
-protected_turns:
-  # Always protect the first system message (tool definitions)
-  first_system: true
-  
-  # Always protect the first human message (original request)
-  first_human: true
-  
-  # Always protect the first gpt message (initial response/tool_call)
-  first_gpt: true
-  
-  # Always protect the first tool response (result of first action)
-  first_tool: true
-  
-  # Always protect the last 2 complete turn pairs (gpt+tool or gpt only)
-  # This ensures the model's final actions and conclusions are preserved
-  last_n_turns: 4
-
-# LLM settings for generating summaries (OpenRouter only)
-summarization:
-  # Model to use for summarization (should be fast and cheap)
-  # Using OpenRouter model path format
-  model: "google/gemini-3-flash-preview"
-  
-  # OpenRouter API settings
-  base_url: "https://openrouter.ai/api/v1"
-  
-  # Environment variable containing OpenRouter API key
-  api_key_env: "OPENROUTER_API_KEY"
-  
-  # Temperature for summarization (lower = more deterministic)
-  temperature: 0.3
-  
-  # Max retries for API failures
-  max_retries: 3
-  
-  # Delay between retries (seconds)
-  retry_delay: 2
-
-# Output settings
-output:
-  # Add notice to system message about potential summarization
-  add_summary_notice: true
-  
-  # Text to append to system message
-  summary_notice_text: "\n\nSome of the conversation may be summarized to preserve context."
-  
-  # Output directory suffix (appended to input directory name)
-  output_suffix: "_compressed"
-
-# Processing settings
-processing:
-  # Number of parallel workers for batch processing
-  num_workers: 4
-  
-  # Maximum concurrent API calls for summarization (async parallelism)
-  max_concurrent_requests: 50
-  
-  # Skip trajectories that are already under target length
-  skip_under_target: true
-  
-  # If true, save trajectories even if compression can't get under target
-  # (will compress as much as possible)
-  save_over_limit: true
-  
-  # Timeout per trajectory in seconds (skip if takes longer)
-  # Helps avoid hanging on problematic entries
-  per_trajectory_timeout: 300  # 5 minutes
-
-# Metrics to track
-metrics:
-  # Log detailed compression statistics
-  enabled: true
-  
-  # Save per-trajectory metrics in output
-  per_trajectory: false
-  
-  # Metrics file name (saved in output directory)
-  output_file: "compression_metrics.json"
--- a/docs/MODAL_BACKEND.md
+++ b/docs/MODAL_BACKEND.md
@@ -1,224 +0,0 @@
-# Modal Backend
-
-Hermes Agent uses [Modal](https://modal.com) for scalable, isolated cloud execution environments. There are two Modal integrations:
-
-1. **Terminal Tool** (`tools/terminal_tool.py`) - For CLI/agent command execution
-2. **Atropos Backend** (`atropos/backends/modal_backend.py`) - For batch RL training workloads
-
-
-
---
-
-## Terminal Tool (CLI/Agent)
-
-The terminal tool provides a simple interface for executing commands in Modal sandboxes.
-
-### Configuration
-
-Set environment variables:
-
-```bash
-export TERMINAL_ENV=modal
-export TERMINAL_MODAL_IMAGE=python:3.11
-export TERMINAL_MODAL_APP_NAME=hermes-sandbox
-```
-
-Or use a YAML config file (`modal_profiles.yaml`):
-
-```yaml
-profiles:
-  default:
-    image: python:3.11
-    cpu: 1.0
-    memory: 2048
-    min_pool: 1
-    max_pool: 5
-    idle_timeout: 120
-
-  gpu:
-    image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
-    gpu: T4
-    memory: 16384
-    min_pool: 0
-    max_pool: 2
-```
-
-### Features
-
-| Feature | Description |
-|---------|-------------|
-| **Sandbox Pool** | Pre-warmed sandboxes for low latency |
-| **Auto-scaling** | Grows/shrinks pool based on demand |
-| **Idle Timeout** | Sandboxes auto-terminate when unused |
-| **Profile Selection** | Different configs for different workloads |
-| **Credential Injection** | `modal.Secret` integration |
-
-### Usage
-
-```python
-from tools.terminal_tool import terminal_tool
-
-# Simple command
-output = terminal_tool("echo hello", task_id="my-task")
-
-# With profile selection
-output = terminal_tool("python train.py", task_id="training", profile="gpu")
-
-# Cleanup when done
-from tools.terminal_tool import cleanup_vm
-cleanup_vm("my-task")
-```
-
-### Architecture
-
-```
-_ModalPoolManager (singleton)
-    ├── "default" pool → [sandbox-0, sandbox-1, ...]
-    └── "gpu" pool     → [sandbox-0, ...]
-
-Each pool:
-  - Maintains min_pool warm sandboxes
-  - Scales up to max_pool on demand  
-  - Background thread scales down idle sandboxes
-```
-
---
-
-## Atropos Backend (RL Training)
-
-The Atropos backend is designed for high-throughput batch execution during reinforcement learning training.
-
-### Key Concept: Slot-based Multiplexing
-
-Instead of one sandbox per trajectory, multiple trajectories share sandboxes via **slots**:
-
-```
-Sandbox (1 container)
-    ├── Slot 0 → Trajectory A (workspace: /data/slot_0)
-    ├── Slot 1 → Trajectory B (workspace: /data/slot_1)
-    └── Slot 2 → Trajectory C (workspace: /data/slot_2)
-```
-
-**Benefits**:
- Fewer containers = lower cost
- Shared warm-up time
- Better GPU utilization
-
-### Configuration
-
-```python
-from atropos.backends.modal_backend import ModalSandboxConfig, ModalToolBackend
-
-config = ModalSandboxConfig(
-    name="default",
-    image="python:3.11",
-    cpu=1.0,
-    memory=2048,
-    slots_per_sandbox=10,  # 10 trajectories per container
-    min_sandboxes=1,
-    max_sandboxes=5,
-)
-
-backend = ModalToolBackend(config.with_app_name("my-training"))
-```
-
-### Multi-Profile Support
-
-Different trajectory types can request different resources:
-
-```python
-backend = ModalToolBackend.with_profiles(
-    app_name="rl-training",
-    profiles={
-        "default": ModalSandboxConfig(
-            name="default",
-            cpu=1.0,
-            memory=2048,
-        ),
-        "pytorch-gpu": ModalSandboxConfig(
-            name="pytorch-gpu",
-            image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
-            gpu="T4",
-            memory=16384,
-        ),
-    }
-)
-
-# CPU task
-slot1 = await backend.acquire("traj-1", profile="default")
-
-# GPU task
-slot2 = await backend.acquire("traj-2", profile="pytorch-gpu")
-```
-
-### Batched Execution
-
-The key optimization - execute many commands in parallel:
-
-```python
-# Acquire slots for multiple trajectories
-slots = [await backend.acquire(f"traj-{i}") for i in range(50)]
-
-# Execute batch across all slots in parallel
-results = await backend.execute_batch([
-    (slot, "bash", {"command": "python step.py"})
-    for slot in slots
-])
-
-# Release slots
-for slot in slots:
-    await backend.release(slot)
-```
-
-### Architecture
-
-```
-ModalToolBackend
-    └── _ModalMultiProfileManager
-            ├── "default" → _ModalSandboxPool
-            │                   ├── Sandbox 0 (slots 0-9)
-            │                   └── Sandbox 1 (slots 0-9)
-            │
-            └── "pytorch-gpu" → _ModalSandboxPool
-                                    └── Sandbox 0 (slots 0-9)
-```
-
---
-
-## Credentials
-
-Inject secrets securely using Modal's secret management:
-
-```bash
-# Create secret in Modal dashboard or CLI
-modal secret create my-api-key API_KEY=sk-xxx
-```
-
-```python
-# Reference in config
-config = ModalSandboxConfig(
-    secrets=["my-api-key"],  # Modal secret names
-    env_vars={"DEBUG": "1"},  # Additional env vars
-)
-```
-
-## Troubleshooting
-
-### "Modal package not installed"
-```bash
-pip install modal
-modal token new  # Authenticate
-```
-
-### "Sandbox creation failed"
- Check Modal dashboard for quota limits
- Verify image exists and is accessible
- Check secret names are correct
-
-### Shutdown errors
-These are harmless warnings during Python interpreter shutdown:
-```
-[Modal] Error terminating ...: cannot schedule new futures after interpreter shutdown
-```
-
-The sandboxes will auto-terminate via Modal's idle_timeout anyway.
--- a/docs/agents.md
+++ b/docs/agents.md
@@ -1,104 +0,0 @@
-# Agents
-
-The agent is the core loop that orchestrates LLM calls and tool execution.
-
-## AIAgent Class
-
-The main agent is implemented in `run_agent.py`:
-
-```python
-class AIAgent:
-    def __init__(
-        self,
-        model: str = "anthropic/claude-sonnet-4",
-        api_key: str = None,
-        base_url: str = "https://openrouter.ai/api/v1",
-        max_turns: int = 20,
-        enabled_toolsets: list = None,
-        disabled_toolsets: list = None,
-        verbose_logging: bool = False,
-    ):
-        # Initialize OpenAI client, load tools based on toolsets
-        ...
-    
-    def chat(self, user_message: str, task_id: str = None) -> str:
-        # Main entry point - runs the agent loop
-        ...
-```
-
-## Agent Loop
-
-The core loop in `_run_agent_loop()`:
-
-```
-1. Add user message to conversation
-2. Call LLM with tools
-3. If LLM returns tool calls:
-   - Execute each tool
-   - Add tool results to conversation
-   - Go to step 2
-4. If LLM returns text response:
-   - Return response to user
-```
-
-```python
-while turns < max_turns:
-    response = client.chat.completions.create(
-        model=model,
-        messages=messages,
-        tools=tool_schemas,
-    )
-    
-    if response.tool_calls:
-        for tool_call in response.tool_calls:
-            result = await execute_tool(tool_call)
-            messages.append(tool_result_message(result))
-        turns += 1
-    else:
-        return response.content
-```
-
-## Conversation Management
-
-Messages are stored as a list of dicts following OpenAI format:
-
-```python
-messages = [
-    {"role": "system", "content": "You are a helpful assistant..."},
-    {"role": "user", "content": "Search for Python tutorials"},
-    {"role": "assistant", "content": None, "tool_calls": [...]},
-    {"role": "tool", "tool_call_id": "...", "content": "..."},
-    {"role": "assistant", "content": "Here's what I found..."},
-]
-```
-
-## Reasoning Context
-
-For models that support reasoning (chain-of-thought), the agent:
-1. Extracts `reasoning_content` from API responses
-2. Stores it in `assistant_msg["reasoning"]` for trajectory export
-3. Passes it back via `reasoning_content` field on subsequent turns
-
-## Trajectory Export
-
-Conversations can be exported for training:
-
-```python
-agent = AIAgent(save_trajectories=True)
-agent.chat("Do something")
-# Saves to trajectories/*.jsonl in ShareGPT format
-```
-
-## Batch Processing
-
-For processing multiple prompts, use `batch_runner.py`:
-
-```bash
-python batch_runner.py \
-    --dataset_file=prompts.jsonl \
-    --batch_size=20 \
-    --num_workers=4 \
-    --run_name=my_run
-```
-
-See `batch_runner.py` for parallel execution with checkpointing.
--- a/docs/cli.md
+++ b/docs/cli.md
@@ -1,264 +0,0 @@
-# CLI
-
-The Hermes Agent CLI provides an interactive terminal interface for working with the agent.
-
-## Running the CLI
-
-```bash
-# Basic usage
-./hermes
-
-# With specific model
-./hermes --model "anthropic/claude-sonnet-4"
-
-# With specific toolsets
-./hermes --toolsets "web,terminal,skills"
-
-# Verbose mode
-./hermes --verbose
-```
-
-## Architecture
-
-The CLI is implemented in `cli.py` and uses:
-
- **Rich** - Welcome banner with ASCII art and styled panels
- **prompt_toolkit** - Fixed input area with command history
- **KawaiiSpinner** - Animated feedback during operations
-
-```
-┌─────────────────────────────────────────────────┐
-│  HERMES-AGENT ASCII Logo                        │
-│  ┌─────────────┐ ┌────────────────────────────┐ │
-│  │  Caduceus   │ │ Model: claude-opus-4.5     │ │
-│  │  ASCII Art  │ │ Terminal: local            │ │
-│  │             │ │ Working Dir: /home/user    │ │
-│  │             │ │ Available Tools: 19        │ │
-│  │             │ │ Available Skills: 12       │ │
-│  └─────────────┘ └────────────────────────────┘ │
-└─────────────────────────────────────────────────┘
-│ Conversation output scrolls here...             │
-│                                                 │
-│ User: Hello!                                    │
-│ ────────────────────────────────────────────── │
-│   (◕‿◕✿) 🧠 pondering... (2.3s)                │
-│   ✧٩(ˊᗜˋ*)و✧ got it! (2.3s)                    │
-│                                                 │
-│ Assistant: Hello! How can I help you today?    │
-├─────────────────────────────────────────────────┤
-│ ❯ [Fixed input area at bottom]                  │
-└─────────────────────────────────────────────────┘
-```
-
-## Commands
-
-| Command | Description |
-|---------|-------------|
-| `/help` | Show available commands |
-| `/tools` | List available tools grouped by toolset |
-| `/toolsets` | List available toolsets with descriptions |
-| `/model [name]` | Show or change the current model |
-| `/prompt [text]` | View/set/clear custom system prompt |
-| `/personality [name]` | Set a predefined personality |
-| `/clear` | Clear screen and reset conversation |
-| `/reset` | Reset conversation only (keep screen) |
-| `/history` | Show conversation history |
-| `/save` | Save current conversation to file |
-| `/config` | Show current configuration |
-| `/quit` | Exit the CLI (also: `/exit`, `/q`) |
-
-## Configuration
-
-The CLI is configured via `cli-config.yaml`. Copy from `cli-config.yaml.example`:
-
-```bash
-cp cli-config.yaml.example cli-config.yaml
-```
-
-### Model Configuration
-
-```yaml
-model:
-  default: "anthropic/claude-opus-4.5"
-  base_url: "https://openrouter.ai/api/v1"
-```
-
-### Terminal Configuration
-
-The CLI supports multiple terminal backends:
-
-```yaml
-# Local execution (default)
-terminal:
-  env_type: "local"
-  cwd: "."  # Current directory
-
-# SSH remote execution (sandboxed - agent can't touch its own code)
-terminal:
-  env_type: "ssh"
-  cwd: "/home/myuser/project"
-  ssh_host: "my-server.example.com"
-  ssh_user: "myuser"
-  ssh_key: "~/.ssh/id_rsa"
-
-# Docker container
-terminal:
-  env_type: "docker"
-  docker_image: "python:3.11"
-
-# Singularity/Apptainer (HPC)
-terminal:
-  env_type: "singularity"
-  singularity_image: "docker://python:3.11"
-
-# Modal cloud
-terminal:
-  env_type: "modal"
-  modal_image: "python:3.11"
-```
-
-### Sudo Support
-
-The CLI supports interactive sudo prompts:
-
-```
-┌──────────────────────────────────────────────────────────┐
-│  🔐 SUDO PASSWORD REQUIRED                               │
-├──────────────────────────────────────────────────────────┤
-│  Enter password below (input is hidden), or:             │
-│    • Press Enter to skip (command fails gracefully)      │
-│    • Wait 45s to auto-skip                               │
-└──────────────────────────────────────────────────────────┘
-
-  Password (hidden): 
-```
-
-**Options:**
- **Interactive**: Leave `sudo_password` unset - you'll be prompted when needed
- **Configured**: Set `sudo_password` in `cli-config.yaml` to auto-fill
- **Environment**: Set `SUDO_PASSWORD` in `.env` for all runs
-
-Password is cached for the session once entered.
-
-### Toolsets
-
-Control which tools are available:
-
-```yaml
-# Enable all tools
-toolsets:
-  - all
-
-# Or enable specific toolsets
-toolsets:
-  - web
-  - terminal
-  - skills
-```
-
-Available toolsets: `web`, `search`, `terminal`, `browser`, `vision`, `image_gen`, `skills`, `moa`, `debugging`, `safe`
-
-### Personalities
-
-Predefined personalities for the `/personality` command:
-
-```yaml
-agent:
-  personalities:
-    helpful: "You are a helpful, friendly AI assistant."
-    kawaii: "You are a kawaii assistant! Use cute expressions..."
-    pirate: "Arrr! Ye be talkin' to Captain Hermes..."
-    # Add your own!
-```
-
-Built-in personalities:
- `helpful`, `concise`, `technical`, `creative`, `teacher`
- `kawaii`, `catgirl`, `pirate`, `shakespeare`, `surfer`
- `noir`, `uwu`, `philosopher`, `hype`
-
-## Animated Feedback
-
-The CLI provides animated feedback during operations:
-
-### Thinking Animation
-
-During API calls, shows animated spinner with thinking verbs:
-```
-  ◜ (｡•́︿•̀｡) pondering... (1.2s)
-  ◠ (⊙_⊙) contemplating... (2.4s)
-  ✧٩(ˊᗜˋ*)و✧ got it! (3.1s)
-```
-
-### Tool Execution Animation
-
-Each tool type has unique animations:
-```
-  ⠋ (◕‿◕✿) 🔍 web_search... (0.8s)
-  ▅ (≧◡≦) 💻 terminal... (1.2s)
-  🌓 (★ω★) 🌐 browser_navigate... (2.1s)
-  ✧ (✿◠‿◠) 🎨 image_generate... (4.5s)
-```
-
-## Multi-line Input
-
-For multi-line input, end a line with `\` to continue:
-
-```
-❯ Write a function that:\
-  1. Takes a list of numbers\
-  2. Returns the sum
-```
-
-## Environment Variable Priority
-
-For terminal settings, `cli-config.yaml` takes precedence over `.env`:
-
-1. `cli-config.yaml` (highest priority in CLI)
-2. `.env` file
-3. System environment variables
-4. Default values
-
-This allows you to have different terminal configs for CLI vs batch processing.
-
-## Session Management
-
- **History**: Command history is saved to `~/.hermes_history`
- **Conversations**: Use `/save` to export conversations
- **Reset**: Use `/clear` for full reset, `/reset` to just clear history
- **Session Logs**: Every session automatically logs to `logs/session_{session_id}.json`
-
-### Session Logging
-
-Sessions are automatically logged to the `logs/` directory:
-
-```
-logs/
-├── session_20260201_143052_a1b2c3.json
-├── session_20260201_150217_d4e5f6.json
-└── ...
-```
-
-The session ID is displayed in the welcome banner and follows the format: `YYYYMMDD_HHMMSS_UUID`.
-
-Log files contain:
- Full conversation history in trajectory format
- Timestamps for session start and last update
- Model and message count metadata
-
-This is useful for:
- Debugging agent behavior
- Replaying conversations
- Training data inspection
-
-## Quiet Mode
-
-The CLI runs in "quiet mode" (`HERMES_QUIET=1`), which:
- Suppresses verbose logging from tools
- Enables kawaii-style animated feedback
- Hides terminal environment warnings
- Keeps output clean and user-friendly
-
-For verbose output (debugging), use:
-```bash
-./hermes --verbose
-```
--- a/docs/llm_client.md
+++ b/docs/llm_client.md
@@ -1,124 +0,0 @@
-# LLM Client
-
-Hermes Agent uses the OpenAI Python SDK with OpenRouter as the backend, providing access to many models through a single API.
-
-## Configuration
-
-```python
-from openai import OpenAI
-
-client = OpenAI(
-    api_key=os.getenv("OPENROUTER_API_KEY"),
-    base_url="https://openrouter.ai/api/v1"
-)
-```
-
-## Supported Models
-
-Any model available on [OpenRouter](https://openrouter.ai/models):
-
-```python
-# Anthropic
-model = "anthropic/claude-sonnet-4"
-model = "anthropic/claude-opus-4"
-
-# OpenAI
-model = "openai/gpt-4o"
-model = "openai/o1"
-
-# Google
-model = "google/gemini-2.0-flash"
-
-# Open models
-model = "meta-llama/llama-3.3-70b-instruct"
-model = "deepseek/deepseek-chat-v3"
-model = "moonshotai/kimi-k2.5"
-```
-
-## Tool Calling
-
-Standard OpenAI function calling format:
-
-```python
-response = client.chat.completions.create(
-    model=model,
-    messages=messages,
-    tools=[
-        {
-            "type": "function",
-            "function": {
-                "name": "web_search",
-                "description": "Search the web",
-                "parameters": {
-                    "type": "object",
-                    "properties": {
-                        "query": {"type": "string"}
-                    },
-                    "required": ["query"]
-                }
-            }
-        }
-    ],
-)
-
-# Check for tool calls
-if response.choices[0].message.tool_calls:
-    for tool_call in response.choices[0].message.tool_calls:
-        name = tool_call.function.name
-        args = json.loads(tool_call.function.arguments)
-        # Execute tool...
-```
-
-## Reasoning Models
-
-Some models return reasoning/thinking content:
-
-```python
-# Access reasoning if available
-message = response.choices[0].message
-if hasattr(message, 'reasoning_content') and message.reasoning_content:
-    reasoning = message.reasoning_content
-    # Store for trajectory export
-```
-
-## Provider Selection
-
-OpenRouter allows selecting specific providers:
-
-```python
-response = client.chat.completions.create(
-    model=model,
-    messages=messages,
-    extra_body={
-        "provider": {
-            "order": ["Anthropic", "Google"],  # Preferred providers
-            "ignore": ["Novita"],              # Providers to skip
-        }
-    }
-)
-```
-
-## Error Handling
-
-Common errors and handling:
-
-```python
-try:
-    response = client.chat.completions.create(...)
-except openai.RateLimitError:
-    # Back off and retry
-except openai.APIError as e:
-    # Check e.code for specific errors
-    # 400 = bad request (often provider-specific)
-    # 502 = bad gateway (retry with different provider)
-```
-
-## Cost Tracking
-
-OpenRouter returns usage info:
-
-```python
-usage = response.usage
-print(f"Tokens: {usage.prompt_tokens} + {usage.completion_tokens}")
-print(f"Cost: ${usage.cost:.6f}")  # If available
-```
--- a/docs/message_graph.md
+++ b/docs/message_graph.md
@@ -1,121 +0,0 @@
-# Message Format & Trajectories
-
-Hermes Agent uses two message formats: the **API format** for LLM calls and the **trajectory format** for training data export.
-
-## API Message Format
-
-Standard OpenAI chat format used during execution:
-
-```python
-messages = [
-    # System prompt
-    {"role": "system", "content": "You are a helpful assistant with tools..."},
-    
-    # User query
-    {"role": "user", "content": "Search for Python tutorials"},
-    
-    # Assistant with tool call
-    {
-        "role": "assistant",
-        "content": None,
-        "tool_calls": [{
-            "id": "call_abc123",
-            "type": "function",
-            "function": {
-                "name": "web_search",
-                "arguments": "{\"query\": \"Python tutorials\"}"
-            }
-        }]
-    },
-    
-    # Tool result
-    {
-        "role": "tool",
-        "tool_call_id": "call_abc123",
-        "content": "{\"results\": [...]}"
-    },
-    
-    # Final response
-    {"role": "assistant", "content": "Here's what I found..."}
-]
-```
-
-## Trajectory Format (ShareGPT)
-
-Exported for training in ShareGPT format:
-
-```json
-{
-    "conversations": [
-        {"from": "system", "value": "You are a helpful assistant..."},
-        {"from": "human", "value": "Search for Python tutorials"},
-        {"from": "gpt", "value": "<tool_call>\n{\"name\": \"web_search\", \"arguments\": {\"query\": \"Python tutorials\"}}\n</tool_call>"},
-        {"from": "tool", "value": "<tool_response>\n{\"results\": [...]}\n</tool_response>"},
-        {"from": "gpt", "value": "Here's what I found..."}
-    ],
-    "tools": "[{\"type\": \"function\", \"function\": {...}}]",
-    "source": "hermes-agent"
-}
-```
-
-## Reasoning Content
-
-For models that output reasoning/chain-of-thought:
-
-**During execution** (API format):
-```python
-# Stored internally but not sent back to model in content
-assistant_msg = {
-    "role": "assistant",
-    "content": "Here's what I found...",
-    "reasoning": "Let me think about this step by step..."  # Internal only
-}
-```
-
-**In trajectory export** (reasoning wrapped in tags):
-```json
-{
-    "from": "gpt",
-    "value": "<think>\nLet me think about this step by step...\n</think>\nHere's what I found..."
-}
-```
-
-## Conversion Flow
-
-```
-API Response → Internal Storage → Trajectory Export
-     ↓              ↓                    ↓
-tool_calls    reasoning field      <tool_call> tags
-reasoning_content                  <think> tags
-```
-
-The conversion happens in `_convert_to_trajectory_format()` in `run_agent.py`.
-
-## Ephemeral System Prompts
-
-Batch processing supports ephemeral system prompts that guide behavior during execution but are NOT saved to trajectories:
-
-```python
-# During execution: full system prompt + ephemeral guidance
-messages = [
-    {"role": "system", "content": SYSTEM_PROMPT + "\n\n" + ephemeral_prompt},
-    ...
-]
-
-# In saved trajectory: only the base system prompt
-trajectory = {
-    "conversations": [
-        {"from": "system", "value": SYSTEM_PROMPT},  # No ephemeral
-        ...
-    ]
-}
-```
-
-## Trajectory Compression
-
-Long trajectories can be compressed for training using `trajectory_compressor.py`:
-
- Protects first/last N turns
- Summarizes middle turns with LLM
- Targets specific token budget
- See `configs/trajectory_compression.yaml` for settings
--- a/docs/tools.md
+++ b/docs/tools.md
@@ -1,159 +0,0 @@
-# Tools
-
-Tools are functions that extend the agent's capabilities. Each tool is defined with an OpenAI-compatible JSON schema and an async handler function.
-
-## Tool Structure
-
-Each tool module in `tools/` exports:
-1. **Schema definitions** - OpenAI function-calling format
-2. **Handler functions** - Async functions that execute the tool
-
-```python
-# Example: tools/web_tools.py
-
-# Schema definition
-WEB_SEARCH_SCHEMA = {
-    "type": "function",
-    "function": {
-        "name": "web_search",
-        "description": "Search the web for information",
-        "parameters": {
-            "type": "object",
-            "properties": {
-                "query": {"type": "string", "description": "Search query"}
-            },
-            "required": ["query"]
-        }
-    }
-}
-
-# Handler function
-async def web_search(query: str) -> dict:
-    """Execute web search and return results."""
-    # Implementation...
-    return {"results": [...]}
-```
-
-## Tool Categories
-
-| Category | Module | Tools |
-|----------|--------|-------|
-| **Web** | `web_tools.py` | `web_search`, `web_extract`, `web_crawl` |
-| **Terminal** | `terminal_tool.py` | `terminal` (local/docker/singularity/modal/ssh backends) |
-| **Browser** | `browser_tool.py` | `browser_navigate`, `browser_click`, `browser_type`, etc. |
-| **Vision** | `vision_tools.py` | `vision_analyze` |
-| **Image Gen** | `image_generation_tool.py` | `image_generate` |
-| **Reasoning** | `mixture_of_agents_tool.py` | `mixture_of_agents` |
-| **Skills** | `skills_tool.py` | `skills_categories`, `skills_list`, `skill_view` |
-
-## Tool Registration
-
-Tools are registered in `model_tools.py`:
-
-```python
-# model_tools.py
-TOOL_SCHEMAS = [
-    *WEB_TOOL_SCHEMAS,
-    *TERMINAL_TOOL_SCHEMAS,
-    *BROWSER_TOOL_SCHEMAS,
-    # ...
-]
-
-TOOL_HANDLERS = {
-    "web_search": web_search,
-    "terminal": terminal_tool,
-    "browser_navigate": browser_navigate,
-    # ...
-}
-```
-
-## Toolsets
-
-Tools are grouped into **toolsets** for logical organization (see `toolsets.py`):
-
-```python
-TOOLSETS = {
-    "web": {
-        "description": "Web search and content extraction",
-        "tools": ["web_search", "web_extract", "web_crawl"]
-    },
-    "terminal": {
-        "description": "Command execution",
-        "tools": ["terminal"]
-    },
-    # ...
-}
-```
-
-## Adding a New Tool
-
-1. Create handler function in `tools/your_tool.py`
-2. Define JSON schema following OpenAI format
-3. Register in `model_tools.py` (schemas and handlers)
-4. Add to appropriate toolset in `toolsets.py`
-5. Update `tools/__init__.py` exports
-
-## Stateful Tools
-
-Some tools maintain state across calls within a session:
-
- **Terminal**: Keeps container/sandbox running between commands
- **Browser**: Maintains browser session for multi-step navigation
-
-State is managed per `task_id` and cleaned up automatically.
-
-## Terminal Backends
-
-The terminal tool supports multiple execution backends:
-
-| Backend | Description | Use Case |
-|---------|-------------|----------|
-| `local` | Direct execution on host | Development, simple tasks |
-| `ssh` | Remote execution via SSH | Sandboxing (agent can't modify its own code) |
-| `docker` | Docker container | Isolation, reproducibility |
-| `singularity` | Singularity/Apptainer | HPC clusters, rootless containers |
-| `modal` | Modal cloud | Scalable cloud compute, GPUs |
-
-Configure via environment variables or `cli-config.yaml`:
-
-```yaml
-# SSH backend example (in cli-config.yaml)
-terminal:
-  env_type: "ssh"
-  ssh_host: "my-server.example.com"
-  ssh_user: "myuser"
-  ssh_key: "~/.ssh/id_rsa"
-  cwd: "/home/myuser/project"
-```
-
-The SSH backend uses ControlMaster for connection persistence, making subsequent commands fast.
-
-## Skills Tools (Progressive Disclosure)
-
-Skills are on-demand knowledge documents. They use **progressive disclosure** to minimize tokens:
-
-```
-Level 0: skills_categories()     → ["mlops", "devops"]           (~50 tokens)
-Level 1: skills_list(category)   → [{name, description}, ...]   (~3k tokens)
-Level 2: skill_view(name)        → Full content + metadata       (varies)
-Level 3: skill_view(name, path)  → Specific reference file       (varies)
-```
-
-Skill directory structure:
-```
-skills/
-└── mlops/
-    └── axolotl/
-        ├── SKILL.md           # Main instructions (required)
-        ├── references/        # Additional docs
-        └── templates/         # Output formats, configs
-```
-
-SKILL.md uses YAML frontmatter:
-```yaml
---
-name: axolotl
-description: Fine-tuning LLMs with Axolotl
-tags: [Fine-Tuning, LoRA, DPO]
---
-```
--- a/example-skill/SKILL.md
+++ b/example-skill/SKILL.md
@@ -1,70 +0,0 @@
---
-name: example-skill
-description: An example skill demonstrating the skill file format and structure
---
-
-# Example Skill
-
-This is an example skill file that demonstrates how to create skills for the Hermes Agent.
-
-## Skill File Format
-
-Skills are markdown files with YAML frontmatter at the top:
-
-```yaml
---
-name: your-skill-name
-description: A brief one-line description of what this skill does
---
-```
-
-The frontmatter fields:
- **name**: The identifier used to reference this skill (lowercase, hyphens for spaces)
- **description**: A brief description shown when listing skills (keep under 200 chars)
-
-## Writing Effective Skills
-
-### 1. Be Specific and Actionable
-
-Good skills provide clear, actionable instructions:
-
-```
-When reviewing code:
-1. Check for security vulnerabilities first
-2. Verify error handling is comprehensive
-3. Ensure tests cover edge cases
-```
-
-### 2. Include Examples
-
-Show concrete examples of what you want:
-
-```python
-# Good: Descriptive variable names
-user_authentication_token = get_token()
-
-# Bad: Cryptic abbreviations  
-uat = gt()
-```
-
-### 3. Define When to Use
-
-Help the agent understand when this skill applies:
-
-> Use this skill when: reviewing pull requests, auditing security, or checking code quality.
-
-## Skill Categories
-
-Consider organizing skills by purpose:
-
- **Conventions**: Coding standards, API patterns, naming rules
- **Workflows**: Step-by-step processes for deployments, reviews, releases
- **Knowledge**: Domain-specific information, system architecture, gotchas
- **Templates**: Boilerplate for common tasks, response formats
-
-## Tips
-
-1. Keep the description concise - it's shown in the skills list
-2. Use headers to organize longer skills
-3. Include code examples where helpful
-4. Reference other skills if they're related
--- a/configs/run_datagen_minimax-3.1.sh
+++ b/configs/run_datagen_minimax-3.1.sh
@@ -1,12 +1,12 @@
 python batch_runner.py \
-  --dataset_file="source-data/hermes-agent-agent-tasks-1/agent_tasks_eval.jsonl" \
-  --batch_size=50 \
-  --run_name="megascience_sft_minimax-m2.1-thinking-2-eval" \
+  --dataset_file="source-data/agent_tasks_eval.jsonl" \
+  --batch_size=1 \
+  --run_name="agenttasks_eval_gemini-4.5-3-nothinking" \
  --distribution="science" \
-  --model="minimax/minimax-m2.1" \
-  --base_url="https://openrouter.ai/api/v1" \
-  --providers_allowed="minimax" \
-  --num_workers=1 \
-  --max_turns=40 \
+  --model="gemini-3-pro-preview" \
+  --base_url="https://generativelanguage.googleapis.com/v1beta/openai/" \
+  --api_key="${GEMINI_API_KEY}" \
+  --num_workers=10 \
+  --max_turns=60 \
  --verbose \
-  --ephemeral_system_prompt="You have access to a variety of tools to help you solve scientific, math, and technology problems presented to you. You can use them in sequence and build off of the results of prior tools you've used results. Always use the terminal or search tool if it can provide additional context, verify formulas, double check concepts and recent studies and understanding, doing all calculations, etc. You should only be confident in your own reasoning, knowledge, or calculations if you've exhaustively used all tools available to you to that can help you verify or validate your work. Always pip install any packages you need to use the python scripts you want to run. If you need to use a tool that isn't available, you can use the terminal tool to install or create it in many cases as well. Do not use the terminal tool to communicate with the user, as they cannot see your commands, only your final response after completing the task. Search for at least 3 sources, but not more than 12."
+  --ephemeral_system_prompt="You have access to a variety of tools to help you solve scientific, math, and technology problems presented to you. You can use them in sequence and build off of the results of prior tools you've used results. Always use the terminal or search tool if it can provide additional context, verify formulas, double check concepts and recent studies and understanding, doing all calculations, etc. You should only be confident in your own reasoning, knowledge, or calculations if you've exhaustively used all tools available to you to that can help you verify or validate your work. Always pip install any packages you need to use the python scripts you want to run. If you need to use a tool that isn't available, you can use the terminal tool to install or create it in many cases as well. Do not use the terminal tool to communicate with the user, as they cannot see your commands, only your final response after completing the task. If you require API keys please check which ones already exist in your environment variables in a way that does not read them."
--- a/configs/run_datagen_glm4.7.sh
+++ b/configs/run_datagen_glm4.7.sh
@@ -1,26 +1,13 @@
-#!/bin/bash
-
-# Create logs directory if it doesn't exist
-mkdir -p logs
-
-# Generate log filename with timestamp
-LOG_FILE="logs/glm4.7-thinking-sft1_$(date +%Y%m%d_%H%M%S).log"
-
-echo "📝 Logging output to: $LOG_FILE"
-
 python batch_runner.py \
-  --dataset_file="source-data/hermes-agent-agent-tasks-1/agent_tasks_sft_2.jsonl" \
-  --batch_size=20 \
-  --run_name="megascience_glm4.7-thinking-sft2" \
+  --dataset_file="source-data/agent_tasks_eval_10.jsonl" \
+  --batch_size=1 \
+  --run_name="agenttasks_eval_gemini-3-3-10-thinking-2025-11-22" \
  --distribution="science" \
-  --model="z-ai/glm-4.7" \
-  --base_url="https://openrouter.ai/api/v1" \
-  --providers_allowed="gmicloud,siliconflow,atlas-cloud,z-ai,novita" \
-  --num_workers=15 \
+  --prokletor_client="HermesToolClient" \
+  --model="gemini-3-pro-preview" \
+  --base_url="https://generativelanguage.googleapis.com/v1beta/openai/" \
+  --api_key="${GEMINI_API_KEY}" \
+  --num_workers=10 \
  --max_turns=60 \
-  --ephemeral_system_prompt="You have access to a variety of tools to help you solve scientific, math, and technology problems presented to you. You can use them in sequence and build off of the results of prior tools you've used results. Always use the terminal or search tool if it can provide additional context, verify formulas, double check concepts and recent studies and understanding, doing all calculations, etc. You should only be confident in your own reasoning, knowledge, or calculations if you've exhaustively used all tools available to you to that can help you verify or validate your work. Always pip install any packages you need to use the python scripts you want to run. If you need to use a tool that isn't available, you can use the terminal tool to install or create it in many cases as well. Do not use the terminal tool to communicate with the user, as they cannot see your commands, only your final response after completing the task. Search for at least 3 sources, but not more than 12, so you can maintain focused context." \
-  2>&1 | tee "$LOG_FILE"
-
-echo "✅ Log saved to: $LOG_FILE"
-
-#  --verbose \
+  --verbose \
+  --ephemeral_system_prompt="You have access to a variety of tools to help you solve scientific, math, and technology problems presented to you. You can use them in sequence and build off of the results of prior tools you've used results. Always use the terminal or search tool if it can provide additional context, verify formulas, double check concepts and recent studies and understanding, doing all calculations, etc. You should only be confident in your own reasoning, knowledge, or calculations if you've exhaustively used all tools available to you to that can help you verify or validate your work. Always pip install any packages you need to use the python scripts you want to run. If you need to use a tool that isn't available, you can use the terminal tool to install or create it in many cases as well. Do not use the terminal tool to communicate with the user, as they cannot see your commands, only your final response after completing the task. The web search tool only gets you urls and brief descriptions you need to run web extract to actually visit those urls. If you need to check if you have a certain API key available please do so in a way that does not expose the key. For verbose tools like installs please use the quietest version. Also please make sure you include -y in your install commands or the terminal will get stuck at the y/n stage."
--- a/46
+++ b/46
@@ -1,46 +0,0 @@
-#!/usr/bin/env python3
-"""
-Hermes Agent CLI Launcher
-
-This is a convenience wrapper to launch the Hermes CLI.
-Usage: ./hermes [options]
-"""
-
-if __name__ == "__main__":
-    """
-    Fire (google/python-fire) does not support POSIX-style short flags like `-p`.
-    We translate the most common shorthands to their long equivalents so wrapper
-    scripts can reliably use:
-      - `-p "..."`  -> `--prompt "..."` (no TUI/banner; print result and exit)
-      - `-q "..."`  -> `--query "..."`  (single-shot with banner UX)
-    """
-
-    import sys
-
-    def _rewrite_short_flags(argv: list[str]) -> list[str]:
-        rewritten: list[str] = []
-        i = 0
-        while i < len(argv):
-            arg = argv[i]
-            if arg == "-p":
-                rewritten.append("--prompt")
-                if i + 1 < len(argv):
-                    rewritten.append(argv[i + 1])
-                    i += 2
-                    continue
-            if arg == "-q":
-                rewritten.append("--query")
-                if i + 1 < len(argv):
-                    rewritten.append(argv[i + 1])
-                    i += 2
-                    continue
-            rewritten.append(arg)
-            i += 1
-        return rewritten
-
-    sys.argv = [sys.argv[0]] + _rewrite_short_flags(sys.argv[1:])
-
-    from cli import main
-    import fire
-
-    fire.Fire(main)
--- a/hermes_agent.egg-info/PKG-INFO
+++ b/hermes_agent.egg-info/PKG-INFO
@@ -1,659 +0,0 @@
-Metadata-Version: 2.4
-Name: hermes-agent
-Version: 0.1.0
-Summary: AI agent with advanced tool-calling and toolsets
-Author: Nous Research
-License: MIT
-Requires-Python: >=3.10
-Description-Content-Type: text/markdown
-Requires-Dist: openai
-Requires-Dist: python-dotenv
-Requires-Dist: fire
-Requires-Dist: httpx
-Requires-Dist: rich
-Requires-Dist: tenacity
-Requires-Dist: pyyaml
-Requires-Dist: prompt_toolkit
-Requires-Dist: requests
-Requires-Dist: jinja2
-Requires-Dist: pydantic>=2.0
-Requires-Dist: firecrawl-py
-Requires-Dist: fal-client
-Requires-Dist: litellm>=1.75.5
-Requires-Dist: typer
-Requires-Dist: platformdirs
-Provides-Extra: modal
-Requires-Dist: modal; extra == "modal"
-Requires-Dist: boto3; extra == "modal"
-Provides-Extra: dev
-Requires-Dist: pytest; extra == "dev"
-Requires-Dist: pytest-asyncio; extra == "dev"
-Provides-Extra: atropos
-Requires-Dist: atroposlib @ git+https://github.com/NousResearch/atropos.git ; extra == "atropos"
-Requires-Dist: aiohttp; extra == "atropos"
-Requires-Dist: fastapi; extra == "atropos"
-Requires-Dist: uvicorn; extra == "atropos"
-Requires-Dist: pyte; extra == "atropos"
-
-# Hermes Agent
-
-An AI agent with advanced tool-calling capabilities, featuring a flexible toolsets system for organizing and managing tools.
-
-## Features
-
- **Interactive CLI**: Beautiful terminal interface with animated feedback, personalities, and session management
- **Web Tools**: Search, extract content, and crawl websites
- **Terminal Tools**: Execute commands via local, Docker, Singularity, Modal, or SSH backends
- **Browser Tools**: Automate web browsers to navigate, click, type, and extract content
- **Vision Tools**: Analyze images from URLs
- **Reasoning Tools**: Advanced multi-model reasoning (Mixture of Agents)
- **Creative Tools**: Generate images from text prompts
- **Skills Tools**: On-demand knowledge documents with progressive disclosure
- **Toolsets System**: Organize tools into logical groups for different scenarios
- **Batch Processing**: Process datasets in parallel with checkpointing and statistics tracking
- **Ephemeral System Prompts**: Guide model behavior without polluting training datasets
-
-## Quick Start (CLI)
-
-```bash
-# After setup (see below), just run:
-./hermes
-
-# Or with options:
-./hermes --model "anthropic/claude-sonnet-4" --toolsets "web,terminal"
-```
-
-The CLI provides:
- Animated spinners during thinking and tool execution
- Kawaii-style feedback messages
- `/commands` for configuration, history, and session management
- Customizable personalities (`/personality kawaii`, `/personality pirate`, etc.)
- Persistent configuration via `cli-config.yaml`
-
-## Setup
-
-### 1. Clone the Repository
-```bash
-# Clone with submodules (recommended)
-git clone --recurse-submodules https://github.com/NousResearch/Hermes-Agent.git
-cd Hermes-Agent
-
-# Or if already cloned without submodules:
-git submodule update --init --recursive
-```
-
-### 2. Install Dependencies
-```bash
-# Create and activate virtual environment (recommended)
-python3 -m venv venv
-source venv/bin/activate  # On Windows: venv\Scripts\activate
-
-# Install Python packages
-pip install -r requirements.txt
-
-# Install mini-swe-agent for terminal tools
-pip install -e ./mini-swe-agent
-
-# Install Node.js dependencies for browser tools (requires Node.js)
-npm install
-```
-
-### 3. Configure Environment Variables
-```bash
-# Copy the example environment file
-cp .env.example .env
-
-# Edit .env and add your API keys
-nano .env  # or use your preferred editor
-```
-
-**Required API Keys:**
- `OPENROUTER_API_KEY` - LLM access via OpenRouter (get at: https://openrouter.ai/keys)
- `FIRECRAWL_API_KEY` - Web tools (get at: https://firecrawl.dev/)
- `NOUS_API_KEY` - Vision & reasoning tools (get at: https://inference-api.nousresearch.com/)
- `FAL_KEY` - Image generation (get at: https://fal.ai/)
-
-**Optional API Keys (for specific features):**
- `BROWSERBASE_API_KEY` - Browser automation (get at: https://browserbase.com/)
- `BROWSERBASE_PROJECT_ID` - From Browserbase dashboard
- `MORPH_API_KEY` - For legacy Hecate terminal backend (get at: https://morph.so/)
-
-### 4. Configure Terminal Backend
-
-The terminal tool uses **mini-swe-agent** environments. Configure in `.env` or `cli-config.yaml`:
-
-```bash
-# Backend: "local", "docker", "singularity", "modal", or "ssh"
-TERMINAL_ENV=local          # Default: runs on host machine (no isolation)
-TERMINAL_ENV=ssh            # Remote execution via SSH (agent code stays local)
-TERMINAL_ENV=singularity    # Recommended for HPC: Apptainer/Singularity containers
-TERMINAL_ENV=docker         # Isolated Docker containers
-TERMINAL_ENV=modal          # Cloud execution via Modal
-
-# Container image (for docker/singularity/modal backends)
-TERMINAL_DOCKER_IMAGE=python:3.11-slim
-TERMINAL_SINGULARITY_IMAGE=docker://python:3.11-slim
-TERMINAL_TIMEOUT=60
-
-# SSH backend (for ssh)
-TERMINAL_SSH_HOST=my-server.example.com
-TERMINAL_SSH_USER=myuser
-TERMINAL_SSH_KEY=~/.ssh/id_rsa  # Optional, uses ssh-agent if not set
-```
-
-**Backend Requirements:**
- **local**: No extra setup (runs directly on your machine, no isolation)
- **ssh**: SSH access to remote machine (great for sandboxing - agent can't touch its own code)
- **singularity**: Requires Apptainer or Singularity installed (common on HPC clusters, no root needed)
- **docker**: Requires Docker installed and user in `docker` group
- **modal**: Requires Modal account (see setup below)
-
-### Singularity/Apptainer Setup (Recommended for HPC)
-
-Singularity/Apptainer provides rootless container execution, ideal for HPC clusters:
-
-```bash
-# 1. Verify Apptainer is installed
-apptainer --version  # or: singularity --version
-
-# 2. Set up cache directories (important for parallel workers)
-# Use /scratch if available (HPC), otherwise /tmp
-export APPTAINER_CACHEDIR=/scratch/$USER/.apptainer
-export APPTAINER_TMPDIR=/scratch/$USER/.apptainer/tmp
-mkdir -p "$APPTAINER_CACHEDIR" "$APPTAINER_TMPDIR"
-
-# 3. Pre-build SIF image (recommended for parallel batch processing)
-# This avoids race conditions when multiple workers start simultaneously
-apptainer build $APPTAINER_CACHEDIR/python-nodejs.sif docker://nikolaik/python-nodejs:python3.11-nodejs20
-
-# 4. Configure .env to use the local SIF
-TERMINAL_ENV=singularity
-TERMINAL_SINGULARITY_IMAGE=/scratch/$USER/.apptainer/python-nodejs.sif
-```
-
-**Tip:** The batch scripts in `configs/` automatically handle SIF pre-building if `/scratch` is available.
-
-### Modal Cloud Backend Setup
-
-[Modal](https://modal.com) provides serverless cloud compute for running sandboxed environments at scale.
-
-```bash
-# 1. Install Modal and dependencies
-pip install modal boto3
-
-# 2. Authenticate with Modal (opens browser)
-modal setup
-
-# 3. Set terminal backend to modal in .env
-TERMINAL_ENV=modal
-```
-
-Modal uses CLI-based authentication (stored in `~/.modal/`), so no API key is needed in `.env`. After running `modal setup`, commands will automatically execute in Modal's cloud sandboxes.
-
-### Browser Tools Setup
-
-Browser tools enable the agent to navigate websites, fill forms, click buttons, and extract content. They use [agent-browser](https://github.com/vercel-labs/agent-browser) CLI with [Browserbase](https://browserbase.com) cloud execution.
-
-```bash
-# 1. Install Node.js (if not already installed)
-# Use nvm (recommended) or your package manager
-
-# 2. Install agent-browser CLI (choose one option):
-npm install -g agent-browser     # Option A: Global install (recommended)
-npm install                      # Option B: Local install (uses npx fallback)
-
-# 3. Get Browserbase credentials
-# Sign up at https://browserbase.com/ and get your:
-# - API Key (from Settings → API Keys)
-# - Project ID (from your project dashboard)
-
-# 4. Add to your .env file:
-BROWSERBASE_API_KEY=your_api_key_here
-BROWSERBASE_PROJECT_ID=your_project_id_here
-```
-
-**Available Browser Tools:**
-
-| Tool | Description |
-|------|-------------|
-| `browser_navigate` | Navigate to a URL |
-| `browser_snapshot` | Get text-based page snapshot with element refs |
-| `browser_click` | Click an element by ref (e.g., `@e5`) |
-| `browser_type` | Type text into an input field |
-| `browser_scroll` | Scroll up or down |
-| `browser_back` | Go back in browser history |
-| `browser_press` | Press a keyboard key (Enter, Tab, etc.) |
-| `browser_close` | Close the browser session |
-| `browser_get_images` | Get list of images on the page |
-
-**Example Usage:**
-```bash
-# Use browser tools with web search and vision
-python run_agent.py \
-  --query "Go to amazon.com and find the price of the latest Kindle" \
-  --enabled_toolsets=browser,web,vision
-
-# Use browser-focused distribution
-python batch_runner.py \
-  --dataset_file=browser_tasks.jsonl \
-  --distribution=browser_use \
-  --run_name=browser_run
-```
-
-See `.env.example` for all available configuration options including debug settings.
-
-### Skills Tools
-
-Skills are on-demand knowledge documents the agent can load when needed. They follow a **progressive disclosure** pattern to minimize token usage:
-
-```
-skills/
-├── mlops/                    # Category folder
-│   ├── axolotl/             # Skill folder
-│   │   ├── SKILL.md         # Main instructions (required)
-│   │   ├── references/      # Additional docs, API specs
-│   │   └── templates/       # Output formats, configs
-│   └── vllm/
-│       └── SKILL.md
-```
-
-**Available Skills Tools:**
-
-| Tool | Description |
-|------|-------------|
-| `skills_categories` | List available skill categories (~50 tokens) |
-| `skills_list` | List skills with name + description (~3k tokens for 40 skills) |
-| `skill_view` | Load full skill content, tags, and linked files |
-
-**Example Usage:**
-```bash
-# Use skills tools
-python run_agent.py \
-  --query "What skills do you have for fine-tuning? Show me the axolotl skill." \
-  --enabled_toolsets=skills
-```
-
-**Creating Skills:**
-
-Skills use YAML frontmatter for metadata:
-```yaml
---
-name: my-skill
-description: Brief description shown in skills_list
-tags: [tag1, tag2]
-related_skills: [other-skill]
-version: 1.0.0
---
-# Skill Content
-
-Instructions, examples, and guidelines here...
-```
-
-Skills can include:
- `references/` - Additional documentation, API specs, examples
- `templates/` - Output formats, config files, boilerplate code
- `scripts/` - Executable helpers (Python, shell scripts)
-
-## Session Logging
-
-Every conversation is automatically logged to `logs/` for debugging and inspection:
-
-```
-logs/
-├── session_20260201_143052_a1b2c3.json
-├── session_20260201_150217_d4e5f6.json
-└── ...
-```
-
-**Log Format:**
-```json
-{
-  "session_id": "20260201_143052_a1b2c3",
-  "model": "anthropic/claude-sonnet-4",
-  "session_start": "2026-02-01T14:30:52.123456",
-  "last_updated": "2026-02-01T14:35:12.789012",
-  "message_count": 8,
-  "conversations": [
-    {"from": "system", "value": "..."},
-    {"from": "human", "value": "..."},
-    {"from": "gpt", "value": "..."},
-    {"from": "tool", "value": "..."}
-  ]
-}
-```
-
- **Automatic**: Logs are created and updated automatically after each conversation turn
- **Session ID in Banner**: The CLI displays the session ID in the welcome banner
- **Trajectory Format**: Uses the same format as batch processing for consistency
- **Git Ignored**: `logs/` is in `.gitignore` so logs aren't committed
-
-## Interactive CLI
-
-The CLI provides a rich interactive experience for working with the agent.
-
-### Running the CLI
-
-```bash
-# Basic usage
-./hermes
-
-# With specific model
-./hermes --model "anthropic/claude-sonnet-4"
-
-# With specific toolsets
-./hermes --toolsets "web,terminal,skills"
-```
-
-### CLI Commands
-
-| Command | Description |
-|---------|-------------|
-| `/help` | Show available commands |
-| `/tools` | List available tools by toolset |
-| `/toolsets` | List available toolsets |
-| `/model [name]` | Show or change the current model |
-| `/prompt [text]` | View/set custom system prompt |
-| `/personality [name]` | Set a predefined personality |
-| `/clear` | Clear screen and reset conversation |
-| `/reset` | Reset conversation only |
-| `/history` | Show conversation history |
-| `/save` | Save current conversation to file |
-| `/config` | Show current configuration |
-| `/quit` | Exit the CLI |
-
-### Configuration
-
-Copy `cli-config.yaml.example` to `cli-config.yaml` and customize:
-
-```yaml
-# Model settings
-model:
-  default: "anthropic/claude-sonnet-4"
-
-# Terminal backend (local, docker, singularity, modal, or ssh)
-terminal:
-  env_type: "local"
-  cwd: "."  # Use current directory
-
-# Or use SSH for remote execution (keeps agent code isolated)
-# terminal:
-#   env_type: "ssh"
-#   ssh_host: "my-server.example.com"
-#   ssh_user: "myuser"
-#   ssh_key: "~/.ssh/id_rsa"
-#   cwd: "/home/myuser/project"
-
-# Enable specific toolsets
-toolsets:
-  - all  # or: web, terminal, browser, vision, etc.
-
-# Custom personalities (use with /personality command)
-agent:
-  personalities:
-    helpful: "You are a helpful assistant."
-    kawaii: "You are a kawaii assistant! Use cute expressions..."
-```
-
-### Personalities
-
-Built-in personalities available via `/personality`:
- `helpful`, `concise`, `technical`, `creative`, `teacher`
- `kawaii`, `catgirl`, `pirate`, `shakespeare`, `surfer`
- `noir`, `uwu`, `philosopher`, `hype`
-
-## Toolsets System
-
-The agent uses a toolsets system for organizing and managing tools. All tools must be part of a toolset to be accessible - individual tool selection is not supported. This ensures consistent and logical grouping of capabilities.
-
-### Key Concepts
-
- **Toolsets**: Logical groups of tools for specific use cases (e.g., "research", "development", "debugging")
- **Composition**: Toolsets can include other toolsets for powerful combinations
- **Custom Toolsets**: Create your own toolsets at runtime or by editing `toolsets.py`
- **Toolset-Only Access**: Tools are only accessible through toolsets, not individually
-
-### Available Toolsets
-
-See `toolsets.py` for the complete list of predefined toolsets including:
- Basic toolsets (web, terminal, vision, creative, reasoning)
- Composite toolsets (research, development, analysis, etc.)
- Scenario-specific toolsets (debugging, documentation, API testing, etc.)
- Special toolsets (safe mode without terminal, minimal, offline)
-
-### Using Toolsets
-
-```bash
-# Use a predefined toolset
-python run_agent.py --enabled_toolsets=research --query "Find latest AI papers"
-
-# Combine multiple toolsets
-python run_agent.py --enabled_toolsets=web,vision --query "Analyze this website"
-
-# Enable all toolsets explicitly (same as omitting the flag)
-python run_agent.py --enabled_toolsets=all --query "Do web research and run commands if helpful"
-
-# Safe mode (no terminal access)
-python run_agent.py --enabled_toolsets=safe --query "Help without running commands"
-
-# List all available toolsets and tools
-python run_agent.py --list_tools
-```
-
-See `toolsets.py` for the complete list of available toolsets and how to create custom ones.
-
-## Basic Usage
-
-### Default (all tools enabled)
-```bash
-# Uses OpenRouter by default - just set OPENROUTER_API_KEY in .env
-python run_agent.py \
-  --query "search up the latest docs on jit in python 3.13 and write me basic example that's not in their docs. profile its perf" \
-  --max_turns 20 \
-  --model anthropic/claude-sonnet-4-20250514
-```
-
-### With specific toolset
-```bash
-python run_agent.py \
-  --query "Debug this Python error" \
-  --enabled_toolsets=debugging \
-  --model anthropic/claude-sonnet-4-20250514
-```
-
-### Python API
-```python
-from run_agent import AIAgent
-
-# Uses OpenRouter by default (reads OPENROUTER_API_KEY from .env)
-agent = AIAgent(
-    model="anthropic/claude-sonnet-4-20250514",
-    enabled_toolsets=["research"]
-)
-response = agent.chat("Find information about quantum computing")
-
-# Create custom toolset at runtime
-from toolsets import create_custom_toolset
-
-create_custom_toolset(
-    name="my_tools",
-    description="My custom toolkit",
-    tools=["web_search"],
-    includes=["terminal", "vision"]
-)
-
-agent = AIAgent(enabled_toolsets=["my_tools"])
-```
-
-## Batch Processing
-
-Process multiple prompts from a dataset in parallel with automatic checkpointing and statistics tracking:
-
-```bash
-# Basic batch processing
-python batch_runner.py \
-  --dataset_file=prompts.jsonl \
-  --batch_size=20 \
-  --run_name=my_run
-
-# With specific distribution
-python batch_runner.py \
-  --dataset_file=prompts.jsonl \
-  --batch_size=20 \
-  --run_name=image_run \
-  --distribution=image_gen \
-  --num_workers=4
-```
-
-**Key Features:**
- Parallel processing with configurable workers
- Toolset distributions for varied data generation
- Automatic checkpointing and resume capability
- Combined output in `data/<run_name>/trajectories.jsonl`
- Tool usage statistics and success rates
-
-Use `--list_distributions` to see available toolset distributions for varied data generation.
-
-### Trajectory Compression
-
-Post-process trajectories to fit within token budgets for training:
-
-```bash
-# Compress a directory of JSONL files
-python trajectory_compressor.py --input=data/my_run
-
-# Compress a single JSONL file
-python trajectory_compressor.py --input=data/trajectories.jsonl
-
-# Compress a 15% sample (useful for creating smaller training sets)
-python trajectory_compressor.py --input=data/trajectories.jsonl --sample_percent=15
-
-# Custom output and token target
-python trajectory_compressor.py \
-  --input=data/trajectories.jsonl \
-  --output=data/compressed.jsonl \
-  --target_max_tokens=16000
-```
-
-**Features:**
- Protects first turns (system, human, first GPT response, first tool call)
- Protects last N turns (configurable)
- Summarizes middle turns using LLM to fit target token budget
- Supports both directory and single file input
- Optional random sampling with `--sample_percent`
- Configurable via `configs/trajectory_compression.yaml`
-
-### Ephemeral System Prompts
-
-The ephemeral system prompt feature allows you to guide the model's behavior during batch processing **without** saving that prompt to the training dataset trajectories. This is useful for:
-
- Guiding model behavior during data collection
- Adding task-specific instructions 
- Keeping saved trajectories clean and focused on tool-calling format
-
-**Example:**
-```bash
-python batch_runner.py \
-  --dataset_file=prompts.jsonl \
-  --batch_size=10 \
-  --run_name=my_run \
-  --ephemeral_system_prompt="You are a helpful assistant focused on image generation."
-```
-
-The ephemeral prompt will influence the model's behavior during execution, but **only the standard tool-calling system prompt** will be saved in the trajectory files.
-
-The ephemeral prompt influences model behavior during execution, but **only the standard tool-calling system prompt** is saved in trajectory files.
-
-## Command Line Arguments
-
-**Single Agent (`run_agent.py`):**
- `--query`: The question or task for the agent
- `--model`: Model to use (default: claude-opus-4-20250514)
- `--api_key`: API key for authentication
- `--base_url`: API endpoint URL
- `--max_turns`: Maximum number of tool-calling iterations
- `--enabled_toolsets`: Comma-separated list of toolsets to enable. Use `all` (or `*`) to enable everything. If omitted, all toolsets are enabled by default.
- `--disabled_toolsets`: Comma-separated list of toolsets to disable
- `--list_tools`: List all available toolsets and tools
- `--save_trajectories`: Save conversation trajectories to JSONL files
-
-**Batch Processing (`batch_runner.py`):**
- `--dataset_file`: Path to JSONL file with prompts
- `--batch_size`: Number of prompts per batch
- `--run_name`: Name for this run (for output/checkpointing)
- `--distribution`: Toolset distribution to use (default: "default")
- `--num_workers`: Number of parallel workers (default: 4)
- `--resume`: Resume from checkpoint if interrupted
- `--ephemeral_system_prompt`: System prompt used during execution but NOT saved to trajectories
- `--list_distributions`: List available toolset distributions
-
-## Environment Variables
-
-All environment variables can be configured in the `.env` file (copy from `.env.example`).
-
-**LLM Provider (OpenRouter):**
- `OPENROUTER_API_KEY`: Primary LLM access via OpenRouter (supports Claude, GPT-4, Gemini, etc.)
- `LLM_MODEL`: Default model (e.g., `anthropic/claude-sonnet-4`, `openai/gpt-4o`)
-
-**Tool API Keys:**
- `FIRECRAWL_API_KEY`: Web tools (search, extract, crawl)
- `NOUS_API_KEY`: Vision and reasoning tools
- `FAL_KEY`: Image generation tools
-
-**Terminal Tool Configuration (mini-swe-agent backend):**
- `TERMINAL_ENV`: Backend type - `local`, `docker`, `singularity`, `modal`, or `ssh` (default: `local`)
- `TERMINAL_DOCKER_IMAGE`: Docker image for docker backend (default: `python:3.11-slim`)
- `TERMINAL_SINGULARITY_IMAGE`: Singularity/Apptainer image (can be `docker://...` URL or local `.sif` path)
- `TERMINAL_TIMEOUT`: Command timeout in seconds (default: `60`)
- `TERMINAL_LIFETIME_SECONDS`: Cleanup inactive environments after this time (default: `300`)
- `TERMINAL_CWD`: Working directory inside containers (default: `/tmp`)
- `TERMINAL_SCRATCH_DIR`: Custom scratch directory for sandbox storage (optional, auto-detects `/scratch`)
- `SUDO_PASSWORD`: Enable sudo commands by piping password via `sudo -S` (works with all backends)
-  - If unset in CLI mode, you'll be prompted interactively when sudo is needed (45s timeout)
-
-**SSH Backend Configuration (for remote execution):**
- `TERMINAL_SSH_HOST`: Remote server hostname or IP
- `TERMINAL_SSH_USER`: SSH username
- `TERMINAL_SSH_PORT`: SSH port (default: `22`)
- `TERMINAL_SSH_KEY`: Path to SSH private key (optional, uses ssh-agent if not set)
-
-**Browser Tool Configuration (agent-browser + Browserbase):**
- `BROWSERBASE_API_KEY`: Browserbase API key for cloud browser execution
- `BROWSERBASE_PROJECT_ID`: Browserbase project ID
- `BROWSER_SESSION_TIMEOUT`: Session timeout in seconds (default: `300`)
-
-**Legacy Hecate Terminal Backend (optional):**
- `MORPH_API_KEY`: For Hecate/MorphCloud terminal backend
- `HECATE_VM_LIFETIME_SECONDS`: VM lifetime (default: 300)
- `HECATE_DEFAULT_SNAPSHOT_ID`: Default snapshot (default: snapshot_p5294qxt)
-
-**Debug Options:**
- `WEB_TOOLS_DEBUG`, `VISION_TOOLS_DEBUG`, `MOA_TOOLS_DEBUG`, `IMAGE_TOOLS_DEBUG`: Enable debug logging
-
-## Key Files
-
-| File | Purpose |
-|------|---------|
-| `hermes` | CLI launcher script (run with `./hermes`) |
-| `cli.py` | Interactive CLI implementation |
-| `cli-config.yaml` | CLI configuration (copy from `.example`) |
-| `run_agent.py` | Main agent runner - single query execution |
-| `batch_runner.py` | Parallel batch processing with checkpointing |
-| `model_tools.py` | Core tool definitions and handlers |
-| `toolsets.py` | Toolset definitions and composition |
-| `toolset_distributions.py` | Probability distributions for data generation |
-| `trajectory_compressor.py` | Post-process trajectories for training |
-| `tools/` | Individual tool implementations |
-| `tools/skills_tool.py` | Skills system with progressive disclosure |
-| `skills/` | On-demand knowledge documents |
-| `docs/` | Documentation |
-| `configs/` | Example batch run scripts |
-
-# Atropos Integrations & RL Training
-
-## Nomad Setup
-Follow this: https://developer.hashicorp.com/nomad/docs/deploy
-
-## Atropos dependencies
-python3 -m venv .venv
-source .venv/bin/activate
-pip install -e '.[atropos]'
--- a/hermes_agent.egg-info/SOURCES.txt
+++ b/hermes_agent.egg-info/SOURCES.txt
@@ -1,70 +0,0 @@
-README.md
-atropos_compatible_agent.py
-batch_runner.py
-local_server.py
-model_tools.py
-pyproject.toml
-run_agent.py
-toolset_distributions.py
-toolsets.py
-trajectory_compressor.py
-atropos/__init__.py
-atropos/sandbox_server.py
-atropos/agent/__init__.py
-atropos/agent/atropos_agent.py
-atropos/api/__init__.py
-atropos/api/tool_executor_server.py
-atropos/api/tool_server.py
-atropos/backends/__init__.py
-atropos/backends/base.py
-atropos/backends/modal_backend.py
-atropos/backends/nomad_backend.py
-atropos/envs/__init__.py
-atropos/envs/agent_env.py
-atropos/envs/hermes_compat_test_env.py
-atropos/envs/sandbox_terminal_smoke_env.py
-atropos/envs/swe_smith_oracle_env.py
-atropos/envs/test_env.py
-atropos/envs/toolserver_smoke_env.py
-atropos/nomad/__init__.py
-atropos/nomad/client.py
-atropos/slots/__init__.py
-atropos/slots/executor.py
-atropos/slots/pool.py
-atropos/slots/slot.py
-atropos/terminal/__init__.py
-atropos/terminal/asciinema_stream.py
-atropos/tools/__init__.py
-atropos/tools/base.py
-atropos/tools/build_registry.py
-atropos/tools/hermes_external_tools.py
-atropos/tools/sandbox_stubs.py
-atropos/tools/terminal_stateful_tool.py
-atropos/tools/tmux_tool.py
-atropos/tools/tool_executor.py
-atropos/tools/toolset_resolver.py
-hermes_agent.egg-info/PKG-INFO
-hermes_agent.egg-info/SOURCES.txt
-hermes_agent.egg-info/dependency_links.txt
-hermes_agent.egg-info/entry_points.txt
-hermes_agent.egg-info/requires.txt
-hermes_agent.egg-info/top_level.txt
-tests/test_batch_runner.py
-tests/test_checkpoint_resumption.py
-tests/test_modal_integration.py
-tests/test_modal_stress.py
-tests/test_modal_terminal.py
-tests/test_nous_api_limits.py
-tests/test_nous_api_pattern.py
-tests/test_temperature_fix.py
-tests/test_tool_call_parsing.py
-tests/test_web_tools.py
-tools/__init__.py
-tools/browser_tool.py
-tools/image_generation_tool.py
-tools/mixture_of_agents_tool.py
-tools/skills_tool.py
-tools/terminal_hecate.py
-tools/terminal_tool.py
-tools/vision_tools.py
-tools/web_tools.py
--- a/hermes_agent.egg-info/dependency_links.txt
+++ b/hermes_agent.egg-info/dependency_links.txt
@@ -1 +0,0 @@
-
--- a/hermes_agent.egg-info/entry_points.txt
+++ b/hermes_agent.egg-info/entry_points.txt
@@ -1,4 +0,0 @@
-[console_scripts]
-hermes-agent = run_agent:main
-hermes-atropos-sandbox-smoke = atropos.envs.sandbox_terminal_smoke_env:SandboxTerminalSmokeEnv.cli
-hermes-atropos-toolserver-smoke = atropos.envs.toolserver_smoke_env:ToolServerSmokeEnv.cli
--- a/hermes_agent.egg-info/requires.txt
+++ b/hermes_agent.egg-info/requires.txt
@@ -1,31 +0,0 @@
-openai
-python-dotenv
-fire
-httpx
-rich
-tenacity
-pyyaml
-prompt_toolkit
-requests
-jinja2
-pydantic>=2.0
-firecrawl-py
-fal-client
-litellm>=1.75.5
-typer
-platformdirs
-
-[atropos]
-atroposlib @ git+https://github.com/NousResearch/atropos.git
-aiohttp
-fastapi
-uvicorn
-pyte
-
-[dev]
-pytest
-pytest-asyncio
-
-[modal]
-modal
-boto3
--- a/hermes_agent.egg-info/top_level.txt
+++ b/hermes_agent.egg-info/top_level.txt
@@ -1,10 +0,0 @@
-atropos
-atropos_compatible_agent
-batch_runner
-local_server
-model_tools
-run_agent
-tools
-toolset_distributions
-toolsets
-trajectory_compressor
--- a/local_server.py
+++ b/local_server.py
@@ -1,353 +0,0 @@
-"""
-Local OpenAI-compatible server implementation for Hermes-Agent (Atropos integration).
-
-Extends the Atropos APIServer to work with local OpenAI-compatible APIs (e.g. vLLM, SGLang),
-providing tokens_and_logprobs_completion support via client-side tokenization.
-"""
-
-import asyncio
-import os
-import warnings
-from typing import Any, List, Optional
-
-import openai
-from openai.types.chat.chat_completion import ChatCompletion
-from openai.types.completion import Completion
-
-from atroposlib.envs.server_handling.server_baseline import (
-    APIServer,
-    APIServerConfig,
-    ReasoningConfig,
-)
-
-
-class LocalServer(APIServer):
-    """
-    OpenAI-compatible local server with tokens_and_logprobs support.
-    
-    Uses an OpenAI-compatible API (typically at a /v1 endpoint) and handles
-    token extraction via client-side tokenization.
-    
-    Note: Many local servers don't return per-token logprobs in the standard API,
-    so this implementation uses placeholder logprobs (0.0) for PoC purposes.
-    For production training, use vLLM/SGLang servers that return real logprobs.
-    """
-
-    def __init__(
-        self,
-        config: APIServerConfig,
-        tokenizer: Optional[Any] = None,
-        tokenizer_name: str = "gpt2",
-        reasoning_config: Optional[ReasoningConfig] = None,
-    ):
-        """
-        Initialize the local server.
-        
-        Args:
-            config: Server configuration
-            tokenizer: Pre-initialized tokenizer (optional)
-            tokenizer_name: Name of tokenizer to load if tokenizer not provided
-            reasoning_config: Optional reasoning configuration
-        """
-        # Build the OpenAI client pointing to the server's /v1 endpoint
-        base_url = config.base_url
-        if base_url and not base_url.endswith("/v1"):
-            base_url = f"{base_url.rstrip('/')}/v1"
-        
-        self.openai = openai.AsyncClient(
-            api_key=config.api_key or "local",  # Local servers often ignore auth
-            base_url=base_url,
-            timeout=config.timeout,
-        )
-        
-        # Initialize tokenizer
-        if tokenizer is not None:
-            self.tokenizer = tokenizer
-        else:
-            try:
-                from transformers import AutoTokenizer  # type: ignore
-            except ModuleNotFoundError as exc:
-                raise ModuleNotFoundError(
-                    "Missing optional dependency 'transformers'. Pass a tokenizer instance to LocalServer, "
-                    "or install transformers to enable `tokenizer_name` auto-loading."
-                ) from exc
-            self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
-            
-        # Add a simple chat template if the tokenizer doesn't have one
-        # This is needed for ManagedServer's chat_completion to work
-        if not hasattr(self.tokenizer, 'chat_template') or self.tokenizer.chat_template is None:
-            # Simple ChatML-style template
-            self.tokenizer.chat_template = (
-                "{% for message in messages %}"
-                "{% if message['role'] == 'system' %}<|im_start|>system\n{{ message['content'] }}<|im_end|>\n"
-                "{% elif message['role'] == 'user' %}<|im_start|>user\n{{ message['content'] }}<|im_end|>\n"
-                "{% elif message['role'] == 'assistant' %}<|im_start|>assistant\n{{ message['content'] }}<|im_end|>\n"
-                "{% endif %}"
-                "{% endfor %}"
-                "{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
-            )
-        
-        super().__init__(config, reasoning_config=reasoning_config)
-        # Local servers are treated as always-healthy unless a status task is enabled.
-        self.server_healthy = True
-
-    @classmethod
-    def from_env(
-        cls,
-        base_url: Optional[str] = None,
-        model: Optional[str] = None,
-        api_key: Optional[str] = None,
-        tokenizer_name: str = "gpt2",
-        **kwargs,
-    ) -> "LocalServer":
-        """
-        Create a LocalServer from environment variables (or explicit overrides).
-        
-        Env vars (checked in order):
-        - base URL: ATROPOS_SERVER_BASE_URL, OPENAI_BASE_URL, LOCAL_LLM_BASE_URL, LLM_BASE_URL
-        - model:    ATROPOS_SERVER_MODEL,    LLM_MODEL,       LOCAL_LLM_MODEL
-        - api key:  ATROPOS_SERVER_API_KEY,  OPENAI_API_KEY,  LOCAL_LLM_API_KEY, LLM_API_KEY
-        """
-        from dotenv import load_dotenv
-        load_dotenv()
-        
-        base_url = (
-            base_url
-            or os.getenv("ATROPOS_SERVER_BASE_URL")
-            or os.getenv("OPENAI_BASE_URL")
-            or os.getenv("LOCAL_LLM_BASE_URL")
-            or os.getenv("LLM_BASE_URL")
-            or "http://localhost:11434"
-        )
-        model = (
-            model
-            or os.getenv("ATROPOS_SERVER_MODEL")
-            or os.getenv("LLM_MODEL")
-            or os.getenv("LOCAL_LLM_MODEL")
-            or "hermes3:8b"
-        )
-        api_key = (
-            api_key
-            or os.getenv("ATROPOS_SERVER_API_KEY")
-            or os.getenv("OPENAI_API_KEY")
-            or os.getenv("LOCAL_LLM_API_KEY")
-            or os.getenv("LLM_API_KEY")
-        )
-        
-        config = APIServerConfig(
-            model_name=model,
-            base_url=base_url,
-            api_key=api_key or "local",
-            timeout=kwargs.get("timeout", 120),
-            num_max_requests_at_once=kwargs.get("num_max_requests_at_once", 4),
-            num_requests_for_eval=kwargs.get("num_requests_for_eval", 4),
-            health_check=False,  # Local dev servers often lack /health
-        )
-        
-        return cls(config, tokenizer_name=tokenizer_name)
-
-    async def check_server_status_task(self, chat_completion: bool = True):
-        """
-        Check if the server is healthy.
-        
-        For local development, we generally assume the server is healthy.
-        """
-        while True:
-            try:
-                # Simple health check via a minimal completion
-                if chat_completion:
-                    await self.openai.chat.completions.create(
-                        model=self.config.model_name,
-                        messages=[{"role": "user", "content": "hi"}],
-                        max_tokens=1,
-                    )
-                else:
-                    await self.openai.completions.create(
-                        model=self.config.model_name,
-                        prompt="hi",
-                        max_tokens=1,
-                    )
-                self.server_healthy = True
-            except Exception:
-                self.server_healthy = False
-            await asyncio.sleep(5)
-
-    async def _chat_completion_wrapper(self, **kwargs) -> ChatCompletion:
-        """
-        Wrapper for chat completion using an OpenAI-compatible API.
-        """
-        assert kwargs.get("model") is not None, "Model is required!"
-        assert kwargs.get("messages") is not None, "Messages are required!"
-        
-        n = kwargs.get("n", 1)
-        
-        # Some OpenAI-compatible servers don't support n > 1, so we make multiple requests.
-        if n > 1:
-            completion_list = await asyncio.gather(
-                *[self.openai.chat.completions.create(**{**kwargs, "n": 1}) for _ in range(n)]
-            )
-            # Merge completions
-            completions = completion_list[0]
-            for c in completion_list[1:]:
-                for choice in c.choices:
-                    choice.index = len(completions.choices)
-                    completions.choices.append(choice)
-            return completions
-        else:
-            return await self.openai.chat.completions.create(**kwargs)
-
-    async def _completion_wrapper(self, **kwargs) -> Completion:
-        """
-        Wrapper for completion using an OpenAI-compatible API.
-        """
-        assert kwargs.get("model") is not None, "Model is required!"
-        assert kwargs.get("prompt") is not None, "Prompt is required!"
-        
-        n = kwargs.get("n", 1)
-        
-        # Some OpenAI-compatible servers don't support n > 1.
-        if n > 1:
-            completion_list = await asyncio.gather(
-                *[self.openai.completions.create(**{**kwargs, "n": 1}) for _ in range(n)]
-            )
-            completions = completion_list[0]
-            for c in completion_list[1:]:
-                for choice in c.choices:
-                    choice.index = len(completions.choices)
-                    completions.choices.append(choice)
-            return completions
-        else:
-            return await self.openai.completions.create(**kwargs)
-
-    async def _tokens_and_logprobs_completion_wrapper(
-        self, **kwargs
-    ) -> tuple[List[int], List[List[int]], List[List[float]], List[str]]:
-        """
-        Wrapper for tokens and logprobs completion.
-        
-        Returns:
-            Tuple of (prompt_tokens, output_tokens_list, output_logprobs_list, finish_reasons)
-        
-        Note: Many OpenAI-compatible local servers don't return per-token logprobs,
-        so we use placeholder logprobs (0.0). For real training, use vLLM/SGLang.
-        """
-        model = kwargs.get("model")
-        assert model is not None, "Model is required!"
-        
-        # Handle input_ids (from ManagedServer) or prompt
-        if "input_ids" in kwargs:
-            prompt_tokens = kwargs.pop("input_ids")
-            prompt = self.tokenizer.decode(prompt_tokens)
-            kwargs.pop("prompt", None)
-        else:
-            prompt = kwargs.pop("prompt", "")
-            prompt_tokens = self.tokenizer.encode(prompt, add_special_tokens=True)
-        
-        n = kwargs.pop("n", 1)
-        max_tokens = kwargs.pop("max_tokens", 256)
-        temperature = kwargs.pop("temperature", 0.7)
-        stop = kwargs.pop("stop", None)
-        
-        # Make completion requests
-        completions = []
-        for _ in range(n):
-            try:
-                response = await self.openai.completions.create(
-                    model=model,
-                    prompt=prompt,
-                    max_tokens=max_tokens,
-                    temperature=temperature,
-                    stop=stop,
-                )
-                completions.append(response)
-            except Exception as e:
-                # Fallback to chat completion if completion endpoint not supported
-                warnings.warn(f"Completion API failed, trying chat: {e}")
-                response = await self.openai.chat.completions.create(
-                    model=model,
-                    messages=[{"role": "user", "content": prompt}],
-                    max_tokens=max_tokens,
-                    temperature=temperature,
-                    stop=stop,
-                )
-                # Convert to completion-like response
-                completions.append(response)
-        
-        output_tokens_list = []
-        output_logprobs_list = []
-        finish_reasons = []
-        
-        for completion in completions:
-            # Extract text from response
-            if hasattr(completion.choices[0], "text"):
-                # Completion API response
-                text = completion.choices[0].text
-                finish_reason = completion.choices[0].finish_reason or "stop"
-            else:
-                # Chat completion API response
-                text = completion.choices[0].message.content or ""
-                finish_reason = completion.choices[0].finish_reason or "stop"
-            
-            # Tokenize output
-            output_tokens = self.tokenizer.encode(text, add_special_tokens=False)
-            
-            # Placeholder logprobs (many local servers don't provide per-token logprobs).
-            # In production, use vLLM/SGLang which return real logprobs
-            output_logprobs = [0.0] * len(output_tokens)
-            
-            output_tokens_list.append(output_tokens)
-            output_logprobs_list.append(output_logprobs)
-            finish_reasons.append(finish_reason)
-        
-        return prompt_tokens, output_tokens_list, output_logprobs_list, finish_reasons
-
-    def managed_server(self, tokenizer=None, track_tree: bool = False):
-        """
-        Create a ManagedServer context manager for this server.
-        
-        Args:
-            tokenizer: Optional tokenizer override
-            track_tree: Whether to maintain tree structure for multi-turn
-            
-        Returns:
-            ManagedServer context manager
-        """
-        from atroposlib.envs.server_handling.managed_server import ManagedServer
-        
-        return ManagedServerContext(
-            self,
-            tokenizer=tokenizer or self.tokenizer,
-            track_tree=track_tree,
-        )
-
-
-class ManagedServerContext:
-    """
-    Context manager wrapper for ManagedServer.
-    
-    Usage:
-        async with server.managed_server(tokenizer=tokenizer) as managed:
-            response = await managed.chat_completion(...)
-            state = managed.get_state()
-    """
-    
-    def __init__(self, server: LocalServer, tokenizer, track_tree: bool = False):
-        self.server = server
-        self.tokenizer = tokenizer
-        self.track_tree = track_tree
-        self.managed = None
-    
-    async def __aenter__(self):
-        from atroposlib.envs.server_handling.managed_server import ManagedServer
-        
-        self.managed = ManagedServer(
-            self.server,
-            tokenizer=self.tokenizer,
-            track_tree=self.track_tree,
-        )
-        return self.managed
-    
-    async def __aexit__(self, exc_type, exc_val, exc_tb):
-        if self.managed:
-            self.managed.reset()
-        return False
--- a/memory-bank/activeContext.md
+++ b/memory-bank/activeContext.md
@@ -1,62 +0,0 @@
-# Active Context
-
-## Current Focus
-Singularity/Apptainer integration for HPC environments has been **COMPLETED AND TESTED**.
-
-## Recently Completed (Feb 6, 2026)
-
-### Singularity/Apptainer Sandbox Integration - FULLY WORKING
-Successfully adapted the Atropos implementation from Docker to Singularity/Apptainer for HPC clusters where Docker cannot run without sudo permissions.
-
-**Files Modified:**
-1. `atropos/nomad/client.py` - Added `driver` and `singularity_image` parameters to `create_sandbox_job()`; Fixed port detection to check both `DynamicPorts` and `ReservedPorts` in `get_job_allocations()`
-2. `atropos/slots/pool.py` - Added `driver` and `singularity_image` to `SlotPoolConfig`
-3. `atropos/backends/nomad_backend.py` - Added driver options to `NomadBackendConfig`
-4. `atropos/envs/agent_env.py` - Added CLI arguments `--env.driver` and `--env.singularity_image` to `AgentEnvConfig`
-
-**Files Created:**
-1. `nomad-singularity.hcl` - Nomad config with raw_exec driver enabled
-2. `atropos/atropos-sandbox.sif` - Singularity image (80MB) built from Docker image
-3. `test_singularity_job.py` - Test script for Singularity integration
-
-**Key Implementation Details:**
- Uses Nomad's `raw_exec` driver to run `apptainer` commands
- Shell wrapper (`/bin/sh -c`) ensures Nomad environment variables expand correctly
- Binds Nomad allocation directory to `/data` for workspace persistence
- Uses **static ports** (`ReservedPorts`) instead of dynamic ports since raw_exec runs directly on host
- `get_job_allocations()` now checks both `DynamicPorts` (Docker) and `ReservedPorts` (Singularity)
-
-**Test Results (All Passing):**
- Health check: ✅ Server responding with 5 slots
- Bash execution: ✅ Commands execute inside Singularity container
- Write file: ✅ File written to slot workspace
- Read file: ✅ File read back successfully
-
-## Usage
-
-### For Docker (default):
-```python
-config = SlotPoolConfig(
-    driver="docker",
-    image="atropos-sandbox:local",
-)
-```
-
-### For Singularity/Apptainer:
-```python
-config = SlotPoolConfig(
-    driver="singularity",
-    singularity_image="/path/to/atropos-sandbox.sif",
-)
-```
-
-### Nomad Configuration:
-```bash
-# Start Nomad with Singularity support
-nomad agent -dev -config=nomad-singularity.hcl
-```
-
-## Next Steps
- Deploy to HPC cluster for production testing
- Consider adding bubblewrap (bwrap) support inside Singularity for additional sandboxing
- Document HPC-specific deployment procedures in skills/mlops/
--- a/memory-bank/productContext.md
+++ b/memory-bank/productContext.md
@@ -1,55 +0,0 @@
-# Product Context: Hermes-Agent
-
-## Why This Project Exists
-
-Hermes-Agent addresses several key challenges in the AI agent space:
-
-1. **Unified Tool Interface** - Provides a clean, consistent interface for LLMs to use various tools (web, terminal, browser, vision, etc.) without requiring custom integration for each model provider.
-
-2. **Training Data Generation** - Enables efficient generation of high-quality tool-calling trajectories for fine-tuning LLMs, with features like batch processing, checkpointing, and trajectory compression.
-
-3. **Flexible Deployment** - Supports multiple execution environments (local, Docker, Singularity, Modal, SSH) to accommodate different security and isolation requirements.
-
-4. **Developer Experience** - Offers a beautiful, interactive CLI with kawaii-style feedback that makes working with AI agents enjoyable.
-
-## Problems It Solves
-
-### For AI Researchers
- **Data Generation at Scale**: Parallel batch processing with content-based checkpointing for fault tolerance
- **Clean Trajectories**: Trajectory compression to fit token budgets while preserving important information
- **Toolset Distributions**: Probability-based tool selection for varied training data
-
-### For Developers
- **Tool Orchestration**: Logical grouping of tools into toolsets (research, development, debugging, etc.)
- **Session Persistence**: Conversation history and session logging for debugging
- **Multi-Model Support**: Works with any OpenAI-compatible API (OpenRouter, local models, etc.)
-
-### For MLOps
- **Skills System**: On-demand knowledge documents for specific tools/frameworks (Axolotl, vLLM, TRL, etc.)
- **Sandboxed Execution**: Terminal commands can run in isolated environments (Docker, Singularity, Modal)
- **Configurable Backends**: Easy switching between local and cloud execution
-
-## How It Should Work
-
-### User Flow (CLI)
-1. User launches `./hermes` 
-2. Beautiful welcome banner displays with caduceus logo, model info, and available tools
-3. User types a natural language request
-4. Agent processes request, potentially calling tools with animated feedback
-5. Agent responds with results, conversation continues
-6. Session is automatically logged for debugging
-
-### User Flow (Batch Processing)
-1. User prepares JSONL file with prompts
-2. Runs `batch_runner.py` with distribution and worker count
-3. System processes prompts in parallel, saves checkpoints
-4. Completed trajectories saved to `data/<run_name>/trajectories.jsonl`
-5. Optional: compress trajectories with `trajectory_compressor.py`
-
-## User Experience Goals
-
- **Delightful Interaction**: Kawaii ASCII faces, animated spinners, cute messages
- **Informative Feedback**: Clear progress indication during tool execution
- **Configurable Personalities**: From "helpful" to "pirate" to "Shakespeare"
- **Easy Configuration**: YAML config file + environment variables + CLI flags
- **Graceful Degradation**: Missing tools/APIs don't break the system, just disable features
--- a/memory-bank/progress.md
+++ b/memory-bank/progress.md
@@ -1,67 +0,0 @@
-# Progress
-
-## Completed Features
-
-### ✅ Singularity/Apptainer Sandbox Integration (Feb 6, 2026 - FULLY TESTED)
-Adapted the Atropos sandbox environment from Docker to Singularity/Apptainer for HPC clusters.
-
-**What Works:**
- `create_sandbox_job()` supports both `driver="docker"` and `driver="singularity"`
- SlotPoolConfig and NomadBackendConfig propagate driver settings
- Singularity container runs sandbox_server.py via Nomad's raw_exec driver
- All sandbox operations work: bash execution, file read/write
- Nomad environment variables properly expanded via shell wrapper
- **CLI arguments** `--env.driver` and `--env.singularity_image` for AgentEnvConfig
- **Static port binding** for Singularity (ReservedPorts vs DynamicPorts)
- **Port detection** works for both Docker and Singularity allocations
-
-**CLI Usage:**
-```bash
-python -m atropos.envs.swe_smith_oracle_env process \
-    --env.driver singularity \
-    --env.singularity_image /path/to/atropos-sandbox.sif
-```
-
-**Created Files:**
- `nomad-singularity.hcl` - Nomad config with raw_exec enabled
- `atropos/atropos-sandbox.sif` - 80MB Singularity image
- `test_singularity_job.py` - Integration test script
-
-**Modified Files:**
- `atropos/nomad/client.py` - driver support + ReservedPorts detection
- `atropos/slots/pool.py` - driver config fields
- `atropos/backends/nomad_backend.py` - driver config fields
- `atropos/envs/agent_env.py` - CLI arguments for driver selection
-
-### ✅ Memory Bank Initialized (Feb 5, 2026)
-Set up project documentation structure for context persistence.
-
-## In Progress
-None currently.
-
-## Known Issues
- `bwrap_available: false` in Singularity containers - bubblewrap sandboxing not available inside the container (kernel namespaces already in use)
- Health check timing - may need longer wait for container startup on slower systems
-
-## What's Left to Build
-
-### HPC Deployment
- [ ] Test on actual HPC cluster with Slurm/PBS integration
- [ ] Document cluster-specific deployment procedures
- [ ] Add support for shared filesystem workspace binding
-
-### Enhanced Sandboxing
- [ ] Investigate alternative sandboxing inside Singularity (seccomp, etc.)
- [ ] Add network isolation options for Singularity
-
-### Documentation
- [ ] Add Singularity deployment to README
- [ ] Create HPC deployment skill in skills/mlops/
-
-## Evolution of Decisions
-
-### Container Runtime Selection
- **Initial**: Docker-only via Nomad docker driver
- **Problem**: HPC clusters don't allow Docker without sudo
- **Solution**: Added Singularity/Apptainer support via raw_exec driver
- **Result**: Both runtimes now supported with same API
--- a/memory-bank/projectbrief.md
+++ b/memory-bank/projectbrief.md
@@ -1,44 +0,0 @@
-# Project Brief: Hermes-Agent
-
-## Overview
-Hermes-Agent is an AI agent harness for LLMs with advanced tool-calling capabilities, featuring a flexible toolsets system for organizing and managing tools. Named after Hermes, the Greek messenger god, it serves as a bridge between human intent and AI-powered task execution.
-
-## Core Requirements
-
-### Primary Goals
-1. **Interactive CLI Experience** - Beautiful terminal interface with animated feedback, personalities, and session management
-2. **Flexible Tool System** - Modular tools organized into logical toolsets for different use cases
-3. **Batch Processing** - Process multiple prompts in parallel with checkpointing and statistics
-4. **Multi-Backend Support** - Support for local, Docker, Singularity, Modal, and SSH terminal backends
-5. **Training Data Generation** - Save conversation trajectories in formats suitable for LLM fine-tuning
-
-### Target Users
- AI researchers generating training data
- Developers needing an AI assistant with tool access
- MLOps practitioners automating workflows
- Anyone needing a powerful CLI-based AI agent
-
-## Scope
-
-### In Scope
- Interactive CLI with rich formatting and kawaii-style feedback
- Web tools (search, extract, crawl via Firecrawl)
- Terminal tools (command execution across multiple backends)
- Browser automation (via agent-browser + Browserbase)
- Vision tools (image analysis)
- Image generation (FLUX via FAL.ai)
- Mixture-of-Agents reasoning
- Skills system for on-demand knowledge
- Batch processing with parallel workers
- Trajectory compression for training
-
-### Out of Scope (Current)
- Proactive suggestions (agent only runs on request)
- Clipboard integration (no local system access)
- Real-time streaming of thinking/reasoning (deferred)
-
-## Success Metrics
- Clean, maintainable tool architecture
- Reliable tool execution with proper error handling
- Efficient context management for long conversations
- High-quality trajectory data for training
--- a/memory-bank/systemPatterns.md
+++ b/memory-bank/systemPatterns.md
@@ -1,149 +0,0 @@
-# System Patterns: Hermes-Agent
-
-## Architecture Overview
-
-```
-┌─────────────────────────────────────────────────────────────────┐
-│                           CLI (cli.py)                          │
-│  - Rich welcome banner with caduceus                            │
-│  - prompt_toolkit for input with history                        │
-│  - Kawaii-style feedback and personalities                      │
-└────────────────────────────┬────────────────────────────────────┘
-                             │
-                             ▼
-┌─────────────────────────────────────────────────────────────────┐
-│                     AIAgent (run_agent.py)                      │
-│  - Conversation loop with tool calling                          │
-│  - KawaiiSpinner for animated feedback                          │
-│  - Retry logic with exponential backoff                         │
-│  - Session logging to logs/ directory                           │
-└────────────────────────────┬────────────────────────────────────┘
-                             │
-                             ▼
-┌─────────────────────────────────────────────────────────────────┐
-│                   Tool Routing (model_tools.py)                 │
-│  - get_tool_definitions() - returns tools for API calls         │
-│  - handle_function_call() - dispatches to tool handlers         │
-│  - Toolset filtering (enabled/disabled)                         │
-└────────────────────────────┬────────────────────────────────────┘
-                             │
-           ┌─────────────────┼─────────────────┐
-           ▼                 ▼                 ▼
-    ┌───────────┐     ┌───────────┐     ┌───────────┐
-    │ Web Tools │     │ Terminal  │     │ Browser   │
-    │ (Firecrawl)│    │ (mini-swe)│     │(agent-brw)│
-    └───────────┘     └───────────┘     └───────────┘
-           │                 │                 │
-           └─────────────────┼─────────────────┘
-                             ▼
-                    ┌───────────────┐
-                    │  Toolsets     │
-                    │  (toolsets.py)│
-                    │  Composition  │
-                    └───────────────┘
-```
-
-## Key Design Patterns
-
-### 1. Toolset Composition Pattern
-Toolsets can include other toolsets, allowing flexible composition:
-
-```python
-TOOLSETS = {
-    "web": {"tools": ["web_search", "web_extract"], "includes": []},
-    "debugging": {"tools": ["terminal"], "includes": ["web"]},
-    "full_stack": {"tools": [], "includes": ["web", "terminal", "vision", "browser"]}
-}
-```
-
-Resolution is recursive with cycle detection.
-
-### 2. Graceful Degradation Pattern
-Each tool module has a `check_*_requirements()` function:
- Tools are only loaded if requirements are met
- Missing API keys disable tools, not crash the system
- Import errors are caught and tools marked unavailable
-
-```python
-try:
-    from tools.web_tools import web_search_tool, check_firecrawl_api_key
-except ModuleNotFoundError:
-    web_search_tool = None
-    def check_firecrawl_api_key(): return False
-```
-
-### 3. Session Isolation Pattern (task_id)
-Stateful tools (terminal, browser) use `task_id` to isolate concurrent sessions:
- Each batch worker gets unique task_id
- VMs and browser sessions are tracked per task_id
- Cleanup functions release resources: `cleanup_vm(task_id)`, `cleanup_browser(task_id)`
-
-### 4. Trajectory Format Pattern
-Conversations are saved in ShareGPT format for training:
-
-```json
-{"from": "system", "value": "System prompt with <tools>...</tools>"}
-{"from": "human", "value": "User message"}
-{"from": "gpt", "value": "<think>reasoning</think>\n<tool_call>{...}</tool_call>"}
-{"from": "tool", "value": "<tool_response>{...}</tool_response>"}
-{"from": "gpt", "value": "Final response"}
-```
-
-### 5. Ephemeral System Prompt Pattern
-Guide model behavior during data collection without saving to trajectories:
- `ephemeral_system_prompt` influences execution
- Only standard tool-calling system prompt saved to trajectories
- Keeps training data clean
-
-### 6. Retry with Validation Pattern
-The agent validates responses before accepting:
- Check tool names against `valid_tool_names` set
- Validate JSON arguments can be parsed
- Check for content after `<think>` blocks
- Roll back to last valid state on persistent failures
-
-## Component Relationships
-
-### AIAgent Class
- Central orchestrator for conversations
- Manages conversation history
- Calls OpenAI-compatible API
- Routes tool calls to handlers
- Provides animated feedback (KawaiiSpinner)
-
-### Tool Modules (tools/*.py)
- Self-contained tool implementations
- Export: handler function + check function + schema
- Return JSON strings (never raw dicts)
- Accept optional `task_id` for stateful tools
-
-### Toolsets System (toolsets.py)
- Defines logical groupings of tools
- Supports composition via `includes`
- `resolve_toolset()` recursively resolves all tools
- `validate_toolset()` checks if name is valid
-
-### Model Tools (model_tools.py)
- Aggregates all tool definitions
- Routes function calls to correct handlers
- Filters tools based on enabled/disabled toolsets
- Bridge between agent and tool implementations
-
-## Critical Implementation Paths
-
-### Tool Execution Flow
-1. AIAgent receives tool_calls from API response
-2. Validates tool names against `valid_tool_names`
-3. Validates JSON arguments can be parsed
-4. Calls `handle_function_call()` with tool name, args, task_id
-5. `handle_function_call()` routes to appropriate handler
-6. Tool executes, returns JSON string
-7. Result added to conversation as tool message
-8. Loop continues until natural language response
-
-### Configuration Loading Flow
-1. `cli.py` calls `load_cli_config()`
-2. Loads `cli-config.yaml`, merges with defaults
-3. Sets environment variables for terminal config
-4. `AIAgent` reads env vars when initializing terminal tool
-5. Terminal tool creates appropriate backend based on `TERMINAL_ENV`
--- a/memory-bank/techContext.md
+++ b/memory-bank/techContext.md
@@ -1,113 +0,0 @@
-# Technical Context: Hermes-Agent
-
-## Technologies Used
-
-### Core Stack
- **Python 3.11+** - Primary language
- **OpenAI SDK** - For LLM API interactions (OpenAI-compatible)
- **OpenRouter** - Default LLM provider (supports multiple models)
- **Rich** - Terminal formatting and panels
- **prompt_toolkit** - Interactive input with history
- **Fire** - CLI argument parsing
- **PyYAML** - Configuration files
- **python-dotenv** - Environment variable management
-
-### Tool Dependencies
- **Firecrawl** - Web search and extraction (`FIRECRAWL_API_KEY`)
- **mini-swe-agent** - Terminal tool backend (local/docker/singularity/modal/ssh)
- **agent-browser** - Browser automation (npm package)
- **Browserbase** - Cloud browser execution (`BROWSERBASE_API_KEY`)
- **FAL.ai** - Image generation with FLUX (`FAL_KEY`)
- **Nous API** - Vision and MoA tools (`NOUS_API_KEY`)
-
-### Optional Dependencies
- **Modal** - Cloud compute for sandboxed environments
- **Singularity/Apptainer** - Rootless containers (HPC environments)
- **Docker** - Container isolation
-
-## Development Setup
-
-### Quick Start
-```bash
-# Clone with submodules
-git clone --recurse-submodules https://github.com/NousResearch/Hermes-Agent.git
-cd Hermes-Agent
-
-# Create virtual environment
-python3 -m venv venv
-source venv/bin/activate
-
-# Install dependencies
-pip install -r requirements.txt
-pip install -e ./mini-swe-agent
-
-# Install browser tools (optional)
-npm install
-
-# Configure environment
-cp .env.example .env
-# Edit .env with your API keys
-```
-
-### Key Configuration Files
- `.env` - API keys and secrets
- `cli-config.yaml` - CLI configuration (model, terminal, toolsets, personalities)
- `configs/` - Batch run scripts and configuration
-
-### Environment Variables
-
-**Required for Full Functionality:**
- `OPENROUTER_API_KEY` - Primary LLM access
- `FIRECRAWL_API_KEY` - Web tools
- `NOUS_API_KEY` - Vision and reasoning tools
- `FAL_KEY` - Image generation
-
-**Terminal Backend:**
- `TERMINAL_ENV` - Backend type: `local`, `docker`, `singularity`, `modal`, `ssh`
- `TERMINAL_CWD` - Working directory
- `TERMINAL_DOCKER_IMAGE` / `TERMINAL_SINGULARITY_IMAGE` - Container images
- `TERMINAL_SSH_HOST/USER/KEY` - SSH backend config
- `SUDO_PASSWORD` - Optional sudo support
-
-**Browser:**
- `BROWSERBASE_API_KEY` - Browser automation
- `BROWSERBASE_PROJECT_ID` - Browserbase project
-
-## Technical Constraints
-
-1. **Context Window Limits** - Long tool outputs can exhaust context; trajectory compression helps
-2. **API Rate Limits** - OpenRouter and tool APIs have rate limits; exponential backoff implemented
-3. **Tool Availability** - Tools gracefully degrade if dependencies/keys missing
-4. **Async Compatibility** - Some tools are async, handled via `asyncio.run()` in sync context
-
-## Dependency Graph
-
-```
-tools/*.py → tools/__init__.py → model_tools.py → toolsets.py → toolset_distributions.py
-                                       ↑
-run_agent.py ──────────────────────────┘
-cli.py → run_agent.py (uses AIAgent with quiet_mode=True)
-batch_runner.py → run_agent.py + toolset_distributions.py
-```
-
-## Tool Usage Patterns
-
-### Adding a New Tool
-1. Create `tools/your_tool.py` with handler + requirements check
-2. Export in `tools/__init__.py`
-3. Register in `model_tools.py` (definitions + handler routing)
-4. Add to toolset in `toolsets.py`
-5. Optionally add to `toolset_distributions.py` for batch processing
-
-### Tool Handler Pattern
-```python
-def your_tool(param: str, task_id: str = None) -> str:
-    """Execute tool and return JSON string result."""
-    try:
-        result = {"success": True, "data": "..."}
-        return json.dumps(result, ensure_ascii=False)
-    except Exception as e:
-        return json.dumps({"error": str(e)}, ensure_ascii=False)
-```
-
-All tool handlers MUST return a JSON string, never raw dicts.
--- a/1
+++ b/1
--- a/mini_swe_runner.py
+++ b/mini_swe_runner.py
@@ -1,708 +0,0 @@
-#!/usr/bin/env python3
-"""
-Mini-SWE-Agent Runner with Hermes Trajectory Format
-
-This module provides a runner that uses mini-swe-agent's execution environments
-(local, docker, modal) but outputs trajectories in the Hermes-Agent format
-compatible with batch_runner.py and trajectory_compressor.py.
-
-Features:
- Uses mini-swe-agent's Docker, Modal, or Local environments for command execution
- Outputs trajectories in Hermes format (from/value pairs with <tool_call>/<tool_response> XML)
- Compatible with the trajectory compression pipeline
- Supports batch processing from JSONL prompt files
-
-Usage:
-    # Run a single task with local environment
-    python mini_swe_runner.py --task "Create a hello world Python script" --env local
-    
-    # Run with Docker
-    python mini_swe_runner.py --task "List files in /tmp" --env docker --image python:3.11-slim
-    
-    # Run with Modal (cloud)
-    python mini_swe_runner.py --task "Install numpy and test it" --env modal --image python:3.11-slim
-    
-    # Batch mode from JSONL file
-    python mini_swe_runner.py --prompts_file prompts.jsonl --output_file trajectories.jsonl --env docker
-"""
-
-import json
-import logging
-import os
-import sys
-import time
-import uuid
-from datetime import datetime
-from pathlib import Path
-from typing import List, Dict, Any, Optional, Literal
-
-import fire
-from dotenv import load_dotenv
-
-# Load environment variables
-load_dotenv()
-
-# Add mini-swe-agent to path if not installed
-mini_swe_path = Path(__file__).parent / "mini-swe-agent" / "src"
-if mini_swe_path.exists():
-    sys.path.insert(0, str(mini_swe_path))
-
-
-# ============================================================================
-# Terminal Tool Definition (matches Hermes-Agent format)
-# ============================================================================
-
-TERMINAL_TOOL_DEFINITION = {
-    "type": "function",
-    "function": {
-        "name": "terminal",
-        "description": """Execute bash commands in a sandboxed environment.
-
-**Environment:**
- Isolated execution environment (local, Docker, or Modal cloud)
- Filesystem persists between tool calls within the same task
- Internet access available
-
-**Command Execution:**
- Provide the command to execute via the 'command' parameter
- Optional 'timeout' parameter in seconds (default: 60)
-
-**Examples:**
- Run command: `{"command": "ls -la"}`
- With timeout: `{"command": "long_task.sh", "timeout": 300}`
-
-**Best Practices:**
- Use non-interactive commands (avoid vim, nano, interactive python)
- Pipe to cat if output might be large
- Install tools with apt-get or pip as needed
-
-**Completion:**
- When task is complete, output: echo "MINI_SWE_AGENT_FINAL_OUTPUT" followed by your result
-""",
-        "parameters": {
-            "type": "object",
-            "properties": {
-                "command": {
-                    "type": "string",
-                    "description": "The bash command to execute"
-                },
-                "timeout": {
-                    "type": "integer",
-                    "description": "Command timeout in seconds (default: 60)"
-                }
-            },
-            "required": ["command"]
-        }
-    }
-}
-
-
-# ============================================================================
-# Environment Factory
-# ============================================================================
-
-def create_environment(
-    env_type: str = "local",
-    image: str = "python:3.11-slim",
-    cwd: str = "/tmp",
-    timeout: int = 60,
-    **kwargs
-):
-    """
-    Create an execution environment from mini-swe-agent.
-    
-    Args:
-        env_type: One of "local", "docker", "modal"
-        image: Docker/Modal image name (ignored for local)
-        cwd: Working directory
-        timeout: Default command timeout
-        **kwargs: Additional environment-specific options
-        
-    Returns:
-        Environment instance with execute() method
-    """
-    if env_type == "local":
-        from minisweagent.environments.local import LocalEnvironment
-        return LocalEnvironment(cwd=cwd, timeout=timeout)
-    
-    elif env_type == "docker":
-        from minisweagent.environments.docker import DockerEnvironment
-        return DockerEnvironment(image=image, cwd=cwd, timeout=timeout, **kwargs)
-    
-    elif env_type == "modal":
-        from minisweagent.environments.extra.swerex_modal import SwerexModalEnvironment
-        return SwerexModalEnvironment(image=image, cwd=cwd, timeout=timeout, **kwargs)
-    
-    else:
-        raise ValueError(f"Unknown environment type: {env_type}. Use 'local', 'docker', or 'modal'")
-
-
-# ============================================================================
-# Mini-SWE Runner with Hermes Trajectory Format
-# ============================================================================
-
-class MiniSWERunner:
-    """
-    Agent runner that uses mini-swe-agent environments but outputs
-    trajectories in Hermes-Agent format.
-    """
-    
-    def __init__(
-        self,
-        model: str = "anthropic/claude-sonnet-4-20250514",
-        base_url: str = None,
-        api_key: str = None,
-        env_type: str = "local",
-        image: str = "python:3.11-slim",
-        cwd: str = "/tmp",
-        max_iterations: int = 15,
-        command_timeout: int = 60,
-        verbose: bool = False,
-    ):
-        """
-        Initialize the Mini-SWE Runner.
-        
-        Args:
-            model: Model name for OpenAI-compatible API
-            base_url: API base URL (optional, uses env vars if not provided)
-            api_key: API key (optional, uses env vars if not provided)
-            env_type: Environment type - "local", "docker", or "modal"
-            image: Docker/Modal image (ignored for local)
-            cwd: Working directory for commands
-            max_iterations: Maximum tool-calling iterations
-            command_timeout: Default timeout for commands
-            verbose: Enable verbose logging
-        """
-        self.model = model
-        self.max_iterations = max_iterations
-        self.command_timeout = command_timeout
-        self.verbose = verbose
-        self.env_type = env_type
-        self.image = image
-        self.cwd = cwd
-        
-        # Setup logging
-        logging.basicConfig(
-            level=logging.DEBUG if verbose else logging.INFO,
-            format='%(asctime)s - %(levelname)s - %(message)s',
-            datefmt='%H:%M:%S'
-        )
-        self.logger = logging.getLogger(__name__)
-        
-        # Initialize OpenAI client - defaults to OpenRouter
-        from openai import OpenAI
-        
-        client_kwargs = {}
-        
-        # Default to OpenRouter if no base_url provided
-        if base_url:
-            client_kwargs["base_url"] = base_url
-        else:
-            client_kwargs["base_url"] = "https://openrouter.ai/api/v1"
-        
-        # Handle API key - OpenRouter is the primary provider
-        if api_key:
-            client_kwargs["api_key"] = api_key
-        else:
-            client_kwargs["api_key"] = os.getenv(
-                "OPENROUTER_API_KEY",
-                os.getenv("ANTHROPIC_API_KEY", os.getenv("OPENAI_API_KEY", ""))
-            )
-        
-        self.client = OpenAI(**client_kwargs)
-        
-        # Environment will be created per-task
-        self.env = None
-        
-        # Tool definition
-        self.tools = [TERMINAL_TOOL_DEFINITION]
-        
-        print(f"🤖 Mini-SWE Runner initialized")
-        print(f"   Model: {self.model}")
-        print(f"   Environment: {self.env_type}")
-        if self.env_type != "local":
-            print(f"   Image: {self.image}")
-        print(f"   Max iterations: {self.max_iterations}")
-    
-    def _create_env(self):
-        """Create the execution environment."""
-        print(f"🔧 Creating {self.env_type} environment...")
-        self.env = create_environment(
-            env_type=self.env_type,
-            image=self.image,
-            cwd=self.cwd,
-            timeout=self.command_timeout
-        )
-        print(f"✅ Environment ready")
-    
-    def _cleanup_env(self):
-        """Cleanup the execution environment."""
-        if self.env is not None:
-            if hasattr(self.env, 'cleanup'):
-                self.env.cleanup()
-            elif hasattr(self.env, 'stop'):
-                self.env.stop()
-            self.env = None
-    
-    def _execute_command(self, command: str, timeout: int = None) -> Dict[str, Any]:
-        """
-        Execute a command in the environment.
-        
-        Args:
-            command: Bash command to execute
-            timeout: Optional timeout override
-            
-        Returns:
-            Dict with 'output' and 'returncode'
-        """
-        if self.env is None:
-            self._create_env()
-        
-        try:
-            result = self.env.execute(command, timeout=timeout or self.command_timeout)
-            return {
-                "output": result.get("output", ""),
-                "exit_code": result.get("returncode", 0),
-                "error": None
-            }
-        except Exception as e:
-            return {
-                "output": "",
-                "exit_code": -1,
-                "error": str(e)
-            }
-    
-    def _format_tools_for_system_message(self) -> str:
-        """Format tool definitions for the system message."""
-        formatted_tools = []
-        for tool in self.tools:
-            func = tool["function"]
-            formatted_tools.append({
-                "name": func["name"],
-                "description": func.get("description", ""),
-                "parameters": func.get("parameters", {}),
-                "required": None
-            })
-        return json.dumps(formatted_tools, ensure_ascii=False)
-    
-    def _convert_to_hermes_format(
-        self,
-        messages: List[Dict[str, Any]],
-        user_query: str,
-        completed: bool
-    ) -> List[Dict[str, Any]]:
-        """
-        Convert internal message format to Hermes trajectory format.
-        
-        This produces the exact format used by batch_runner.py.
-        """
-        trajectory = []
-        
-        # System message with tool definitions
-        system_msg = (
-            "You are a function calling AI model. You are provided with function signatures within <tools> </tools> XML tags. "
-            "You may call one or more functions to assist with the user query. If available tools are not relevant in assisting "
-            "with user query, just respond in natural conversational language. Don't make assumptions about what values to plug "
-            "into functions. After calling & executing the functions, you will be provided with function results within "
-            "<tool_response> </tool_response> XML tags. Here are the available tools:\n"
-            f"<tools>\n{self._format_tools_for_system_message()}\n</tools>\n"
-            "For each function call return a JSON object, with the following pydantic model json schema for each:\n"
-            "{'title': 'FunctionCall', 'type': 'object', 'properties': {'name': {'title': 'Name', 'type': 'string'}, "
-            "'arguments': {'title': 'Arguments', 'type': 'object'}}, 'required': ['name', 'arguments']}\n"
-            "Each function call should be enclosed within <tool_call> </tool_call> XML tags.\n"
-            "Example:\n<tool_call>\n{'name': <function-name>,'arguments': <args-dict>}\n</tool_call>"
-        )
-        
-        trajectory.append({"from": "system", "value": system_msg})
-        trajectory.append({"from": "human", "value": user_query})
-        
-        # Process messages (skip first user message as we already added it)
-        i = 1
-        while i < len(messages):
-            msg = messages[i]
-            
-            if msg["role"] == "assistant":
-                if "tool_calls" in msg and msg["tool_calls"]:
-                    # Assistant message with tool calls
-                    content = ""
-                    
-                    # Add reasoning if present
-                    if msg.get("reasoning"):
-                        content = f"<think>{msg['reasoning']}</think>"
-                    
-                    if msg.get("content"):
-                        content += msg["content"] + "\n"
-                    
-                    # Add tool calls in XML format
-                    for tool_call in msg["tool_calls"]:
-                        try:
-                            arguments = json.loads(tool_call["function"]["arguments"]) \
-                                if isinstance(tool_call["function"]["arguments"], str) \
-                                else tool_call["function"]["arguments"]
-                        except json.JSONDecodeError:
-                            arguments = {}
-                        
-                        tool_call_json = {
-                            "name": tool_call["function"]["name"],
-                            "arguments": arguments
-                        }
-                        content += f"<tool_call>\n{json.dumps(tool_call_json, ensure_ascii=False)}\n</tool_call>\n"
-                    
-                    trajectory.append({"from": "gpt", "value": content.rstrip()})
-                    
-                    # Collect subsequent tool responses
-                    tool_responses = []
-                    j = i + 1
-                    while j < len(messages) and messages[j]["role"] == "tool":
-                        tool_msg = messages[j]
-                        tool_content = tool_msg["content"]
-                        
-                        # Try to parse as JSON
-                        try:
-                            if tool_content.strip().startswith(("{", "[")):
-                                tool_content = json.loads(tool_content)
-                        except (json.JSONDecodeError, AttributeError):
-                            pass
-                        
-                        tool_response = f"<tool_response>\n"
-                        tool_response += json.dumps({
-                            "tool_call_id": tool_msg.get("tool_call_id", ""),
-                            "name": msg["tool_calls"][len(tool_responses)]["function"]["name"] \
-                                if len(tool_responses) < len(msg["tool_calls"]) else "unknown",
-                            "content": tool_content
-                        }, ensure_ascii=False)
-                        tool_response += "\n</tool_response>"
-                        tool_responses.append(tool_response)
-                        j += 1
-                    
-                    if tool_responses:
-                        trajectory.append({"from": "tool", "value": "\n".join(tool_responses)})
-                        i = j - 1
-                
-                else:
-                    # Regular assistant message (no tool calls)
-                    content = ""
-                    if msg.get("reasoning"):
-                        content = f"<think>{msg['reasoning']}</think>"
-                    content += msg.get("content") or ""
-                    trajectory.append({"from": "gpt", "value": content})
-            
-            elif msg["role"] == "user":
-                trajectory.append({"from": "human", "value": msg["content"]})
-            
-            i += 1
-        
-        return trajectory
-    
-    def run_task(self, task: str) -> Dict[str, Any]:
-        """
-        Run a single task and return the result with trajectory.
-        
-        Args:
-            task: The task/prompt to execute
-            
-        Returns:
-            Dict with trajectory, completion status, and metadata
-        """
-        print(f"\n{'='*60}")
-        print(f"📝 Task: {task[:80]}{'...' if len(task) > 80 else ''}")
-        print(f"{'='*60}")
-        
-        # Initialize environment
-        self._create_env()
-        
-        # Message history
-        messages = [{"role": "user", "content": task}]
-        
-        # System prompt for the LLM (ephemeral - not saved to trajectory)
-        system_prompt = """You are an AI agent that can execute bash commands to complete tasks.
-
-When you need to run commands, use the 'terminal' tool with your bash command.
-
-**Important:**
- When you have completed the task successfully, run: echo "MINI_SWE_AGENT_FINAL_OUTPUT" followed by a summary
- Be concise and efficient in your approach
- Install any needed tools with apt-get or pip
- Avoid interactive commands (no vim, nano, less, etc.)
-
-Complete the user's task step by step."""
-        
-        api_call_count = 0
-        completed = False
-        final_response = None
-        
-        try:
-            while api_call_count < self.max_iterations:
-                api_call_count += 1
-                print(f"\n🔄 API call #{api_call_count}/{self.max_iterations}")
-                
-                # Prepare API messages
-                api_messages = [{"role": "system", "content": system_prompt}] + messages
-                
-                # Make API call
-                try:
-                    response = self.client.chat.completions.create(
-                        model=self.model,
-                        messages=api_messages,
-                        tools=self.tools,
-                        timeout=300.0
-                    )
-                except Exception as e:
-                    self.logger.error(f"API call failed: {e}")
-                    break
-                
-                assistant_message = response.choices[0].message
-                
-                # Log assistant response
-                if assistant_message.content:
-                    print(f"🤖 Assistant: {assistant_message.content[:100]}...")
-                
-                # Check for tool calls
-                if assistant_message.tool_calls:
-                    print(f"🔧 Tool calls: {len(assistant_message.tool_calls)}")
-                    
-                    # Add assistant message with tool calls
-                    messages.append({
-                        "role": "assistant",
-                        "content": assistant_message.content,
-                        "tool_calls": [
-                            {
-                                "id": tc.id,
-                                "type": tc.type,
-                                "function": {
-                                    "name": tc.function.name,
-                                    "arguments": tc.function.arguments
-                                }
-                            }
-                            for tc in assistant_message.tool_calls
-                        ]
-                    })
-                    
-                    # Execute each tool call
-                    for tc in assistant_message.tool_calls:
-                        try:
-                            args = json.loads(tc.function.arguments)
-                        except json.JSONDecodeError:
-                            args = {}
-                        
-                        command = args.get("command", "echo 'No command provided'")
-                        timeout = args.get("timeout", self.command_timeout)
-                        
-                        print(f"   📞 terminal: {command[:60]}...")
-                        
-                        # Execute command
-                        result = self._execute_command(command, timeout)
-                        
-                        # Format result
-                        result_json = json.dumps({
-                            "content": {
-                                "output": result["output"],
-                                "exit_code": result["exit_code"],
-                                "error": result["error"]
-                            }
-                        }, ensure_ascii=False)
-                        
-                        # Check for task completion signal
-                        if "MINI_SWE_AGENT_FINAL_OUTPUT" in result["output"]:
-                            print(f"   ✅ Task completion signal detected!")
-                            completed = True
-                        
-                        # Add tool response
-                        messages.append({
-                            "role": "tool",
-                            "content": result_json,
-                            "tool_call_id": tc.id
-                        })
-                        
-                        print(f"   ✅ exit_code={result['exit_code']}, output={len(result['output'])} chars")
-                    
-                    # If task completed, we can stop
-                    if completed:
-                        final_response = assistant_message.content
-                        break
-                
-                else:
-                    # No tool calls - final response
-                    final_response = assistant_message.content or ""
-                    messages.append({
-                        "role": "assistant",
-                        "content": final_response
-                    })
-                    completed = True
-                    print(f"🎉 Agent finished (no more tool calls)")
-                    break
-            
-            if api_call_count >= self.max_iterations:
-                print(f"⚠️  Reached max iterations ({self.max_iterations})")
-        
-        finally:
-            # Cleanup environment
-            self._cleanup_env()
-        
-        # Convert to Hermes trajectory format
-        trajectory = self._convert_to_hermes_format(messages, task, completed)
-        
-        return {
-            "conversations": trajectory,
-            "completed": completed,
-            "api_calls": api_call_count,
-            "metadata": {
-                "model": self.model,
-                "env_type": self.env_type,
-                "timestamp": datetime.now().isoformat()
-            }
-        }
-    
-    def run_batch(
-        self,
-        prompts: List[str],
-        output_file: str
-    ) -> List[Dict[str, Any]]:
-        """
-        Run multiple tasks and save trajectories to a JSONL file.
-        
-        Args:
-            prompts: List of task prompts
-            output_file: Output JSONL file path
-            
-        Returns:
-            List of results
-        """
-        results = []
-        
-        print(f"\n📦 Running batch of {len(prompts)} tasks")
-        print(f"📁 Output: {output_file}")
-        
-        with open(output_file, 'w', encoding='utf-8') as f:
-            for i, prompt in enumerate(prompts, 1):
-                print(f"\n{'='*60}")
-                print(f"📋 Task {i}/{len(prompts)}")
-                print(f"{'='*60}")
-                
-                try:
-                    result = self.run_task(prompt)
-                    results.append(result)
-                    
-                    # Write to file immediately
-                    f.write(json.dumps(result, ensure_ascii=False) + "\n")
-                    f.flush()
-                    
-                    print(f"✅ Task {i} completed (api_calls={result['api_calls']})")
-                    
-                except Exception as e:
-                    self.logger.error(f"Error on task {i}: {e}")
-                    error_result = {
-                        "conversations": [],
-                        "completed": False,
-                        "api_calls": 0,
-                        "error": str(e),
-                        "metadata": {"timestamp": datetime.now().isoformat()}
-                    }
-                    results.append(error_result)
-                    f.write(json.dumps(error_result, ensure_ascii=False) + "\n")
-                    f.flush()
-        
-        print(f"\n✅ Batch complete! {len(results)} trajectories saved to {output_file}")
-        return results
-
-
-# ============================================================================
-# CLI Interface
-# ============================================================================
-
-def main(
-    task: str = None,
-    prompts_file: str = None,
-    output_file: str = "mini-swe-agent-test1.jsonl",
-    model: str = "claude-sonnet-4-20250514",
-    base_url: str = None,
-    api_key: str = None,
-    env: str = "local",
-    image: str = "python:3.11-slim",
-    cwd: str = "/tmp",
-    max_iterations: int = 15,
-    timeout: int = 60,
-    verbose: bool = False,
-):
-    """
-    Run mini-swe-agent tasks with Hermes trajectory format output.
-    
-    Args:
-        task: Single task to run (use this OR prompts_file)
-        prompts_file: JSONL file with prompts (each line: {"prompt": "..."})
-        output_file: Output JSONL file for trajectories
-        model: Model name (default: claude-sonnet-4-20250514)
-        base_url: API base URL (optional)
-        api_key: API key (optional, uses env vars)
-        env: Environment type - "local", "docker", or "modal"
-        image: Docker/Modal image (default: python:3.11-slim)
-        cwd: Working directory (default: /tmp)
-        max_iterations: Maximum tool-calling iterations (default: 15)
-        timeout: Command timeout in seconds (default: 60)
-        verbose: Enable verbose logging
-        
-    Examples:
-        # Single task with local environment
-        python mini_swe_runner.py --task "Create hello.py that prints Hello World"
-        
-        # Single task with Docker
-        python mini_swe_runner.py --task "List files" --env docker
-        
-        # Batch from file
-        python mini_swe_runner.py --prompts_file tasks.jsonl --output_file results.jsonl
-    """
-    print("🚀 Mini-SWE Runner with Hermes Trajectory Format")
-    print("=" * 60)
-    
-    # Initialize runner
-    runner = MiniSWERunner(
-        model=model,
-        base_url=base_url,
-        api_key=api_key,
-        env_type=env,
-        image=image,
-        cwd=cwd,
-        max_iterations=max_iterations,
-        command_timeout=timeout,
-        verbose=verbose,
-    )
-    
-    if task:
-        # Single task mode
-        result = runner.run_task(task)
-        
-        # Save to file
-        with open(output_file, 'w', encoding='utf-8') as f:
-            f.write(json.dumps(result, ensure_ascii=False) + "\n")
-        
-        print(f"\n📁 Trajectory saved to: {output_file}")
-        print(f"✅ Completed: {result['completed']}")
-        print(f"📞 API calls: {result['api_calls']}")
-        print(f"💬 Turns: {len(result['conversations'])}")
-        
-    elif prompts_file:
-        # Batch mode
-        prompts = []
-        with open(prompts_file, 'r', encoding='utf-8') as f:
-            for line in f:
-                line = line.strip()
-                if line:
-                    try:
-                        entry = json.loads(line)
-                        prompts.append(entry.get("prompt", entry.get("task", "")))
-                    except json.JSONDecodeError:
-                        prompts.append(line)
-        
-        if not prompts:
-            print(f"❌ No prompts found in {prompts_file}")
-            return
-        
-        runner.run_batch(prompts, output_file)
-    
-    else:
-        print("❌ Please provide either --task or --prompts_file")
-        print("   Example: python mini_swe_runner.py --task 'Create a hello world script'")
-
-
-if __name__ == "__main__":
-    fire.Fire(main)
--- a/modal_profiles.yaml.example
+++ b/modal_profiles.yaml.example
@@ -1,134 +0,0 @@
-# Modal Sandbox Profiles Configuration
-# =====================================
-# This file defines different sandbox profiles for heterogeneous workloads.
-# Copy to modal_profiles.yaml and customize as needed.
-#
-# Usage:
-#   terminal_tool("python train.py", profile="pytorch-gpu")
-#   terminal_tool("npm test", profile="node")
-#
-# Each profile can specify:
-#   - image: Docker image to use
-#   - gpu: GPU type (null, "T4", "A10G", "A100", "H100")
-#   - cpu: CPU cores (float)
-#   - memory: Memory in MB
-#   - min_pool: Minimum warm sandboxes (cost vs latency tradeoff)
-#   - max_pool: Maximum sandboxes (hard cost cap)
-#   - idle_timeout: Server-side auto-cleanup in seconds
-#   - max_lifetime: Maximum sandbox lifetime in seconds
-#   - scale_down_idle: Client-side scale-down threshold in seconds
-#   - workdir: Working directory inside container
-#   - secrets: List of Modal Secret names to inject (created via dashboard/CLI)
-#   - env_vars: Dict of environment variables to pass directly
-#   - use_dotenv: If true, loads local .env file into sandbox
-#
-# SECRETS SETUP:
-#   Create secrets via Modal dashboard or CLI:
-#     modal secret create huggingface-token HF_TOKEN=hf_xxx
-#     modal secret create openai-key OPENAI_API_KEY=sk-xxx
-#   Then reference by name in profile's secrets list.
-
-# Default profile used when no profile specified
-default_profile: default
-
-profiles:
-  # Default Python environment - good for most tasks
-  default:
-    image: python:3.11
-    gpu: null
-    cpu: 1.0
-    memory: 2048
-    min_pool: 1        # Keep 1 warm for fast response
-    max_pool: 5
-    idle_timeout: 120  # Modal terminates if idle 2 min
-    max_lifetime: 3600 # Max 1 hour
-    scale_down_idle: 180
-    workdir: /workspace
-    secrets: []        # Add secret names here: ["my-api-keys"]
-    env_vars: {}       # Add env vars here: {DEBUG: "1"}
-    use_dotenv: false  # Set to true to load local .env
-
-  # PyTorch with GPU for ML training/inference
-  pytorch-gpu:
-    image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
-    gpu: T4            # Options: T4, A10G, A100, H100
-    cpu: 4.0
-    memory: 16384      # 16GB
-    min_pool: 0        # Don't keep GPU sandboxes warm (expensive!)
-    max_pool: 2
-    idle_timeout: 60   # Shorter idle timeout for GPU (cost)
-    max_lifetime: 1800 # 30 min max for GPU tasks
-    scale_down_idle: 60
-    workdir: /workspace
-    # ML-specific secrets
-    secrets:
-      - huggingface-token  # HF_TOKEN env var
-      - wandb-key          # WANDB_API_KEY env var
-    env_vars:
-      CUDA_VISIBLE_DEVICES: "0"
-      PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:True"
-
-  # High-end GPU for large models
-  pytorch-a100:
-    image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
-    gpu: A100
-    cpu: 8.0
-    memory: 65536      # 64GB
-    min_pool: 0
-    max_pool: 1        # Only 1 at a time (very expensive)
-    idle_timeout: 30
-    max_lifetime: 3600
-    scale_down_idle: 30
-    workdir: /workspace
-
-  # Node.js for JavaScript/TypeScript tasks
-  node:
-    image: node:18
-    gpu: null
-    cpu: 1.0
-    memory: 2048
-    min_pool: 0        # Create on-demand
-    max_pool: 3
-    idle_timeout: 120
-    max_lifetime: 3600
-    scale_down_idle: 180
-    workdir: /workspace
-
-  # High memory for data processing
-  high-memory:
-    image: python:3.11
-    gpu: null
-    cpu: 4.0
-    memory: 32768      # 32GB
-    min_pool: 0
-    max_pool: 2
-    idle_timeout: 120
-    max_lifetime: 3600
-    scale_down_idle: 180
-    workdir: /workspace
-
-  # Rust development environment
-  rust:
-    image: rust:1.75
-    gpu: null
-    cpu: 2.0
-    memory: 4096
-    min_pool: 0
-    max_pool: 2
-    idle_timeout: 120
-    max_lifetime: 3600
-    scale_down_idle: 180
-    workdir: /workspace
-
-  # Go development environment
-  golang:
-    image: golang:1.21
-    gpu: null
-    cpu: 2.0
-    memory: 4096
-    min_pool: 0
-    max_pool: 2
-    idle_timeout: 120
-    max_lifetime: 3600
-    scale_down_idle: 180
-    workdir: /workspace
--- a/model_tools.py
+++ b/model_tools.py
@@ -8,7 +8,7 @@ for defining tools and executing function calls.

 Currently supports:
 - Web tools (search, extract, crawl) from web_tools.py
- Terminal tools (simple command execution, no session persistence) from simple_terminal_tool.py
+- Terminal tools (command execution with interactive sessions) from terminal_tool.py
 - Vision tools (image analysis) from vision_tools.py
 - Mixture of Agents tools (collaborative multi-model reasoning) from mixture_of_agents_tool.py
 - Image generation tools (text-to-image with upscaling) from image_generation_tool.py
@@ -23,125 +23,20 @@ Usage:
    web_tools = get_tool_definitions(enabled_toolsets=['web_tools'])
    
    # Handle function calls from model
-    result = handle_function_call("web_search", {"query": "Python"})
+    result = await handle_function_call("web_search", {"query": "Python"})
 """

 import json
 import asyncio
 from typing import Dict, Any, List, Optional

-from tools.terminal_tool import TERMINAL_TOOL_DESCRIPTION, cleanup_vm, check_terminal_requirements, terminal_tool
-
-# Optional toolsets: keep Hermes importable even when some deps aren't installed.
-try:
-    from tools.web_tools import check_firecrawl_api_key, web_crawl_tool, web_extract_tool, web_search_tool
-except ModuleNotFoundError:
-    web_search_tool = None  # type: ignore[assignment]
-    web_extract_tool = None  # type: ignore[assignment]
-    web_crawl_tool = None  # type: ignore[assignment]
-
-    def check_firecrawl_api_key() -> bool:  # type: ignore[no-redef]
-        return False
-
-try:
-    # Hecate/MorphCloud terminal tool (cloud VMs) - available as alternative backend
-    from tools.terminal_hecate import TERMINAL_HECATE_DESCRIPTION, check_hecate_requirements, terminal_hecate_tool
-except ModuleNotFoundError:
-    terminal_hecate_tool = None  # type: ignore[assignment]
-    TERMINAL_HECATE_DESCRIPTION = ""
-
-    def check_hecate_requirements() -> bool:  # type: ignore[no-redef]
-        return False
-
-try:
-    from tools.vision_tools import check_vision_requirements, vision_analyze_tool
-except ModuleNotFoundError:
-    vision_analyze_tool = None  # type: ignore[assignment]
-
-    def check_vision_requirements() -> bool:  # type: ignore[no-redef]
-        return False
-
-try:
-    from tools.mixture_of_agents_tool import check_moa_requirements, mixture_of_agents_tool
-except ModuleNotFoundError:
-    mixture_of_agents_tool = None  # type: ignore[assignment]
-
-    def check_moa_requirements() -> bool:  # type: ignore[no-redef]
-        return False
-
-try:
-    from tools.image_generation_tool import check_image_generation_requirements, image_generate_tool
-except ModuleNotFoundError:
-    image_generate_tool = None  # type: ignore[assignment]
-
-    def check_image_generation_requirements() -> bool:  # type: ignore[no-redef]
-        return False
-
-try:
-    from tools.skills_tool import (
-        SKILLS_TOOL_DESCRIPTION,
-        check_skills_requirements,
-        skill_view,
-        skills_categories,
-        skills_list,
-    )
-except ModuleNotFoundError:
-    SKILLS_TOOL_DESCRIPTION = ""
-
-    def check_skills_requirements() -> bool:  # type: ignore[no-redef]
-        return False
-
-    def skills_categories() -> str:  # type: ignore[no-redef]
-        return json.dumps({"error": "Skills toolset is unavailable (missing dependencies)."}, ensure_ascii=False)
-
-    def skills_list(category: Optional[str] = None) -> str:  # type: ignore[no-redef]
-        _ = category
-        return json.dumps({"error": "Skills toolset is unavailable (missing dependencies)."}, ensure_ascii=False)
-
-    def skill_view(name: str, file_path: Optional[str] = None) -> str:  # type: ignore[no-redef]
-        _ = (name, file_path)
-        return json.dumps({"error": "Skills toolset is unavailable (missing dependencies)."}, ensure_ascii=False)
-
-try:
-    # Browser automation tools (agent-browser + Browserbase)
-    from tools.browser_tool import (
-        BROWSER_TOOL_SCHEMAS,
-        browser_back,
-        browser_click,
-        browser_close,
-        browser_get_images,
-        browser_navigate,
-        browser_press,
-        browser_scroll,
-        browser_snapshot,
-        browser_type,
-        browser_vision,
-        check_browser_requirements,
-        cleanup_browser,
-    )
-except ModuleNotFoundError:
-    BROWSER_TOOL_SCHEMAS: List[Dict[str, Any]] = []
-
-    def check_browser_requirements() -> bool:  # type: ignore[no-redef]
-        return False
-
-    def cleanup_browser(task_id: Optional[str] = None) -> None:  # type: ignore[no-redef]
-        _ = task_id
-        return None
-
-    def _browser_unavailable(*_args: Any, **_kwargs: Any) -> str:
-        return json.dumps({"error": "Browser toolset is unavailable (missing dependencies)."}, ensure_ascii=False)
-
-    browser_navigate = _browser_unavailable  # type: ignore[assignment]
-    browser_snapshot = _browser_unavailable  # type: ignore[assignment]
-    browser_click = _browser_unavailable  # type: ignore[assignment]
-    browser_type = _browser_unavailable  # type: ignore[assignment]
-    browser_scroll = _browser_unavailable  # type: ignore[assignment]
-    browser_back = _browser_unavailable  # type: ignore[assignment]
-    browser_press = _browser_unavailable  # type: ignore[assignment]
-    browser_close = _browser_unavailable  # type: ignore[assignment]
-    browser_get_images = _browser_unavailable  # type: ignore[assignment]
-    browser_vision = _browser_unavailable  # type: ignore[assignment]
+from tools.web_tools import web_search_tool, web_extract_tool, web_crawl_tool, check_firecrawl_api_key
+from tools.simple_terminal_tool import simple_terminal_tool, check_requirements as check_simple_terminal_requirements, SIMPLE_TERMINAL_TOOL_DESCRIPTION
+# Keep old terminal tool for backwards compatibility if needed
+# from tools.terminal_tool import terminal_tool, check_hecate_requirements, TERMINAL_TOOL_DESCRIPTION
+from tools.vision_tools import vision_analyze_tool, check_vision_requirements
+from tools.mixture_of_agents_tool import mixture_of_agents_tool, check_moa_requirements
+from tools.image_generation_tool import image_generate_tool, check_image_generation_requirements
 from toolsets import (
    get_toolset, resolve_toolset, resolve_multiple_toolsets,
    get_all_toolsets, get_toolset_names, validate_toolset,
@@ -160,7 +55,7 @@ def get_web_tool_definitions() -> List[Dict[str, Any]]:
            "type": "function",
            "function": {
                "name": "web_search",
-                "description": "Search the web for information on any topic. Returns up to 5 relevant results with titles and URLs. Uses advanced search depth for comprehensive results. PREFERRED over browser tools for finding information - faster and more cost-effective. Use browser tools only when you need to interact with pages (click, fill forms, handle dynamic content).",
+                "description": "Search the web for information on any topic. Returns up to 5 relevant results with titles and URLs. Uses advanced search depth for comprehensive results.",
                "parameters": {
                    "type": "object",
                    "properties": {
@@ -177,7 +72,7 @@ def get_web_tool_definitions() -> List[Dict[str, Any]]:
            "type": "function",
            "function": {
                "name": "web_extract",
-                "description": "Extract and read the full content from specific web page URLs. Useful for getting detailed information from webpages found through search. The content returned will be excerpts and key points summarized with an LLM to reduce impact on the context window. PREFERRED over browser tools for reading page content - faster and more cost-effective. Use browser tools only when pages require interaction or have dynamic content.",
+                "description": "Extract and read the full content from specific web page URLs. Useful for getting detailed information from webpages found through search. The content returned will be excerpts and key points summarized with an LLM to reduce impact on the context window.",
                "parameters": {
                    "type": "object",
                    "properties": {
@@ -192,13 +87,32 @@ def get_web_tool_definitions() -> List[Dict[str, Any]]:
                }
            }
        },
+        {
+            "type": "function",
+            "function": {
+                "name": "web_crawl",
+                "description": "Crawl a website with specific instructions to find and extract targeted content. Uses AI to intelligently navigate and extract relevant information from across the site. The content returned will be excerpts and key points summarized with an LLM to reduce impact on the context window.",
+                "parameters": {
+                    "type": "object",
+                    "properties": {
+                        "url": {
+                            "type": "string",
+                            "description": "The base URL to crawl (can include or exclude https://)"
+                        },
+                        "instructions": {
+                            "type": "string",
+                            "description": "Specific instructions for what to crawl/extract using AI intelligence (e.g., 'Find pricing information', 'Get documentation pages', 'Extract contact details')"
+                        }
+                    },
+                    "required": ["url"]
+                }
+            }
+        }
    ]

 def get_terminal_tool_definitions() -> List[Dict[str, Any]]:
    """
    Get tool definitions for terminal tools in OpenAI's expected format.
-    
-    Uses mini-swe-agent backend (local/docker/modal) by default.

    Returns:
        List[Dict]: List of terminal tool definitions compatible with OpenAI API
@@ -208,7 +122,7 @@ def get_terminal_tool_definitions() -> List[Dict[str, Any]]:
            "type": "function",
            "function": {
                "name": "terminal",
-                "description": TERMINAL_TOOL_DESCRIPTION,
+                "description": SIMPLE_TERMINAL_TOOL_DESCRIPTION,
                "parameters": {
                    "type": "object",
                    "properties": {
@@ -306,7 +220,7 @@ def get_image_tool_definitions() -> List[Dict[str, Any]]:
            "type": "function",
            "function": {
                "name": "image_generate",
-                "description": "Generate high-quality images from text prompts using FLUX 2 Pro model with automatic 2x upscaling. Creates detailed, artistic images that are automatically upscaled for hi-rez results. Returns a single upscaled image URL that can be displayed using <img src=\"{URL}\"></img> tags.",
+                "description": "Generate high-quality images from text prompts using FLUX Krea model with automatic 2x upscaling. Creates detailed, artistic images that are automatically enhanced for superior quality. Returns a single upscaled image URL that can be displayed using <img src=\"{URL}\"></img> tags.",
                "parameters": {
                    "type": "object",
                    "properties": {
@@ -314,11 +228,11 @@ def get_image_tool_definitions() -> List[Dict[str, Any]]:
                            "type": "string",
                            "description": "The text prompt describing the desired image. Be detailed and descriptive."
                        },
-                        "aspect_ratio": {
+                        "image_size": {
                            "type": "string",
-                            "enum": ["landscape", "square", "portrait"],
-                            "description": "The aspect ratio of the generated image. 'landscape' is 16:9 wide, 'portrait' is 16:9 tall, 'square' is 1:1.",
-                            "default": "landscape"
+                            "enum": ["square","portrait_16_9", "landscape_16_9"],
+                            "description": "The size/aspect ratio of the generated image (default: landscape_4_3)",
+                            "default": "landscape_16_9"
                        }
                    },
                    "required": ["prompt"]
@@ -328,79 +242,6 @@ def get_image_tool_definitions() -> List[Dict[str, Any]]:
    ]


-def get_skills_tool_definitions() -> List[Dict[str, Any]]:
-    """
-    Get tool definitions for skills tools in OpenAI's expected format.
-    
-    Returns:
-        List[Dict]: List of skills tool definitions compatible with OpenAI API
-    """
-    return [
-        {
-            "type": "function",
-            "function": {
-                "name": "skills_list",
-                "description": "List available skills (name + description). Use skill_view(name) to load full content.",
-                "parameters": {
-                    "type": "object",
-                    "properties": {
-                        "category": {
-                            "type": "string",
-                            "description": "Optional category filter (from skills_categories)"
-                        }
-                    },
-                    "required": []
-                }
-            }
-        },
-        {
-            "type": "function",
-            "function": {
-                "name": "skills_categories",
-                "description": "List available skill categories. Call first if you want to discover categories, then use skills_list(category) to filter, or call skills_list if unsure.",
-                "parameters": {
-                    "type": "object",
-                    "properties": {},
-                    "required": []
-                }
-            }
-        },
-        {
-            "type": "function",
-            "function": {
-                "name": "skill_view",
-                "description": "Skills allow for loading information about specific tasks and workflows, as well as scripts and templates. Load a skill's full content or access its linked files (references, templates, scripts). First call returns SKILL.md content plus a 'linked_files' dict showing available references/templates/scripts. To access those, call again with file_path parameter.",
-                "parameters": {
-                    "type": "object",
-                    "properties": {
-                        "name": {
-                            "type": "string",
-                            "description": "The skill name (use skills_list to see available skills)"
-                        },
-                        "file_path": {
-                            "type": "string",
-                            "description": "OPTIONAL: Path to a linked file within the skill (e.g., 'references/api.md', 'templates/config.yaml', 'scripts/validate.py'). Omit to get the main SKILL.md content."
-                        }
-                    },
-                    "required": ["name"]
-                }
-            }
-        }
-    ]
-
-
-def get_browser_tool_definitions() -> List[Dict[str, Any]]:
-    """
-    Get tool definitions for browser automation tools in OpenAI's expected format.
-    
-    Uses agent-browser CLI with Browserbase cloud execution.
-    
-    Returns:
-        List[Dict]: List of browser tool definitions compatible with OpenAI API
-    """
-    return [{"type": "function", "function": schema} for schema in BROWSER_TOOL_SCHEMAS]
-
-
 def get_all_tool_names() -> List[str]:
    """
    Get the names of all available tools across all toolsets.
@@ -412,10 +253,10 @@ def get_all_tool_names() -> List[str]:
    
    # Web tools
    if check_firecrawl_api_key():
-        tool_names.extend(["web_search", "web_extract"])
+        tool_names.extend(["web_search", "web_extract", "web_crawl"])

-    # Terminal tools (mini-swe-agent backend)
-    if check_terminal_requirements():
+    # Terminal tools
+    if check_simple_terminal_requirements():
        tool_names.extend(["terminal"])

    # Vision tools
@@ -430,19 +271,6 @@ def get_all_tool_names() -> List[str]:
    if check_image_generation_requirements():
        tool_names.extend(["image_generate"])
    
-    # Skills tools
-    if check_skills_requirements():
-        tool_names.extend(["skills_categories", "skills_list", "skill_view"])
-    
-    # Browser automation tools
-    if check_browser_requirements():
-        tool_names.extend([
-            "browser_navigate", "browser_snapshot", "browser_click",
-            "browser_type", "browser_scroll", "browser_back",
-            "browser_press", "browser_close", "browser_get_images",
-            "browser_vision"
-        ])
-    
    return tool_names


@@ -458,26 +286,12 @@ def get_toolset_for_tool(tool_name: str) -> str:
    """
    toolset_mapping = {
        "web_search": "web_tools",
-        "web_extract": "web_tools",
+        "web_extract": "web_tools", 
+        "web_crawl": "web_tools",
        "terminal": "terminal_tools",
        "vision_analyze": "vision_tools",
        "mixture_of_agents": "moa_tools",
-        "image_generate": "image_tools",
-        # Skills tools
-        "skills_categories": "skills_tools",
-        "skills_list": "skills_tools",
-        "skill_view": "skills_tools",
-        # Browser automation tools
-        "browser_navigate": "browser_tools",
-        "browser_snapshot": "browser_tools",
-        "browser_click": "browser_tools",
-        "browser_type": "browser_tools",
-        "browser_scroll": "browser_tools",
-        "browser_back": "browser_tools",
-        "browser_press": "browser_tools",
-        "browser_close": "browser_tools",
-        "browser_get_images": "browser_tools",
-        "browser_vision": "browser_tools"
+        "image_generate": "image_tools"
    }
    
    return toolset_mapping.get(tool_name, "unknown")
@@ -485,8 +299,7 @@ def get_toolset_for_tool(tool_name: str) -> str:

 def get_tool_definitions(
    enabled_toolsets: List[str] = None,
-    disabled_toolsets: List[str] = None,
-    quiet_mode: bool = False,
+    disabled_toolsets: List[str] = None
 ) -> List[Dict[str, Any]]:
    """
    Get tool definitions for model API calls with toolset-based filtering.
@@ -526,7 +339,7 @@ def get_tool_definitions(
        for tool in get_web_tool_definitions():
            all_available_tools_map[tool["function"]["name"]] = tool

-    if check_terminal_requirements():
+    if check_simple_terminal_requirements():
        for tool in get_terminal_tool_definitions():
            all_available_tools_map[tool["function"]["name"]] = tool

@@ -542,14 +355,6 @@ def get_tool_definitions(
        for tool in get_image_tool_definitions():
            all_available_tools_map[tool["function"]["name"]] = tool
    
-    if check_skills_requirements():
-        for tool in get_skills_tool_definitions():
-            all_available_tools_map[tool["function"]["name"]] = tool
-    
-    if check_browser_requirements():
-        for tool in get_browser_tool_definitions():
-            all_available_tools_map[tool["function"]["name"]] = tool
-    
    # Determine which tools to include based on toolsets
    tools_to_include = set()
    
@@ -562,21 +367,14 @@ def get_tool_definitions(
                print(f"✅ Enabled toolset '{toolset_name}': {', '.join(resolved_tools) if resolved_tools else 'no tools'}")
            else:
                # Try legacy compatibility
-                if toolset_name in ["web_tools", "terminal_tools", "vision_tools", "moa_tools", "image_tools", "skills_tools", "browser_tools"]:
+                if toolset_name in ["web_tools", "terminal_tools", "vision_tools", "moa_tools", "image_tools"]:
                    # Map legacy names to new system
                    legacy_map = {
-                        "web_tools": ["web_search", "web_extract"],
+                        "web_tools": ["web_search", "web_extract", "web_crawl"],
                        "terminal_tools": ["terminal"],
                        "vision_tools": ["vision_analyze"],
                        "moa_tools": ["mixture_of_agents"],
-                        "image_tools": ["image_generate"],
-                        "skills_tools": ["skills_categories", "skills_list", "skill_view"],
-                        "browser_tools": [
-                            "browser_navigate", "browser_snapshot", "browser_click",
-                            "browser_type", "browser_scroll", "browser_back",
-                            "browser_press", "browser_close", "browser_get_images",
-                            "browser_vision"
-                        ]
+                        "image_tools": ["image_generate"]
                    }
                    legacy_tools = legacy_map.get(toolset_name, [])
                    tools_to_include.update(legacy_tools)
@@ -604,20 +402,13 @@ def get_tool_definitions(
                print(f"🚫 Disabled toolset '{toolset_name}': {', '.join(resolved_tools) if resolved_tools else 'no tools'}")
            else:
                # Try legacy compatibility
-                if toolset_name in ["web_tools", "terminal_tools", "vision_tools", "moa_tools", "image_tools", "skills_tools", "browser_tools"]:
+                if toolset_name in ["web_tools", "terminal_tools", "vision_tools", "moa_tools", "image_tools"]:
                    legacy_map = {
-                        "web_tools": ["web_search", "web_extract"],
+                        "web_tools": ["web_search", "web_extract", "web_crawl"],
                        "terminal_tools": ["terminal"],
                        "vision_tools": ["vision_analyze"],
                        "moa_tools": ["mixture_of_agents"],
-                        "image_tools": ["image_generate"],
-                        "skills_tools": ["skills_categories", "skills_list", "skill_view"],
-                        "browser_tools": [
-                            "browser_navigate", "browser_snapshot", "browser_click",
-                            "browser_type", "browser_scroll", "browser_back",
-                            "browser_press", "browser_close", "browser_get_images",
-                            "browser_vision"
-                        ]
+                        "image_tools": ["image_generate"]
                    }
                    legacy_tools = legacy_map.get(toolset_name, [])
                    tools_to_include.difference_update(legacy_tools)
@@ -640,16 +431,15 @@ def get_tool_definitions(
    # Sort tools for consistent ordering
    filtered_tools.sort(key=lambda t: t["function"]["name"])
    
-    if not quiet_mode:
-        if filtered_tools:
-            tool_names = [t["function"]["name"] for t in filtered_tools]
-            print(f"🛠️  Final tool selection ({len(filtered_tools)} tools): {', '.join(tool_names)}")
-        else:
-            print("🛠️  No tools selected (all filtered out or unavailable)")
+    if filtered_tools:
+        tool_names = [t["function"]["name"] for t in filtered_tools]
+        print(f"🛠️  Final tool selection ({len(filtered_tools)} tools): {', '.join(tool_names)}")
+    else:
+        print("🛠️  No tools selected (all filtered out or unavailable)")
    
    return filtered_tools

-def handle_web_function_call(function_name: str, function_args: Dict[str, Any]) -> str:
+async def handle_web_function_call(function_name: str, function_args: Dict[str, Any]) -> str:
    """
    Handle function calls for web tools.
    
@@ -660,43 +450,36 @@ def handle_web_function_call(function_name: str, function_args: Dict[str, Any])
    Returns:
        str: Function result as JSON string
    """
-    if web_search_tool is None or web_extract_tool is None or web_crawl_tool is None:
-        return json.dumps(
-            {
-                "error": (
-                    "Web toolset is unavailable (missing dependencies and/or FIRECRAWL_API_KEY). "
-                    "Install web tool deps and set FIRECRAWL_API_KEY to enable."
-                )
-            },
-            ensure_ascii=False,
-        )
-
    if function_name == "web_search":
        query = function_args.get("query", "")
        # Always use fixed limit of 5
        limit = 5
-        return web_search_tool(query, limit)
+        return await web_search_tool(query, limit)
    
    elif function_name == "web_extract":
        urls = function_args.get("urls", [])
        # Limit URLs to prevent abuse
        urls = urls[:5] if isinstance(urls, list) else []
-        # Run async function in event loop
-        return asyncio.run(web_extract_tool(urls, "markdown"))
+        # Run async function
+        return await web_extract_tool(urls, "markdown")
+    
+    elif function_name == "web_crawl":
+        url = function_args.get("url", "")
+        instructions = function_args.get("instructions")
+        # Run async function
+        return await web_crawl_tool(url, instructions, "basic")
    
    else:
        return json.dumps({"error": f"Unknown web function: {function_name}"}, ensure_ascii=False)

-def handle_terminal_function_call(function_name: str, function_args: Dict[str, Any], task_id: Optional[str] = None) -> str:
+async def handle_terminal_function_call(function_name: str, function_args: Dict[str, Any], task_id: Optional[str] = None) -> str:
    """
    Handle function calls for terminal tools.
-    
-    Uses mini-swe-agent backend (local/docker/modal) by default.

    Args:
        function_name (str): Name of the terminal function to call
        function_args (Dict): Arguments for the function
-        task_id (str): Unique identifier for this task to isolate environments between concurrent tasks (optional)
+        task_id (str): Unique identifier for this task to isolate VMs between concurrent tasks (optional)

    Returns:
        str: Function result as JSON string
@@ -706,13 +489,20 @@ def handle_terminal_function_call(function_name: str, function_args: Dict[str, A
        background = function_args.get("background", False)
        timeout = function_args.get("timeout")

-        return terminal_tool(command=command, background=background, timeout=timeout, task_id=task_id)
+        # Run sync terminal tool in a thread to avoid blocking
+        return await asyncio.to_thread(
+            simple_terminal_tool,
+            command=command,
+            background=background,
+            timeout=timeout,
+            task_id=task_id
+        )

    else:
        return json.dumps({"error": f"Unknown terminal function: {function_name}"}, ensure_ascii=False)


-def handle_vision_function_call(function_name: str, function_args: Dict[str, Any]) -> str:
+async def handle_vision_function_call(function_name: str, function_args: Dict[str, Any]) -> str:
    """
    Handle function calls for vision tools.
    
@@ -723,31 +513,20 @@ def handle_vision_function_call(function_name: str, function_args: Dict[str, Any
    Returns:
        str: Function result as JSON string
    """
-    if vision_analyze_tool is None:
-        return json.dumps(
-            {
-                "error": (
-                    "Vision toolset is unavailable (missing dependencies and/or NOUS_API_KEY). "
-                    "Install vision deps and set NOUS_API_KEY to enable."
-                )
-            },
-            ensure_ascii=False,
-        )
-
    if function_name == "vision_analyze":
        image_url = function_args.get("image_url", "")
        question = function_args.get("question", "")

        full_prompt = f"Fully describe and explain everything about this image, then answer the following question:\n\n{question}"
        
-        # Run async function in event loop
-        return asyncio.run(vision_analyze_tool(image_url, full_prompt, "google/gemini-3-flash-preview"))
+        # Run async function
+        return await vision_analyze_tool(image_url, full_prompt, "gemini-2.5-flash")
    
    else:
        return json.dumps({"error": f"Unknown vision function: {function_name}"}, ensure_ascii=False)


-def handle_moa_function_call(function_name: str, function_args: Dict[str, Any]) -> str:
+async def handle_moa_function_call(function_name: str, function_args: Dict[str, Any]) -> str:
    """
    Handle function calls for Mixture-of-Agents tools.
    
@@ -758,31 +537,20 @@ def handle_moa_function_call(function_name: str, function_args: Dict[str, Any])
    Returns:
        str: Function result as JSON string
    """
-    if mixture_of_agents_tool is None:
-        return json.dumps(
-            {
-                "error": (
-                    "Mixture-of-Agents toolset is unavailable (missing dependencies and/or NOUS_API_KEY). "
-                    "Install MoA deps and set NOUS_API_KEY to enable."
-                )
-            },
-            ensure_ascii=False,
-        )
-
    if function_name == "mixture_of_agents":
        user_prompt = function_args.get("user_prompt", "")
        
        if not user_prompt:
            return json.dumps({"error": "user_prompt is required for MoA processing"}, ensure_ascii=False)
        
-        # Run async function in event loop
-        return asyncio.run(mixture_of_agents_tool(user_prompt=user_prompt))
+        # Run async function
+        return await mixture_of_agents_tool(user_prompt=user_prompt)
    
    else:
        return json.dumps({"error": f"Unknown MoA function: {function_name}"}, ensure_ascii=False)


-def handle_image_function_call(function_name: str, function_args: Dict[str, Any]) -> str:
+async def handle_image_function_call(function_name: str, function_args: Dict[str, Any]) -> str:
    """
    Handle function calls for image generation tools.
    
@@ -793,143 +561,43 @@ def handle_image_function_call(function_name: str, function_args: Dict[str, Any]
    Returns:
        str: Function result as JSON string
    """
-    if image_generate_tool is None:
-        return json.dumps(
-            {
-                "error": (
-                    "Image generation toolset is unavailable (missing dependencies and/or FAL_KEY). "
-                    "Install image deps and set FAL_KEY to enable."
-                )
-            },
-            ensure_ascii=False,
-        )
-
    if function_name == "image_generate":
        prompt = function_args.get("prompt", "")
        
        if not prompt:
            return json.dumps({"success": False, "image": None}, ensure_ascii=False)
        
-        aspect_ratio = function_args.get("aspect_ratio", "landscape")
+        image_size = function_args.get("image_size", "landscape_16_9")
        
        # Use fixed internal defaults for all other parameters (not exposed to model)
        num_inference_steps = 50
        guidance_scale = 4.5
        num_images = 1
+        enable_safety_checker = True
        output_format = "png"
+        acceleration = "none"
+        allow_nsfw_images = True
        seed = None
        
-        # Run async function in event loop with proper handling for multiprocessing
-        try:
-            # Try to get existing event loop
-            loop = asyncio.get_event_loop()
-            if loop.is_closed():
-                # If closed, create a new one
-                loop = asyncio.new_event_loop()
-                asyncio.set_event_loop(loop)
-        except RuntimeError:
-            # No event loop in current thread, create one
-            loop = asyncio.new_event_loop()
-            asyncio.set_event_loop(loop)
-        
-        # Run the coroutine in the event loop
-        result = loop.run_until_complete(image_generate_tool(
+        # Run async function
+        return await image_generate_tool(
            prompt=prompt,
-            aspect_ratio=aspect_ratio,
+            image_size=image_size,
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            num_images=num_images,
+            enable_safety_checker=enable_safety_checker,
            output_format=output_format,
+            acceleration=acceleration,
+            allow_nsfw_images=allow_nsfw_images,
            seed=seed
-        ))
-        
-        return result
+        )
    
    else:
        return json.dumps({"error": f"Unknown image generation function: {function_name}"}, ensure_ascii=False)


-def handle_skills_function_call(function_name: str, function_args: Dict[str, Any]) -> str:
-    """
-    Handle function calls for skills tools.
-    
-    Args:
-        function_name (str): Name of the skills function to call
-        function_args (Dict): Arguments for the function
-    
-    Returns:
-        str: Function result as JSON string
-    """
-    if function_name == "skills_categories":
-        return skills_categories()
-    
-    elif function_name == "skills_list":
-        category = function_args.get("category")
-        return skills_list(category=category)
-    
-    elif function_name == "skill_view":
-        name = function_args.get("name", "")
-        if not name:
-            return json.dumps({"error": "Skill name is required"}, ensure_ascii=False)
-        file_path = function_args.get("file_path")
-        return skill_view(name, file_path=file_path)
-    
-    else:
-        return json.dumps({"error": f"Unknown skills function: {function_name}"}, ensure_ascii=False)
-
-
-# Browser tool handlers mapping
-BROWSER_HANDLERS = {
-    "browser_navigate": browser_navigate,
-    "browser_click": browser_click,
-    "browser_type": browser_type,
-    "browser_scroll": browser_scroll,
-    "browser_back": browser_back,
-    "browser_press": browser_press,
-    "browser_close": browser_close,
-    "browser_get_images": browser_get_images,
-    "browser_vision": browser_vision,
-}
-
-
-def handle_browser_function_call(
-    function_name: str, 
-    function_args: Dict[str, Any], 
-    task_id: Optional[str] = None,
-    user_task: Optional[str] = None
-) -> str:
-    """
-    Handle function calls for browser automation tools.
-    
-    Args:
-        function_name (str): Name of the browser function to call
-        function_args (Dict): Arguments for the function
-        task_id (str): Task identifier for session isolation
-        user_task (str): User's current task (for task-aware extraction in snapshots)
-    
-    Returns:
-        str: Function result as JSON string
-    """
-    # Special handling for browser_snapshot which needs user_task for extraction
-    if function_name == "browser_snapshot":
-        full = function_args.get("full", False)
-        return browser_snapshot(full=full, task_id=task_id, user_task=user_task)
-    
-    # Handle other browser tools
-    if function_name in BROWSER_HANDLERS:
-        handler = BROWSER_HANDLERS[function_name]
-        # Add task_id to args
-        return handler(**function_args, task_id=task_id)
-    
-    return json.dumps({"error": f"Unknown browser function: {function_name}"}, ensure_ascii=False)
-
-
-def handle_function_call(
-    function_name: str, 
-    function_args: Dict[str, Any], 
-    task_id: Optional[str] = None,
-    user_task: Optional[str] = None
-) -> str:
+async def handle_function_call(function_name: str, function_args: Dict[str, Any], task_id: Optional[str] = None) -> str:
    """
    Main function call dispatcher that routes calls to appropriate toolsets.

@@ -940,8 +608,7 @@ def handle_function_call(
    Args:
        function_name (str): Name of the function to call
        function_args (Dict): Arguments for the function
-        task_id (str): Unique identifier for this task to isolate VMs/sessions between concurrent tasks (optional)
-        user_task (str): The user's original task/query (used for task-aware content extraction) (optional)
+        task_id (str): Unique identifier for this task to isolate VMs between concurrent tasks (optional)

    Returns:
        str: Function result as JSON string
@@ -951,37 +618,24 @@ def handle_function_call(
    """
    try:
        # Route web tools
-        if function_name in ["web_search", "web_extract"]:
-            return handle_web_function_call(function_name, function_args)
+        if function_name in ["web_search", "web_extract", "web_crawl"]:
+            return await handle_web_function_call(function_name, function_args)

        # Route terminal tools
        elif function_name in ["terminal"]:
-            return handle_terminal_function_call(function_name, function_args, task_id)
+            return await handle_terminal_function_call(function_name, function_args, task_id)

        # Route vision tools
        elif function_name in ["vision_analyze"]:
-            return handle_vision_function_call(function_name, function_args)
+            return await handle_vision_function_call(function_name, function_args)

        # Route MoA tools
        elif function_name in ["mixture_of_agents"]:
-            return handle_moa_function_call(function_name, function_args)
+            return await handle_moa_function_call(function_name, function_args)

        # Route image generation tools
        elif function_name in ["image_generate"]:
-            return handle_image_function_call(function_name, function_args)
-
-        # Route skills tools
-        elif function_name in ["skills_categories", "skills_list", "skill_view"]:
-            return handle_skills_function_call(function_name, function_args)
-
-        # Route browser automation tools
-        elif function_name in [
-            "browser_navigate", "browser_snapshot", "browser_click",
-            "browser_type", "browser_scroll", "browser_back",
-            "browser_press", "browser_close", "browser_get_images",
-            "browser_vision"
-        ]:
-            return handle_browser_function_call(function_name, function_args, task_id, user_task)
+            return await handle_image_function_call(function_name, function_args)

        else:
            error_msg = f"Unknown function: {function_name}"
@@ -1004,15 +658,15 @@ def get_available_toolsets() -> Dict[str, Dict[str, Any]]:
    toolsets = {
        "web_tools": {
            "available": check_firecrawl_api_key(),
-            "tools": ["web_search_tool", "web_extract_tool"],
-            "description": "Web search and content extraction tools",
+            "tools": ["web_search_tool", "web_extract_tool", "web_crawl_tool"],
+            "description": "Web search, content extraction, and website crawling tools",
            "requirements": ["FIRECRAWL_API_KEY environment variable"]
        },
        "terminal_tools": {
-            "available": check_terminal_requirements(),
-            "tools": ["terminal_tool"],
-            "description": "Execute commands using mini-swe-agent (local/docker/modal)",
-            "requirements": ["mini-swe-agent package, TERMINAL_ENV to select backend"]
+            "available": check_simple_terminal_requirements(),
+            "tools": ["simple_terminal_tool"],
+            "description": "Execute commands on secure Linux VMs without session persistence",
+            "requirements": ["MORPH_API_KEY environment variable"]
        },
        "vision_tools": {
            "available": check_vision_requirements(),
@@ -1031,23 +685,6 @@ def get_available_toolsets() -> Dict[str, Dict[str, Any]]:
            "tools": ["image_generate_tool"],
            "description": "Generate high-quality images from text prompts using FAL.ai's FLUX.1 Krea model with automatic 2x upscaling for enhanced quality",
            "requirements": ["FAL_KEY environment variable", "fal-client package"]
-        },
-        "skills_tools": {
-            "available": check_skills_requirements(),
-            "tools": ["skills_categories", "skills_list", "skill_view"],
-            "description": "Access skill documents that provide specialized instructions, guidelines, or knowledge the agent can load on demand",
-            "requirements": ["skills/ directory in repo root"]
-        },
-        "browser_tools": {
-            "available": check_browser_requirements(),
-            "tools": [
-                "browser_navigate", "browser_snapshot", "browser_click",
-                "browser_type", "browser_scroll", "browser_back",
-                "browser_press", "browser_close", "browser_get_images",
-                "browser_vision"
-            ],
-            "description": "Browser automation for web interaction using agent-browser CLI with Browserbase cloud execution",
-            "requirements": ["BROWSERBASE_API_KEY", "BROWSERBASE_PROJECT_ID", "agent-browser npm package"]
        }
    }
    
@@ -1062,12 +699,10 @@ def check_toolset_requirements() -> Dict[str, bool]:
    """
    return {
        "web_tools": check_firecrawl_api_key(),
-        "terminal_tools": check_terminal_requirements(),
+        "terminal_tools": check_simple_terminal_requirements(),
        "vision_tools": check_vision_requirements(),
        "moa_tools": check_moa_requirements(),
-        "image_tools": check_image_generation_requirements(),
-        "skills_tools": check_skills_requirements(),
-        "browser_tools": check_browser_requirements()
+        "image_tools": check_image_generation_requirements()
    }

 if __name__ == "__main__":
@@ -1130,4 +765,4 @@ if __name__ == "__main__":
        
        if "terminal" in all_tool_names:
            no_terminal = get_tool_definitions(disabled_tools=["terminal"])
-            print(f"  All except terminal: {len(no_terminal)} tools")
+            print(f"  All except terminal: {len(no_terminal)} tools")
--- a/nomad-dev.hcl
+++ b/nomad-dev.hcl
@@ -1,37 +0,0 @@
-# Nomad Development Configuration (Hermes-Agent)
-# Run with: nomad agent -dev -config=nomad-dev.hcl
-#
-# This is intended for local development only.
-
-client {
-  enabled = true
-
-  options {
-    # Enable Docker volume mounts for persistent slot workspaces
-    "docker.volumes.enabled" = "true"
-  }
-}
-
-# Docker driver plugin configuration
-plugin "docker" {
-  config {
-    # CRITICAL: Enable volume mounts
-    volumes {
-      enabled = true
-    }
-
-    # Allow privileged containers if needed
-    allow_privileged = false
-
-    # Garbage collection settings
-    gc {
-      image       = true
-      # NOTE: For local dev we often rely on locally built images like `atropos-sandbox:local`.
-      # A short image GC delay can delete these between runs, causing confusing "Failed to pull"
-      # crash loops. Keep this comfortably long; tighten it for CI/production if needed.
-      image_delay = "24h"
-      container   = true
-    }
-  }
-}
-
--- a/nomad-singularity.hcl
+++ b/nomad-singularity.hcl
@@ -1,31 +0,0 @@
-# Nomad Configuration for Singularity/Apptainer Sandbox
-# Run with: nomad agent -dev -config=nomad-singularity.hcl
-#
-# This uses the raw_exec driver to run Apptainer containers.
-# Suitable for HPC environments where Docker cannot run without sudo.
-
-client {
-  enabled = true
-
-  options {
-    # Enable raw_exec driver for Singularity/Apptainer
-    "driver.raw_exec.enable" = "1"
-  }
-}
-
-# raw_exec driver plugin configuration
-plugin "raw_exec" {
-  config {
-    enabled = true
-  }
-}
-
-# Optional: If you have the nomad-driver-singularity plugin installed,
-# uncomment the following instead of using raw_exec:
-# plugin "singularity" {
-#   config {
-#     enabled = true
-#     # Allow bind mounts
-#     bind_paths = ["/tmp", "/var/tmp"]
-#   }
-# }
--- a/package-lock.json
+++ b/package-lock.json
@@ -1,77 +0,0 @@
-{
-  "name": "hermes-agent",
-  "version": "1.0.0",
-  "lockfileVersion": 3,
-  "requires": true,
-  "packages": {
-    "": {
-      "name": "hermes-agent",
-      "version": "1.0.0",
-      "hasInstallScript": true,
-      "license": "MIT",
-      "dependencies": {
-        "agent-browser": "^0.7.6"
-      },
-      "engines": {
-        "node": ">=18.0.0"
-      }
-    },
-    "node_modules/agent-browser": {
-      "version": "0.7.6",
-      "resolved": "https://registry.npmjs.org/agent-browser/-/agent-browser-0.7.6.tgz",
-      "integrity": "sha512-BDmzFlTM0siqn5P8LSBxgOBUNGv02Vo7RYztvXXjNOwQ+8rFJILWfBPxmw+57l/PcMst61AscjIe8uZ5sWrRZQ==",
-      "hasInstallScript": true,
-      "license": "Apache-2.0",
-      "dependencies": {
-        "playwright-core": "^1.57.0",
-        "ws": "^8.19.0",
-        "zod": "^3.22.4"
-      },
-      "bin": {
-        "agent-browser": "bin/agent-browser"
-      }
-    },
-    "node_modules/playwright-core": {
-      "version": "1.58.0",
-      "resolved": "https://registry.npmjs.org/playwright-core/-/playwright-core-1.58.0.tgz",
-      "integrity": "sha512-aaoB1RWrdNi3//rOeKuMiS65UCcgOVljU46At6eFcOFPFHWtd2weHRRow6z/n+Lec0Lvu0k9ZPKJSjPugikirw==",
-      "license": "Apache-2.0",
-      "bin": {
-        "playwright-core": "cli.js"
-      },
-      "engines": {
-        "node": ">=18"
-      }
-    },
-    "node_modules/ws": {
-      "version": "8.19.0",
-      "resolved": "https://registry.npmjs.org/ws/-/ws-8.19.0.tgz",
-      "integrity": "sha512-blAT2mjOEIi0ZzruJfIhb3nps74PRWTCz1IjglWEEpQl5XS/UNama6u2/rjFkDDouqr4L67ry+1aGIALViWjDg==",
-      "license": "MIT",
-      "engines": {
-        "node": ">=10.0.0"
-      },
-      "peerDependencies": {
-        "bufferutil": "^4.0.1",
-        "utf-8-validate": ">=5.0.2"
-      },
-      "peerDependenciesMeta": {
-        "bufferutil": {
-          "optional": true
-        },
-        "utf-8-validate": {
-          "optional": true
-        }
-      }
-    },
-    "node_modules/zod": {
-      "version": "3.25.76",
-      "resolved": "https://registry.npmjs.org/zod/-/zod-3.25.76.tgz",
-      "integrity": "sha512-gzUt/qt81nXsFGKIFcC3YnfEAx5NkunCfnDlvuBSSFS02bcXu4Lmea0AFIUwbLWxWPx3d9p8S5QoaujKcNQxcQ==",
-      "license": "MIT",
-      "funding": {
-        "url": "https://github.com/sponsors/colinhacks"
-      }
-    }
-  }
-}
--- a/package.json
+++ b/package.json
@@ -1,24 +0,0 @@
-{
-  "name": "hermes-agent",
-  "version": "1.0.0",
-  "description": "An AI agent with advanced tool-calling capabilities, featuring a flexible toolsets system for organizing and managing tools.",
-  "private": true,
-  "scripts": {
-    "postinstall": "echo '✅ Browser tools ready. Run: python run_agent.py --help'"
-  },
-  "repository": {
-    "type": "git",
-    "url": "git+https://github.com/NousResearch/Hermes-Agent.git"
-  },
-  "license": "MIT",
-  "bugs": {
-    "url": "https://github.com/NousResearch/Hermes-Agent/issues"
-  },
-  "homepage": "https://github.com/NousResearch/Hermes-Agent#readme",
-  "dependencies": {
-    "agent-browser": "^0.7.6"
-  },
-  "engines": {
-    "node": ">=18.0.0"
-  }
-}
--- a/profiling.py
+++ b/profiling.py
@@ -0,0 +1,381 @@
+"""
+Profiling module for tracking timing statistics of tools and LLM API calls.
+
+This module provides a centralized way to track timing information for various
+operations in the agent system, including:
+- Individual tool executions
+- OpenAI API calls
+- Aggregate statistics (min, max, median, mean, total)
+"""
+
+import time
+from typing import Dict, List, Optional
+from dataclasses import dataclass, field
+from collections import defaultdict
+import statistics
+
+
+@dataclass
+class ProfilingStats:
+    """Statistics for a particular operation type."""
+    call_count: int = 0
+    total_time: float = 0.0
+    min_time: float = float('inf')
+    max_time: float = 0.0
+    times: List[float] = field(default_factory=list)
+
+    def add_timing(self, duration: float):
+        """Add a timing measurement."""
+        self.call_count += 1
+        self.total_time += duration
+        self.min_time = min(self.min_time, duration)
+        self.max_time = max(self.max_time, duration)
+        self.times.append(duration)
+
+    @property
+    def mean_time(self) -> float:
+        """Calculate mean time."""
+        return self.total_time / self.call_count if self.call_count > 0 else 0.0
+
+    @property
+    def median_time(self) -> float:
+        """Calculate median time."""
+        return statistics.median(self.times) if self.times else 0.0
+
+    def to_dict(self) -> Dict:
+        """Convert to dictionary for serialization."""
+        return {
+            "call_count": self.call_count,
+            "total_time": self.total_time,
+            "min_time": self.min_time if self.min_time != float('inf') else 0.0,
+            "max_time": self.max_time,
+            "mean_time": self.mean_time,
+            "median_time": self.median_time
+        }
+
+
+class Profiler:
+    """
+    Global profiler for tracking timing statistics across tools and API calls.
+
+    Usage:
+        profiler = Profiler()
+
+        # Time a tool execution
+        with profiler.time_tool("web_search"):
+            # ... tool execution code ...
+            pass
+
+        # Time an API call
+        with profiler.time_api_call():
+            # ... API call code ...
+            pass
+
+        # Get statistics
+        stats = profiler.get_statistics()
+    """
+
+    def __init__(self):
+        """Initialize the profiler."""
+        self.tool_stats: Dict[str, ProfilingStats] = defaultdict(ProfilingStats)
+        self.api_stats: ProfilingStats = ProfilingStats()
+        self._enabled = True
+
+    def enable(self):
+        """Enable profiling."""
+        self._enabled = True
+
+    def disable(self):
+        """Disable profiling."""
+        self._enabled = False
+
+    def reset(self):
+        """Reset all profiling data."""
+        self.tool_stats.clear()
+        self.api_stats = ProfilingStats()
+
+    def record_tool_timing(self, tool_name: str, duration: float):
+        """Record timing for a tool execution."""
+        if self._enabled:
+            self.tool_stats[tool_name].add_timing(duration)
+
+    def record_api_timing(self, duration: float):
+        """Record timing for an API call."""
+        if self._enabled:
+            self.api_stats.add_timing(duration)
+
+    def get_statistics(self) -> Dict:
+        """
+        Get all profiling statistics.
+
+        Returns:
+            Dictionary containing tool and API statistics
+        """
+        return {
+            "tools": {
+                tool_name: stats.to_dict()
+                for tool_name, stats in sorted(self.tool_stats.items())
+            },
+            "api_calls": self.api_stats.to_dict()
+        }
+
+    def print_statistics(self, detailed: bool = True):
+        """
+        Print profiling statistics in a readable format.
+
+        Args:
+            detailed: If True, show per-tool breakdown. If False, show summary only.
+        """
+        print("\n" + "="*80)
+        print("📊 PROFILING STATISTICS")
+        print("="*80)
+
+        # API Call Statistics
+        print("\n🔷 OpenAI API Calls:")
+        if self.api_stats.call_count > 0:
+            api_dict = self.api_stats.to_dict()
+            print(f"  Total Calls:  {api_dict['call_count']}")
+            print(f"  Total Time:   {api_dict['total_time']:.2f}s")
+            print(f"  Min Time:     {api_dict['min_time']:.2f}s")
+            print(f"  Max Time:     {api_dict['max_time']:.2f}s")
+            print(f"  Mean Time:    {api_dict['mean_time']:.2f}s")
+            print(f"  Median Time:  {api_dict['median_time']:.2f}s")
+        else:
+            print("  No API calls recorded")
+
+        # Tool Statistics
+        print("\n🔧 Tool Executions:")
+        if self.tool_stats:
+            if detailed:
+                for tool_name in sorted(self.tool_stats.keys()):
+                    stats_dict = self.tool_stats[tool_name].to_dict()
+                    print(f"\n  📌 {tool_name}:")
+                    print(f"     Total Calls:  {stats_dict['call_count']}")
+                    print(f"     Total Time:   {stats_dict['total_time']:.2f}s")
+                    print(f"     Min Time:     {stats_dict['min_time']:.2f}s")
+                    print(f"     Max Time:     {stats_dict['max_time']:.2f}s")
+                    print(f"     Mean Time:    {stats_dict['mean_time']:.2f}s")
+                    print(f"     Median Time:  {stats_dict['median_time']:.2f}s")
+
+            # Summary
+            total_tool_calls = sum(s.call_count for s in self.tool_stats.values())
+            total_tool_time = sum(s.total_time for s in self.tool_stats.values())
+            print(f"\n  📊 Summary:")
+            print(f"     Total Tool Calls:  {total_tool_calls}")
+            print(f"     Total Tool Time:   {total_tool_time:.2f}s")
+            print(f"     Unique Tools Used: {len(self.tool_stats)}")
+        else:
+            print("  No tool executions recorded")
+
+        # Overall Summary
+        total_api_time = self.api_stats.total_time
+        total_tool_time = sum(s.total_time for s in self.tool_stats.values())
+        print(f"\n📈 Overall Summary:")
+        print(f"  Total API Time:   {total_api_time:.2f}s")
+        print(f"  Total Tool Time:  {total_tool_time:.2f}s")
+        print(f"  Total Time:       {total_api_time + total_tool_time:.2f}s")
+        print("="*80 + "\n")
+
+    def export_to_json(self) -> str:
+        """Export statistics as JSON string."""
+        import json
+        return json.dumps(self.get_statistics(), indent=2)
+
+    def export_to_file(self, filepath: str):
+        """
+        Export statistics to a JSON file.
+
+        Args:
+            filepath: Path to output file
+        """
+        import json
+        with open(filepath, 'w') as f:
+            json.dump(self.get_statistics(), f, indent=2)
+        print(f"📁 Profiling statistics exported to: {filepath}")
+
+
+# Global profiler instance
+_global_profiler: Optional[Profiler] = None
+
+
+def get_profiler() -> Profiler:
+    """Get or create the global profiler instance."""
+    global _global_profiler
+    if _global_profiler is None:
+        _global_profiler = Profiler()
+    return _global_profiler
+
+
+def reset_profiler():
+    """Reset the global profiler."""
+    global _global_profiler
+    if _global_profiler is not None:
+        _global_profiler.reset()
+
+
+class TimingContext:
+    """Context manager for timing operations."""
+
+    def __init__(self, profiler: Profiler, operation_type: str, operation_name: Optional[str] = None):
+        """
+        Initialize timing context.
+
+        Args:
+            profiler: Profiler instance to record timing
+            operation_type: 'tool' or 'api'
+            operation_name: Name of the operation (required for tools)
+        """
+        self.profiler = profiler
+        self.operation_type = operation_type
+        self.operation_name = operation_name
+        self.start_time = None
+
+    def __enter__(self):
+        """Start timing."""
+        self.start_time = time.time()
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        """Stop timing and record."""
+        duration = time.time() - self.start_time
+
+        if self.operation_type == 'tool':
+            self.profiler.record_tool_timing(self.operation_name, duration)
+        elif self.operation_type == 'api':
+            self.profiler.record_api_timing(duration)
+
+        return False  # Don't suppress exceptions
+
+
+def aggregate_profiling_stats(stats_list: List[Dict]) -> Dict:
+    """
+    Aggregate multiple profiling statistics dictionaries into one.
+
+    This is useful for batch processing where each worker process has its own
+    profiler instance that needs to be combined.
+
+    Args:
+        stats_list: List of statistics dictionaries from get_statistics()
+
+    Returns:
+        Dict: Aggregated statistics with combined tool and API call data
+    """
+    aggregated = {
+        "tools": defaultdict(lambda: {"times": []}),
+        "api_calls": {"times": []}
+    }
+
+    # Aggregate tool statistics
+    for stats in stats_list:
+        # Aggregate tool timings
+        for tool_name, tool_stats in stats.get("tools", {}).items():
+            # Reconstruct individual timings from aggregated stats
+            # Since we have mean_time and call_count, we approximate
+            aggregated["tools"][tool_name]["times"].extend(
+                [tool_stats.get("mean_time", 0.0)] * tool_stats.get("call_count", 0)
+            )
+
+        # Aggregate API call timings
+        api_stats = stats.get("api_calls", {})
+        if api_stats.get("call_count", 0) > 0:
+            aggregated["api_calls"]["times"].extend(
+                [api_stats.get("mean_time", 0.0)] * api_stats.get("call_count", 0)
+            )
+
+    # Calculate final statistics for tools
+    final_stats = {"tools": {}, "api_calls": {}}
+
+    for tool_name, data in aggregated["tools"].items():
+        times = data["times"]
+        if times:
+            final_stats["tools"][tool_name] = {
+                "call_count": len(times),
+                "total_time": sum(times),
+                "min_time": min(times),
+                "max_time": max(times),
+                "mean_time": statistics.mean(times),
+                "median_time": statistics.median(times)
+            }
+
+    # Calculate final statistics for API calls
+    api_times = aggregated["api_calls"]["times"]
+    if api_times:
+        final_stats["api_calls"] = {
+            "call_count": len(api_times),
+            "total_time": sum(api_times),
+            "min_time": min(api_times),
+            "max_time": max(api_times),
+            "mean_time": statistics.mean(api_times),
+            "median_time": statistics.median(api_times)
+        }
+    else:
+        final_stats["api_calls"] = {
+            "call_count": 0,
+            "total_time": 0.0,
+            "min_time": 0.0,
+            "max_time": 0.0,
+            "mean_time": 0.0,
+            "median_time": 0.0
+        }
+
+    return final_stats
+
+
+def print_aggregated_statistics(stats: Dict, detailed: bool = True):
+    """
+    Print aggregated profiling statistics in a readable format.
+
+    Args:
+        stats: Aggregated statistics dictionary from aggregate_profiling_stats()
+        detailed: If True, show per-tool breakdown. If False, show summary only.
+    """
+    print("\n" + "="*80)
+    print("📊 AGGREGATED PROFILING STATISTICS")
+    print("="*80)
+
+    # API Call Statistics
+    print("\n🔷 OpenAI API Calls:")
+    api_stats = stats.get("api_calls", {})
+    if api_stats.get("call_count", 0) > 0:
+        print(f"  Total Calls:  {api_stats['call_count']}")
+        print(f"  Total Time:   {api_stats['total_time']:.2f}s")
+        print(f"  Min Time:     {api_stats['min_time']:.2f}s")
+        print(f"  Max Time:     {api_stats['max_time']:.2f}s")
+        print(f"  Mean Time:    {api_stats['mean_time']:.2f}s")
+        print(f"  Median Time:  {api_stats['median_time']:.2f}s")
+    else:
+        print("  No API calls recorded")
+
+    # Tool Statistics
+    print("\n🔧 Tool Executions:")
+    tool_stats = stats.get("tools", {})
+    if tool_stats:
+        if detailed:
+            for tool_name in sorted(tool_stats.keys()):
+                stats_dict = tool_stats[tool_name]
+                print(f"\n  📌 {tool_name}:")
+                print(f"     Total Calls:  {stats_dict['call_count']}")
+                print(f"     Total Time:   {stats_dict['total_time']:.2f}s")
+                print(f"     Min Time:     {stats_dict['min_time']:.2f}s")
+                print(f"     Max Time:     {stats_dict['max_time']:.2f}s")
+                print(f"     Mean Time:    {stats_dict['mean_time']:.2f}s")
+                print(f"     Median Time:  {stats_dict['median_time']:.2f}s")
+
+        # Summary
+        total_tool_calls = sum(s["call_count"] for s in tool_stats.values())
+        total_tool_time = sum(s["total_time"] for s in tool_stats.values())
+        print(f"\n  📊 Summary:")
+        print(f"     Total Tool Calls:  {total_tool_calls}")
+        print(f"     Total Tool Time:   {total_tool_time:.2f}s")
+        print(f"     Unique Tools Used: {len(tool_stats)}")
+    else:
+        print("  No tool executions recorded")
+
+    # Overall Summary
+    total_api_time = api_stats.get("total_time", 0.0)
+    total_tool_time = sum(s["total_time"] for s in tool_stats.values())
+    print(f"\n📈 Overall Summary:")
+    print(f"  Total API Time:   {total_api_time:.2f}s")
+    print(f"  Total Tool Time:  {total_tool_time:.2f}s")
+    print(f"  Total Time:       {total_api_time + total_tool_time:.2f}s")
+    print("="*80 + "\n")
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -8,59 +8,21 @@ version = "0.1.0"
 description = "AI agent with advanced tool-calling and toolsets"
 readme = "README.md"
 requires-python = ">=3.10"
-authors = [{ name = "Nous Research" }]
+authors = [{ name = "Hermes Agent" }]
 license = { text = "MIT" }
 dependencies = [
-  # Core
-  "openai",
-  "python-dotenv",
-  "fire",
-  "httpx",
-  "rich",
-  "tenacity",
-  "pyyaml",
-  "prompt_toolkit",
-  "requests",
-  "jinja2",
-  "pydantic>=2.0",
-  # Tools
  "firecrawl-py",
+  "openai",
  "fal-client",
-  # mini-swe-agent deps (terminal tool)
-  "litellm>=1.75.5",
-  "typer",
-  "platformdirs",
-]
-
-[project.optional-dependencies]
-modal = ["modal", "boto3"]
-dev = ["pytest", "pytest-asyncio"]
-# Install Atropos from source (PyPI is often stale for this internal dependency).
-atropos = [
-  "atroposlib @ git+https://github.com/NousResearch/atropos.git",
-  # Atropos integration runtime deps (kept optional for Hermes-only users)
-  "aiohttp",
-  "fastapi",
-  "uvicorn",
-  "pyte",
+  "python-dotenv",
+  "fire"
 ]

 [project.scripts]
 hermes-agent = "run_agent:main"
-hermes-atropos-sandbox-smoke = "atropos.envs.sandbox_terminal_smoke_env:SandboxTerminalSmokeEnv.cli"
-hermes-atropos-toolserver-smoke = "atropos.envs.toolserver_smoke_env:ToolServerSmokeEnv.cli"

 [tool.setuptools]
-py-modules = [
-  "run_agent",
-  "model_tools",
-  "toolsets",
-  "batch_runner",
-  "trajectory_compressor",
-  "toolset_distributions",
-  "atropos_compatible_agent",
-  "local_server",
-]
+py-modules = ["run_agent", "model_tools", "toolsets"]

 [tool.setuptools.packages.find]
-include = ["tools", "atropos", "atropos.*"]
+include = ["tools"]
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,34 +1,6 @@
-# Core dependencies
+firecrawl-py
 openai
+fal-client
 python-dotenv
 fire
-httpx
-rich
-tenacity
-prompt_toolkit
-
-# Web tools
-firecrawl-py
-
-# Image generation
-fal-client
-
-# mini-swe-agent dependencies (for terminal tool)
-# Note: Install mini-swe-agent itself with: pip install -e ./mini-swe-agent
-pyyaml
-requests
-jinja2
-pydantic>=2.0
-litellm>=1.75.5
-typer
-platformdirs
-
-# Optional: For Docker backend (recommended)
-# Requires Docker installed and user in 'docker' group
-
-# Optional: For Modal backend (cloud execution)
-# modal
-# boto3
-
-# Optional: Legacy Hecate terminal backend
-# git+ssh://git@github.com/NousResearch/hecate.git
+httpx
--- a/run_agent.py
+++ b/run_agent.py
--- a/run_datagen_images.sh
+++ b/run_datagen_images.sh
@@ -0,0 +1,12 @@
+python batch_runner.py \
+  --dataset_file="hermes-agent-imagen-data/hermes_agent_imagen_eval.jsonl" \
+  --batch_size=10 \
+  --run_name="imagen_eval_gpt5" \
+  --distribution="image_gen" \
+  --model="gpt-5" \
+  --base_url="https://api.openai.com/v1" \
+  --api_key="${OPENAI_API_KEY}" \
+  --num_workers=4 \
+  --max_turns=5 \
+  --verbose \
+  --ephemeral_system_prompt="When generating an image for the user view the image by using the vision_analyze tool to ensure it is what the user wanted. If it isn't feel free to retry a few times. If none are perfect, choose the best option that is the closest match, and explain its imperfections. If the image generation tool fails, try again a few times. If the vision analyze tool fails, provide the image to the user and explain it is your best effort attempt."
--- a/configs/run_datagen_megascience.sh
+++ b/configs/run_datagen_megascience.sh
--- a/run_datagen_megascience_glm4-6.sh
+++ b/run_datagen_megascience_glm4-6.sh
@@ -0,0 +1,12 @@
+python batch_runner.py \
+  --dataset_file="hermes-agent-megascience-data/hermes_agent_megascience_eval.jsonl" \
+  --batch_size=10 \
+  --run_name="megascience_eval_glm4-6-fixedterminal-2" \
+  --distribution="science" \
+  --model="z-ai/glm-4.6" \
+  --base_url="https://openrouter.ai/api/v1" \
+  --api_key="${OPENROUTER_API_KEY}" \
+  --num_workers=5 \
+  --max_turns=30 \
+  --verbose \
+  --ephemeral_system_prompt="You have access to a variety of tools to help you solve scientific, math, and technology problems presented to you. You can use them in sequence and build off of the results of prior tools you've used results. Always use a tool if it can provide additional context, verify formulas, double check concepts and recent studies and understanding, doing all calculations, etc. You should only be confident in your own reasoning, knowledge, or calculations if you've exhaustively used all tools available to you to that can help you verify or validate your work. Always pip install any packages you need to use the python scripts you want to run."
--- a/safe_print.py
+++ b/safe_print.py
@@ -0,0 +1,20 @@
+#!/usr/bin/env python3
+"""Simple safe print that tries rich, falls back to regular print."""
+
+try:
+    from rich import print as rich_print
+    RICH_AVAILABLE = True
+except ImportError:
+    RICH_AVAILABLE = False
+
+
+def safe_print(*args, **kwargs):
+    """Try rich.print, fall back to regular print if it fails."""
+    if RICH_AVAILABLE:
+        try:
+            rich_print(*args, **kwargs)
+            return
+        except Exception:
+            pass
+    # Fallback to regular print
+    print(*args, **kwargs)
--- a/scripts/launch_llama_cpp_glm47_flash.sh
+++ b/scripts/launch_llama_cpp_glm47_flash.sh
@@ -1,62 +0,0 @@
-#!/usr/bin/env bash
-set -euo pipefail
-
-# Launch a local llama.cpp OpenAI-compatible server running GLM-4.7-Flash (GGUF).
-#
-# Requires:
-# - `llama-server` installed (e.g. `brew install llama.cpp`)
-#
-# Default settings are chosen to avoid clashing with Atropos sandbox_server
-# (which commonly uses port 8080 in local dev).
-#
-# Usage:
-#   Hermes-Agent/scripts/launch_llama_cpp_glm47_flash.sh
-#
-# Override defaults:
-#   LLAMA_CPP_HOST=127.0.0.1 LLAMA_CPP_PORT=8082 \
-#   LLAMA_CPP_HF_REPO=ggml-org/GLM-4.7-Flash-GGUF \
-#   LLAMA_CPP_HF_FILE=GLM-4.7-Flash-Q4_K.gguf \
-#   Hermes-Agent/scripts/launch_llama_cpp_glm47_flash.sh
-
-HOST="${LLAMA_CPP_HOST:-127.0.0.1}"
-PORT="${LLAMA_CPP_PORT:-8080}"
-HF_REPO="${LLAMA_CPP_HF_REPO:-ggml-org/GLM-4.7-Flash-GGUF}"
-HF_FILE="${LLAMA_CPP_HF_FILE:-GLM-4.7-Flash-Q4_K.gguf}"
-ALIAS="${LLAMA_CPP_ALIAS:-glm-4.7-flash}"
-
-if ! command -v llama-server >/dev/null 2>&1; then
-  echo "Error: llama-server not found in PATH."
-  echo "Install via Homebrew: brew install llama.cpp"
-  exit 1
-fi
-
-echo "Launching llama.cpp server..."
-echo "  host:  $HOST"
-echo "  port:  $PORT"
-echo "  repo:  $HF_REPO"
-echo "  file:  $HF_FILE"
-echo "  alias: $ALIAS"
-echo
-echo "Suggested env vars for Hermes/Atropos integration:"
-echo "  export ATROPOS_SERVER_BASE_URL=http://${HOST}:${PORT}"
-echo "  export ATROPOS_SERVER_MODEL=${ALIAS}"
-echo "  export ATROPOS_SERVER_API_KEY=local"
-echo
-
-if command -v lsof >/dev/null 2>&1; then
-  if lsof -nP -iTCP:"$PORT" -sTCP:LISTEN >/dev/null 2>&1; then
-    echo "Error: port $PORT is already in use."
-    echo "Pick a different port, e.g.:"
-    echo "  LLAMA_CPP_PORT=8082 Hermes-Agent/scripts/launch_llama_cpp_glm47_flash.sh"
-    exit 1
-  fi
-fi
-
-exec llama-server \
-  --host "$HOST" \
-  --port "$PORT" \
-  --hf-repo "$HF_REPO" \
-  --hf-file "$HF_FILE" \
-  --alias "$ALIAS" \
-  -c 32768 \
-  -n -1
--- a/scripts/launch_llama_cpp_hermes_4_36b.sh
+++ b/scripts/launch_llama_cpp_hermes_4_36b.sh
@@ -1,70 +0,0 @@
-#!/usr/bin/env bash
-set -euo pipefail
-
-# Launch a local llama.cpp OpenAI-compatible server running Hermes 4.3 36B (GGUF).
-#
-# Requires:
-# - `llama-server` installed (e.g. `brew install llama.cpp`)
-#
-# Note: Port choice can conflict with other local dev servers. If 8080 is already
-# in use, override via `LLAMA_CPP_PORT=...`.
-#
-# Usage:
-#   Hermes-Agent/scripts/launch_llama_cpp_hermes_4_36b.sh
-#
-# Override defaults:
-#   LLAMA_CPP_HOST=127.0.0.1 LLAMA_CPP_PORT=8082 \
-#   LLAMA_CPP_HF_REPO=NousResearch/Hermes-4.3-36B-GGUF \
-#   LLAMA_CPP_HF_FILE=hermes-4_3_36b-Q4_K_M.gguf \
-#   LLAMA_CPP_ALIAS=hermes-4-36b \
-#   LLAMA_CPP_PARALLEL=4 LLAMA_CPP_THREADS_HTTP=4 \
-#   Hermes-Agent/scripts/launch_llama_cpp_hermes_4_36b.sh
-
-HOST="${LLAMA_CPP_HOST:-127.0.0.1}"
-PORT="${LLAMA_CPP_PORT:-8080}"
-HF_REPO="${LLAMA_CPP_HF_REPO:-NousResearch/Hermes-4.3-36B-GGUF}"
-HF_FILE="${LLAMA_CPP_HF_FILE:-hermes-4_3_36b-Q4_K_M.gguf}"
-ALIAS="${LLAMA_CPP_ALIAS:-hermes-4-36b}"
-PARALLEL="${LLAMA_CPP_PARALLEL:-4}"
-THREADS_HTTP="${LLAMA_CPP_THREADS_HTTP:-4}"
-
-if ! command -v llama-server >/dev/null 2>&1; then
-  echo "Error: llama-server not found in PATH."
-  echo "Install via Homebrew: brew install llama.cpp"
-  exit 1
-fi
-
-echo "Launching llama.cpp server..."
-echo "  host:  $HOST"
-echo "  port:  $PORT"
-echo "  repo:  $HF_REPO"
-echo "  file:  $HF_FILE"
-echo "  alias: $ALIAS"
-echo "  slots: $PARALLEL"
-echo
-echo "Suggested env vars for Hermes/Atropos integration:"
-echo "  export ATROPOS_SERVER_BASE_URL=http://${HOST}:${PORT}"
-echo "  export ATROPOS_SERVER_MODEL=${ALIAS}"
-echo "  export ATROPOS_TOKENIZER_NAME=NousResearch/Hermes-4.3-36B"
-echo "  export ATROPOS_SERVER_API_KEY=local"
-echo
-
-if command -v lsof >/dev/null 2>&1; then
-  if lsof -nP -iTCP:"$PORT" -sTCP:LISTEN >/dev/null 2>&1; then
-    echo "Error: port $PORT is already in use."
-    echo "Pick a different port, e.g.:"
-    echo "  LLAMA_CPP_PORT=8082 Hermes-Agent/scripts/launch_llama_cpp_hermes_4_36b.sh"
-    exit 1
-  fi
-fi
-
-exec llama-server \
-  --host "$HOST" \
-  --port "$PORT" \
-  --hf-repo "$HF_REPO" \
-  --hf-file "$HF_FILE" \
-  --alias "$ALIAS" \
-  --parallel "$PARALLEL" \
-  --threads-http "$THREADS_HTTP" \
-  -c 32768 \
-  -n -1
--- a/scripts/sample_and_compress.py
+++ b/scripts/sample_and_compress.py
@@ -1,411 +0,0 @@
-#!/usr/bin/env python3
-"""
-Sample and Compress HuggingFace Datasets
-
-Downloads trajectories from multiple HuggingFace datasets, randomly samples them,
-and runs trajectory compression to fit within a target token budget.
-
-Usage:
-    python scripts/sample_and_compress.py
-    
-    # Custom sample size
-    python scripts/sample_and_compress.py --total_samples=5000
-    
-    # Custom output name
-    python scripts/sample_and_compress.py --output_name=compressed_16k
-"""
-
-import json
-import random
-import os
-from pathlib import Path
-from typing import List, Dict, Any, Tuple
-import fire
-
-# Load environment variables
-from dotenv import load_dotenv
-load_dotenv()
-
-
-# Default datasets to sample from
-DEFAULT_DATASETS = [
-    "NousResearch/swe-terminus-agent-glm-kimi-minimax",
-    "NousResearch/hermes-agent-megascience-sft1",
-    "NousResearch/Hermes-Agent-Thinking-GLM-4.7-SFT2",
-    "NousResearch/Hermes-Agent-Thinking-GLM-4.7-SFT1",
-    "NousResearch/terminal-tasks-glm-hermes-agent"
-]
-
-
-def load_dataset_from_hf(dataset_name: str) -> List[Dict[str, Any]]:
-    """
-    Load a dataset from HuggingFace.
-    
-    Args:
-        dataset_name: HuggingFace dataset name (e.g., "NousResearch/dataset-name")
-        
-    Returns:
-        List of trajectory entries
-    """
-    from datasets import load_dataset
-    
-    print(f"   Loading {dataset_name}...")
-    
-    try:
-        # Try loading with default config
-        ds = load_dataset(dataset_name, split="train")
-    except Exception as e:
-        print(f"   ⚠️  Error loading {dataset_name}: {e}")
-        return []
-    
-    # Convert to list of dicts
-    entries = []
-    for item in ds:
-        # Handle different possible formats
-        if "conversations" in item:
-            entries.append({"conversations": item["conversations"]})
-        elif "messages" in item:
-            # Convert messages format to conversations format if needed
-            entries.append({"conversations": item["messages"]})
-        else:
-            # Assume the whole item is the entry
-            entries.append(dict(item))
-    
-    print(f"   ✅ Loaded {len(entries):,} entries from {dataset_name}")
-    return entries
-
-
-# Global tokenizer for multiprocessing (set in worker init)
-_TOKENIZER = None
-
-
-def _init_tokenizer_worker(tokenizer_name: str):
-    """Initialize tokenizer in worker process."""
-    global _TOKENIZER
-    from transformers import AutoTokenizer
-    _TOKENIZER = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
-
-
-def _count_tokens_for_entry(entry: Dict) -> Tuple[Dict, int]:
-    """
-    Count tokens for a single entry (used in parallel processing).
-    
-    Args:
-        entry: Trajectory entry with 'conversations' field
-        
-    Returns:
-        Tuple of (entry, token_count)
-    """
-    global _TOKENIZER
-    
-    conversations = entry.get("conversations", [])
-    if not conversations:
-        return entry, 0
-    
-    total = 0
-    for turn in conversations:
-        value = turn.get("value", "")
-        if value:
-            try:
-                total += len(_TOKENIZER.encode(value))
-            except:
-                # Fallback to character estimate
-                total += len(value) // 4
-    
-    return entry, total
-
-
-def sample_from_datasets(
-    datasets: List[str],
-    total_samples: int,
-    min_tokens: int = 16000,
-    tokenizer_name: str = "moonshotai/Kimi-K2-Thinking",
-    seed: int = 42,
-    num_proc: int = 8
-) -> List[Dict[str, Any]]:
-    """
-    Load all datasets, filter by token count, then randomly sample from combined pool.
-    
-    Args:
-        datasets: List of HuggingFace dataset names
-        total_samples: Total number of samples to collect
-        min_tokens: Minimum token count to include (only sample trajectories >= this)
-        tokenizer_name: HuggingFace tokenizer for counting tokens
-        seed: Random seed for reproducibility
-        num_proc: Number of parallel processes for tokenization
-        
-    Returns:
-        List of sampled trajectory entries
-    """
-    from multiprocessing import Pool
-    from functools import partial
-    
-    random.seed(seed)
-    
-    print(f"\n📥 Loading {len(datasets)} datasets...")
-    print(f"   Minimum tokens: {min_tokens:,} (filtering smaller trajectories)")
-    print(f"   Parallel workers: {num_proc}")
-    print()
-    
-    # Load ALL entries from all datasets into one pool
-    all_entries = []
-    
-    for dataset_name in datasets:
-        entries = load_dataset_from_hf(dataset_name)
-        
-        if not entries:
-            print(f"   ⚠️  Skipping {dataset_name} (no entries loaded)")
-            continue
-        
-        # Add source metadata to each entry
-        for entry in entries:
-            entry["_source_dataset"] = dataset_name
-        
-        all_entries.extend(entries)
-    
-    print(f"\n📊 Total entries loaded: {len(all_entries):,}")
-    
-    # Filter by token count using parallel processing
-    print(f"\n🔍 Filtering trajectories with >= {min_tokens:,} tokens (using {num_proc} workers)...")
-    
-    filtered_entries = []
-    token_counts = []
-    
-    # Use multiprocessing for token counting
-    with Pool(
-        processes=num_proc,
-        initializer=_init_tokenizer_worker,
-        initargs=(tokenizer_name,)
-    ) as pool:
-        # Process in chunks and show progress
-        chunk_size = 1000
-        processed = 0
-        
-        for result in pool.imap_unordered(_count_tokens_for_entry, all_entries, chunksize=100):
-            entry, token_count = result
-            processed += 1
-            
-            if processed % chunk_size == 0:
-                print(f"   Processed {processed:,}/{len(all_entries):,}...", end="\r")
-            
-            if token_count >= min_tokens:
-                entry["_original_tokens"] = token_count
-                filtered_entries.append(entry)
-                token_counts.append(token_count)
-    
-    print(f"\n   ✅ Found {len(filtered_entries):,} trajectories >= {min_tokens:,} tokens")
-    
-    if token_counts:
-        avg_tokens = sum(token_counts) / len(token_counts)
-        print(f"   📈 Token stats: min={min(token_counts):,}, max={max(token_counts):,}, avg={avg_tokens:,.0f}")
-    
-    # Random sample from the filtered pool
-    if len(filtered_entries) <= total_samples:
-        print(f"\n⚠️  Only {len(filtered_entries):,} trajectories available, using all of them")
-        sampled = filtered_entries
-    else:
-        sampled = random.sample(filtered_entries, total_samples)
-        print(f"\n✅ Randomly sampled {len(sampled):,} trajectories from pool of {len(filtered_entries):,}")
-    
-    # Show source distribution
-    source_counts = {}
-    for entry in sampled:
-        source = entry.get("_source_dataset", "unknown").split("/")[-1]
-        source_counts[source] = source_counts.get(source, 0) + 1
-    
-    print(f"\n📌 Sample distribution by source:")
-    for source, count in sorted(source_counts.items()):
-        print(f"      {source}: {count:,}")
-    
-    # Shuffle
-    random.shuffle(sampled)
-    
-    return sampled
-
-
-def save_samples_for_compression(
-    samples: List[Dict[str, Any]],
-    output_dir: Path,
-    batch_size: int = 100
-):
-    """
-    Save samples to JSONL files for trajectory compression.
-    
-    Args:
-        samples: List of trajectory entries
-        output_dir: Directory to save JSONL files
-        batch_size: Number of entries per file
-    """
-    output_dir.mkdir(parents=True, exist_ok=True)
-    
-    # Split into batches
-    num_batches = (len(samples) + batch_size - 1) // batch_size
-    
-    print(f"\n💾 Saving {len(samples)} samples to {output_dir}")
-    print(f"   Batch size: {batch_size}, Total batches: {num_batches}")
-    
-    for i in range(num_batches):
-        start_idx = i * batch_size
-        end_idx = min((i + 1) * batch_size, len(samples))
-        batch = samples[start_idx:end_idx]
-        
-        output_file = output_dir / f"batch_{i}.jsonl"
-        with open(output_file, 'w', encoding='utf-8') as f:
-            for entry in batch:
-                f.write(json.dumps(entry, ensure_ascii=False) + '\n')
-    
-    print(f"   ✅ Saved {num_batches} batch files")
-
-
-def run_compression(input_dir: Path, output_dir: Path, config_path: str):
-    """
-    Run trajectory compression on the sampled data.
-    
-    Args:
-        input_dir: Directory containing JSONL files to compress
-        output_dir: Directory for compressed output
-        config_path: Path to compression config YAML
-    """
-    # Import the compressor
-    import sys
-    sys.path.insert(0, str(Path(__file__).parent.parent))
-    from trajectory_compressor import TrajectoryCompressor, CompressionConfig
-    
-    print(f"\n🗜️  Running trajectory compression...")
-    print(f"   Input: {input_dir}")
-    print(f"   Output: {output_dir}")
-    print(f"   Config: {config_path}")
-    
-    # Load config
-    config = CompressionConfig.from_yaml(config_path)
-    
-    # Initialize compressor
-    compressor = TrajectoryCompressor(config)
-    
-    # Run compression
-    compressor.process_directory(input_dir, output_dir)
-
-
-def merge_output_to_single_jsonl(input_dir: Path, output_file: Path):
-    """
-    Merge all JSONL files in a directory into a single JSONL file.
-    
-    Args:
-        input_dir: Directory containing JSONL files
-        output_file: Output JSONL file path
-    """
-    print(f"\n📦 Merging output files into {output_file.name}...")
-    
-    all_entries = []
-    for jsonl_file in sorted(input_dir.glob("*.jsonl")):
-        if jsonl_file.name == output_file.name:
-            continue
-        with open(jsonl_file, 'r', encoding='utf-8') as f:
-            for line in f:
-                line = line.strip()
-                if line:
-                    all_entries.append(json.loads(line))
-    
-    # Write merged file
-    with open(output_file, 'w', encoding='utf-8') as f:
-        for entry in all_entries:
-            f.write(json.dumps(entry, ensure_ascii=False) + '\n')
-    
-    print(f"   ✅ Merged {len(all_entries):,} entries into {output_file.name}")
-    return output_file
-
-
-def main(
-    total_samples: int = 2500,
-    output_name: str = "compressed_agentic",
-    datasets: str = None,
-    config: str = "configs/trajectory_compression.yaml",
-    seed: int = 42,
-    batch_size: int = 100,
-    min_tokens: int = 16000,
-    num_proc: int = 8,
-    skip_download: bool = False,
-):
-    """
-    Sample trajectories from HuggingFace datasets and run compression.
-    
-    Args:
-        total_samples: Total number of samples to collect (default: 2500)
-        output_name: Name for output directory/file (default: "compressed_agentic")
-        datasets: Comma-separated list of dataset names (uses defaults if not provided)
-        config: Path to compression config YAML
-        seed: Random seed for reproducibility
-        batch_size: Number of entries per JSONL file during processing
-        min_tokens: Minimum token count to filter trajectories (default: 16000)
-        num_proc: Number of parallel workers for tokenization (default: 8)
-        skip_download: Skip download and use existing sampled data
-    """
-    print("=" * 70)
-    print("📊 TRAJECTORY SAMPLING AND COMPRESSION")
-    print("=" * 70)
-    
-    # Parse datasets
-    if datasets:
-        dataset_list = [d.strip() for d in datasets.split(",")]
-    else:
-        dataset_list = DEFAULT_DATASETS
-    
-    print(f"\n📋 Configuration:")
-    print(f"   Total samples: {total_samples:,}")
-    print(f"   Min tokens filter: {min_tokens:,}")
-    print(f"   Parallel workers: {num_proc}")
-    print(f"   Datasets: {len(dataset_list)}")
-    for ds in dataset_list:
-        print(f"      - {ds}")
-    print(f"   Output name: {output_name}")
-    print(f"   Config: {config}")
-    print(f"   Seed: {seed}")
-    
-    # Setup paths
-    base_dir = Path(__file__).parent.parent
-    sampled_dir = base_dir / "data" / f"{output_name}_raw"
-    compressed_dir = base_dir / "data" / f"{output_name}_batches"
-    final_output = base_dir / "data" / f"{output_name}.jsonl"
-    
-    if not skip_download:
-        # Step 1: Download, filter by token count, and sample from combined pool
-        samples = sample_from_datasets(
-            dataset_list, 
-            total_samples, 
-            min_tokens=min_tokens,
-            seed=seed,
-            num_proc=num_proc
-        )
-        
-        if not samples:
-            print("❌ No samples collected. Exiting.")
-            return
-        
-        # Step 2: Save to JSONL files
-        save_samples_for_compression(samples, sampled_dir, batch_size)
-    else:
-        print(f"\n⏭️  Skipping download, using existing data in {sampled_dir}")
-    
-    # Step 3: Run compression
-    config_path = base_dir / config
-    if not config_path.exists():
-        print(f"❌ Config not found: {config_path}")
-        return
-    
-    run_compression(sampled_dir, compressed_dir, str(config_path))
-    
-    # Step 4: Merge into single JSONL file
-    merge_output_to_single_jsonl(compressed_dir, final_output)
-    
-    print("\n" + "=" * 70)
-    print("✅ COMPLETE!")
-    print("=" * 70)
-    print(f"\n📁 Raw samples:        {sampled_dir}")
-    print(f"📁 Compressed batches: {compressed_dir}")
-    print(f"📁 Final output:       {final_output}")
-    print(f"\nTo upload to HuggingFace:")
-    print(f"   huggingface-cli upload NousResearch/{output_name} {final_output}")
-
-
-if __name__ == "__main__":
-    fire.Fire(main)
--- a/setup-hermes.sh
+++ b/setup-hermes.sh
@@ -1,149 +0,0 @@
-#!/bin/bash
-
-# Hermes Agent Setup Script
-# Automated setup for all dependencies and configuration
-
-set -e
-
-echo "========================================="
-echo "Hermes Agent Setup"
-echo "========================================="
-echo ""
-
-# Change to hermes-agent directory
-cd /home/teknium/hermes-agent
-
-# Check Python version
-echo "[1/10] Checking Python version..."
-python_version=$(python3 --version | cut -d' ' -f2 | cut -d'.' -f1,2)
-echo "✓ Python $python_version detected"
-echo ""
-
-# Install uv
-echo "[2/10] Installing uv (fast Python package installer)..."
-if ! command -v uv &> /dev/null; then
-    echo "Installing uv..."
-    curl -LsSf https://astral.sh/uv/install.sh | sh
-    export PATH="$HOME/.cargo/bin:$PATH"
-    echo "✓ uv installed"
-else
-    echo "✓ uv already installed: $(uv --version)"
-fi
-echo ""
-
-# Install Node.js 20 using NodeSource
-echo "[3/10] Installing Node.js 20..."
-if ! command -v node &> /dev/null || [[ $(node --version | cut -d'v' -f2 | cut -d'.' -f1) -lt 20 ]]; then
-    echo "Installing Node.js 20 LTS..."
-    curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
-    sudo apt-get install -y nodejs
-    echo "✓ Node.js installed"
-else
-    echo "✓ Node.js 20+ already installed: $(node --version)"
-fi
-echo ""
-
-# Initialize git submodules
-echo "[4/10] Initializing git submodules..."
-git submodule update --init --recursive
-echo "✓ Submodules initialized"
-echo ""
-
-# Create Python virtual environment with uv
-echo "[5/10] Creating Python virtual environment with uv..."
-if [ -d "venv" ]; then
-    echo "Virtual environment already exists, skipping..."
-else
-    uv venv venv
-    echo "✓ Virtual environment created with uv"
-fi
-echo ""
-
-# Activate virtual environment and install Python packages with uv
-echo "[6/10] Installing Python dependencies with uv..."
-source venv/bin/activate
-uv pip install -r requirements.txt
-echo "✓ Python packages installed"
-echo ""
-
-# Install mini-swe-agent with uv
-echo "[7/10] Installing mini-swe-agent..."
-uv pip install -e ./mini-swe-agent
-echo "✓ mini-swe-agent installed"
-echo ""
-
-# Install Node.js dependencies
-echo "[8/10] Installing Node.js dependencies..."
-npm install
-echo "✓ Node.js packages installed"
-echo ""
-
-# Set up environment file
-echo "[9/10] Setting up environment configuration..."
-if [ -f ".env" ]; then
-    echo ".env file already exists, creating backup..."
-    cp .env .env.backup.$(date +%Y%m%d_%H%M%S)
-fi
-cp .env.example .env
-echo "✓ .env file created from .env.example"
-echo ""
-
-# Set up CLI config
-echo "[10/10] Setting up CLI configuration..."
-if [ ! -f "cli-config.yaml" ]; then
-    cp cli-config.yaml.example cli-config.yaml
-    echo "✓ cli-config.yaml created from example"
-else
-    echo "cli-config.yaml already exists, skipping..."
-fi
-echo ""
-
-# Show Node.js and Python versions
-echo "========================================="
-echo "Setup Complete!"
-echo "========================================="
-echo ""
-echo "Installed versions:"
-echo "  Node.js: $(node --version)"
-echo "  npm: $(npm --version)"
-echo "  Python: $(python3 --version)"
-echo "  uv: $(uv --version)"
-echo ""
-
-echo "========================================="
-echo "Next Steps:"
-echo "========================================="
-echo ""
-echo "1. Configure API Keys in .env file:"
-echo "   nano .env"
-echo ""
-echo "   Required API keys:"
-echo "   - OPENROUTER_API_KEY (https://openrouter.ai/keys)"
-echo "   - FIRECRAWL_API_KEY (https://firecrawl.dev/)"
-echo "   - NOUS_API_KEY (https://inference-api.nousresearch.com/)"
-echo "   - FAL_KEY (https://fal.ai/)"
-echo ""
-echo "   Optional API keys:"
-echo "   - BROWSERBASE_API_KEY (https://browserbase.com/)"
-echo "   - BROWSERBASE_PROJECT_ID"
-echo ""
-echo "2. Activate the virtual environment:"
-echo "   source venv/bin/activate"
-echo ""
-echo "3. Run the CLI:"
-echo "   ./hermes"
-echo ""
-echo "4. Or run a single query:"
-echo "   python run_agent.py --query \"your question here\""
-echo ""
-echo "5. List available tools:"
-echo "   python run_agent.py --list_tools"
-echo ""
-echo "========================================="
-echo "Configuration Files:"
-echo "========================================="
-echo "  .env - API keys and environment variables"
-echo "  cli-config.yaml - CLI settings and preferences"
-echo ""
-echo "For more information, see README.md"
-echo ""
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
hjc-puro	e578f976af	gemini thinking script	2025-12-11 00:46:25 -05:00
hjc-puro	96bc31a8b1	add prokletor	2025-12-10 23:07:28 -05:00
hjc-puro	7d9a1e119d	add prokletor formatter	2025-11-23 10:24:58 -05:00
hjc-puro	e91d9e839a	switch to asyncio	2025-11-22 11:25:23 -05:00
hjc-puro	98321be8b0	gemini fake reasoning	2025-11-22 09:47:00 -05:00
hjc-puro	a219e178a1	support gemini models	2025-11-19 21:14:37 -05:00
hjc-puro	e06a15b3ab	add profiling	2025-11-18 07:12:05 -05:00
hjc-puro	349e37de0a	add linewise profiling	2025-11-17 23:21:36 -05:00
hjc-puro	31c733383b	add tracking for cluster failurse	2025-11-15 00:01:19 -05:00
				`@@ -1,2 +0,0 @@`
				`"""Terminal helpers for stateful sandbox interactions."""`