mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-28 23:11:37 +08:00
Compare commits
34 Commits
opencode-p
...
endless-te
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
735723803f | ||
|
|
1472cc302d | ||
|
|
9c200abdb1 | ||
|
|
9dc27880cd | ||
|
|
3b9c53e6db | ||
|
|
05dd31131f | ||
|
|
36ea883d45 | ||
|
|
6be8cdeeca | ||
|
|
0bc914b00c | ||
|
|
411e7f8ff4 | ||
|
|
eb2e6b73fe | ||
|
|
664acf7426 | ||
|
|
fd1c3da305 | ||
|
|
4d619bcd21 | ||
|
|
beac2ee06a | ||
|
|
487487406d | ||
|
|
87464821d8 | ||
|
|
661d8f4d6c | ||
|
|
bf13a848ef | ||
|
|
88286f6da3 | ||
|
|
5b82190460 | ||
|
|
ea7aa0b0d4 | ||
|
|
7130fa50cb | ||
|
|
5a9c98a771 | ||
|
|
6cb4fe948a | ||
|
|
30221d8c20 | ||
|
|
b5b1fef20a | ||
|
|
16fb41f9cc | ||
|
|
4939130485 | ||
|
|
8dccd6569e | ||
|
|
db348dc467 | ||
|
|
88722e230d | ||
|
|
68fb0efe0e | ||
|
|
e38c274f8d |
115
.clinerules
Normal file
115
.clinerules
Normal file
@@ -0,0 +1,115 @@
|
|||||||
|
# Cline's Memory Bank
|
||||||
|
|
||||||
|
I am Cline, an expert software engineer with a unique characteristic: my memory resets completely between sessions. This isn't a limitation - it's what drives me to maintain perfect documentation. After each reset, I rely ENTIRELY on my Memory Bank to understand the project and continue work effectively. I MUST read ALL memory bank files at the start of EVERY task - this is not optional.
|
||||||
|
|
||||||
|
## Memory Bank Structure
|
||||||
|
|
||||||
|
The Memory Bank consists of core files and optional context files, all in Markdown format. Files build upon each other in a clear hierarchy:
|
||||||
|
|
||||||
|
flowchart TD
|
||||||
|
PB[projectbrief.md] --> PC[productContext.md]
|
||||||
|
PB --> SP[systemPatterns.md]
|
||||||
|
PB --> TC[techContext.md]
|
||||||
|
|
||||||
|
PC --> AC[activeContext.md]
|
||||||
|
SP --> AC
|
||||||
|
TC --> AC
|
||||||
|
|
||||||
|
AC --> P[progress.md]
|
||||||
|
|
||||||
|
### Core Files (Required)
|
||||||
|
1. `projectbrief.md`
|
||||||
|
- Foundation document that shapes all other files
|
||||||
|
- Created at project start if it doesn't exist
|
||||||
|
- Defines core requirements and goals
|
||||||
|
- Source of truth for project scope
|
||||||
|
|
||||||
|
2. `productContext.md`
|
||||||
|
- Why this project exists
|
||||||
|
- Problems it solves
|
||||||
|
- How it should work
|
||||||
|
- User experience goals
|
||||||
|
|
||||||
|
3. `activeContext.md`
|
||||||
|
- Current work focus
|
||||||
|
- Recent changes
|
||||||
|
- Next steps
|
||||||
|
- Active decisions and considerations
|
||||||
|
- Important patterns and preferences
|
||||||
|
- Learnings and project insights
|
||||||
|
|
||||||
|
4. `systemPatterns.md`
|
||||||
|
- System architecture
|
||||||
|
- Key technical decisions
|
||||||
|
- Design patterns in use
|
||||||
|
- Component relationships
|
||||||
|
- Critical implementation paths
|
||||||
|
|
||||||
|
5. `techContext.md`
|
||||||
|
- Technologies used
|
||||||
|
- Development setup
|
||||||
|
- Technical constraints
|
||||||
|
- Dependencies
|
||||||
|
- Tool usage patterns
|
||||||
|
|
||||||
|
6. `progress.md`
|
||||||
|
- What works
|
||||||
|
- What's left to build
|
||||||
|
- Current status
|
||||||
|
- Known issues
|
||||||
|
- Evolution of project decisions
|
||||||
|
|
||||||
|
### Additional Context
|
||||||
|
Create additional files/folders within memory-bank/ when they help organize:
|
||||||
|
- Complex feature documentation
|
||||||
|
- Integration specifications
|
||||||
|
- API documentation
|
||||||
|
- Testing strategies
|
||||||
|
- Deployment procedures
|
||||||
|
|
||||||
|
## Core Workflows
|
||||||
|
|
||||||
|
### Plan Mode
|
||||||
|
flowchart TD
|
||||||
|
Start[Start] --> ReadFiles[Read Memory Bank]
|
||||||
|
ReadFiles --> CheckFiles{Files Complete?}
|
||||||
|
|
||||||
|
CheckFiles -->|No| Plan[Create Plan]
|
||||||
|
Plan --> Document[Document in Chat]
|
||||||
|
|
||||||
|
CheckFiles -->|Yes| Verify[Verify Context]
|
||||||
|
Verify --> Strategy[Develop Strategy]
|
||||||
|
Strategy --> Present[Present Approach]
|
||||||
|
|
||||||
|
### Act Mode
|
||||||
|
flowchart TD
|
||||||
|
Start[Start] --> Context[Check Memory Bank]
|
||||||
|
Context --> Update[Update Documentation]
|
||||||
|
Update --> Execute[Execute Task]
|
||||||
|
Execute --> Document[Document Changes]
|
||||||
|
|
||||||
|
## Documentation Updates
|
||||||
|
|
||||||
|
Memory Bank updates occur when:
|
||||||
|
1. Discovering new project patterns
|
||||||
|
2. After implementing significant changes
|
||||||
|
3. When user requests with **update memory bank** (MUST review ALL files)
|
||||||
|
4. When context needs clarification
|
||||||
|
|
||||||
|
flowchart TD
|
||||||
|
Start[Update Process]
|
||||||
|
|
||||||
|
subgraph Process
|
||||||
|
P1[Review ALL Files]
|
||||||
|
P2[Document Current State]
|
||||||
|
P3[Clarify Next Steps]
|
||||||
|
P4[Document Insights & Patterns]
|
||||||
|
|
||||||
|
P1 --> P2 --> P3 --> P4
|
||||||
|
end
|
||||||
|
|
||||||
|
Start --> Process
|
||||||
|
|
||||||
|
Note: When triggered by **update memory bank**, I MUST review every memory bank file, even if some don't require updates. Focus particularly on activeContext.md and progress.md as they track current state.
|
||||||
|
|
||||||
|
REMEMBER: After every memory reset, I begin completely fresh. The Memory Bank is my only link to previous work. It must be maintained with precision and clarity, as my effectiveness depends entirely on its accuracy.
|
||||||
141
.env.example
141
.env.example
@@ -1,12 +1,68 @@
|
|||||||
# Hermes Agent Environment Configuration
|
# Hermes Agent Environment Configuration
|
||||||
# Copy this file to .env and fill in your API keys
|
# Copy this file to .env and fill in your API keys
|
||||||
|
|
||||||
|
# =============================================================================
|
||||||
|
# CORE SETTINGS
|
||||||
|
# =============================================================================
|
||||||
|
# Agent backend:
|
||||||
|
# - openai : default Hermes-Agent loop (OpenAI function-calling via OpenAI SDK)
|
||||||
|
# - atropos : Atroposlib ServerManager/ManagedServer-backed loop (training/env integration)
|
||||||
|
HERMES_BACKEND=openai
|
||||||
|
|
||||||
|
|
||||||
|
# =============================================================================
|
||||||
|
# LOCAL / SELF-HOSTED OPENAI-COMPATIBLE ENDPOINTS (vLLM, SGLang, llama.cpp, etc.)
|
||||||
|
# =============================================================================
|
||||||
|
# For local development (matches the Atropos test env defaults):
|
||||||
|
# ATROPOS_SERVER_BASE_URL=http://127.0.0.1:8080
|
||||||
|
# ATROPOS_SERVER_MODEL=hermes-4-36b
|
||||||
|
# For hosted inference (Nous Research inference API):
|
||||||
|
ATROPOS_SERVER_BASE_URL=
|
||||||
|
ATROPOS_SERVER_MODEL=
|
||||||
|
ATROPOS_TOKENIZER_NAME=
|
||||||
|
# Set this to your Nous API key (Bearer token).
|
||||||
|
ATROPOS_SERVER_API_KEY=
|
||||||
|
|
||||||
|
# Debugging (prints to stdout; use with care)
|
||||||
|
# HERMES_DEBUG_ATROPOS_REQUEST=1
|
||||||
|
# HERMES_DEBUG_ATROPOS_RESPONSE=1
|
||||||
|
# HERMES_DEBUG_OPENAI_REQUEST=1
|
||||||
|
# HERMES_DEBUG_OPENAI_RESPONSE=1
|
||||||
|
|
||||||
|
# =============================================================================
|
||||||
|
# LOCAL / SELF-HOSTED OPENAI-COMPATIBLE ENDPOINTS (vLLM, SGLang, llama.cpp, etc.)
|
||||||
|
# =============================================================================
|
||||||
|
# If you set ATROPOS_SERVER_BASE_URL or OPENAI_BASE_URL, Hermes will use it instead
|
||||||
|
# of OpenRouter.
|
||||||
|
#
|
||||||
|
# Local server convenience (base URL without /v1):
|
||||||
|
# llama.cpp example (see `Hermes-Agent/scripts/launch_llama_cpp_hermes_4_36b.sh`):
|
||||||
|
# ATROPOS_SERVER_BASE_URL=http://127.0.0.1:8080
|
||||||
|
# ATROPOS_SERVER_MODEL=hermes-4-36b
|
||||||
|
# ATROPOS_TOKENIZER_NAME=NousResearch/Hermes-4.3-36B
|
||||||
|
# ATROPOS_SERVER_API_KEY=local
|
||||||
|
#
|
||||||
|
# Hosted Nous inference API:
|
||||||
|
# ATROPOS_SERVER_BASE_URL=https://inference-api.nousresearch.com
|
||||||
|
# ATROPOS_SERVER_MODEL=Hermes-4.3-36B
|
||||||
|
# ATROPOS_TOKENIZER_NAME=NousResearch/Hermes-4.3-36B
|
||||||
|
# ATROPOS_SERVER_API_KEY=sk-... (Bearer token)
|
||||||
|
#
|
||||||
|
# If you plan to run GRPO-style group sampling (e.g. `--env.group_size 4`) against
|
||||||
|
# llama.cpp, start the server with at least that many slots, e.g.:
|
||||||
|
# LLAMA_CPP_PARALLEL=4 Hermes-Agent/scripts/launch_llama_cpp_hermes_4_36b.sh
|
||||||
|
#
|
||||||
|
# Generic OpenAI-compatible (base URL should include /v1):
|
||||||
|
# OPENAI_BASE_URL=http://127.0.0.1:8080/v1
|
||||||
|
# OPENAI_API_KEY=local
|
||||||
|
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
# LLM PROVIDER (OpenRouter)
|
# LLM PROVIDER (OpenRouter)
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
# OpenRouter provides access to many models through one API
|
# OpenRouter provides access to many models through one API
|
||||||
# All LLM calls go through OpenRouter - no direct provider keys needed
|
# All LLM calls go through OpenRouter - no direct provider keys needed
|
||||||
# Get your key at: https://openrouter.ai/keys
|
# Get your key at: https://openrouter.ai/keys
|
||||||
|
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
|
||||||
OPENROUTER_API_KEY=
|
OPENROUTER_API_KEY=
|
||||||
|
|
||||||
# Default model to use (OpenRouter format: provider/model)
|
# Default model to use (OpenRouter format: provider/model)
|
||||||
@@ -92,12 +148,87 @@ TERMINAL_LIFETIME_SECONDS=300
|
|||||||
# SUDO_PASSWORD=your_password_here
|
# SUDO_PASSWORD=your_password_here
|
||||||
|
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
# MODAL CLOUD BACKEND (Optional - for TERMINAL_ENV=modal)
|
# MODAL CLOUD BACKEND (for TERMINAL_ENV=modal)
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
# Modal uses CLI authentication, not environment variables.
|
# Modal provides cloud sandboxes with per-second billing and auto-scaling.
|
||||||
# Run: pip install modal && modal setup
|
# This implementation uses a warm pool of sandboxes for cost efficiency.
|
||||||
# This will authenticate via browser and store credentials locally.
|
#
|
||||||
# No API key needed in .env - Modal handles auth automatically.
|
# SETUP:
|
||||||
|
# pip install modal && modal setup
|
||||||
|
# (Authenticates via browser, stores credentials locally)
|
||||||
|
#
|
||||||
|
# FEATURES:
|
||||||
|
# - Auto-scaling warm sandbox pool (no cold start after first use)
|
||||||
|
# - Named sandbox recovery (reconnects after restart)
|
||||||
|
# - Profile-based heterogeneous environments (CPU, GPU, different images)
|
||||||
|
# - Server-side idle_timeout protection against orphaned sandboxes
|
||||||
|
|
||||||
|
# Modal app name (groups all sandboxes, used for recovery)
|
||||||
|
TERMINAL_MODAL_APP_NAME=hermes-sandbox
|
||||||
|
|
||||||
|
# Default profile when none specified
|
||||||
|
TERMINAL_MODAL_DEFAULT_PROFILE=default
|
||||||
|
|
||||||
|
# Profile config file (optional - YAML format, see modal_profiles.yaml)
|
||||||
|
# TERMINAL_MODAL_PROFILES_FILE=modal_profiles.yaml
|
||||||
|
|
||||||
|
# --- Default Profile Settings (used if no YAML file) ---
|
||||||
|
# These apply when no profile is specified or for the "default" profile
|
||||||
|
TERMINAL_MODAL_IMAGE=python:3.11
|
||||||
|
TERMINAL_MODAL_MIN_POOL=1
|
||||||
|
TERMINAL_MODAL_MAX_POOL=5
|
||||||
|
TERMINAL_MODAL_IDLE_TIMEOUT=120
|
||||||
|
TERMINAL_MODAL_MAX_LIFETIME=3600
|
||||||
|
TERMINAL_MODAL_SCALE_DOWN_IDLE=180
|
||||||
|
|
||||||
|
# --- Custom Profile Example: pytorch-gpu ---
|
||||||
|
# Uncomment to enable a GPU profile for ML tasks
|
||||||
|
# Usage: terminal_tool("python train.py", profile="pytorch-gpu")
|
||||||
|
#
|
||||||
|
# TERMINAL_MODAL_PROFILE_pytorch_gpu_IMAGE=pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
|
||||||
|
# TERMINAL_MODAL_PROFILE_pytorch_gpu_GPU=T4
|
||||||
|
# TERMINAL_MODAL_PROFILE_pytorch_gpu_MEMORY=16384
|
||||||
|
# TERMINAL_MODAL_PROFILE_pytorch_gpu_MIN_POOL=0
|
||||||
|
# TERMINAL_MODAL_PROFILE_pytorch_gpu_MAX_POOL=2
|
||||||
|
# TERMINAL_MODAL_PROFILE_pytorch_gpu_IDLE_TIMEOUT=60
|
||||||
|
|
||||||
|
# --- Custom Profile Example: node ---
|
||||||
|
# Uncomment to enable a Node.js profile
|
||||||
|
# Usage: terminal_tool("npm test", profile="node")
|
||||||
|
#
|
||||||
|
# TERMINAL_MODAL_PROFILE_node_IMAGE=node:18
|
||||||
|
# TERMINAL_MODAL_PROFILE_node_MIN_POOL=0
|
||||||
|
# TERMINAL_MODAL_PROFILE_node_MAX_POOL=3
|
||||||
|
|
||||||
|
# =============================================================================
|
||||||
|
# MODAL SECRETS (Secure credential injection)
|
||||||
|
# =============================================================================
|
||||||
|
# Modal Secrets allow you to securely pass API keys, passwords, and other
|
||||||
|
# sensitive data to your sandboxes without exposing them in code or logs.
|
||||||
|
#
|
||||||
|
# SETUP SECRETS:
|
||||||
|
# 1. Via Dashboard: https://modal.com/secrets
|
||||||
|
# 2. Via CLI: modal secret create my-secret KEY1=value1 KEY2=value2
|
||||||
|
# 3. Via CLI with env: modal secret create my-secret API_KEY="$API_KEY"
|
||||||
|
#
|
||||||
|
# LIST SECRETS:
|
||||||
|
# modal secret list
|
||||||
|
#
|
||||||
|
# DELETE SECRETS:
|
||||||
|
# modal secret delete my-secret
|
||||||
|
|
||||||
|
# Global secrets applied to ALL profiles (comma-separated secret names)
|
||||||
|
# These secrets must be created on Modal dashboard or via CLI first
|
||||||
|
# TERMINAL_MODAL_SECRETS=my-api-keys,database-creds
|
||||||
|
|
||||||
|
# Per-profile secrets (comma-separated secret names)
|
||||||
|
# TERMINAL_MODAL_PROFILE_pytorch_gpu_SECRETS=huggingface-token,wandb-key
|
||||||
|
|
||||||
|
# Per-profile environment variables (semicolon-separated KEY=VALUE pairs)
|
||||||
|
# TERMINAL_MODAL_PROFILE_default_ENV_VARS=DEBUG=1;LOG_LEVEL=info
|
||||||
|
|
||||||
|
# Load local .env file into sandbox (useful for development)
|
||||||
|
# TERMINAL_MODAL_PROFILE_default_USE_DOTENV=true
|
||||||
|
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
# BROWSER TOOL CONFIGURATION (agent-browser + Browserbase)
|
# BROWSER TOOL CONFIGURATION (agent-browser + Browserbase)
|
||||||
|
|||||||
20
.gitignore
vendored
20
.gitignore
vendored
@@ -46,3 +46,23 @@ testlogs
|
|||||||
|
|
||||||
# CLI config (may contain sensitive SSH paths)
|
# CLI config (may contain sensitive SSH paths)
|
||||||
cli-config.yaml
|
cli-config.yaml
|
||||||
|
|
||||||
|
.DS_Store
|
||||||
|
|
||||||
|
# artifacts
|
||||||
|
*.jsonl
|
||||||
|
*.html
|
||||||
|
*.json
|
||||||
|
*.log
|
||||||
|
*.csv
|
||||||
|
|
||||||
|
# Singularity/Apptainer images (large binary files)
|
||||||
|
*.sif
|
||||||
|
|
||||||
|
# Test files
|
||||||
|
test_singularity_*.py
|
||||||
|
test_*.py
|
||||||
|
!tests/test_*.py
|
||||||
|
|
||||||
|
# Nomad data
|
||||||
|
/tmp/NomadClient*/
|
||||||
|
|||||||
131
README.md
131
README.md
@@ -995,6 +995,137 @@ All variables go in `~/.hermes/.env`. Run `hermes config set VAR value` to set t
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## RL Training with Tinker
|
||||||
|
|
||||||
|
Hermes-Agent includes an RL training integration with [Tinker](https://thinkingmachines.ai/tinker/) (Thinking Machines) and [Atropos](https://github.com/NousResearch/atropos) for training language models with reinforcement learning from agent trajectories.
|
||||||
|
|
||||||
|
### Prerequisites
|
||||||
|
|
||||||
|
1. **Install with Atropos extras** (includes Tinker SDK, atroposlib, torch, wandb):
|
||||||
|
```bash
|
||||||
|
pip install -e ".[atropos]"
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Initialize the tinker-atropos submodule**:
|
||||||
|
```bash
|
||||||
|
git submodule update --init
|
||||||
|
pip install -e ./tinker-atropos
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Get API keys**:
|
||||||
|
- `TINKER_API_KEY` from [Tinker Console](https://tinker-console.thinkingmachines.ai/keys) (requires billing setup)
|
||||||
|
- `WANDB_API_KEY` from [Weights & Biases](https://wandb.ai/settings) (for metrics tracking)
|
||||||
|
|
||||||
|
4. **Add keys to your `.env` file**:
|
||||||
|
```bash
|
||||||
|
# Add to .env or ~/.hermes/.env
|
||||||
|
TINKER_API_KEY=your_tinker_key
|
||||||
|
WANDB_API_KEY=your_wandb_key
|
||||||
|
```
|
||||||
|
|
||||||
|
### Architecture
|
||||||
|
|
||||||
|
The RL training pipeline uses three processes that communicate over HTTP:
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────────────┐ ┌─────────────────────┐ ┌────────────────────────┐
|
||||||
|
│ Atropos Rollout API │ │ Tinker Trainer │ │ Environment │
|
||||||
|
│ (port 8000) │◄──│ (port 8001) │◄──│ (worker) │
|
||||||
|
│ │ │ │ │ │
|
||||||
|
│ • Collects batches │ │ • LoRA training │ │ • Generates prompts │
|
||||||
|
│ • Coordinates env │ │ • Inference server │ │ • Calls inference API │
|
||||||
|
│ and trainer │ │ • Weight updates │ │ • Scores responses │
|
||||||
|
│ │ │ • WandB logging │ │ • Sends scored batches │
|
||||||
|
└──────────────────────┘ └─────────────────────┘ └────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Quick Start: GSM8k Agent Training
|
||||||
|
|
||||||
|
This example trains a model on math problems using a Python REPL tool — the model learns to write and execute Python code to solve math:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Terminal 1: Start Atropos Rollout API
|
||||||
|
cd tinker-atropos
|
||||||
|
source ../.venv/bin/activate
|
||||||
|
set -a && source ../.env && set +a
|
||||||
|
run-api
|
||||||
|
|
||||||
|
# Terminal 2: Start Tinker Trainer + Inference Server
|
||||||
|
cd tinker-atropos
|
||||||
|
source ../.venv/bin/activate
|
||||||
|
set -a && source ../.env && set +a
|
||||||
|
python launch_training.py --config configs/gsm8k_agent.yaml
|
||||||
|
|
||||||
|
# Terminal 3: Start GSM8k Agent Environment
|
||||||
|
cd tinker-atropos
|
||||||
|
source ../.venv/bin/activate
|
||||||
|
set -a && source ../.env && set +a
|
||||||
|
python tinker_atropos/environments/gsm8k_agent.py serve --config configs/gsm8k_agent.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Available Environments
|
||||||
|
|
||||||
|
| Environment | File | Description |
|
||||||
|
|------------|------|-------------|
|
||||||
|
| `gsm8k` | `gsm8k_tinker.py` | Standard GSM8k math (no tools) |
|
||||||
|
| `gsm8k_agent` | `gsm8k_agent.py` | GSM8k with Python REPL tool calling |
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
|
||||||
|
Configs are YAML files in `tinker-atropos/configs/` with three sections:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
env: # Atropos environment settings
|
||||||
|
group_size: 4 # Parallel rollouts per problem
|
||||||
|
batch_size: 16 # Training batch size
|
||||||
|
tokenizer_name: "Qwen/Qwen3-4B-Instruct-2507"
|
||||||
|
max_token_length: 2048 # Max generation length
|
||||||
|
total_steps: 20 # Training steps
|
||||||
|
|
||||||
|
openai: # Inference server (served by Tinker trainer)
|
||||||
|
- model_name: "Qwen/Qwen3-4B-Instruct-2507"
|
||||||
|
base_url: "http://localhost:8001/v1"
|
||||||
|
|
||||||
|
tinker: # Tinker training parameters
|
||||||
|
lora_rank: 16 # LoRA rank (lower = faster, less capacity)
|
||||||
|
learning_rate: 0.00005 # Learning rate
|
||||||
|
max_token_trainer_length: 4096 # Max tokens for training
|
||||||
|
wandb_project: "hermes-agent-rl"
|
||||||
|
```
|
||||||
|
|
||||||
|
### RL CLI (Agent-Driven Training)
|
||||||
|
|
||||||
|
For interactive training management via the Hermes agent:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Interactive mode - let the agent manage training
|
||||||
|
python rl_cli.py --interactive
|
||||||
|
|
||||||
|
# List available environments
|
||||||
|
python rl_cli.py --list-environments
|
||||||
|
|
||||||
|
# Direct task
|
||||||
|
python rl_cli.py "Train a model on GSM8k with tool use"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Sandbox Backends for Agent Environments
|
||||||
|
|
||||||
|
For agent environments that need isolated tool execution (e.g., SWE tasks), Hermes-Agent supports multiple sandbox backends:
|
||||||
|
|
||||||
|
| Backend | Use Case | Command |
|
||||||
|
|---------|----------|---------|
|
||||||
|
| **Nomad + Docker** | Default, local development | `--env.tool_pool_mode nomad` |
|
||||||
|
| **Nomad + Singularity** | HPC clusters without Docker | `--env.tool_pool_mode nomad --env.driver singularity` |
|
||||||
|
| **Modal** | Cloud-based, auto-scaling | `--env.tool_pool_mode modal` |
|
||||||
|
|
||||||
|
See [docs/MODAL_BACKEND.md](docs/MODAL_BACKEND.md) for Modal backend details.
|
||||||
|
|
||||||
|
### Cost
|
||||||
|
|
||||||
|
Check the [Tinker Rate Card](https://tinker-console.thinkingmachines.ai/rate-card) for available models and pricing.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Troubleshooting
|
## Troubleshooting
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|||||||
41
atropos/Dockerfile
Normal file
41
atropos/Dockerfile
Normal file
@@ -0,0 +1,41 @@
|
|||||||
|
# Dockerfile for atropos-agent sandbox server
|
||||||
|
# Runs inside Nomad containers to handle tool execution
|
||||||
|
# Includes bubblewrap for namespace-based slot isolation
|
||||||
|
|
||||||
|
FROM python:3.11-slim
|
||||||
|
|
||||||
|
# Install system dependencies
|
||||||
|
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||||
|
# Bubblewrap for namespace isolation
|
||||||
|
bubblewrap \
|
||||||
|
# `script` for PTY allocation (used for stable tmux+asciinema startup)
|
||||||
|
util-linux \
|
||||||
|
# Git for SWE-style tasks (cloning repos)
|
||||||
|
git \
|
||||||
|
# tmux for stateful terminal sessions (Phase 4.7+)
|
||||||
|
tmux \
|
||||||
|
# Common tools agents might need
|
||||||
|
curl \
|
||||||
|
wget \
|
||||||
|
jq \
|
||||||
|
# Cleanup
|
||||||
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
# Install Python dependencies (sandbox server + optional terminal recording)
|
||||||
|
RUN pip install --no-cache-dir aiohttp asciinema
|
||||||
|
|
||||||
|
# Copy the sandbox server
|
||||||
|
COPY sandbox_server.py /app/sandbox_server.py
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
# Create data directory for slot workspaces
|
||||||
|
RUN mkdir -p /data
|
||||||
|
|
||||||
|
# Verify bubblewrap is installed and working
|
||||||
|
RUN bwrap --version
|
||||||
|
|
||||||
|
EXPOSE 8080
|
||||||
|
|
||||||
|
# Default command - can be overridden by Nomad job spec
|
||||||
|
CMD ["python", "sandbox_server.py", "--port", "8080", "--slots", "10", "--data-dir", "/data"]
|
||||||
46
atropos/__init__.py
Normal file
46
atropos/__init__.py
Normal file
@@ -0,0 +1,46 @@
|
|||||||
|
"""
|
||||||
|
Atropos integration for Hermes-Agent.
|
||||||
|
|
||||||
|
This package is intentionally optional: Hermes-Agent should work without Atropos.
|
||||||
|
If you import anything from `atropos.*` without having `atroposlib` installed,
|
||||||
|
we raise a clear error with install instructions.
|
||||||
|
|
||||||
|
Install (recommended, from repo checkout):
|
||||||
|
uv sync --extra atropos
|
||||||
|
|
||||||
|
Or (pip / editable):
|
||||||
|
pip install -e '.[atropos]'
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
|
||||||
|
def _require_atroposlib() -> None:
|
||||||
|
try:
|
||||||
|
import atroposlib # noqa: F401
|
||||||
|
except ModuleNotFoundError as exc: # pragma: no cover
|
||||||
|
raise ModuleNotFoundError(
|
||||||
|
"Hermes-Agent Atropos integration requires `atroposlib`, but it is not installed.\n"
|
||||||
|
"Install it with:\n"
|
||||||
|
" uv sync --extra atropos\n"
|
||||||
|
"or:\n"
|
||||||
|
" pip install -e '.[atropos]'\n"
|
||||||
|
) from exc
|
||||||
|
|
||||||
|
|
||||||
|
_require_atroposlib()
|
||||||
|
|
||||||
|
# Re-export the most commonly used pieces for convenience.
|
||||||
|
from .agent import AgentConfig, AgentResult, AgentStep, AtroposAgent, SequenceData # noqa: E402
|
||||||
|
from .envs import AgentEnv, AgentEnvConfig # noqa: E402
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"AtroposAgent",
|
||||||
|
"AgentConfig",
|
||||||
|
"AgentResult",
|
||||||
|
"AgentStep",
|
||||||
|
"SequenceData",
|
||||||
|
"AgentEnv",
|
||||||
|
"AgentEnvConfig",
|
||||||
|
]
|
||||||
|
|
||||||
15
atropos/agent/__init__.py
Normal file
15
atropos/agent/__init__.py
Normal file
@@ -0,0 +1,15 @@
|
|||||||
|
"""
|
||||||
|
Agent abstractions for atropos-agent.
|
||||||
|
|
||||||
|
Provides the core AtroposAgent class for running ReACT-style agent loops.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from .atropos_agent import AgentConfig, AgentResult, AgentStep, AtroposAgent, SequenceData
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"AtroposAgent",
|
||||||
|
"AgentConfig",
|
||||||
|
"AgentResult",
|
||||||
|
"AgentStep",
|
||||||
|
"SequenceData",
|
||||||
|
]
|
||||||
850
atropos/agent/atropos_agent.py
Normal file
850
atropos/agent/atropos_agent.py
Normal file
@@ -0,0 +1,850 @@
|
|||||||
|
"""
|
||||||
|
ReACT-style agent implementation for atropos-agent.
|
||||||
|
|
||||||
|
This module provides the core AtroposAgent class that implements a basic
|
||||||
|
Reason-Act-Observe loop with tool calling capabilities.
|
||||||
|
|
||||||
|
Uses ManagedServer from atroposlib for automatic token/logprob tracking,
|
||||||
|
making trajectories ready for RL training.
|
||||||
|
|
||||||
|
The agent uses Hermes-style XML tags for tool calls:
|
||||||
|
- <think>...</think> for reasoning
|
||||||
|
- <tool_call>{"name": "...", "arguments": {...}}</tool_call> for actions
|
||||||
|
- <tool_response>...</tool_response> for observations
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import os
|
||||||
|
import json
|
||||||
|
import time
|
||||||
|
from contextlib import asynccontextmanager
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from uuid import uuid4
|
||||||
|
from typing import Any, AsyncGenerator, Awaitable, Callable, Dict, List, Optional, Union
|
||||||
|
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
import httpx
|
||||||
|
|
||||||
|
from ..tools import ToolCall, ToolRegistry, ToolResult
|
||||||
|
from atroposlib.envs.server_handling.managed_server import ManagedServer
|
||||||
|
|
||||||
|
load_dotenv()
|
||||||
|
|
||||||
|
|
||||||
|
# Default system prompt with tool calling instructions.
|
||||||
|
AGENT_SYSTEM_PROMPT = """You are a deep thinking AI. You MUST enclose your internal reasoning inside <think>...</think> tags.
|
||||||
|
|
||||||
|
You are a function calling AI model.
|
||||||
|
|
||||||
|
You are provided with function signatures within <tools></tools> XML tags.
|
||||||
|
You must call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.
|
||||||
|
You can ONLY respond without a tool call if you are totally certain you have the final answer to the user's question or task
|
||||||
|
After calling & executing a function, you will be provided with function results within <tool_response></tool_response> XML tags.
|
||||||
|
|
||||||
|
Here are the available tools:
|
||||||
|
<tools>
|
||||||
|
{tools_json}
|
||||||
|
</tools>
|
||||||
|
|
||||||
|
Use the following JSON schema for each tool call you will make:
|
||||||
|
{"title": "FunctionCall", "type": "object", "properties": {"name": {"title": "Name", "type": "string"}, "arguments": {"title": "Arguments", "type": "object"}}, "required": ["name", "arguments"]}
|
||||||
|
|
||||||
|
## REQUIRED TOOL FORMAT
|
||||||
|
|
||||||
|
When you decide to call a tool, your assistant message MUST be:
|
||||||
|
1) exactly one <think>...</think> block, followed by
|
||||||
|
2) one or more <tool_call>...</tool_call> blocks,
|
||||||
|
and NOTHING else in that message.
|
||||||
|
|
||||||
|
If you need to explain anything, put it inside <think>. Do NOT write natural language outside <think> or <tool_call>.
|
||||||
|
|
||||||
|
For each function call return a JSON object with function name and arguments within <tool_call></tool_call> XML tags as follows:
|
||||||
|
<tool_call>
|
||||||
|
{"name": "<function-name>", "arguments": {"arg1": "value1"}}
|
||||||
|
</tool_call>
|
||||||
|
|
||||||
|
Each <tool_call> must be on its own and contain ONLY the JSON object (no extra text).
|
||||||
|
The JSON inside <tool_call> MUST be valid JSON with double quotes.
|
||||||
|
|
||||||
|
Do NOT output <tool_response> in an assistant message.
|
||||||
|
|
||||||
|
After you receive tool results, you may either call more tools (same required format) or provide the final answer.
|
||||||
|
When providing the final answer, do NOT include any <tool_call> blocks.
|
||||||
|
|
||||||
|
## TERMINAL TOOL NOTES
|
||||||
|
|
||||||
|
- Commands execute under POSIX `/bin/sh` (not bash).
|
||||||
|
- Each tool call runs in a fresh shell: environment changes (like `cd` or venv activation) do not persist across tool calls.
|
||||||
|
- Avoid bash-only features like `source`, `[[ ... ]]`, or process substitution.
|
||||||
|
- Prefer explicit venv usage:
|
||||||
|
- `python -m venv .venv && . .venv/bin/activate && python -m pip install -e .` (POSIX `.` activation), or
|
||||||
|
- `.venv/bin/python -m pip install -e .` (no activation required).
|
||||||
|
|
||||||
|
## ICL (examples)
|
||||||
|
|
||||||
|
User: Show the current directory.
|
||||||
|
Assistant:
|
||||||
|
<think>I should run pwd.</think>
|
||||||
|
<tool_call>
|
||||||
|
{"name": "terminal", "arguments": {"command": "pwd"}}
|
||||||
|
</tool_call>
|
||||||
|
User: <tool_response>{"success": true, "output": "/tmp\\n"}</tool_response>
|
||||||
|
Assistant: /tmp
|
||||||
|
|
||||||
|
User: List files, then count them.
|
||||||
|
Assistant:
|
||||||
|
<think>I should count files.</think>
|
||||||
|
<tool_call>
|
||||||
|
{"name": "terminal", "arguments": {"command": "ls -1 | wc -l"}}
|
||||||
|
</tool_call>
|
||||||
|
User: <tool_response>{"success": true, "output": "3\\n"}</tool_response>
|
||||||
|
Assistant: 3
|
||||||
|
|
||||||
|
User: Run pwd, then print ok (two tool calls).
|
||||||
|
Assistant:
|
||||||
|
<think>I should run two commands.</think>
|
||||||
|
<tool_call>
|
||||||
|
{"name": "terminal", "arguments": {"command": "pwd"}}
|
||||||
|
</tool_call>
|
||||||
|
<tool_call>
|
||||||
|
{"name": "terminal", "arguments": {"command": "echo ok"}}
|
||||||
|
</tool_call>
|
||||||
|
User: <tool_response>{"success": true, "output": "/tmp\\n"}</tool_response>
|
||||||
|
User: <tool_response>{"success": true, "output": "ok\\n"}</tool_response>
|
||||||
|
Assistant: ok
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class AgentConfig:
|
||||||
|
"""Configuration for the AtroposAgent."""
|
||||||
|
|
||||||
|
# Generation parameters
|
||||||
|
temperature: Optional[float] = 0.7
|
||||||
|
# Default to "let the backend decide" (important for tool-tag completions that may be longer).
|
||||||
|
max_tokens: Optional[int] = None
|
||||||
|
|
||||||
|
# Agent behavior
|
||||||
|
max_steps: int = 50
|
||||||
|
system_prompt: Optional[str] = None
|
||||||
|
tool_delay_s: float = 0.0
|
||||||
|
|
||||||
|
# Working directory for tools
|
||||||
|
working_dir: Optional[str] = None
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class SequenceData:
|
||||||
|
"""Token/logprob data from a single completion."""
|
||||||
|
|
||||||
|
full_text: str
|
||||||
|
tokens: List[int]
|
||||||
|
masked_tokens: List[int] # -100 for prompt, actual IDs for completion
|
||||||
|
logprobs: List[float] # 1.0 for prompt, actual values for completion
|
||||||
|
metadata: Optional[Dict[str, Any]] = None
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_sequence_node(cls, node) -> "SequenceData":
|
||||||
|
"""Create from a ManagedServer SequenceNode."""
|
||||||
|
return cls(
|
||||||
|
full_text=node.full_text,
|
||||||
|
tokens=node.tokens,
|
||||||
|
masked_tokens=node.masked_tokens,
|
||||||
|
logprobs=node.logprobs,
|
||||||
|
metadata=getattr(node, "metadata", None),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class AgentStep:
|
||||||
|
"""A single step in the agent's trajectory."""
|
||||||
|
|
||||||
|
step_number: int
|
||||||
|
assistant_message: str
|
||||||
|
tool_calls: List[ToolCall] = field(default_factory=list)
|
||||||
|
tool_results: List[ToolResult] = field(default_factory=list)
|
||||||
|
sequence_data: Optional[SequenceData] = None # Token data from this step
|
||||||
|
|
||||||
|
@property
|
||||||
|
def has_tool_calls(self) -> bool:
|
||||||
|
return len(self.tool_calls) > 0
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class AgentResult:
|
||||||
|
"""Result of running an agent trajectory."""
|
||||||
|
|
||||||
|
success: bool
|
||||||
|
final_response: str
|
||||||
|
steps: List[AgentStep] = field(default_factory=list)
|
||||||
|
total_tokens: int = 0
|
||||||
|
error: Optional[str] = None
|
||||||
|
metadata: Dict[str, Any] = field(default_factory=dict)
|
||||||
|
|
||||||
|
# Full trajectory token data for RL training
|
||||||
|
trajectory_data: Optional[SequenceData] = None
|
||||||
|
|
||||||
|
@property
|
||||||
|
def num_steps(self) -> int:
|
||||||
|
return len(self.steps)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def total_tool_calls(self) -> int:
|
||||||
|
return sum(len(step.tool_calls) for step in self.steps)
|
||||||
|
|
||||||
|
def to_messages(self) -> List[Dict[str, str]]:
|
||||||
|
"""Convert trajectory to messages format for logging."""
|
||||||
|
messages = []
|
||||||
|
for step in self.steps:
|
||||||
|
messages.append({"role": "assistant", "content": step.assistant_message})
|
||||||
|
if step.tool_results:
|
||||||
|
# Combine all tool responses
|
||||||
|
responses = "\n".join(r.to_xml() for r in step.tool_results)
|
||||||
|
messages.append({"role": "user", "content": responses})
|
||||||
|
return messages
|
||||||
|
|
||||||
|
def to_scored_data(self, score: float) -> Optional[Dict[str, Any]]:
|
||||||
|
"""
|
||||||
|
Convert to format suitable for ScoredDataGroup.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
score: The score for this trajectory
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict with tokens, masks, scores suitable for training, or None if no data
|
||||||
|
"""
|
||||||
|
if self.trajectory_data is None:
|
||||||
|
return None
|
||||||
|
|
||||||
|
return {
|
||||||
|
"tokens": self.trajectory_data.tokens,
|
||||||
|
"masks": self.trajectory_data.masked_tokens,
|
||||||
|
"scores": score,
|
||||||
|
"logprobs": self.trajectory_data.logprobs,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class AtroposAgent:
|
||||||
|
"""
|
||||||
|
A ReACT-style agent that uses LLMs with tool calling.
|
||||||
|
|
||||||
|
This implementation wraps ManagedServer for automatic token/logprob tracking,
|
||||||
|
making trajectories ready for RL training.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
# `server` may be an Atropos `ServerManager` (recommended) or a single `APIServer`.
|
||||||
|
# In practice, environments usually construct this via `BaseEnv`.
|
||||||
|
server = ...
|
||||||
|
tools = ToolRegistry()
|
||||||
|
tools.register(BashTool())
|
||||||
|
|
||||||
|
agent = AtroposAgent(server=server, tools=tools)
|
||||||
|
result = await agent.run("List the files in the current directory")
|
||||||
|
|
||||||
|
# Access token data for training
|
||||||
|
if result.trajectory_data:
|
||||||
|
print(f"Tokens: {result.trajectory_data.tokens}")
|
||||||
|
print(f"Masked: {result.trajectory_data.masked_tokens}")
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
server, # ServerManager or APIServer
|
||||||
|
tools: Optional[ToolRegistry] = None,
|
||||||
|
config: Optional[AgentConfig] = None,
|
||||||
|
tokenizer: Optional[Any] = None,
|
||||||
|
execute_tool: Optional[Callable[[ToolCall], Awaitable[ToolResult]]] = None,
|
||||||
|
):
|
||||||
|
self.server = server
|
||||||
|
self.tools = tools or ToolRegistry()
|
||||||
|
self.config = config or AgentConfig()
|
||||||
|
self.tokenizer = tokenizer or getattr(server, "tokenizer", None)
|
||||||
|
self.execute_tool = execute_tool or self.tools.execute
|
||||||
|
|
||||||
|
@asynccontextmanager
|
||||||
|
async def _managed(self) -> AsyncGenerator[Any, None]:
|
||||||
|
"""
|
||||||
|
Yield a ManagedServer-like object.
|
||||||
|
|
||||||
|
- If `self.server` is a ServerManager, use its `managed_server()` context manager.
|
||||||
|
- If `self.server` is a single APIServer, wrap it in `ManagedServer` directly.
|
||||||
|
"""
|
||||||
|
if os.getenv("ATROPOS_BYPASS_MANAGED_SERVER") == "1":
|
||||||
|
yield _DirectChatCompletionClient(server=self.server)
|
||||||
|
return
|
||||||
|
if hasattr(self.server, "managed_server"):
|
||||||
|
async with self.server.managed_server(tokenizer=self.tokenizer) as managed:
|
||||||
|
yield managed
|
||||||
|
else:
|
||||||
|
managed = ManagedServer(server=self.server, tokenizer=self.tokenizer)
|
||||||
|
try:
|
||||||
|
yield managed
|
||||||
|
finally:
|
||||||
|
managed.reset()
|
||||||
|
|
||||||
|
def _build_system_prompt(self) -> str:
|
||||||
|
"""Build the system prompt with tool descriptions."""
|
||||||
|
if self.config.system_prompt:
|
||||||
|
return self.config.system_prompt
|
||||||
|
|
||||||
|
tools_json = self.tools.get_prompt_tool_definitions_json()
|
||||||
|
# Avoid `str.format()` here because the prompt contains many literal `{}` braces
|
||||||
|
# in JSON examples; we only want to substitute the single `{tools_json}` token.
|
||||||
|
return AGENT_SYSTEM_PROMPT.replace("{tools_json}", tools_json)
|
||||||
|
|
||||||
|
def _infer_server_model_for_debug(self) -> Optional[str]:
|
||||||
|
"""
|
||||||
|
Best-effort inference of the configured model name for debug payload saving.
|
||||||
|
|
||||||
|
ManagedServer/server_manager typically injects `model` internally, so `chat_kwargs`
|
||||||
|
may not contain it. For replaying saved payloads via curl, it's useful to persist it.
|
||||||
|
"""
|
||||||
|
servers = getattr(self.server, "servers", None)
|
||||||
|
if isinstance(servers, list) and servers:
|
||||||
|
s0 = servers[0]
|
||||||
|
cfg = getattr(s0, "config", None)
|
||||||
|
model = getattr(cfg, "model_name", None) or getattr(s0, "model_name", None)
|
||||||
|
if isinstance(model, str) and model:
|
||||||
|
return model
|
||||||
|
model = getattr(self.server, "model_name", None) or getattr(self.server, "model", None)
|
||||||
|
if isinstance(model, str) and model:
|
||||||
|
return model
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _infer_server_base_url_for_debug(self) -> Optional[str]:
|
||||||
|
"""
|
||||||
|
Best-effort inference of the configured base_url for debug logging.
|
||||||
|
|
||||||
|
This is helpful when diagnosing hangs / retries at the transport layer.
|
||||||
|
"""
|
||||||
|
servers = getattr(self.server, "servers", None)
|
||||||
|
if isinstance(servers, list) and servers:
|
||||||
|
s0 = servers[0]
|
||||||
|
cfg = getattr(s0, "config", None)
|
||||||
|
base_url = getattr(cfg, "base_url", None) or getattr(s0, "base_url", None)
|
||||||
|
if isinstance(base_url, str) and base_url:
|
||||||
|
return base_url
|
||||||
|
base_url = getattr(self.server, "base_url", None)
|
||||||
|
if isinstance(base_url, str) and base_url:
|
||||||
|
return base_url
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _extract_response_metadata(self, response: Any) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Extract lightweight, JSON-serializable metadata from an OpenAI-style response.
|
||||||
|
|
||||||
|
This is useful for debugging training runs, especially when ManagedServer state
|
||||||
|
tracking is unavailable (e.g. OpenAI-compatible chat endpoints).
|
||||||
|
"""
|
||||||
|
meta: Dict[str, Any] = {}
|
||||||
|
try:
|
||||||
|
rid = getattr(response, "id", None)
|
||||||
|
if isinstance(rid, str) and rid:
|
||||||
|
meta["id"] = rid
|
||||||
|
model = getattr(response, "model", None)
|
||||||
|
if isinstance(model, str) and model:
|
||||||
|
meta["model"] = model
|
||||||
|
created = getattr(response, "created", None)
|
||||||
|
if isinstance(created, int):
|
||||||
|
meta["created"] = created
|
||||||
|
system_fingerprint = getattr(response, "system_fingerprint", None)
|
||||||
|
if isinstance(system_fingerprint, str) and system_fingerprint:
|
||||||
|
meta["system_fingerprint"] = system_fingerprint
|
||||||
|
|
||||||
|
choices = getattr(response, "choices", None)
|
||||||
|
if isinstance(choices, list) and choices:
|
||||||
|
fr = getattr(choices[0], "finish_reason", None)
|
||||||
|
if isinstance(fr, str) and fr:
|
||||||
|
meta["finish_reason"] = fr
|
||||||
|
|
||||||
|
usage = getattr(response, "usage", None)
|
||||||
|
if usage is not None:
|
||||||
|
if hasattr(usage, "model_dump"):
|
||||||
|
meta["usage"] = usage.model_dump()
|
||||||
|
elif isinstance(usage, dict):
|
||||||
|
meta["usage"] = usage
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
return meta
|
||||||
|
|
||||||
|
def _debug_dump_request(self, *, step_num: int, chat_kwargs: Dict[str, Any]) -> None:
|
||||||
|
if os.getenv("ATROPOS_DEBUG_AGENT_REQUEST") != "1":
|
||||||
|
return
|
||||||
|
try:
|
||||||
|
# Avoid dumping megabytes by default; messages can be huge.
|
||||||
|
meta = {
|
||||||
|
"step": step_num,
|
||||||
|
"base_url": self._infer_server_base_url_for_debug(),
|
||||||
|
"model": chat_kwargs.get("model") or self._infer_server_model_for_debug(),
|
||||||
|
"chat_kwargs_keys": sorted(list(chat_kwargs.keys())),
|
||||||
|
"n": chat_kwargs.get("n"),
|
||||||
|
"max_tokens": chat_kwargs.get("max_tokens"),
|
||||||
|
"temperature": chat_kwargs.get("temperature"),
|
||||||
|
"num_messages": len(chat_kwargs.get("messages") or []),
|
||||||
|
}
|
||||||
|
print("\n=== ATROPOS_DEBUG_AGENT_REQUEST ===", flush=True)
|
||||||
|
print(meta, flush=True)
|
||||||
|
|
||||||
|
if os.getenv("ATROPOS_DEBUG_AGENT_REQUEST_FULL") == "1":
|
||||||
|
payload = dict(chat_kwargs)
|
||||||
|
# Make the payload more legible and less huge.
|
||||||
|
try:
|
||||||
|
dumped = json.dumps(payload, ensure_ascii=False, indent=2)
|
||||||
|
except Exception:
|
||||||
|
dumped = repr(payload)
|
||||||
|
print("\n=== ATROPOS_DEBUG_AGENT_REQUEST_FULL ===", flush=True)
|
||||||
|
print(dumped[:200_000], flush=True)
|
||||||
|
|
||||||
|
# Optional: save the FULL request payload to disk (no truncation).
|
||||||
|
save_dir = os.getenv("ATROPOS_DEBUG_AGENT_REQUEST_SAVE_DIR")
|
||||||
|
if save_dir:
|
||||||
|
os.makedirs(save_dir, exist_ok=True)
|
||||||
|
payload: Dict[str, Any] = dict(chat_kwargs)
|
||||||
|
if "model" not in payload:
|
||||||
|
model = self._infer_server_model_for_debug()
|
||||||
|
if model:
|
||||||
|
payload["model"] = model
|
||||||
|
# Use a unique filename so parallel trajectories don't clobber each other.
|
||||||
|
fname = os.path.join(
|
||||||
|
save_dir,
|
||||||
|
f"atropos_agent_request_step{step_num}_{int(time.time()*1000)}_{os.getpid()}_{uuid4().hex}.json",
|
||||||
|
)
|
||||||
|
with open(fname, "w", encoding="utf-8") as f:
|
||||||
|
json.dump(payload, f, ensure_ascii=False, indent=2)
|
||||||
|
print(f"[AtroposAgent] saved request payload: {fname}", flush=True)
|
||||||
|
except Exception:
|
||||||
|
return
|
||||||
|
|
||||||
|
def _debug_dump_response(self, *, step_num: int, response: Any) -> None:
|
||||||
|
if os.getenv("ATROPOS_DEBUG_AGENT_RESPONSE") != "1":
|
||||||
|
return
|
||||||
|
print("\n=== ATROPOS_DEBUG_AGENT_RESPONSE ===", flush=True)
|
||||||
|
print({"step": step_num, "type": type(response).__name__}, flush=True)
|
||||||
|
try:
|
||||||
|
dumped = response.model_dump() # openai pydantic model
|
||||||
|
except Exception:
|
||||||
|
dumped = getattr(response, "__dict__", {"repr": repr(response)})
|
||||||
|
# Keep the dump bounded; we only need enough to see the assistant message content.
|
||||||
|
text = str(dumped)
|
||||||
|
print(text[:200_000], flush=True)
|
||||||
|
|
||||||
|
async def _chat_completion_with_debug(
|
||||||
|
self, *, managed: Any, step_num: int, chat_kwargs: Dict[str, Any]
|
||||||
|
) -> Any:
|
||||||
|
"""
|
||||||
|
Call `managed.chat_completion()` with optional timeout + richer failure logging.
|
||||||
|
|
||||||
|
Debug env vars:
|
||||||
|
- `ATROPOS_AGENT_CHAT_TIMEOUT_S`: if set, wraps the await in `asyncio.wait_for`.
|
||||||
|
- `ATROPOS_DEBUG_AGENT_WAIT_EVERY_S`: if set, prints a heartbeat while waiting.
|
||||||
|
"""
|
||||||
|
# Hard guardrail: never allow a single chat completion to block for too long.
|
||||||
|
# This is essential for RL data-gen stability; long hangs should be treated as failures (score=0).
|
||||||
|
timeout_s_raw = os.getenv("ATROPOS_AGENT_CHAT_TIMEOUT_S")
|
||||||
|
timeout_s_default = 240.0
|
||||||
|
timeout_s = float(timeout_s_raw) if timeout_s_raw else timeout_s_default
|
||||||
|
timeout_s = min(timeout_s, 240.0)
|
||||||
|
|
||||||
|
wait_every_raw = os.getenv("ATROPOS_DEBUG_AGENT_WAIT_EVERY_S")
|
||||||
|
wait_every_s = float(wait_every_raw) if wait_every_raw else None
|
||||||
|
|
||||||
|
async def _await_call() -> Any:
|
||||||
|
if not wait_every_s or wait_every_s <= 0:
|
||||||
|
return await managed.chat_completion(**chat_kwargs)
|
||||||
|
|
||||||
|
# Heartbeat mode: wait in chunks without cancelling the underlying request.
|
||||||
|
# NOTE: do NOT use `asyncio.wait_for(task, timeout=...)` here, because a timeout
|
||||||
|
# will cancel the task and surface as `CancelledError` on the next loop.
|
||||||
|
task = asyncio.create_task(managed.chat_completion(**chat_kwargs))
|
||||||
|
t0 = time.perf_counter()
|
||||||
|
try:
|
||||||
|
while True:
|
||||||
|
done, _pending = await asyncio.wait({task}, timeout=wait_every_s)
|
||||||
|
if task in done:
|
||||||
|
return task.result()
|
||||||
|
|
||||||
|
waited = time.perf_counter() - t0
|
||||||
|
print(
|
||||||
|
f"[AtroposAgent] step={step_num} still waiting for chat_completion... ({waited:.1f}s)",
|
||||||
|
flush=True,
|
||||||
|
)
|
||||||
|
except asyncio.CancelledError:
|
||||||
|
task.cancel()
|
||||||
|
raise
|
||||||
|
|
||||||
|
try:
|
||||||
|
return await asyncio.wait_for(_await_call(), timeout=timeout_s)
|
||||||
|
except asyncio.TimeoutError as e:
|
||||||
|
print("\n=== ATROPOS_DEBUG_AGENT_CHAT_TIMEOUT ===", flush=True)
|
||||||
|
print({"step": step_num, "timeout_s": timeout_s}, flush=True)
|
||||||
|
raise RuntimeError(f"chat_completion timed out after {timeout_s:.1f}s") from e
|
||||||
|
except asyncio.CancelledError:
|
||||||
|
# Treat cancellation as a hard failure rather than crashing the whole env run.
|
||||||
|
# (Atropos/BaseEnv may cancel tasks during shutdown or retries.)
|
||||||
|
raise RuntimeError("chat_completion cancelled") from None
|
||||||
|
except Exception as e:
|
||||||
|
detail: Dict[str, Any] = {
|
||||||
|
"step": step_num,
|
||||||
|
"exc_type": type(e).__name__,
|
||||||
|
"exc_str": str(e),
|
||||||
|
}
|
||||||
|
if isinstance(e, httpx.HTTPStatusError):
|
||||||
|
try:
|
||||||
|
detail["status_code"] = e.response.status_code
|
||||||
|
detail["response_text"] = e.response.text[:20_000]
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
elif isinstance(e, httpx.RequestError):
|
||||||
|
detail["request"] = repr(getattr(e, "request", None))
|
||||||
|
|
||||||
|
print("\n=== ATROPOS_DEBUG_AGENT_CHAT_FAILURE ===", flush=True)
|
||||||
|
print(detail, flush=True)
|
||||||
|
raise
|
||||||
|
|
||||||
|
async def run(
|
||||||
|
self,
|
||||||
|
task: str,
|
||||||
|
initial_messages: Optional[List[Dict[str, str]]] = None,
|
||||||
|
) -> AgentResult:
|
||||||
|
"""
|
||||||
|
Run the agent on a task using ManagedServer for token tracking.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
task: The task/prompt for the agent
|
||||||
|
initial_messages: Optional additional context messages
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
AgentResult with the trajectory, final response, and token data
|
||||||
|
"""
|
||||||
|
messages = [
|
||||||
|
{"role": "system", "content": self._build_system_prompt()},
|
||||||
|
]
|
||||||
|
|
||||||
|
if initial_messages:
|
||||||
|
messages.extend(initial_messages)
|
||||||
|
|
||||||
|
messages.append({"role": "user", "content": task})
|
||||||
|
|
||||||
|
steps = []
|
||||||
|
final_response = ""
|
||||||
|
final_node = None
|
||||||
|
final_prompt_messages: Optional[List[Dict[str, str]]] = None
|
||||||
|
last_node = None
|
||||||
|
last_prompt_messages: Optional[List[Dict[str, str]]] = None
|
||||||
|
last_response_text: str = ""
|
||||||
|
|
||||||
|
# Use ManagedServer for automatic token tracking
|
||||||
|
async with self._managed() as managed:
|
||||||
|
for step_num in range(self.config.max_steps):
|
||||||
|
# ReACT loop iteration here, just call -> tools -> observe until done (no tools called)
|
||||||
|
try:
|
||||||
|
# Keep a copy of the prompt messages used for this completion.
|
||||||
|
# Useful for reconstructing tokens/masks when state tracking is unavailable.
|
||||||
|
prompt_messages = list(messages)
|
||||||
|
chat_kwargs: Dict[str, Any] = {"messages": messages, "n": 1}
|
||||||
|
if self.config.max_tokens is not None:
|
||||||
|
chat_kwargs["max_tokens"] = self.config.max_tokens
|
||||||
|
if self.config.temperature is not None:
|
||||||
|
chat_kwargs["temperature"] = self.config.temperature
|
||||||
|
|
||||||
|
t_req = time.perf_counter()
|
||||||
|
print(
|
||||||
|
f"[AtroposAgent] step={step_num+1} chat_completion start "
|
||||||
|
f"(messages={len(messages)}, max_tokens={self.config.max_tokens}, temp={self.config.temperature})",
|
||||||
|
flush=True,
|
||||||
|
)
|
||||||
|
self._debug_dump_request(step_num=step_num + 1, chat_kwargs=chat_kwargs)
|
||||||
|
response = await self._chat_completion_with_debug(
|
||||||
|
managed=managed, step_num=step_num + 1, chat_kwargs=chat_kwargs
|
||||||
|
)
|
||||||
|
self._debug_dump_response(step_num=step_num + 1, response=response)
|
||||||
|
response_meta = self._extract_response_metadata(response)
|
||||||
|
print(
|
||||||
|
f"[AtroposAgent] step={step_num+1} chat_completion done in {time.perf_counter() - t_req:.2f}s",
|
||||||
|
flush=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
current_node = None
|
||||||
|
if hasattr(managed, "get_state"):
|
||||||
|
state = managed.get_state()
|
||||||
|
nodes = state.get("nodes", [])
|
||||||
|
current_node = nodes[-1] if nodes else None
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
return AgentResult(
|
||||||
|
success=False,
|
||||||
|
final_response="",
|
||||||
|
steps=steps,
|
||||||
|
error=f"Generation error: {str(e)}",
|
||||||
|
)
|
||||||
|
|
||||||
|
msg = response.choices[0].message
|
||||||
|
# Some OpenAI-compatible servers populate `message.reasoning` and leave `content=""`.
|
||||||
|
response_text = (msg.content or "") or (getattr(msg, "reasoning", None) or "")
|
||||||
|
tool_calls = ToolCall.parse_from_text(response_text)
|
||||||
|
last_node = current_node
|
||||||
|
last_prompt_messages = prompt_messages
|
||||||
|
last_response_text = response_text
|
||||||
|
|
||||||
|
step_sequence_data = SequenceData.from_sequence_node(current_node) if current_node else None
|
||||||
|
if step_sequence_data is None:
|
||||||
|
if response_meta:
|
||||||
|
# We still want metadata for debugging even if token/logprob state tracking is unavailable.
|
||||||
|
step_sequence_data = SequenceData(
|
||||||
|
full_text=response_text,
|
||||||
|
tokens=[],
|
||||||
|
masked_tokens=[],
|
||||||
|
logprobs=[],
|
||||||
|
metadata=response_meta,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
merged = dict(response_meta)
|
||||||
|
node_meta = step_sequence_data.metadata
|
||||||
|
if isinstance(node_meta, dict):
|
||||||
|
merged.update(node_meta)
|
||||||
|
step_sequence_data.metadata = merged or step_sequence_data.metadata
|
||||||
|
|
||||||
|
step = AgentStep(
|
||||||
|
step_number=step_num + 1,
|
||||||
|
assistant_message=response_text,
|
||||||
|
tool_calls=tool_calls,
|
||||||
|
sequence_data=step_sequence_data,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not tool_calls:
|
||||||
|
steps.append(step)
|
||||||
|
final_response = response_text
|
||||||
|
final_node = current_node
|
||||||
|
final_prompt_messages = prompt_messages
|
||||||
|
break
|
||||||
|
|
||||||
|
messages.append({"role": "assistant", "content": response_text})
|
||||||
|
|
||||||
|
tool_responses = []
|
||||||
|
for call in tool_calls:
|
||||||
|
result = await self.execute_tool(call)
|
||||||
|
step.tool_results.append(result)
|
||||||
|
tool_responses.append(result.to_xml())
|
||||||
|
if self.config.tool_delay_s > 0:
|
||||||
|
await asyncio.sleep(self.config.tool_delay_s)
|
||||||
|
|
||||||
|
steps.append(step)
|
||||||
|
|
||||||
|
responses_text = "\n".join(tool_responses)
|
||||||
|
# Tool observations are represented as user content with Hermes-style tags.
|
||||||
|
# This is compatible with most OpenAI-compatible chat APIs and ensures
|
||||||
|
# tokenizers/chat templates include tool outputs during training.
|
||||||
|
messages.append({"role": "user", "content": responses_text})
|
||||||
|
|
||||||
|
else:
|
||||||
|
# Reached max steps without completing
|
||||||
|
# Return a failure result but include the last observed completion so callers can
|
||||||
|
# record the trajectory (score=0) without triggering retries.
|
||||||
|
final_response = last_response_text or final_response
|
||||||
|
final_node = last_node
|
||||||
|
final_prompt_messages = last_prompt_messages
|
||||||
|
trajectory_data = None
|
||||||
|
if final_node:
|
||||||
|
trajectory_data = SequenceData.from_sequence_node(final_node)
|
||||||
|
elif final_prompt_messages is not None and self.tokenizer is not None:
|
||||||
|
if hasattr(self.tokenizer, "apply_chat_template"):
|
||||||
|
prompt_text = self.tokenizer.apply_chat_template(
|
||||||
|
final_prompt_messages, tokenize=False, add_generation_prompt=True
|
||||||
|
)
|
||||||
|
prompt_tokens = self.tokenizer.encode(prompt_text, add_special_tokens=False)
|
||||||
|
else:
|
||||||
|
prompt_text = "\n".join([f"{m['role']}: {m['content']}" for m in final_prompt_messages])
|
||||||
|
prompt_tokens = self.tokenizer.encode(prompt_text, add_special_tokens=True)
|
||||||
|
output_tokens = self.tokenizer.encode(final_response, add_special_tokens=False)
|
||||||
|
tokens = prompt_tokens + output_tokens
|
||||||
|
masked_tokens = ([-100] * len(prompt_tokens)) + output_tokens
|
||||||
|
logprobs = ([1.0] * len(prompt_tokens)) + ([0.0] * len(output_tokens))
|
||||||
|
trajectory_data = SequenceData(
|
||||||
|
full_text=f"{prompt_text}{final_response}",
|
||||||
|
tokens=tokens,
|
||||||
|
masked_tokens=masked_tokens,
|
||||||
|
logprobs=logprobs,
|
||||||
|
)
|
||||||
|
# Preserve response metadata (if any) even on failure trajectories.
|
||||||
|
try:
|
||||||
|
if trajectory_data is not None and steps:
|
||||||
|
last_step = steps[-1]
|
||||||
|
if last_step.sequence_data and isinstance(last_step.sequence_data.metadata, dict):
|
||||||
|
trajectory_data.metadata = dict(last_step.sequence_data.metadata)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
return AgentResult(
|
||||||
|
success=False,
|
||||||
|
final_response=final_response,
|
||||||
|
steps=steps,
|
||||||
|
error=f"Reached maximum steps ({self.config.max_steps})",
|
||||||
|
trajectory_data=trajectory_data,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Build result with trajectory data
|
||||||
|
trajectory_data = None
|
||||||
|
if final_node:
|
||||||
|
trajectory_data = SequenceData.from_sequence_node(final_node)
|
||||||
|
elif final_prompt_messages is not None and self.tokenizer is not None:
|
||||||
|
if hasattr(self.tokenizer, "apply_chat_template"):
|
||||||
|
prompt_text = self.tokenizer.apply_chat_template(
|
||||||
|
final_prompt_messages, tokenize=False, add_generation_prompt=True
|
||||||
|
)
|
||||||
|
prompt_tokens = self.tokenizer.encode(prompt_text, add_special_tokens=False)
|
||||||
|
else:
|
||||||
|
prompt_text = "\n".join([f"{m['role']}: {m['content']}" for m in final_prompt_messages])
|
||||||
|
prompt_tokens = self.tokenizer.encode(prompt_text, add_special_tokens=True)
|
||||||
|
output_tokens = self.tokenizer.encode(final_response, add_special_tokens=False)
|
||||||
|
tokens = prompt_tokens + output_tokens
|
||||||
|
masked_tokens = ([-100] * len(prompt_tokens)) + output_tokens
|
||||||
|
logprobs = ([1.0] * len(prompt_tokens)) + ([0.0] * len(output_tokens))
|
||||||
|
trajectory_data = SequenceData(
|
||||||
|
full_text=f"{prompt_text}{final_response}",
|
||||||
|
tokens=tokens,
|
||||||
|
masked_tokens=masked_tokens,
|
||||||
|
logprobs=logprobs,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Ensure trajectory_data carries the most recent metadata we observed (if any).
|
||||||
|
try:
|
||||||
|
if trajectory_data is not None and steps:
|
||||||
|
last_step = steps[-1]
|
||||||
|
if last_step.sequence_data and isinstance(last_step.sequence_data.metadata, dict):
|
||||||
|
trajectory_data.metadata = dict(last_step.sequence_data.metadata)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
return AgentResult(
|
||||||
|
success=True,
|
||||||
|
final_response=final_response,
|
||||||
|
steps=steps,
|
||||||
|
trajectory_data=trajectory_data,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def run_single_turn(
|
||||||
|
self,
|
||||||
|
messages: List[Dict[str, str]],
|
||||||
|
execute_tools: bool = True,
|
||||||
|
) -> tuple[str, List[ToolResult], Optional[SequenceData]]:
|
||||||
|
"""
|
||||||
|
Run a single turn of the agent (one LLM call + tool execution).
|
||||||
|
|
||||||
|
This is useful for integration with BaseEnv where you want more
|
||||||
|
control over the loop.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
messages: The conversation history
|
||||||
|
execute_tools: Whether to execute parsed tool calls
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple of (response_text, tool_results, sequence_data)
|
||||||
|
"""
|
||||||
|
async with self._managed() as managed:
|
||||||
|
chat_kwargs: Dict[str, Any] = {"messages": messages, "n": 1}
|
||||||
|
if self.config.max_tokens is not None:
|
||||||
|
chat_kwargs["max_tokens"] = self.config.max_tokens
|
||||||
|
if self.config.temperature is not None:
|
||||||
|
chat_kwargs["temperature"] = self.config.temperature
|
||||||
|
|
||||||
|
self._debug_dump_request(step_num=1, chat_kwargs=chat_kwargs)
|
||||||
|
response = await self._chat_completion_with_debug(managed=managed, step_num=1, chat_kwargs=chat_kwargs)
|
||||||
|
self._debug_dump_response(step_num=1, response=response)
|
||||||
|
|
||||||
|
current_node = None
|
||||||
|
if hasattr(managed, "get_state"):
|
||||||
|
state = managed.get_state()
|
||||||
|
nodes = state.get("nodes", [])
|
||||||
|
current_node = nodes[-1] if nodes else None
|
||||||
|
|
||||||
|
msg = response.choices[0].message
|
||||||
|
response_text = (msg.content or "") or (getattr(msg, "reasoning", None) or "")
|
||||||
|
tool_results = []
|
||||||
|
|
||||||
|
if execute_tools:
|
||||||
|
tool_calls = ToolCall.parse_from_text(response_text)
|
||||||
|
for call in tool_calls:
|
||||||
|
result = await self.execute_tool(call)
|
||||||
|
tool_results.append(result)
|
||||||
|
|
||||||
|
sequence_data = SequenceData.from_sequence_node(current_node) if current_node else None
|
||||||
|
|
||||||
|
return response_text, tool_results, sequence_data
|
||||||
|
|
||||||
|
|
||||||
|
class _DirectChatCompletionClient:
|
||||||
|
"""
|
||||||
|
Minimal stand-in for ManagedServer that calls the OpenAI-compatible endpoint directly.
|
||||||
|
|
||||||
|
This is for isolating issues where `ManagedServer.chat_completion()` hangs or misbehaves.
|
||||||
|
It intentionally does NOT do token/logprob tracking.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, server: Any):
|
||||||
|
self._server = server
|
||||||
|
|
||||||
|
def _server_config(self) -> tuple[str, str, str]:
|
||||||
|
# ServerManager case: first configured server.
|
||||||
|
servers = getattr(self._server, "servers", None)
|
||||||
|
if isinstance(servers, list) and servers:
|
||||||
|
s0 = servers[0]
|
||||||
|
cfg = getattr(s0, "config", None)
|
||||||
|
base_url = getattr(cfg, "base_url", None) or getattr(s0, "base_url", None)
|
||||||
|
api_key = getattr(cfg, "api_key", None) or getattr(s0, "api_key", None)
|
||||||
|
model = getattr(cfg, "model_name", None) or getattr(s0, "model_name", None)
|
||||||
|
if isinstance(base_url, str) and isinstance(api_key, str) and isinstance(model, str):
|
||||||
|
return base_url.rstrip("/"), api_key, model
|
||||||
|
|
||||||
|
# APIServer-like fallback.
|
||||||
|
base_url = getattr(self._server, "base_url", None)
|
||||||
|
api_key = getattr(self._server, "api_key", None)
|
||||||
|
model = getattr(self._server, "model_name", None) or getattr(self._server, "model", None)
|
||||||
|
if isinstance(base_url, str) and isinstance(api_key, str) and isinstance(model, str):
|
||||||
|
return base_url.rstrip("/"), api_key, model
|
||||||
|
|
||||||
|
raise RuntimeError("Unable to resolve server base_url/api_key/model for direct chat completion")
|
||||||
|
|
||||||
|
async def chat_completion(self, *, messages: List[Dict[str, str]], n: int = 1, **kwargs: Any) -> Any:
|
||||||
|
base_url, api_key, model = self._server_config()
|
||||||
|
url = f"{base_url}/chat/completions"
|
||||||
|
|
||||||
|
payload: Dict[str, Any] = {
|
||||||
|
"model": model,
|
||||||
|
"messages": messages,
|
||||||
|
"n": n,
|
||||||
|
}
|
||||||
|
# Pass through common generation kwargs.
|
||||||
|
for k in ("max_tokens", "temperature", "top_p", "presence_penalty", "frequency_penalty", "stop"):
|
||||||
|
if k in kwargs and kwargs[k] is not None:
|
||||||
|
payload[k] = kwargs[k]
|
||||||
|
|
||||||
|
timeout_s = float(os.getenv("ATROPOS_DIRECT_REQUEST_TIMEOUT_S") or "120")
|
||||||
|
print(f"[AtroposAgent] DIRECT chat_completion POST {url} (timeout={timeout_s}s)", flush=True)
|
||||||
|
async with httpx.AsyncClient(timeout=timeout_s) as client:
|
||||||
|
resp = await client.post(
|
||||||
|
url,
|
||||||
|
headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
|
||||||
|
json=payload,
|
||||||
|
)
|
||||||
|
resp.raise_for_status()
|
||||||
|
data = resp.json()
|
||||||
|
|
||||||
|
# Return a very small object compatible with the code paths that read
|
||||||
|
# `response.choices[0].message.content`.
|
||||||
|
class _Msg:
|
||||||
|
def __init__(self, d: Dict[str, Any]):
|
||||||
|
self.content = d.get("content")
|
||||||
|
self.reasoning = d.get("reasoning")
|
||||||
|
|
||||||
|
class _Choice:
|
||||||
|
def __init__(self, d: Dict[str, Any]):
|
||||||
|
self.message = _Msg(d.get("message") or {})
|
||||||
|
|
||||||
|
class _Resp:
|
||||||
|
def __init__(self, d: Dict[str, Any]):
|
||||||
|
self._d = d
|
||||||
|
self.choices = [_Choice(c) for c in (d.get("choices") or [])]
|
||||||
|
|
||||||
|
def model_dump(self) -> Dict[str, Any]:
|
||||||
|
return self._d
|
||||||
|
|
||||||
|
return _Resp(data)
|
||||||
6
atropos/api/__init__.py
Normal file
6
atropos/api/__init__.py
Normal file
@@ -0,0 +1,6 @@
|
|||||||
|
"""
|
||||||
|
FastAPI services for atropos-agent.
|
||||||
|
|
||||||
|
- tool_executor_server: queued/batched sandbox tool execution (Phase 4)
|
||||||
|
"""
|
||||||
|
|
||||||
254
atropos/api/tool_executor_server.py
Normal file
254
atropos/api/tool_executor_server.py
Normal file
@@ -0,0 +1,254 @@
|
|||||||
|
"""
|
||||||
|
Tool Executor API (Phase 4)
|
||||||
|
|
||||||
|
This service provides a queued, batched execution layer on top of a ToolBackend.
|
||||||
|
It mirrors the stateful FastAPI + app.state pattern used in:
|
||||||
|
atropos/atroposlib/api/server.py
|
||||||
|
|
||||||
|
Run (dev):
|
||||||
|
uv run uvicorn atropos_agent.api.tool_executor_server:app --host 0.0.0.0 --port 9001
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import os
|
||||||
|
from typing import Any, Dict, Optional
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from fastapi import FastAPI, Header, HTTPException, status
|
||||||
|
from pydantic import BaseModel, Field
|
||||||
|
|
||||||
|
from ..backends.nomad_backend import NomadBackendConfig, NomadToolBackend
|
||||||
|
from ..tools import ToolRegistry, build_tool_registry
|
||||||
|
from ..tools.base import (
|
||||||
|
ArtifactArchiveRequestPayload,
|
||||||
|
ArtifactArchiveResponsePayload,
|
||||||
|
ArtifactListRequestPayload,
|
||||||
|
ArtifactListResponsePayload,
|
||||||
|
ArtifactReadRequestPayload,
|
||||||
|
ArtifactReadResponsePayload,
|
||||||
|
ToolExecutorExecuteRequest,
|
||||||
|
ToolExecutorReleaseRequest,
|
||||||
|
ToolResultPayload,
|
||||||
|
)
|
||||||
|
from ..tools.tool_executor import ToolExecutor, ToolExecutorConfig
|
||||||
|
|
||||||
|
|
||||||
|
class ToolExecutorServerConfig(BaseModel):
|
||||||
|
nomad_address: str = Field(default="http://localhost:4646")
|
||||||
|
job_id: str = Field(default="atropos-sandbox-tool-executor")
|
||||||
|
image: str = Field(default="atropos-sandbox:local")
|
||||||
|
slots_per_container: int = Field(default=10)
|
||||||
|
min_containers: int = Field(default=1)
|
||||||
|
max_containers: int = Field(default=10)
|
||||||
|
privileged: bool = Field(default=False)
|
||||||
|
acquire_timeout_s: float = Field(default=30.0)
|
||||||
|
|
||||||
|
batch_window_ms: int = Field(default=20)
|
||||||
|
max_batch_size: int = Field(default=200)
|
||||||
|
allow_network: bool = Field(default=True)
|
||||||
|
|
||||||
|
tool_server_url: Optional[str] = Field(default=None)
|
||||||
|
tool_server_token: Optional[str] = Field(default=None)
|
||||||
|
|
||||||
|
token: Optional[str] = Field(default=None, description="Bearer token required for requests (optional in dev).")
|
||||||
|
|
||||||
|
purge_job_on_shutdown: bool = Field(default=True)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_env(cls) -> "ToolExecutorServerConfig":
|
||||||
|
# In dev, prefer loading secrets/config from the repo-local `.env` (not committed).
|
||||||
|
try:
|
||||||
|
from dotenv import load_dotenv # type: ignore
|
||||||
|
except Exception: # pragma: no cover
|
||||||
|
load_dotenv = None # type: ignore[assignment]
|
||||||
|
if load_dotenv is not None:
|
||||||
|
env_path = Path(__file__).resolve().parents[2] / ".env"
|
||||||
|
if env_path.exists():
|
||||||
|
load_dotenv(dotenv_path=env_path)
|
||||||
|
|
||||||
|
def _get_bool(name: str, default: bool) -> bool:
|
||||||
|
raw = os.getenv(name)
|
||||||
|
if raw is None:
|
||||||
|
return default
|
||||||
|
return raw.strip().lower() in {"1", "true", "yes", "y", "on"}
|
||||||
|
|
||||||
|
return cls(
|
||||||
|
nomad_address=os.getenv("TOOL_EXECUTOR_NOMAD_ADDRESS", "http://localhost:4646"),
|
||||||
|
job_id=os.getenv("TOOL_EXECUTOR_JOB_ID", "atropos-sandbox-tool-executor"),
|
||||||
|
image=os.getenv("TOOL_EXECUTOR_IMAGE", "atropos-sandbox:local"),
|
||||||
|
slots_per_container=int(os.getenv("TOOL_EXECUTOR_SLOTS", "10")),
|
||||||
|
min_containers=int(os.getenv("TOOL_EXECUTOR_MIN_CONTAINERS", "1")),
|
||||||
|
max_containers=int(os.getenv("TOOL_EXECUTOR_MAX_CONTAINERS", "10")),
|
||||||
|
privileged=_get_bool("TOOL_EXECUTOR_PRIVILEGED", False),
|
||||||
|
acquire_timeout_s=float(os.getenv("TOOL_EXECUTOR_ACQUIRE_TIMEOUT_S", "30.0")),
|
||||||
|
batch_window_ms=int(os.getenv("TOOL_EXECUTOR_BATCH_WINDOW_MS", "20")),
|
||||||
|
max_batch_size=int(os.getenv("TOOL_EXECUTOR_MAX_BATCH_SIZE", "200")),
|
||||||
|
allow_network=_get_bool("TOOL_EXECUTOR_ALLOW_NETWORK", True),
|
||||||
|
tool_server_url=os.getenv("TOOL_EXECUTOR_TOOL_SERVER_URL") or None,
|
||||||
|
tool_server_token=os.getenv("TOOL_EXECUTOR_TOOL_SERVER_TOKEN") or None,
|
||||||
|
token=os.getenv("TOOL_EXECUTOR_TOKEN") or None,
|
||||||
|
purge_job_on_shutdown=_get_bool("TOOL_EXECUTOR_PURGE_JOB_ON_SHUTDOWN", True),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
app = FastAPI(title="Atropos-Agent Tool Executor")
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/")
|
||||||
|
async def root() -> Dict[str, str]:
|
||||||
|
return {"message": "Atropos-Agent Tool Executor"}
|
||||||
|
|
||||||
|
|
||||||
|
def _check_auth(cfg: ToolExecutorServerConfig, authorization: Optional[str]) -> None:
|
||||||
|
if not cfg.token:
|
||||||
|
return
|
||||||
|
if not authorization:
|
||||||
|
raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Missing Authorization header")
|
||||||
|
if not authorization.lower().startswith("bearer "):
|
||||||
|
raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Invalid Authorization header")
|
||||||
|
token = authorization.split(" ", 1)[1].strip()
|
||||||
|
if token != cfg.token:
|
||||||
|
raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail="Invalid token")
|
||||||
|
|
||||||
|
|
||||||
|
@app.on_event("startup")
|
||||||
|
async def _startup() -> None:
|
||||||
|
cfg = ToolExecutorServerConfig.from_env()
|
||||||
|
|
||||||
|
# Default to Atropos "full" tool surface: sandbox + external (if tool_server_url provided).
|
||||||
|
tools: ToolRegistry = build_tool_registry(
|
||||||
|
enabled_toolsets=["full"],
|
||||||
|
disabled_toolsets=None,
|
||||||
|
tool_server_url=cfg.tool_server_url,
|
||||||
|
)
|
||||||
|
|
||||||
|
backend = NomadToolBackend(
|
||||||
|
NomadBackendConfig(
|
||||||
|
nomad_address=cfg.nomad_address,
|
||||||
|
sandbox_job_id=cfg.job_id,
|
||||||
|
sandbox_image=cfg.image,
|
||||||
|
slots_per_container=cfg.slots_per_container,
|
||||||
|
min_containers=cfg.min_containers,
|
||||||
|
max_containers=cfg.max_containers,
|
||||||
|
privileged=cfg.privileged,
|
||||||
|
acquire_timeout_s=cfg.acquire_timeout_s,
|
||||||
|
purge_job_on_start=False,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
await backend.start()
|
||||||
|
|
||||||
|
executor = ToolExecutor(
|
||||||
|
backend=backend,
|
||||||
|
tools=tools,
|
||||||
|
config=ToolExecutorConfig(
|
||||||
|
batch_window_ms=cfg.batch_window_ms,
|
||||||
|
max_batch_size=cfg.max_batch_size,
|
||||||
|
allow_network=cfg.allow_network,
|
||||||
|
tool_server_url=cfg.tool_server_url,
|
||||||
|
tool_server_token=cfg.tool_server_token,
|
||||||
|
),
|
||||||
|
)
|
||||||
|
await executor.start()
|
||||||
|
|
||||||
|
app.state.cfg = cfg
|
||||||
|
app.state.backend = backend
|
||||||
|
app.state.executor = executor
|
||||||
|
|
||||||
|
|
||||||
|
@app.on_event("shutdown")
|
||||||
|
async def _shutdown() -> None:
|
||||||
|
executor: Optional[ToolExecutor] = getattr(app.state, "executor", None)
|
||||||
|
backend: Optional[NomadToolBackend] = getattr(app.state, "backend", None)
|
||||||
|
cfg: Optional[ToolExecutorServerConfig] = getattr(app.state, "cfg", None)
|
||||||
|
|
||||||
|
if executor is not None:
|
||||||
|
await executor.close()
|
||||||
|
|
||||||
|
if backend is not None:
|
||||||
|
await backend.stop(purge=bool(cfg.purge_job_on_shutdown) if cfg else False)
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/health")
|
||||||
|
async def health() -> Dict[str, Any]:
|
||||||
|
return {"status": "ok"}
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/status")
|
||||||
|
async def status_endpoint() -> Dict[str, Any]:
|
||||||
|
executor: ToolExecutor = app.state.executor
|
||||||
|
backend: NomadToolBackend = app.state.backend
|
||||||
|
|
||||||
|
return {
|
||||||
|
"queue_size": executor.queue_size(),
|
||||||
|
"total_requests": executor.total_requests,
|
||||||
|
"total_errors": executor.total_errors,
|
||||||
|
"pool": backend.get_stats(),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@app.post("/execute", response_model=ToolResultPayload)
|
||||||
|
async def execute_tool(
|
||||||
|
req: ToolExecutorExecuteRequest,
|
||||||
|
authorization: Optional[str] = Header(default=None),
|
||||||
|
status_code: int = status.HTTP_200_OK, # noqa: B008
|
||||||
|
) -> ToolResultPayload:
|
||||||
|
cfg: ToolExecutorServerConfig = app.state.cfg
|
||||||
|
_check_auth(cfg, authorization)
|
||||||
|
|
||||||
|
executor: ToolExecutor = app.state.executor
|
||||||
|
result = await executor.execute(
|
||||||
|
trajectory_id=req.trajectory_id,
|
||||||
|
call=req.tool.to_tool_call(),
|
||||||
|
timeout_s=req.timeout_s,
|
||||||
|
)
|
||||||
|
return ToolResultPayload.from_tool_result(result)
|
||||||
|
|
||||||
|
|
||||||
|
@app.post("/release")
|
||||||
|
async def release_trajectory(
|
||||||
|
req: ToolExecutorReleaseRequest,
|
||||||
|
authorization: Optional[str] = Header(default=None),
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
cfg: ToolExecutorServerConfig = app.state.cfg
|
||||||
|
_check_auth(cfg, authorization)
|
||||||
|
|
||||||
|
executor: ToolExecutor = app.state.executor
|
||||||
|
await executor.release_trajectory(req.trajectory_id, reset_workspace=req.reset_workspace)
|
||||||
|
return {"status": "ok"}
|
||||||
|
|
||||||
|
|
||||||
|
@app.post("/artifacts/read", response_model=ArtifactReadResponsePayload)
|
||||||
|
async def artifacts_read(
|
||||||
|
req: ArtifactReadRequestPayload,
|
||||||
|
authorization: Optional[str] = Header(default=None),
|
||||||
|
) -> ArtifactReadResponsePayload:
|
||||||
|
cfg: ToolExecutorServerConfig = app.state.cfg
|
||||||
|
_check_auth(cfg, authorization)
|
||||||
|
|
||||||
|
executor: ToolExecutor = app.state.executor
|
||||||
|
return await executor.read_artifact(req)
|
||||||
|
|
||||||
|
|
||||||
|
@app.post("/artifacts/list", response_model=ArtifactListResponsePayload)
|
||||||
|
async def artifacts_list(
|
||||||
|
req: ArtifactListRequestPayload,
|
||||||
|
authorization: Optional[str] = Header(default=None),
|
||||||
|
) -> ArtifactListResponsePayload:
|
||||||
|
cfg: ToolExecutorServerConfig = app.state.cfg
|
||||||
|
_check_auth(cfg, authorization)
|
||||||
|
|
||||||
|
executor: ToolExecutor = app.state.executor
|
||||||
|
return await executor.list_artifacts(req)
|
||||||
|
|
||||||
|
|
||||||
|
@app.post("/artifacts/archive", response_model=ArtifactArchiveResponsePayload)
|
||||||
|
async def artifacts_archive(
|
||||||
|
req: ArtifactArchiveRequestPayload,
|
||||||
|
authorization: Optional[str] = Header(default=None),
|
||||||
|
) -> ArtifactArchiveResponsePayload:
|
||||||
|
cfg: ToolExecutorServerConfig = app.state.cfg
|
||||||
|
_check_auth(cfg, authorization)
|
||||||
|
|
||||||
|
executor: ToolExecutor = app.state.executor
|
||||||
|
return await executor.archive_artifacts(req)
|
||||||
140
atropos/api/tool_server.py
Normal file
140
atropos/api/tool_server.py
Normal file
@@ -0,0 +1,140 @@
|
|||||||
|
"""
|
||||||
|
External ToolServer (Phase 4.5+).
|
||||||
|
|
||||||
|
This server executes tools that must NOT run inside the sandbox, typically
|
||||||
|
because they require credentials or access to external services.
|
||||||
|
|
||||||
|
Run (dev):
|
||||||
|
uv run uvicorn atropos_agent.api.tool_server:app --host 0.0.0.0 --port 9002
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import os
|
||||||
|
import inspect
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from fastapi import FastAPI, Header, HTTPException, status
|
||||||
|
from pydantic import BaseModel, Field
|
||||||
|
|
||||||
|
from ..tools import ToolRegistry, build_tool_registry
|
||||||
|
from ..tools.base import ToolResultPayload, ToolServerExecuteRequest
|
||||||
|
|
||||||
|
|
||||||
|
class ToolServerConfig(BaseModel):
|
||||||
|
token: Optional[str] = Field(
|
||||||
|
default=None,
|
||||||
|
description="Bearer token required for requests (optional in dev).",
|
||||||
|
)
|
||||||
|
max_concurrency: int = Field(default=16, ge=1, description="Max concurrent tool executions.")
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_env(cls) -> "ToolServerConfig":
|
||||||
|
# In dev, prefer loading secrets from the repo-local `.env` (not committed).
|
||||||
|
try:
|
||||||
|
from dotenv import load_dotenv # type: ignore
|
||||||
|
except Exception: # pragma: no cover
|
||||||
|
load_dotenv = None # type: ignore[assignment]
|
||||||
|
if load_dotenv is not None:
|
||||||
|
env_path = Path(__file__).resolve().parents[2] / ".env"
|
||||||
|
if env_path.exists():
|
||||||
|
load_dotenv(dotenv_path=env_path)
|
||||||
|
|
||||||
|
token = os.getenv("TOOL_SERVER_TOKEN") or None
|
||||||
|
max_concurrency = int(os.getenv("TOOL_SERVER_MAX_CONCURRENCY", "16"))
|
||||||
|
return cls(token=token, max_concurrency=max_concurrency)
|
||||||
|
|
||||||
|
|
||||||
|
app = FastAPI(title="Atropos-Agent Tool Server")
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/")
|
||||||
|
async def root() -> Dict[str, str]:
|
||||||
|
return {"message": "Atropos-Agent Tool Server"}
|
||||||
|
|
||||||
|
|
||||||
|
@app.on_event("startup")
|
||||||
|
async def _startup() -> None:
|
||||||
|
cfg = ToolServerConfig.from_env()
|
||||||
|
|
||||||
|
# External-only registry. It will only include tools that are enabled by toolsets and
|
||||||
|
# whose Hermes requirements/keys are satisfied in this process.
|
||||||
|
tools: ToolRegistry = build_tool_registry(
|
||||||
|
enabled_toolsets=["all"],
|
||||||
|
disabled_toolsets=["terminal", "sandbox", "filesystem", "terminal_stateful", "default"],
|
||||||
|
tool_server_url="enabled",
|
||||||
|
)
|
||||||
|
|
||||||
|
app.state.cfg = cfg
|
||||||
|
app.state.tools = tools
|
||||||
|
app.state.semaphore = asyncio.Semaphore(cfg.max_concurrency)
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/health")
|
||||||
|
async def health() -> Dict[str, Any]:
|
||||||
|
return {"status": "ok"}
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/tools")
|
||||||
|
async def list_tools() -> Dict[str, Any]:
|
||||||
|
tools: ToolRegistry = app.state.tools
|
||||||
|
return {"tools": [s.to_dict() for s in tools.get_schemas()]}
|
||||||
|
|
||||||
|
|
||||||
|
def _check_auth(cfg: ToolServerConfig, authorization: Optional[str]) -> None:
|
||||||
|
if not cfg.token:
|
||||||
|
return
|
||||||
|
if not authorization:
|
||||||
|
raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Missing Authorization header")
|
||||||
|
if not authorization.lower().startswith("bearer "):
|
||||||
|
raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Invalid Authorization header")
|
||||||
|
token = authorization.split(" ", 1)[1].strip()
|
||||||
|
if token != cfg.token:
|
||||||
|
raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail="Invalid token")
|
||||||
|
|
||||||
|
|
||||||
|
@app.post("/execute", response_model=ToolResultPayload)
|
||||||
|
async def execute_tool(
|
||||||
|
req: ToolServerExecuteRequest,
|
||||||
|
authorization: Optional[str] = Header(default=None),
|
||||||
|
) -> ToolResultPayload:
|
||||||
|
cfg: ToolServerConfig = app.state.cfg
|
||||||
|
_check_auth(cfg, authorization)
|
||||||
|
|
||||||
|
tools: ToolRegistry = app.state.tools
|
||||||
|
sem: asyncio.Semaphore = app.state.semaphore
|
||||||
|
|
||||||
|
tool = tools.get(req.tool.name)
|
||||||
|
if tool is None:
|
||||||
|
return ToolResultPayload(
|
||||||
|
success=False,
|
||||||
|
error=f"Unknown tool: {req.tool.name}",
|
||||||
|
uniq_id=req.tool.uniq_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
async with sem:
|
||||||
|
try:
|
||||||
|
kwargs = dict(req.tool.arguments)
|
||||||
|
sig = inspect.signature(tool.execute).parameters
|
||||||
|
# Some tools can benefit from extra context.
|
||||||
|
if req.trajectory_id and "trajectory_id" in sig:
|
||||||
|
kwargs["trajectory_id"] = req.trajectory_id
|
||||||
|
if req.slot_id and "slot_id" in sig:
|
||||||
|
kwargs["slot_id"] = req.slot_id
|
||||||
|
if req.container_addr and "container_addr" in sig:
|
||||||
|
kwargs["container_addr"] = req.container_addr
|
||||||
|
if "task_id" in sig:
|
||||||
|
kwargs["task_id"] = req.trajectory_id
|
||||||
|
result = await tool.execute(**kwargs)
|
||||||
|
except Exception as e:
|
||||||
|
return ToolResultPayload(
|
||||||
|
success=False,
|
||||||
|
error=f"Tool execution error: {e}",
|
||||||
|
uniq_id=req.tool.uniq_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.uniq_id is None:
|
||||||
|
result.uniq_id = req.tool.uniq_id
|
||||||
|
return ToolResultPayload.from_tool_result(result)
|
||||||
27
atropos/backends/__init__.py
Normal file
27
atropos/backends/__init__.py
Normal file
@@ -0,0 +1,27 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from .base import ToolBackend
|
||||||
|
from .modal_backend import ModalSandboxConfig, ModalToolBackend
|
||||||
|
from .nomad_backend import NomadBackendConfig, NomadToolBackend
|
||||||
|
|
||||||
|
|
||||||
|
def create_tool_backend(cfg: Any) -> ToolBackend:
|
||||||
|
mode = str(getattr(cfg, "tool_pool_mode", "nomad")).strip().lower()
|
||||||
|
if mode == "nomad":
|
||||||
|
return NomadToolBackend(NomadBackendConfig.from_agent_env_config(cfg))
|
||||||
|
if mode == "modal":
|
||||||
|
return ModalToolBackend(ModalSandboxConfig.from_agent_env_config(cfg))
|
||||||
|
raise ValueError(f"Unknown tool_pool_mode: {mode}")
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"ToolBackend",
|
||||||
|
"create_tool_backend",
|
||||||
|
"NomadBackendConfig",
|
||||||
|
"NomadToolBackend",
|
||||||
|
"ModalSandboxConfig",
|
||||||
|
"ModalToolBackend",
|
||||||
|
]
|
||||||
|
|
||||||
89
atropos/backends/base.py
Normal file
89
atropos/backends/base.py
Normal file
@@ -0,0 +1,89 @@
|
|||||||
|
"""
|
||||||
|
Backend interfaces for AgentEnv tool execution.
|
||||||
|
|
||||||
|
The goal of this module is to decouple ToolExecutor / AgentEnv from any single
|
||||||
|
execution backend (Nomad/Docker today; Modal later).
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import Any, Dict, List, Optional, Protocol, Tuple
|
||||||
|
|
||||||
|
from ..slots.executor import ExecutionResult
|
||||||
|
from ..slots.slot import Slot
|
||||||
|
|
||||||
|
|
||||||
|
class ToolBackend(Protocol):
|
||||||
|
"""
|
||||||
|
Minimal interface required by ToolExecutor.
|
||||||
|
|
||||||
|
Backends provide:
|
||||||
|
- lifecycle (start/stop)
|
||||||
|
- slot acquisition/release (workspace affinity)
|
||||||
|
- batched tool execution across slots
|
||||||
|
- optional artifact helpers (for env verification / demos)
|
||||||
|
"""
|
||||||
|
|
||||||
|
@property
|
||||||
|
def default_timeout_s(self) -> Optional[float]:
|
||||||
|
"""Default sandbox execution timeout in seconds (if any)."""
|
||||||
|
|
||||||
|
async def start(self) -> None:
|
||||||
|
"""Start the backend (provision workers/containers, health checks, etc)."""
|
||||||
|
|
||||||
|
async def stop(self, *, purge: bool = False) -> None:
|
||||||
|
"""Stop the backend and optionally purge remote resources."""
|
||||||
|
|
||||||
|
async def acquire(self, trajectory_id: Optional[str] = None) -> Slot:
|
||||||
|
"""Acquire a slot for a trajectory (workspace affinity)."""
|
||||||
|
|
||||||
|
async def release(self, slot: Slot, *, reset_workspace: bool = False) -> None:
|
||||||
|
"""Release a slot back to the pool."""
|
||||||
|
|
||||||
|
async def execute_batch(
|
||||||
|
self,
|
||||||
|
requests: List[Tuple[Slot, str, Dict[str, Any]]],
|
||||||
|
*,
|
||||||
|
timeout_s: Optional[float] = None,
|
||||||
|
) -> List[ExecutionResult]:
|
||||||
|
"""Execute a batch of sandbox tool calls and return results in order."""
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------
|
||||||
|
# Optional artifact helpers (supported by the Nomad sandbox-server today)
|
||||||
|
# ---------------------------------------------------------------------
|
||||||
|
|
||||||
|
async def read_artifact(
|
||||||
|
self,
|
||||||
|
slot: Slot,
|
||||||
|
path: str,
|
||||||
|
*,
|
||||||
|
encoding: str = "text",
|
||||||
|
max_bytes: Optional[int] = None,
|
||||||
|
include_sha256: bool = False,
|
||||||
|
timeout_s: Optional[float] = None,
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
raise NotImplementedError
|
||||||
|
|
||||||
|
async def list_artifacts(
|
||||||
|
self,
|
||||||
|
slot: Slot,
|
||||||
|
path: str = ".",
|
||||||
|
*,
|
||||||
|
recursive: bool = False,
|
||||||
|
max_entries: Optional[int] = None,
|
||||||
|
timeout_s: Optional[float] = None,
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
raise NotImplementedError
|
||||||
|
|
||||||
|
async def archive_artifacts(
|
||||||
|
self,
|
||||||
|
slot: Slot,
|
||||||
|
path: str = ".",
|
||||||
|
*,
|
||||||
|
archive_format: str = "tar.gz",
|
||||||
|
max_bytes: Optional[int] = None,
|
||||||
|
max_entries: Optional[int] = None,
|
||||||
|
timeout_s: Optional[float] = None,
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
raise NotImplementedError
|
||||||
|
|
||||||
1179
atropos/backends/modal_backend.py
Normal file
1179
atropos/backends/modal_backend.py
Normal file
File diff suppressed because it is too large
Load Diff
156
atropos/backends/nomad_backend.py
Normal file
156
atropos/backends/nomad_backend.py
Normal file
@@ -0,0 +1,156 @@
|
|||||||
|
"""
|
||||||
|
Nomad/Docker tool backend.
|
||||||
|
|
||||||
|
This backend is the current default for AgentEnv: it provisions a Nomad job
|
||||||
|
running `sandbox_server.py` and multiplexes stateless slots inside each container.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from typing import Any, Dict, List, Optional, Tuple
|
||||||
|
|
||||||
|
from ..slots import Slot, SlotPool, SlotPoolConfig
|
||||||
|
from ..slots.executor import ExecutionResult
|
||||||
|
from .base import ToolBackend
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class NomadBackendConfig:
|
||||||
|
nomad_address: str
|
||||||
|
sandbox_job_id: str
|
||||||
|
sandbox_image: str
|
||||||
|
slots_per_container: int
|
||||||
|
min_containers: int
|
||||||
|
max_containers: int
|
||||||
|
privileged: bool
|
||||||
|
acquire_timeout_s: float
|
||||||
|
purge_job_on_start: bool
|
||||||
|
# Driver selection: "docker" or "singularity"
|
||||||
|
driver: str = "docker"
|
||||||
|
# Path to .sif file for singularity driver (required if driver="singularity")
|
||||||
|
singularity_image: Optional[str] = None
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_agent_env_config(cls, cfg: Any) -> "NomadBackendConfig":
|
||||||
|
return cls(
|
||||||
|
nomad_address=str(getattr(cfg, "nomad_address")),
|
||||||
|
sandbox_job_id=str(getattr(cfg, "sandbox_job_id")),
|
||||||
|
sandbox_image=str(getattr(cfg, "sandbox_image")),
|
||||||
|
slots_per_container=int(getattr(cfg, "slots_per_container")),
|
||||||
|
min_containers=int(getattr(cfg, "min_containers")),
|
||||||
|
max_containers=int(getattr(cfg, "max_containers")),
|
||||||
|
privileged=bool(getattr(cfg, "privileged")),
|
||||||
|
acquire_timeout_s=float(getattr(cfg, "acquire_timeout_s")),
|
||||||
|
purge_job_on_start=bool(getattr(cfg, "purge_job_on_start", False)),
|
||||||
|
driver=str(getattr(cfg, "driver", "docker")),
|
||||||
|
singularity_image=getattr(cfg, "singularity_image", None),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class NomadToolBackend(ToolBackend):
|
||||||
|
def __init__(self, config: NomadBackendConfig):
|
||||||
|
self.config = config
|
||||||
|
self.pool = SlotPool(
|
||||||
|
SlotPoolConfig(
|
||||||
|
nomad_address=config.nomad_address,
|
||||||
|
job_id=config.sandbox_job_id,
|
||||||
|
image=config.sandbox_image,
|
||||||
|
slots_per_container=config.slots_per_container,
|
||||||
|
min_containers=config.min_containers,
|
||||||
|
max_containers=config.max_containers,
|
||||||
|
privileged=config.privileged,
|
||||||
|
acquire_timeout=config.acquire_timeout_s,
|
||||||
|
purge_job_on_start=bool(config.purge_job_on_start),
|
||||||
|
driver=config.driver,
|
||||||
|
singularity_image=config.singularity_image,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def default_timeout_s(self) -> Optional[float]:
|
||||||
|
t = getattr(self.pool.executor, "timeout", None)
|
||||||
|
total = getattr(t, "total", None)
|
||||||
|
try:
|
||||||
|
return float(total) if total is not None else None
|
||||||
|
except Exception:
|
||||||
|
return None
|
||||||
|
|
||||||
|
async def start(self) -> None:
|
||||||
|
await self.pool.start()
|
||||||
|
|
||||||
|
async def stop(self, *, purge: bool = False) -> None:
|
||||||
|
await self.pool.stop(purge_job=purge)
|
||||||
|
|
||||||
|
async def acquire(self, trajectory_id: Optional[str] = None) -> Slot:
|
||||||
|
return await self.pool.acquire(trajectory_id)
|
||||||
|
|
||||||
|
async def release(self, slot: Slot, *, reset_workspace: bool = False) -> None:
|
||||||
|
await self.pool.release(slot, reset_workspace=reset_workspace)
|
||||||
|
|
||||||
|
async def execute_batch(
|
||||||
|
self,
|
||||||
|
requests: List[Tuple[Slot, str, Dict[str, Any]]],
|
||||||
|
*,
|
||||||
|
timeout_s: Optional[float] = None,
|
||||||
|
) -> List[ExecutionResult]:
|
||||||
|
return await self.pool.execute_batch(requests, timeout=timeout_s)
|
||||||
|
|
||||||
|
async def read_artifact(
|
||||||
|
self,
|
||||||
|
slot: Slot,
|
||||||
|
path: str,
|
||||||
|
*,
|
||||||
|
encoding: str = "text",
|
||||||
|
max_bytes: Optional[int] = None,
|
||||||
|
include_sha256: bool = False,
|
||||||
|
timeout_s: Optional[float] = None,
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
return await self.pool.executor.read_artifact(
|
||||||
|
slot,
|
||||||
|
path,
|
||||||
|
encoding=encoding,
|
||||||
|
max_bytes=max_bytes,
|
||||||
|
include_sha256=include_sha256,
|
||||||
|
timeout=timeout_s,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def list_artifacts(
|
||||||
|
self,
|
||||||
|
slot: Slot,
|
||||||
|
path: str = ".",
|
||||||
|
*,
|
||||||
|
recursive: bool = False,
|
||||||
|
max_entries: Optional[int] = None,
|
||||||
|
timeout_s: Optional[float] = None,
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
return await self.pool.executor.list_artifacts(
|
||||||
|
slot,
|
||||||
|
path,
|
||||||
|
recursive=recursive,
|
||||||
|
max_entries=max_entries,
|
||||||
|
timeout=timeout_s,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def archive_artifacts(
|
||||||
|
self,
|
||||||
|
slot: Slot,
|
||||||
|
path: str = ".",
|
||||||
|
*,
|
||||||
|
archive_format: str = "tar.gz",
|
||||||
|
max_bytes: Optional[int] = None,
|
||||||
|
max_entries: Optional[int] = None,
|
||||||
|
timeout_s: Optional[float] = None,
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
return await self.pool.executor.archive_artifacts(
|
||||||
|
slot,
|
||||||
|
path,
|
||||||
|
archive_format=archive_format,
|
||||||
|
max_bytes=max_bytes,
|
||||||
|
max_entries=max_entries,
|
||||||
|
timeout=timeout_s,
|
||||||
|
)
|
||||||
|
|
||||||
|
def get_stats(self) -> Dict[str, Any]:
|
||||||
|
return self.pool.get_stats()
|
||||||
|
|
||||||
10
atropos/envs/__init__.py
Normal file
10
atropos/envs/__init__.py
Normal file
@@ -0,0 +1,10 @@
|
|||||||
|
"""
|
||||||
|
Environment implementations for atropos-agent.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from .agent_env import AgentEnv, AgentEnvConfig
|
||||||
|
|
||||||
|
# NOTE: Additional example envs exist as modules (e.g. `test_env`, `swe_smith_oracle_env`),
|
||||||
|
# but are intentionally not imported here to avoid pulling heavy optional deps at import time.
|
||||||
|
|
||||||
|
__all__ = ["AgentEnv", "AgentEnvConfig"]
|
||||||
537
atropos/envs/agent_env.py
Normal file
537
atropos/envs/agent_env.py
Normal file
@@ -0,0 +1,537 @@
|
|||||||
|
"""
|
||||||
|
AgentEnv - Atropos BaseEnv extension for agent/tool-call workloads.
|
||||||
|
|
||||||
|
AgentEnv is responsible for starting the sandbox tool execution backend and
|
||||||
|
providing helpers for running agent trajectories with queued/batched tool calls.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
import os
|
||||||
|
import asyncio
|
||||||
|
import time
|
||||||
|
import uuid
|
||||||
|
from abc import ABC, abstractmethod
|
||||||
|
from typing import Any, Awaitable, Callable, Dict, Generic, List, Optional, Tuple, TypeVar
|
||||||
|
|
||||||
|
from pydantic import Field
|
||||||
|
|
||||||
|
from atroposlib.envs.base import APIServerConfig, BaseEnv, BaseEnvConfig, Item, ScoredDataGroup, ScoredDataItem
|
||||||
|
from atroposlib.envs.server_handling.server_baseline import AsyncSemWithAdaptiveWeight
|
||||||
|
|
||||||
|
from ..agent import AgentConfig, AgentResult, AtroposAgent
|
||||||
|
from ..backends import ToolBackend, create_tool_backend
|
||||||
|
from ..tools import ToolRegistry, build_tool_registry
|
||||||
|
from ..tools.tool_executor import ToolExecutor, ToolExecutorConfig
|
||||||
|
|
||||||
|
# Main BaseEnv child classes. Child class THESE to get agent+tooling functionality easily.
|
||||||
|
|
||||||
|
class AgentEnvConfig(BaseEnvConfig):
|
||||||
|
tool_pool_mode: str = Field(default="nomad", description="Tool execution backend ('nomad' or 'modal')")
|
||||||
|
|
||||||
|
allow_network: bool = Field(
|
||||||
|
default=True,
|
||||||
|
description="Whether sandbox bash commands may access the network (env policy).",
|
||||||
|
)
|
||||||
|
require_sandbox: bool = Field(
|
||||||
|
default=False,
|
||||||
|
description="Fail closed if bubblewrap sandboxing is unavailable/unusable for stateless sandbox tools.",
|
||||||
|
)
|
||||||
|
require_stateful_sandbox: bool = Field(
|
||||||
|
default=False,
|
||||||
|
description="Fail closed if bubblewrap/PID isolation is unavailable for stateful terminal tools (tmux).",
|
||||||
|
)
|
||||||
|
tool_batch_window_ms: int = Field(default=20, description="ToolExecutor batching window (ms)")
|
||||||
|
tool_max_batch_size: int = Field(default=200, description="ToolExecutor maximum batch size")
|
||||||
|
|
||||||
|
# nomad mode settings. TODO: Add Modal support, split this into own config
|
||||||
|
nomad_address: str = Field(default="http://localhost:4646", description="Nomad API address")
|
||||||
|
sandbox_job_id: str = Field(default="atropos-sandbox-agent-env", description="Nomad job id for sandbox containers")
|
||||||
|
sandbox_image: str = Field(default="atropos-sandbox:local", description="Docker image for sandbox containers")
|
||||||
|
slots_per_container: int = Field(default=10, description="Nomad mode: slots per container")
|
||||||
|
min_containers: int = Field(default=1, description="Nomad mode: minimum containers")
|
||||||
|
max_containers: int = Field(default=10, description="Nomad mode: maximum containers")
|
||||||
|
privileged: bool = Field(default=False, description="Nomad mode: run container privileged")
|
||||||
|
acquire_timeout_s: float = Field(default=30.0, description="Slot acquisition timeout (seconds)")
|
||||||
|
purge_job_on_start: bool = Field(
|
||||||
|
default=False,
|
||||||
|
description=(
|
||||||
|
"Nomad mode: stop/purge the sandbox job on startup. This is helpful in local dev and training runs "
|
||||||
|
"to recover from previous crashes that leave the job in a restart backoff state."
|
||||||
|
),
|
||||||
|
)
|
||||||
|
purge_job_on_shutdown: bool = Field(default=True, description="Nomad mode: stop/purge job on shutdown")
|
||||||
|
|
||||||
|
# Nomad driver selection (docker or singularity)
|
||||||
|
driver: str = Field(
|
||||||
|
default="docker",
|
||||||
|
description="Nomad task driver: 'docker' (default) or 'singularity' (for HPC without sudo Docker)",
|
||||||
|
)
|
||||||
|
singularity_image: Optional[str] = Field(
|
||||||
|
default=None,
|
||||||
|
description="Path to .sif file for Singularity driver (required if driver='singularity')",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Modal mode settings
|
||||||
|
modal_app_name: str = Field(default="atropos-sandbox", description="Modal app name prefix")
|
||||||
|
modal_image: str = Field(default="python:3.11", description="Modal: container image")
|
||||||
|
modal_gpu: Optional[str] = Field(default=None, description="Modal: GPU type (None, 'T4', 'A10G', 'A100', 'H100')")
|
||||||
|
modal_cpu: float = Field(default=1.0, description="Modal: CPU cores")
|
||||||
|
modal_memory: int = Field(default=2048, description="Modal: memory in MB")
|
||||||
|
modal_slots_per_sandbox: int = Field(default=10, description="Modal: slots per sandbox")
|
||||||
|
modal_min_sandboxes: int = Field(default=1, description="Modal: minimum sandboxes")
|
||||||
|
modal_max_sandboxes: int = Field(default=5, description="Modal: maximum sandboxes")
|
||||||
|
modal_idle_timeout: int = Field(default=120, description="Modal: server-side idle timeout (seconds)")
|
||||||
|
modal_max_lifetime: int = Field(default=3600, description="Modal: max sandbox lifetime (seconds)")
|
||||||
|
modal_acquire_timeout: float = Field(default=60.0, description="Modal: slot acquisition timeout (seconds)")
|
||||||
|
modal_execution_timeout: float = Field(default=30.0, description="Modal: default command execution timeout (seconds)")
|
||||||
|
modal_secrets: str = Field(default="", description="Modal: comma-separated list of Modal Secret names")
|
||||||
|
modal_env_vars: str = Field(default="", description="Modal: semicolon-separated KEY=VALUE pairs for env vars")
|
||||||
|
modal_workspace_base: str = Field(default="/data", description="Modal: workspace base directory in sandbox")
|
||||||
|
|
||||||
|
# basic agent defaults
|
||||||
|
agent_max_steps: int = Field(default=50, description="Max ReACT steps per trajectory")
|
||||||
|
agent_temperature: float = Field(default=0.7, description="Sampling temperature")
|
||||||
|
agent_max_tokens: Optional[int] = Field(
|
||||||
|
default=None,
|
||||||
|
description="Max tokens per model response (default: let backend decide)",
|
||||||
|
)
|
||||||
|
agent_tool_delay_s: float = Field(default=0.0, description="Delay between tool calls (seconds)")
|
||||||
|
|
||||||
|
# tool selection
|
||||||
|
enabled_toolsets: List[str] = Field(
|
||||||
|
default_factory=lambda: ["default"],
|
||||||
|
description="Toolsets to enable (Hermes-style grouping).",
|
||||||
|
)
|
||||||
|
disabled_toolsets: List[str] = Field(
|
||||||
|
default_factory=list,
|
||||||
|
description="Toolsets to disable (applied after enabled_toolsets).",
|
||||||
|
)
|
||||||
|
|
||||||
|
# external ToolServer routing (Phase 4.5+)
|
||||||
|
tool_server_url: Optional[str] = Field(
|
||||||
|
default=None,
|
||||||
|
description="Base URL for external ToolServer (enables external tools).",
|
||||||
|
)
|
||||||
|
tool_server_token: Optional[str] = Field(
|
||||||
|
default=None,
|
||||||
|
description="Bearer token for ToolServer auth (optional in dev).",
|
||||||
|
)
|
||||||
|
|
||||||
|
AgentEnvConfigT = TypeVar("AgentEnvConfigT", bound="AgentEnvConfig")
|
||||||
|
|
||||||
|
|
||||||
|
class AgentEnv(BaseEnv, ABC, Generic[AgentEnvConfigT]):
|
||||||
|
env_config_cls = AgentEnvConfig
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
config: AgentEnvConfigT,
|
||||||
|
server_configs: List[APIServerConfig],
|
||||||
|
slurm: bool = False,
|
||||||
|
testing: bool = False,
|
||||||
|
):
|
||||||
|
super().__init__(config, server_configs, slurm, testing)
|
||||||
|
self.config: AgentEnvConfigT = config
|
||||||
|
|
||||||
|
self.tools: ToolRegistry = self.build_tools()
|
||||||
|
|
||||||
|
self._backend: Optional[ToolBackend] = None
|
||||||
|
self._tool_executor: Optional[ToolExecutor] = None
|
||||||
|
self._tool_server_inprocess: bool = False
|
||||||
|
self._trajectory_workspace_meta: Dict[str, Dict[str, Any]] = {}
|
||||||
|
|
||||||
|
def build_tools(self) -> ToolRegistry:
|
||||||
|
"""Wraps original Hermes-Agent ToolRegistry for atropos AgentEnv use.
|
||||||
|
See Hermes-Agent docs for toolsets and available tools etc.
|
||||||
|
"""
|
||||||
|
return build_tool_registry(
|
||||||
|
enabled_toolsets=self.config.enabled_toolsets or ["default"],
|
||||||
|
disabled_toolsets=self.config.disabled_toolsets or None,
|
||||||
|
tool_server_url=self.config.tool_server_url,
|
||||||
|
)
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def build_task(self, item: Item) -> str:
|
||||||
|
"""Return the user-facing task string for the agent."""
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
async def score_trajectory(self, item: Item, final_response: str) -> float:
|
||||||
|
"""Return a scalar score for this trajectory."""
|
||||||
|
|
||||||
|
async def setup_trajectory_workspace(
|
||||||
|
self,
|
||||||
|
item: Item,
|
||||||
|
*,
|
||||||
|
trajectory_id: str,
|
||||||
|
exec_tool: Callable[["ToolCall"], Awaitable["ToolResult"]],
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Optional hook: prepare the sandbox workspace before the agent starts.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
- clone a repo and checkout a commit
|
||||||
|
- write fixture files (e.g. images) for external-tool demos
|
||||||
|
- pre-install dependencies
|
||||||
|
|
||||||
|
Default: no-op.
|
||||||
|
"""
|
||||||
|
_ = (item, trajectory_id, exec_tool)
|
||||||
|
return {}
|
||||||
|
|
||||||
|
async def verify_and_score_trajectory(
|
||||||
|
self,
|
||||||
|
item: Item,
|
||||||
|
final_response: str,
|
||||||
|
*,
|
||||||
|
trajectory_id: str,
|
||||||
|
exec_tool: Callable[["ToolCall"], Awaitable["ToolResult"]],
|
||||||
|
agent_result: Optional[AgentResult] = None,
|
||||||
|
workspace_meta: Optional[Dict[str, Any]] = None,
|
||||||
|
) -> tuple[float, Dict[str, Any]]:
|
||||||
|
"""
|
||||||
|
Optional hook: run in-sandbox verification before scoring.
|
||||||
|
|
||||||
|
Many agent envs need to execute verification inside the same trajectory
|
||||||
|
workspace (e.g. pytest) before releasing/resetting the slot.
|
||||||
|
|
||||||
|
Default: calls `score_trajectory()` and returns empty metadata.
|
||||||
|
"""
|
||||||
|
_ = (trajectory_id, exec_tool, agent_result, workspace_meta) # default ignores in-workspace verification
|
||||||
|
score = await self.score_trajectory(item, final_response)
|
||||||
|
return score, {}
|
||||||
|
|
||||||
|
def build_agent_config(self, item: Item) -> AgentConfig: # noqa: ARG002
|
||||||
|
return AgentConfig(
|
||||||
|
max_steps=self.config.agent_max_steps,
|
||||||
|
temperature=self.config.agent_temperature,
|
||||||
|
max_tokens=self.config.agent_max_tokens,
|
||||||
|
tool_delay_s=self.config.agent_tool_delay_s,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def setup(self) -> None:
|
||||||
|
print(f"[AgentEnv] setup(): starting tool backend ({self.config.tool_pool_mode})", flush=True)
|
||||||
|
await self._start_tool_backend()
|
||||||
|
print("[AgentEnv] setup(): configuring server concurrency", flush=True)
|
||||||
|
self._configure_server_concurrency()
|
||||||
|
print("[AgentEnv] setup(): running env-specific setup_agent_env()", flush=True)
|
||||||
|
await self.setup_agent_env()
|
||||||
|
print("[AgentEnv] setup(): done", flush=True)
|
||||||
|
|
||||||
|
def _configure_server_concurrency(self) -> None:
|
||||||
|
"""
|
||||||
|
Ensure the LLM server concurrency isn't accidentally capped below `group_size`.
|
||||||
|
|
||||||
|
In `BaseEnv process` mode, groups are collected concurrently and if the underlying
|
||||||
|
ServerManager/OpenAIServer semaphore is left at 1, we serialize inference even
|
||||||
|
when `--env.group_size` is > 1.
|
||||||
|
"""
|
||||||
|
desired = int(getattr(self.config, "group_size", 1) or 1)
|
||||||
|
if desired <= 1:
|
||||||
|
return
|
||||||
|
|
||||||
|
servers = getattr(self.server, "servers", None)
|
||||||
|
if not isinstance(servers, list) or not servers:
|
||||||
|
return
|
||||||
|
|
||||||
|
for s in servers:
|
||||||
|
sem = getattr(s, "sem", None)
|
||||||
|
eval_sem = getattr(s, "eval_sem", None)
|
||||||
|
# Only increase; never shrink.
|
||||||
|
if sem is not None and getattr(sem, "max_val", 0) < desired:
|
||||||
|
s.sem = AsyncSemWithAdaptiveWeight(desired)
|
||||||
|
if hasattr(s, "config") and hasattr(s.config, "num_max_requests_at_once"):
|
||||||
|
s.config.num_max_requests_at_once = desired
|
||||||
|
if eval_sem is not None and getattr(eval_sem, "max_val", 0) < desired:
|
||||||
|
s.eval_sem = AsyncSemWithAdaptiveWeight(desired)
|
||||||
|
if hasattr(s, "config") and hasattr(s.config, "num_requests_for_eval"):
|
||||||
|
s.config.num_requests_for_eval = desired
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
async def setup_agent_env(self) -> None:
|
||||||
|
"""Subclass hook for env-specific setup."""
|
||||||
|
|
||||||
|
async def evaluate(self, *args, **kwargs): # noqa: ARG002
|
||||||
|
"""
|
||||||
|
Default eval hook (no-op).
|
||||||
|
|
||||||
|
Atropos BaseEnv requires an `evaluate()` implementation. Many agent envs
|
||||||
|
won't have a meaningful evaluation path during early PoC work; they can
|
||||||
|
override this when needed.
|
||||||
|
"""
|
||||||
|
return {}
|
||||||
|
|
||||||
|
async def env_manager(self):
|
||||||
|
try:
|
||||||
|
return await super().env_manager()
|
||||||
|
finally:
|
||||||
|
await self.shutdown_tool_backend()
|
||||||
|
|
||||||
|
async def process_manager(self):
|
||||||
|
try:
|
||||||
|
return await super().process_manager()
|
||||||
|
finally:
|
||||||
|
await self.shutdown_tool_backend()
|
||||||
|
|
||||||
|
async def _start_tool_backend(self) -> None:
|
||||||
|
if self._tool_executor is not None:
|
||||||
|
return
|
||||||
|
|
||||||
|
tool_server_url = self.config.tool_server_url
|
||||||
|
tool_server_client = None
|
||||||
|
if tool_server_url == "inprocess":
|
||||||
|
import httpx
|
||||||
|
from ..api.tool_server import app as tool_server_app
|
||||||
|
|
||||||
|
await tool_server_app.router.startup()
|
||||||
|
tool_server_client = httpx.AsyncClient(
|
||||||
|
transport=httpx.ASGITransport(app=tool_server_app),
|
||||||
|
base_url="http://toolserver",
|
||||||
|
)
|
||||||
|
tool_server_url = "http://toolserver"
|
||||||
|
self._tool_server_inprocess = True
|
||||||
|
|
||||||
|
backend = create_tool_backend(self.config)
|
||||||
|
await backend.start()
|
||||||
|
|
||||||
|
executor = ToolExecutor(
|
||||||
|
backend=backend,
|
||||||
|
tools=self.tools,
|
||||||
|
config=ToolExecutorConfig(
|
||||||
|
batch_window_ms=self.config.tool_batch_window_ms,
|
||||||
|
max_batch_size=self.config.tool_max_batch_size,
|
||||||
|
allow_network=self.config.allow_network,
|
||||||
|
require_sandbox=self.config.require_sandbox,
|
||||||
|
require_stateful_sandbox=self.config.require_stateful_sandbox,
|
||||||
|
tool_server_url=tool_server_url,
|
||||||
|
tool_server_token=self.config.tool_server_token,
|
||||||
|
),
|
||||||
|
)
|
||||||
|
await executor.start()
|
||||||
|
if tool_server_client is not None:
|
||||||
|
executor._tool_server_client = tool_server_client # type: ignore[attr-defined]
|
||||||
|
|
||||||
|
self._backend = backend
|
||||||
|
self._tool_executor = executor
|
||||||
|
|
||||||
|
async def shutdown_tool_backend(self) -> None:
|
||||||
|
executor = self._tool_executor
|
||||||
|
backend = self._backend
|
||||||
|
inprocess_tool_server = self._tool_server_inprocess
|
||||||
|
self._tool_executor = None
|
||||||
|
self._backend = None
|
||||||
|
self._tool_server_inprocess = False
|
||||||
|
|
||||||
|
if executor is not None:
|
||||||
|
await executor.close()
|
||||||
|
if backend is not None:
|
||||||
|
await backend.stop(purge=bool(self.config.purge_job_on_shutdown))
|
||||||
|
if inprocess_tool_server:
|
||||||
|
from ..api.tool_server import app as tool_server_app
|
||||||
|
|
||||||
|
await tool_server_app.router.shutdown()
|
||||||
|
|
||||||
|
async def collect_trajectory(
|
||||||
|
self, item: Item
|
||||||
|
) -> Tuple[Optional[ScoredDataItem], List[Item]]:
|
||||||
|
if self._tool_executor is None:
|
||||||
|
raise RuntimeError("Tool backend not started")
|
||||||
|
|
||||||
|
trajectory_id = str(uuid.uuid4())
|
||||||
|
t0 = time.perf_counter()
|
||||||
|
print(f"[AgentEnv] collect_trajectory(): tid={trajectory_id} start", flush=True)
|
||||||
|
task = self.build_task(item)
|
||||||
|
agent_config = self.build_agent_config(item)
|
||||||
|
if os.getenv("ATROPOS_DEBUG_PRINT_TASK") == "1":
|
||||||
|
print(f"Starting trajectory {trajectory_id} with task: {task}", flush=True)
|
||||||
|
else:
|
||||||
|
# Avoid printing the full task prompt by default (can be huge/noisy).
|
||||||
|
one_line = " ".join(str(task).splitlines()).strip()
|
||||||
|
preview = one_line[:240] + ("…" if len(one_line) > 240 else "")
|
||||||
|
print(f"Starting trajectory {trajectory_id} (task preview): {preview}", flush=True)
|
||||||
|
|
||||||
|
async def _exec(call):
|
||||||
|
return await self._tool_executor.execute(trajectory_id, call)
|
||||||
|
|
||||||
|
agent = AtroposAgent(
|
||||||
|
server=self.server,
|
||||||
|
tokenizer=self.tokenizer,
|
||||||
|
tools=self.tools,
|
||||||
|
config=agent_config,
|
||||||
|
execute_tool=_exec,
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
print(f"[AgentEnv] tid={trajectory_id} setup_trajectory_workspace() start", flush=True)
|
||||||
|
workspace_meta = await self.setup_trajectory_workspace(item, trajectory_id=trajectory_id, exec_tool=_exec)
|
||||||
|
if not isinstance(workspace_meta, dict):
|
||||||
|
workspace_meta = {}
|
||||||
|
self._trajectory_workspace_meta[trajectory_id] = workspace_meta
|
||||||
|
print(
|
||||||
|
f"[AgentEnv] tid={trajectory_id} setup_trajectory_workspace() done in {time.perf_counter() - t0:.2f}s",
|
||||||
|
flush=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"[AgentEnv] tid={trajectory_id} agent.run() start", flush=True)
|
||||||
|
result = await agent.run(task)
|
||||||
|
print(
|
||||||
|
f"[AgentEnv] tid={trajectory_id} agent.run() done in {time.perf_counter() - t0:.2f}s "
|
||||||
|
f"success={result.success} tool_calls={result.total_tool_calls}",
|
||||||
|
flush=True,
|
||||||
|
)
|
||||||
|
if not result.success or result.trajectory_data is None:
|
||||||
|
# Do not trigger BaseEnv retries for agent failures.
|
||||||
|
# Record the trajectory with score 0.0 so training/eval can see the failure mode.
|
||||||
|
messages = [{"role": "system", "content": agent._build_system_prompt()}] # noqa: SLF001
|
||||||
|
messages.append({"role": "user", "content": task})
|
||||||
|
for step in result.steps:
|
||||||
|
messages.append({"role": "assistant", "content": step.assistant_message})
|
||||||
|
if step.tool_results:
|
||||||
|
tool_text = "\n".join(r.to_xml() for r in step.tool_results)
|
||||||
|
messages.append({"role": "user", "content": tool_text})
|
||||||
|
|
||||||
|
scored: ScoredDataItem = {
|
||||||
|
"tokens": (result.trajectory_data.tokens if result.trajectory_data else []),
|
||||||
|
"masks": (result.trajectory_data.masked_tokens if result.trajectory_data else []),
|
||||||
|
"scores": 0.0,
|
||||||
|
}
|
||||||
|
if result.trajectory_data is not None:
|
||||||
|
scored["inference_logprobs"] = result.trajectory_data.logprobs # type: ignore[typeddict-unknown-key]
|
||||||
|
if getattr(result.trajectory_data, "metadata", None):
|
||||||
|
scored["overrides"] = {"managed_metadata": result.trajectory_data.metadata}
|
||||||
|
if self.config.include_messages:
|
||||||
|
# Record a final failure marker as a user-side tool_response-like block so it survives templates.
|
||||||
|
import json
|
||||||
|
|
||||||
|
err = result.error or "agent_failed"
|
||||||
|
messages.append(
|
||||||
|
{
|
||||||
|
"role": "user",
|
||||||
|
"content": f"<tool_response>{json.dumps({'success': False, 'error': err})}</tool_response>",
|
||||||
|
}
|
||||||
|
)
|
||||||
|
scored["messages"] = messages
|
||||||
|
return scored, []
|
||||||
|
|
||||||
|
print(f"[AgentEnv] tid={trajectory_id} verify_and_score_trajectory() start", flush=True)
|
||||||
|
score, score_metadata = await self.verify_and_score_trajectory(
|
||||||
|
item,
|
||||||
|
result.final_response,
|
||||||
|
trajectory_id=trajectory_id,
|
||||||
|
exec_tool=_exec,
|
||||||
|
agent_result=result,
|
||||||
|
workspace_meta=workspace_meta,
|
||||||
|
)
|
||||||
|
print(
|
||||||
|
f"[AgentEnv] tid={trajectory_id} verify_and_score_trajectory() done in {time.perf_counter() - t0:.2f}s "
|
||||||
|
f"score={score}",
|
||||||
|
flush=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
messages = [{"role": "system", "content": agent._build_system_prompt()}] # noqa: SLF001
|
||||||
|
messages.append({"role": "user", "content": task})
|
||||||
|
for step in result.steps:
|
||||||
|
messages.append({"role": "assistant", "content": step.assistant_message})
|
||||||
|
if step.tool_results:
|
||||||
|
tool_text = "\n".join(r.to_xml() for r in step.tool_results)
|
||||||
|
messages.append({"role": "user", "content": tool_text})
|
||||||
|
|
||||||
|
# Optional: allow env verification to attach additional messages (e.g. install logs).
|
||||||
|
if self.config.include_messages and isinstance(score_metadata, dict):
|
||||||
|
extra = score_metadata.get("verification_messages")
|
||||||
|
if isinstance(extra, list):
|
||||||
|
for m in extra:
|
||||||
|
if isinstance(m, dict) and isinstance(m.get("role"), str) and isinstance(m.get("content"), str):
|
||||||
|
messages.append({"role": m["role"], "content": m["content"]})
|
||||||
|
|
||||||
|
scored: ScoredDataItem = {
|
||||||
|
"tokens": result.trajectory_data.tokens,
|
||||||
|
"masks": result.trajectory_data.masked_tokens,
|
||||||
|
"scores": score,
|
||||||
|
}
|
||||||
|
# Atroposlib expects policy logprobs at the *group* level under `inference_logprobs`.
|
||||||
|
# We stash per-item values here and lift them into the group in `collect_trajectories()`.
|
||||||
|
scored["inference_logprobs"] = result.trajectory_data.logprobs # type: ignore[typeddict-unknown-key]
|
||||||
|
if getattr(result.trajectory_data, "metadata", None):
|
||||||
|
scored["overrides"] = {"managed_metadata": result.trajectory_data.metadata}
|
||||||
|
if self.config.include_messages:
|
||||||
|
scored["messages"] = messages
|
||||||
|
|
||||||
|
return scored, []
|
||||||
|
finally:
|
||||||
|
self._trajectory_workspace_meta.pop(trajectory_id, None)
|
||||||
|
print(f"[AgentEnv] tid={trajectory_id} release_trajectory(reset_workspace=True)", flush=True)
|
||||||
|
await self._tool_executor.release_trajectory(trajectory_id, reset_workspace=True)
|
||||||
|
print(f"[AgentEnv] collect_trajectory(): tid={trajectory_id} done in {time.perf_counter() - t0:.2f}s", flush=True)
|
||||||
|
|
||||||
|
async def collect_trajectories(
|
||||||
|
self, item: Item
|
||||||
|
) -> Tuple[Optional[ScoredDataGroup], List[Item]]:
|
||||||
|
tasks = [self.collect_trajectory(item) for _ in range(self.config.group_size)]
|
||||||
|
results = await asyncio.gather(*tasks)
|
||||||
|
|
||||||
|
backlog: List[Item] = []
|
||||||
|
items: List[ScoredDataItem] = []
|
||||||
|
for scored, b in results:
|
||||||
|
backlog.extend(b)
|
||||||
|
if scored is not None:
|
||||||
|
items.append(scored)
|
||||||
|
|
||||||
|
if len(items) != self.config.group_size:
|
||||||
|
return None, backlog
|
||||||
|
|
||||||
|
group: ScoredDataGroup = ScoredDataGroup(
|
||||||
|
tokens=[],
|
||||||
|
masks=[],
|
||||||
|
scores=[],
|
||||||
|
advantages=[],
|
||||||
|
ref_logprobs=[],
|
||||||
|
messages=[] if self.config.include_messages else None,
|
||||||
|
inference_logprobs=[],
|
||||||
|
group_overrides={},
|
||||||
|
overrides=[],
|
||||||
|
images=[],
|
||||||
|
generation_params=None,
|
||||||
|
)
|
||||||
|
|
||||||
|
for it in items:
|
||||||
|
group["tokens"].append(it["tokens"])
|
||||||
|
group["masks"].append(it["masks"])
|
||||||
|
group["scores"].append(it["scores"])
|
||||||
|
# policy logprobs (for PPO/GRPO training) if present
|
||||||
|
lp = it.get("inference_logprobs") # type: ignore[typeddict-item]
|
||||||
|
if lp is not None:
|
||||||
|
group["inference_logprobs"].append(lp)
|
||||||
|
group["overrides"].append(it.get("overrides") or {}) # type: ignore[typeddict-item]
|
||||||
|
if group.get("messages") is not None and it.get("messages") is not None:
|
||||||
|
group["messages"].append(it["messages"])
|
||||||
|
|
||||||
|
return group, backlog
|
||||||
|
|
||||||
|
async def run_agent(self, task: str, *, trajectory_id: Optional[str] = None) -> Tuple[str, Dict[str, Any]]:
|
||||||
|
"""
|
||||||
|
Run the AtroposAgent on a single task and return (final_response, debug).
|
||||||
|
|
||||||
|
This is a helper intended for simple environments and tests.
|
||||||
|
"""
|
||||||
|
if self._tool_executor is None:
|
||||||
|
raise RuntimeError("Tool backend not started")
|
||||||
|
|
||||||
|
tid = trajectory_id or str(uuid.uuid4())
|
||||||
|
|
||||||
|
async def _exec(call):
|
||||||
|
return await self._tool_executor.execute(tid, call)
|
||||||
|
|
||||||
|
agent = AtroposAgent(
|
||||||
|
server=self.server,
|
||||||
|
tokenizer=self.tokenizer,
|
||||||
|
tools=self.tools,
|
||||||
|
config=AgentConfig(
|
||||||
|
max_steps=self.config.agent_max_steps,
|
||||||
|
temperature=self.config.agent_temperature,
|
||||||
|
max_tokens=self.config.agent_max_tokens,
|
||||||
|
),
|
||||||
|
execute_tool=_exec,
|
||||||
|
)
|
||||||
|
result = await agent.run(task)
|
||||||
|
await self._tool_executor.release_trajectory(tid, reset_workspace=True)
|
||||||
|
return result.final_response, {"success": result.success, "error": result.error, "tool_calls": result.total_tool_calls}
|
||||||
873
atropos/envs/endless_terminals_env.py
Normal file
873
atropos/envs/endless_terminals_env.py
Normal file
@@ -0,0 +1,873 @@
|
|||||||
|
"""
|
||||||
|
Endless Terminals Environment for Hermes-Agent + Atropos RL.
|
||||||
|
|
||||||
|
Runs terminal tasks from the Endless Terminals dataset.
|
||||||
|
Supports three modes:
|
||||||
|
1. Local directory: tasks from a local folder of task_* dirs (default)
|
||||||
|
2. HuggingFace dataset: tasks from a HF dataset
|
||||||
|
3. Procedural: generate tasks on-the-fly via LLM (requires vLLM)
|
||||||
|
|
||||||
|
Each task provides a Dockerfile that defines the initial environment.
|
||||||
|
The agent solves the task using terminal commands inside a Docker container.
|
||||||
|
Scoring is done by running pytest on `test_final_state.py` in the container.
|
||||||
|
|
||||||
|
Run (standalone process mode):
|
||||||
|
python -m atropos.envs.endless_terminals_env process \
|
||||||
|
--env.use_wandb false \
|
||||||
|
--env.total_steps 100 \
|
||||||
|
--env.group_size 4
|
||||||
|
|
||||||
|
Run (Tinker serve mode):
|
||||||
|
# Terminal 1: run-api
|
||||||
|
# Terminal 2: python launch_training.py --config configs/endless_terminals.yaml
|
||||||
|
# Terminal 3:
|
||||||
|
TINKER_CONFIG=configs/endless_terminals.yaml \
|
||||||
|
ENDLESS_TERMINALS_DIR=/path/to/endless-terminals \
|
||||||
|
python -m atropos.envs.endless_terminals_env serve
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import base64
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import random
|
||||||
|
import shutil
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
import tempfile
|
||||||
|
import uuid
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any, Dict, List, Optional, Tuple
|
||||||
|
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
from pydantic import Field
|
||||||
|
|
||||||
|
from atroposlib.envs.base import APIServerConfig, Item
|
||||||
|
|
||||||
|
from ..agent import AgentConfig
|
||||||
|
from ..backends.docker_direct_backend import (
|
||||||
|
DockerDirectBackend,
|
||||||
|
build_docker_image,
|
||||||
|
docker_image_exists,
|
||||||
|
)
|
||||||
|
from ..tools import ToolCall
|
||||||
|
from .agent_env import AgentEnv, AgentEnvConfig
|
||||||
|
|
||||||
|
load_dotenv()
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Tinker integration
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# When TINKER_CONFIG is set, we load model/training params from the Tinker YAML.
|
||||||
|
# Custom env fields (ENDLESS_TERMINALS_DIR, etc.) are always read from env vars.
|
||||||
|
TINKER_CONFIG = os.getenv("TINKER_CONFIG", "")
|
||||||
|
|
||||||
|
|
||||||
|
def _load_tinker_config():
|
||||||
|
"""Load TinkerAtroposConfig if available, else return None."""
|
||||||
|
if not TINKER_CONFIG:
|
||||||
|
return None
|
||||||
|
config_path = Path(TINKER_CONFIG)
|
||||||
|
if not config_path.exists():
|
||||||
|
print(f"[EndlessTerminalsEnv] TINKER_CONFIG={TINKER_CONFIG} not found, ignoring", flush=True)
|
||||||
|
return None
|
||||||
|
try:
|
||||||
|
from tinker_atropos.config import TinkerAtroposConfig
|
||||||
|
config = TinkerAtroposConfig.from_yaml(config_path)
|
||||||
|
print(f"[EndlessTerminalsEnv] Loaded Tinker config from {config_path}", flush=True)
|
||||||
|
return config
|
||||||
|
except ImportError:
|
||||||
|
print("[EndlessTerminalsEnv] tinker_atropos not installed, ignoring TINKER_CONFIG", flush=True)
|
||||||
|
return None
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[EndlessTerminalsEnv] Error loading Tinker config: {e}", flush=True)
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Config
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
class EndlessTerminalsEnvConfig(AgentEnvConfig):
|
||||||
|
"""Configuration for Endless Terminals environment."""
|
||||||
|
|
||||||
|
# ---- Local directory mode (primary) ----
|
||||||
|
use_local_dir: bool = Field(
|
||||||
|
default=True,
|
||||||
|
description="Load tasks from a local directory of task_* folders.",
|
||||||
|
)
|
||||||
|
local_tasks_dir: str = Field(
|
||||||
|
default="",
|
||||||
|
description="Path to directory containing task_* folders. Required if use_local_dir=True.",
|
||||||
|
)
|
||||||
|
prebuild_images: bool = Field(
|
||||||
|
default=False,
|
||||||
|
description="Pre-build ALL Docker images during setup (slow but avoids build-during-training).",
|
||||||
|
)
|
||||||
|
max_concurrent_builds: int = Field(
|
||||||
|
default=4,
|
||||||
|
description="Max parallel Docker image builds during pre-build.",
|
||||||
|
)
|
||||||
|
|
||||||
|
# ---- HuggingFace dataset mode ----
|
||||||
|
use_dataset: bool = Field(
|
||||||
|
default=False,
|
||||||
|
description="Load tasks from HuggingFace dataset.",
|
||||||
|
)
|
||||||
|
dataset_name: str = Field(
|
||||||
|
default="obiwan96/endless-terminals-train",
|
||||||
|
description="HuggingFace dataset name (if use_dataset=True)",
|
||||||
|
)
|
||||||
|
dataset_split: str = Field(default="train")
|
||||||
|
dataset_cache_dir: str = Field(default="~/.cache/huggingface/datasets")
|
||||||
|
tasks_base_dir: str = Field(
|
||||||
|
default="",
|
||||||
|
description="Base directory containing task_* folders (for dataset mode path resolution).",
|
||||||
|
)
|
||||||
|
|
||||||
|
# ---- Procedural generation mode ----
|
||||||
|
task_gen_model: str = Field(default="Qwen/Qwen3-32B")
|
||||||
|
task_gen_temperature: float = Field(default=1.0)
|
||||||
|
task_gen_max_tokens: int = Field(default=2048)
|
||||||
|
|
||||||
|
# ---- Container / scoring ----
|
||||||
|
container_build_timeout_s: float = Field(default=600.0, description="Docker build timeout")
|
||||||
|
test_timeout_s: int = Field(default=120, description="Test execution timeout (seconds)")
|
||||||
|
keep_failed_tasks: bool = Field(default=False)
|
||||||
|
|
||||||
|
# ---- Agent defaults ----
|
||||||
|
agent_max_steps: int = Field(default=32)
|
||||||
|
agent_temperature: float = Field(default=0.7)
|
||||||
|
|
||||||
|
# ---- Docker image prefix ----
|
||||||
|
docker_image_prefix: str = Field(
|
||||||
|
default="endless-terminals",
|
||||||
|
description="Docker image name prefix for built task images.",
|
||||||
|
)
|
||||||
|
|
||||||
|
# ---- Server defaults ----
|
||||||
|
server_base_url: str = Field(default="http://127.0.0.1:8080")
|
||||||
|
server_model: str = Field(default="hermes-4-36b")
|
||||||
|
tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Env
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
class EndlessTerminalsEnv(AgentEnv[EndlessTerminalsEnvConfig]):
|
||||||
|
"""
|
||||||
|
Endless Terminals environment.
|
||||||
|
|
||||||
|
Each task:
|
||||||
|
1. Has a Dockerfile defining the initial container state
|
||||||
|
2. Has an instruction.md describing what the agent should do
|
||||||
|
3. Has tests/test_final_state.py to verify completion
|
||||||
|
|
||||||
|
Flow per trajectory:
|
||||||
|
1. get_next_item() → picks a task
|
||||||
|
2. setup_trajectory_workspace() → builds Docker image, registers with backend
|
||||||
|
3. Agent solves task via terminal commands (docker exec in the container)
|
||||||
|
4. verify_and_score_trajectory() → runs pytest in container, returns binary reward
|
||||||
|
"""
|
||||||
|
|
||||||
|
name = "endless_terminals_env"
|
||||||
|
env_config_cls = EndlessTerminalsEnvConfig
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
config: EndlessTerminalsEnvConfig,
|
||||||
|
server_configs: List[APIServerConfig],
|
||||||
|
slurm: bool = False,
|
||||||
|
testing: bool = False,
|
||||||
|
):
|
||||||
|
super().__init__(config, server_configs, slurm, testing)
|
||||||
|
self._iteration = 0
|
||||||
|
|
||||||
|
# Local dir mode
|
||||||
|
self._local_tasks: List[Dict[str, Any]] = []
|
||||||
|
self._local_task_indices: List[int] = []
|
||||||
|
self._local_current_index = 0
|
||||||
|
|
||||||
|
# Eval split (held-out tasks)
|
||||||
|
self._eval_tasks: List[Dict[str, Any]] = []
|
||||||
|
|
||||||
|
# Training metrics
|
||||||
|
self._train_scores_buffer: List[float] = []
|
||||||
|
self._eval_metrics: List[tuple] = []
|
||||||
|
|
||||||
|
# HF dataset mode
|
||||||
|
self._dataset = None
|
||||||
|
self._dataset_indices: List[int] = []
|
||||||
|
self._dataset_current_index = 0
|
||||||
|
|
||||||
|
# Docker image cache: task_name -> image_tag
|
||||||
|
self._image_cache: Dict[str, str] = {}
|
||||||
|
self._build_lock = asyncio.Lock()
|
||||||
|
|
||||||
|
# ---- Config init (CLI) ----
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def config_init(cls) -> Tuple[EndlessTerminalsEnvConfig, List[APIServerConfig]]:
|
||||||
|
"""
|
||||||
|
Initialize config.
|
||||||
|
|
||||||
|
Two modes:
|
||||||
|
1. Tinker mode: TINKER_CONFIG env var points to a Tinker YAML.
|
||||||
|
Model, training params, and server config come from the YAML.
|
||||||
|
2. Standalone mode: Everything from env vars (ATROPOS_SERVER_*, etc.)
|
||||||
|
|
||||||
|
In both modes, Endless Terminals-specific fields (ENDLESS_TERMINALS_DIR,
|
||||||
|
PREBUILD_IMAGES, etc.) are always read from env vars.
|
||||||
|
"""
|
||||||
|
tinker_cfg = _load_tinker_config()
|
||||||
|
|
||||||
|
# ── Endless Terminals-specific fields (always from env vars) ──
|
||||||
|
local_tasks_dir = os.getenv("ENDLESS_TERMINALS_DIR", "")
|
||||||
|
use_local_dir = bool(local_tasks_dir)
|
||||||
|
|
||||||
|
if tinker_cfg is not None:
|
||||||
|
# ── Tinker mode ─────────────────────────────────────────
|
||||||
|
print("[EndlessTerminalsEnv] Using Tinker config", flush=True)
|
||||||
|
|
||||||
|
env_config = EndlessTerminalsEnvConfig(
|
||||||
|
# Standard Atropos fields from Tinker YAML
|
||||||
|
tokenizer_name=tinker_cfg.base_model,
|
||||||
|
group_size=tinker_cfg.group_size,
|
||||||
|
use_wandb=tinker_cfg.use_wandb,
|
||||||
|
rollout_server_url=tinker_cfg.atropos_api_url,
|
||||||
|
total_steps=tinker_cfg.num_steps,
|
||||||
|
batch_size=tinker_cfg.batch_size,
|
||||||
|
steps_per_eval=tinker_cfg.steps_per_eval,
|
||||||
|
max_token_length=tinker_cfg.max_token_env_length,
|
||||||
|
max_num_workers=tinker_cfg.max_num_workers,
|
||||||
|
max_batches_offpolicy=tinker_cfg.max_batches_offpolicy,
|
||||||
|
ensure_scores_are_not_same=tinker_cfg.ensure_scores_are_not_same,
|
||||||
|
wandb_name=f"{tinker_cfg.wandb_run_name}-env",
|
||||||
|
include_messages=True,
|
||||||
|
|
||||||
|
# Tooling: terminal only
|
||||||
|
enabled_toolsets=["terminal"],
|
||||||
|
disabled_toolsets=[],
|
||||||
|
|
||||||
|
# Agent config
|
||||||
|
agent_max_steps=int(os.getenv("AGENT_MAX_STEPS", "32")),
|
||||||
|
agent_temperature=float(os.getenv("AGENT_TEMPERATURE", "0.7")),
|
||||||
|
|
||||||
|
# Docker-direct backend (no Nomad needed)
|
||||||
|
tool_pool_mode="docker_direct",
|
||||||
|
sandbox_image="ubuntu:22.04",
|
||||||
|
purge_job_on_start=False,
|
||||||
|
purge_job_on_shutdown=False,
|
||||||
|
|
||||||
|
# Endless Terminals fields
|
||||||
|
use_local_dir=use_local_dir,
|
||||||
|
local_tasks_dir=local_tasks_dir,
|
||||||
|
prebuild_images=os.getenv("PREBUILD_IMAGES", "false").lower() == "true",
|
||||||
|
use_dataset=os.getenv("USE_DATASET", "false").lower() == "true",
|
||||||
|
dataset_name=os.getenv("ENDLESS_DATASET", "obiwan96/endless-terminals-train"),
|
||||||
|
container_build_timeout_s=float(os.getenv("CONTAINER_BUILD_TIMEOUT", "600")),
|
||||||
|
test_timeout_s=int(os.getenv("TEST_TIMEOUT", "120")),
|
||||||
|
)
|
||||||
|
|
||||||
|
server_configs = [
|
||||||
|
APIServerConfig(
|
||||||
|
model_name=tinker_cfg.base_model,
|
||||||
|
base_url=tinker_cfg.inference_api_url + "/v1",
|
||||||
|
api_key="x",
|
||||||
|
server_type="sglang",
|
||||||
|
num_requests_for_eval=tinker_cfg.num_requests_for_eval,
|
||||||
|
timeout=600, # Longer timeout for multi-step agent trajectories
|
||||||
|
),
|
||||||
|
]
|
||||||
|
return env_config, server_configs
|
||||||
|
|
||||||
|
else:
|
||||||
|
# ── Standalone mode (env vars) ──────────────────────────
|
||||||
|
base_url = (
|
||||||
|
os.getenv("ATROPOS_SERVER_BASE_URL")
|
||||||
|
or os.getenv("OPENAI_BASE_URL")
|
||||||
|
or os.getenv("LLM_BASE_URL")
|
||||||
|
or "http://127.0.0.1:8080"
|
||||||
|
)
|
||||||
|
model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
|
||||||
|
api_key = (
|
||||||
|
os.getenv("ATROPOS_SERVER_API_KEY")
|
||||||
|
or os.getenv("NOUS_API_KEY")
|
||||||
|
or os.getenv("OPENAI_API_KEY")
|
||||||
|
or "local"
|
||||||
|
)
|
||||||
|
|
||||||
|
env_config = EndlessTerminalsEnvConfig(
|
||||||
|
tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
|
||||||
|
group_size=int(os.getenv("ATROPOS_GROUP_SIZE", "4")),
|
||||||
|
use_wandb=os.getenv("USE_WANDB", "false").lower() == "true",
|
||||||
|
include_messages=True,
|
||||||
|
total_steps=int(os.getenv("ATROPOS_TOTAL_STEPS", "1000")),
|
||||||
|
batch_size=int(os.getenv("ATROPOS_BATCH_SIZE", "32")),
|
||||||
|
server_base_url=base_url,
|
||||||
|
server_model=model,
|
||||||
|
|
||||||
|
# Tooling
|
||||||
|
enabled_toolsets=["terminal"],
|
||||||
|
disabled_toolsets=[],
|
||||||
|
|
||||||
|
# Agent
|
||||||
|
agent_max_steps=int(os.getenv("AGENT_MAX_STEPS", "32")),
|
||||||
|
agent_temperature=float(os.getenv("AGENT_TEMPERATURE", "0.7")),
|
||||||
|
|
||||||
|
# Docker-direct backend
|
||||||
|
tool_pool_mode="docker_direct",
|
||||||
|
sandbox_image="ubuntu:22.04",
|
||||||
|
purge_job_on_start=False,
|
||||||
|
purge_job_on_shutdown=False,
|
||||||
|
|
||||||
|
# Endless Terminals fields
|
||||||
|
use_local_dir=use_local_dir,
|
||||||
|
local_tasks_dir=local_tasks_dir,
|
||||||
|
prebuild_images=os.getenv("PREBUILD_IMAGES", "false").lower() == "true",
|
||||||
|
use_dataset=os.getenv("USE_DATASET", "false").lower() == "true",
|
||||||
|
dataset_name=os.getenv("ENDLESS_DATASET", "obiwan96/endless-terminals-train"),
|
||||||
|
task_gen_model=os.getenv("TASK_GEN_MODEL", "Qwen/Qwen3-32B"),
|
||||||
|
container_build_timeout_s=float(os.getenv("CONTAINER_BUILD_TIMEOUT", "600")),
|
||||||
|
test_timeout_s=int(os.getenv("TEST_TIMEOUT", "120")),
|
||||||
|
)
|
||||||
|
|
||||||
|
server_configs = [
|
||||||
|
APIServerConfig(
|
||||||
|
model_name=model,
|
||||||
|
base_url=f"{base_url.rstrip('/')}/v1",
|
||||||
|
api_key=api_key,
|
||||||
|
num_max_requests_at_once=int(os.getenv("MAX_CONCURRENT_REQUESTS", "4")),
|
||||||
|
num_requests_for_eval=int(os.getenv("MAX_EVAL_REQUESTS", "4")),
|
||||||
|
timeout=300,
|
||||||
|
)
|
||||||
|
]
|
||||||
|
return env_config, server_configs
|
||||||
|
|
||||||
|
# ---- Setup ----
|
||||||
|
|
||||||
|
async def setup_agent_env(self) -> None:
|
||||||
|
"""Env-specific setup: scan tasks and optionally pre-build images."""
|
||||||
|
if self.config.use_local_dir:
|
||||||
|
await self._setup_local_dir()
|
||||||
|
elif self.config.use_dataset:
|
||||||
|
await self._setup_hf_dataset()
|
||||||
|
else:
|
||||||
|
print("[EndlessTerminalsEnv] Using procedural task generation", flush=True)
|
||||||
|
|
||||||
|
async def _setup_local_dir(self) -> None:
|
||||||
|
"""Scan local directory for task_* folders."""
|
||||||
|
tasks_dir = Path(self.config.local_tasks_dir).expanduser().resolve()
|
||||||
|
if not tasks_dir.is_dir():
|
||||||
|
raise RuntimeError(f"local_tasks_dir does not exist: {tasks_dir}")
|
||||||
|
|
||||||
|
print(f"[EndlessTerminalsEnv] Scanning {tasks_dir} for tasks...", flush=True)
|
||||||
|
|
||||||
|
tasks = []
|
||||||
|
for entry in sorted(tasks_dir.iterdir()):
|
||||||
|
if not entry.is_dir() or not entry.name.startswith("task_"):
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Validate required files
|
||||||
|
dockerfile = entry / "environment" / "Dockerfile"
|
||||||
|
instruction = entry / "instruction.md"
|
||||||
|
test_final = entry / "tests" / "test_final_state.py"
|
||||||
|
|
||||||
|
if not dockerfile.exists():
|
||||||
|
continue
|
||||||
|
if not instruction.exists():
|
||||||
|
continue
|
||||||
|
if not test_final.exists():
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Read task metadata
|
||||||
|
task_json_path = entry / "environment" / "task.json"
|
||||||
|
description = instruction.read_text(encoding="utf-8").strip()
|
||||||
|
|
||||||
|
truth = ""
|
||||||
|
if task_json_path.exists():
|
||||||
|
try:
|
||||||
|
task_json = json.loads(task_json_path.read_text(encoding="utf-8"))
|
||||||
|
# task.json may have a richer description; prefer instruction.md
|
||||||
|
truth = task_json.get("truth", "")
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
tasks.append({
|
||||||
|
"task_name": entry.name,
|
||||||
|
"task_dir": str(entry),
|
||||||
|
"dockerfile": str(dockerfile),
|
||||||
|
"description": description,
|
||||||
|
"truth": truth,
|
||||||
|
"test_final": str(test_final),
|
||||||
|
})
|
||||||
|
|
||||||
|
if not tasks:
|
||||||
|
raise RuntimeError(f"No valid task_* directories found in {tasks_dir}")
|
||||||
|
|
||||||
|
# Split into train and eval (hold out ~5% for eval, min 10, max 50)
|
||||||
|
random.shuffle(tasks)
|
||||||
|
eval_count = max(10, min(50, len(tasks) // 20))
|
||||||
|
eval_count = min(eval_count, len(tasks) // 2) # Never more than half
|
||||||
|
|
||||||
|
self._eval_tasks = tasks[:eval_count]
|
||||||
|
self._local_tasks = tasks[eval_count:]
|
||||||
|
self._local_task_indices = list(range(len(self._local_tasks)))
|
||||||
|
random.shuffle(self._local_task_indices)
|
||||||
|
self._local_current_index = 0
|
||||||
|
|
||||||
|
print(
|
||||||
|
f"[EndlessTerminalsEnv] Found {len(tasks)} valid tasks "
|
||||||
|
f"({len(self._local_tasks)} train, {len(self._eval_tasks)} eval)",
|
||||||
|
flush=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Optionally pre-build all Docker images
|
||||||
|
if self.config.prebuild_images:
|
||||||
|
await self._prebuild_images()
|
||||||
|
|
||||||
|
async def _prebuild_images(self) -> None:
|
||||||
|
"""Pre-build Docker images for all tasks."""
|
||||||
|
print(f"[EndlessTerminalsEnv] Pre-building Docker images...", flush=True)
|
||||||
|
sem = asyncio.Semaphore(self.config.max_concurrent_builds)
|
||||||
|
built = 0
|
||||||
|
skipped = 0
|
||||||
|
failed = 0
|
||||||
|
|
||||||
|
async def _build_one(task: Dict[str, Any]) -> None:
|
||||||
|
nonlocal built, skipped, failed
|
||||||
|
image_tag = self._image_tag_for_task(task["task_name"])
|
||||||
|
|
||||||
|
if docker_image_exists(image_tag):
|
||||||
|
self._image_cache[task["task_name"]] = image_tag
|
||||||
|
skipped += 1
|
||||||
|
return
|
||||||
|
|
||||||
|
async with sem:
|
||||||
|
ok = await build_docker_image(
|
||||||
|
task["dockerfile"], image_tag,
|
||||||
|
timeout_s=self.config.container_build_timeout_s,
|
||||||
|
)
|
||||||
|
if ok:
|
||||||
|
self._image_cache[task["task_name"]] = image_tag
|
||||||
|
built += 1
|
||||||
|
else:
|
||||||
|
failed += 1
|
||||||
|
|
||||||
|
await asyncio.gather(*[_build_one(t) for t in self._local_tasks])
|
||||||
|
print(
|
||||||
|
f"[EndlessTerminalsEnv] Pre-build: {built} built, {skipped} cached, {failed} failed",
|
||||||
|
flush=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def _setup_hf_dataset(self) -> None:
|
||||||
|
"""Load HuggingFace dataset."""
|
||||||
|
print(f"[EndlessTerminalsEnv] Loading dataset: {self.config.dataset_name}", flush=True)
|
||||||
|
try:
|
||||||
|
from datasets import load_dataset
|
||||||
|
|
||||||
|
loop = asyncio.get_event_loop()
|
||||||
|
self._dataset = await loop.run_in_executor(
|
||||||
|
None,
|
||||||
|
lambda: load_dataset(
|
||||||
|
self.config.dataset_name,
|
||||||
|
split=self.config.dataset_split,
|
||||||
|
cache_dir=os.path.expanduser(self.config.dataset_cache_dir),
|
||||||
|
),
|
||||||
|
)
|
||||||
|
self._dataset_indices = list(range(len(self._dataset)))
|
||||||
|
random.shuffle(self._dataset_indices)
|
||||||
|
self._dataset_current_index = 0
|
||||||
|
print(f"[EndlessTerminalsEnv] Loaded {len(self._dataset)} tasks from dataset", flush=True)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[EndlessTerminalsEnv] ERROR loading dataset: {e}", flush=True)
|
||||||
|
raise
|
||||||
|
|
||||||
|
# ---- Image helpers ----
|
||||||
|
|
||||||
|
def _image_tag_for_task(self, task_name: str) -> str:
|
||||||
|
return f"{self.config.docker_image_prefix}:{task_name}"
|
||||||
|
|
||||||
|
async def _ensure_image(self, task: Dict[str, Any]) -> str:
|
||||||
|
"""Ensure the Docker image for a task is built. Returns image tag."""
|
||||||
|
task_name = task["task_name"]
|
||||||
|
image_tag = self._image_tag_for_task(task_name)
|
||||||
|
|
||||||
|
# Fast path: already cached
|
||||||
|
if task_name in self._image_cache:
|
||||||
|
return self._image_cache[task_name]
|
||||||
|
|
||||||
|
async with self._build_lock:
|
||||||
|
# Double-check after acquiring lock
|
||||||
|
if task_name in self._image_cache:
|
||||||
|
return self._image_cache[task_name]
|
||||||
|
|
||||||
|
# Check if image exists in Docker
|
||||||
|
if docker_image_exists(image_tag):
|
||||||
|
self._image_cache[task_name] = image_tag
|
||||||
|
return image_tag
|
||||||
|
|
||||||
|
# Build it
|
||||||
|
print(f"[EndlessTerminalsEnv] Building image {image_tag}...", flush=True)
|
||||||
|
ok = await build_docker_image(
|
||||||
|
task["dockerfile"], image_tag,
|
||||||
|
timeout_s=self.config.container_build_timeout_s,
|
||||||
|
)
|
||||||
|
if not ok:
|
||||||
|
raise RuntimeError(f"Failed to build Docker image for {task_name}")
|
||||||
|
|
||||||
|
self._image_cache[task_name] = image_tag
|
||||||
|
return image_tag
|
||||||
|
|
||||||
|
# ---- Item generation ----
|
||||||
|
|
||||||
|
async def get_next_item(self) -> Item:
|
||||||
|
self._iteration += 1
|
||||||
|
|
||||||
|
if self.config.use_local_dir and self._local_tasks:
|
||||||
|
return self._get_next_local_item()
|
||||||
|
elif self.config.use_dataset and self._dataset is not None:
|
||||||
|
return self._get_next_dataset_item()
|
||||||
|
else:
|
||||||
|
return self._get_fallback_item()
|
||||||
|
|
||||||
|
def _get_next_local_item(self) -> Item:
|
||||||
|
"""Pick the next task from local directories."""
|
||||||
|
idx = self._local_task_indices[self._local_current_index]
|
||||||
|
task = self._local_tasks[idx]
|
||||||
|
|
||||||
|
self._local_current_index += 1
|
||||||
|
if self._local_current_index >= len(self._local_task_indices):
|
||||||
|
random.shuffle(self._local_task_indices)
|
||||||
|
self._local_current_index = 0
|
||||||
|
print("[EndlessTerminalsEnv] Reshuffled local tasks (epoch complete)", flush=True)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"task_id": f"local_{self._iteration:06d}_{task['task_name']}",
|
||||||
|
"task_name": task["task_name"],
|
||||||
|
"description": task["description"],
|
||||||
|
"truth": task.get("truth", ""),
|
||||||
|
"task_dir": task["task_dir"],
|
||||||
|
"dockerfile": task["dockerfile"],
|
||||||
|
"test_final": task["test_final"],
|
||||||
|
"from_local_dir": True,
|
||||||
|
}
|
||||||
|
|
||||||
|
def _get_next_dataset_item(self) -> Item:
|
||||||
|
"""Pick the next task from HuggingFace dataset."""
|
||||||
|
idx = self._dataset_indices[self._dataset_current_index]
|
||||||
|
task = self._dataset[idx]
|
||||||
|
|
||||||
|
self._dataset_current_index += 1
|
||||||
|
if self._dataset_current_index >= len(self._dataset_indices):
|
||||||
|
random.shuffle(self._dataset_indices)
|
||||||
|
self._dataset_current_index = 0
|
||||||
|
print("[EndlessTerminalsEnv] Reshuffled dataset (epoch complete)", flush=True)
|
||||||
|
|
||||||
|
# Resolve task directory
|
||||||
|
task_dir = task.get("extra_info", {}).get("task_dir") or task.get("reward_spec", {}).get("ground_truth", "")
|
||||||
|
if self.config.tasks_base_dir:
|
||||||
|
task_name = Path(task_dir).name
|
||||||
|
task_dir = str(Path(self.config.tasks_base_dir) / task_name)
|
||||||
|
|
||||||
|
task_dir_path = Path(task_dir)
|
||||||
|
return {
|
||||||
|
"task_id": f"dataset_{self._iteration:06d}_{task_dir_path.name}",
|
||||||
|
"task_name": task_dir_path.name,
|
||||||
|
"description": task.get("description", ""),
|
||||||
|
"task_dir": task_dir,
|
||||||
|
"dockerfile": str(task_dir_path / "environment" / "Dockerfile"),
|
||||||
|
"test_final": str(task_dir_path / "tests" / "test_final_state.py"),
|
||||||
|
"from_dataset": True,
|
||||||
|
}
|
||||||
|
|
||||||
|
def _get_fallback_item(self) -> Item:
|
||||||
|
return {
|
||||||
|
"task_id": f"fallback_{self._iteration:06d}",
|
||||||
|
"task_name": "fallback",
|
||||||
|
"description": (
|
||||||
|
"Create a file named 'hello.txt' in /home/user/ containing "
|
||||||
|
"the text 'Hello, World!' on a single line."
|
||||||
|
),
|
||||||
|
"task_dir": "",
|
||||||
|
"dockerfile": "",
|
||||||
|
"test_final": "",
|
||||||
|
}
|
||||||
|
|
||||||
|
# ---- AgentEnv hooks ----
|
||||||
|
|
||||||
|
def build_task(self, item: Item) -> str:
|
||||||
|
"""Return the task prompt for the agent."""
|
||||||
|
return str(item.get("description", ""))
|
||||||
|
|
||||||
|
def build_agent_config(self, item: Item) -> AgentConfig:
|
||||||
|
return AgentConfig(
|
||||||
|
max_steps=self.config.agent_max_steps,
|
||||||
|
temperature=self.config.agent_temperature,
|
||||||
|
max_tokens=self.config.agent_max_tokens,
|
||||||
|
tool_delay_s=self.config.agent_tool_delay_s,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def setup_trajectory_workspace(
|
||||||
|
self,
|
||||||
|
item: Item,
|
||||||
|
*,
|
||||||
|
trajectory_id: str,
|
||||||
|
exec_tool,
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Build the Docker image for this task and register it with the backend.
|
||||||
|
|
||||||
|
The DockerDirectBackend will start a container from this image when the
|
||||||
|
agent makes its first tool call (lazy acquisition via ToolExecutor).
|
||||||
|
"""
|
||||||
|
task_name = item.get("task_name", "unknown")
|
||||||
|
dockerfile = item.get("dockerfile", "")
|
||||||
|
|
||||||
|
if not dockerfile or not Path(dockerfile).exists():
|
||||||
|
print(f"[EndlessTerminalsEnv] WARNING: No Dockerfile for {task_name}", flush=True)
|
||||||
|
return {"image": "ubuntu:22.04"}
|
||||||
|
|
||||||
|
# Build/get Docker image
|
||||||
|
image_tag = await self._ensure_image({
|
||||||
|
"task_name": task_name,
|
||||||
|
"dockerfile": dockerfile,
|
||||||
|
})
|
||||||
|
|
||||||
|
# Register image with the DockerDirect backend
|
||||||
|
if isinstance(self._backend, DockerDirectBackend):
|
||||||
|
self._backend.register_image(trajectory_id, image_tag)
|
||||||
|
|
||||||
|
return {"image": image_tag, "task_name": task_name}
|
||||||
|
|
||||||
|
async def score_trajectory(self, item: Item, final_response: str) -> float:
|
||||||
|
"""Not used — scoring happens in verify_and_score_trajectory."""
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
async def verify_and_score_trajectory(
|
||||||
|
self,
|
||||||
|
item: Item,
|
||||||
|
final_response: str,
|
||||||
|
*,
|
||||||
|
trajectory_id: str,
|
||||||
|
exec_tool,
|
||||||
|
agent_result=None,
|
||||||
|
workspace_meta=None,
|
||||||
|
) -> tuple[float, Dict[str, Any]]:
|
||||||
|
"""
|
||||||
|
Run test_final_state.py inside the container and return binary reward.
|
||||||
|
"""
|
||||||
|
task_id = item.get("task_id", "unknown")
|
||||||
|
test_final = item.get("test_final", "")
|
||||||
|
|
||||||
|
if not test_final or not Path(test_final).exists():
|
||||||
|
print(f"[EndlessTerminalsEnv] No test file for {task_id}", flush=True)
|
||||||
|
return 0.0, {"error": "No test file"}
|
||||||
|
|
||||||
|
print(f"[EndlessTerminalsEnv] Scoring {task_id}...", flush=True)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Read the test file and base64-encode it for safe transfer
|
||||||
|
test_content = Path(test_final).read_text(encoding="utf-8")
|
||||||
|
encoded = base64.b64encode(test_content.encode("utf-8")).decode("ascii")
|
||||||
|
|
||||||
|
# Write test file into the container and run pytest
|
||||||
|
# We write to /tmp to avoid interfering with the agent's workspace
|
||||||
|
# Use printf + heredoc to avoid quoting issues with single quotes in base64
|
||||||
|
verify_cmd = (
|
||||||
|
f"printf '%s' '{encoded}' | base64 -d > /tmp/_test_final_state.py && "
|
||||||
|
f"cd /home/user && "
|
||||||
|
f"python3 -m pytest /tmp/_test_final_state.py -v --tb=short 2>&1; "
|
||||||
|
f"echo \"EXIT_CODE=$?\""
|
||||||
|
)
|
||||||
|
|
||||||
|
result = await exec_tool(ToolCall(
|
||||||
|
name="terminal",
|
||||||
|
arguments={"command": verify_cmd},
|
||||||
|
))
|
||||||
|
|
||||||
|
output = result.output if hasattr(result, "output") else str(result)
|
||||||
|
|
||||||
|
# Check if pytest passed
|
||||||
|
# Look for EXIT_CODE=0 at the end (most reliable)
|
||||||
|
success = "EXIT_CODE=0" in output
|
||||||
|
|
||||||
|
score = 1.0 if success else 0.0
|
||||||
|
|
||||||
|
metadata = {
|
||||||
|
"task_id": task_id,
|
||||||
|
"success": success,
|
||||||
|
"test_output": output[-2000:] if len(output) > 2000 else output,
|
||||||
|
"total_tool_calls": agent_result.total_tool_calls if agent_result else 0,
|
||||||
|
}
|
||||||
|
|
||||||
|
self._train_scores_buffer.append(score)
|
||||||
|
print(f"[EndlessTerminalsEnv] {task_id} → score={score}", flush=True)
|
||||||
|
return score, metadata
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[EndlessTerminalsEnv] Error scoring {task_id}: {e}", flush=True)
|
||||||
|
return 0.0, {"error": str(e)}
|
||||||
|
|
||||||
|
# ---- WandB logging ----
|
||||||
|
|
||||||
|
async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
|
||||||
|
"""Log training metrics to wandb."""
|
||||||
|
if wandb_metrics is None:
|
||||||
|
wandb_metrics = {}
|
||||||
|
|
||||||
|
# Training pass rate since last log
|
||||||
|
if self._train_scores_buffer:
|
||||||
|
wandb_metrics["train/percent_correct"] = (
|
||||||
|
sum(self._train_scores_buffer) / len(self._train_scores_buffer)
|
||||||
|
)
|
||||||
|
wandb_metrics["train/num_trajectories"] = len(self._train_scores_buffer)
|
||||||
|
self._train_scores_buffer = []
|
||||||
|
|
||||||
|
# Eval metrics (populated by evaluate())
|
||||||
|
for key, value in self._eval_metrics:
|
||||||
|
wandb_metrics[key] = value
|
||||||
|
self._eval_metrics = []
|
||||||
|
|
||||||
|
await super().wandb_log(wandb_metrics)
|
||||||
|
|
||||||
|
# ---- Evaluation ----
|
||||||
|
|
||||||
|
async def evaluate(self, *args, **kwargs):
|
||||||
|
"""
|
||||||
|
Run the agent on held-out eval tasks and report pass rate.
|
||||||
|
|
||||||
|
Each eval task: build Docker container → run agent (temp=0) → pytest → score.
|
||||||
|
This is expensive (full agent trajectories), so we only eval a subset.
|
||||||
|
"""
|
||||||
|
import time as _time
|
||||||
|
|
||||||
|
if not self._eval_tasks:
|
||||||
|
return {}
|
||||||
|
|
||||||
|
start_time = _time.time()
|
||||||
|
eval_sample_size = min(len(self._eval_tasks), 20)
|
||||||
|
eval_subset = random.sample(self._eval_tasks, eval_sample_size)
|
||||||
|
|
||||||
|
print(
|
||||||
|
f"[EndlessTerminalsEnv] Running evaluation on {eval_sample_size} tasks...",
|
||||||
|
flush=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
scores = []
|
||||||
|
samples = []
|
||||||
|
|
||||||
|
for task_info in eval_subset:
|
||||||
|
task_name = task_info["task_name"]
|
||||||
|
description = task_info["description"]
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Build Docker image
|
||||||
|
image_tag = await self._ensure_image(task_info)
|
||||||
|
|
||||||
|
# Run agent with temp=0 for deterministic eval
|
||||||
|
eval_tid = f"eval_{uuid.uuid4().hex[:8]}"
|
||||||
|
|
||||||
|
# Register image with backend
|
||||||
|
if isinstance(self._backend, DockerDirectBackend):
|
||||||
|
self._backend.register_image(eval_tid, image_tag)
|
||||||
|
|
||||||
|
async def _exec(call, _tid=eval_tid):
|
||||||
|
return await self._tool_executor.execute(_tid, call)
|
||||||
|
|
||||||
|
from ..agent import AtroposAgent as _AtroposAgent
|
||||||
|
|
||||||
|
agent = _AtroposAgent(
|
||||||
|
server=self.server,
|
||||||
|
tokenizer=self.tokenizer,
|
||||||
|
tools=self.tools,
|
||||||
|
config=AgentConfig(
|
||||||
|
max_steps=self.config.agent_max_steps,
|
||||||
|
temperature=0.0, # Deterministic for eval
|
||||||
|
max_tokens=self.config.agent_max_tokens,
|
||||||
|
),
|
||||||
|
execute_tool=_exec,
|
||||||
|
)
|
||||||
|
|
||||||
|
result = await agent.run(description)
|
||||||
|
|
||||||
|
# Score: run pytest in the container
|
||||||
|
score = 0.0
|
||||||
|
test_final = task_info.get("test_final", "")
|
||||||
|
if result.success and test_final and Path(test_final).exists():
|
||||||
|
test_content = Path(test_final).read_text(encoding="utf-8")
|
||||||
|
encoded = base64.b64encode(test_content.encode("utf-8")).decode("ascii")
|
||||||
|
verify_cmd = (
|
||||||
|
f"printf '%s' '{encoded}' | base64 -d > /tmp/_test_final_state.py && "
|
||||||
|
f"cd /home/user && "
|
||||||
|
f"python3 -m pytest /tmp/_test_final_state.py -v --tb=short 2>&1; "
|
||||||
|
f'echo "EXIT_CODE=$?"'
|
||||||
|
)
|
||||||
|
test_result = await _exec(ToolCall(
|
||||||
|
name="terminal",
|
||||||
|
arguments={"command": verify_cmd},
|
||||||
|
))
|
||||||
|
test_output = test_result.output if hasattr(test_result, "output") else ""
|
||||||
|
if "EXIT_CODE=0" in test_output:
|
||||||
|
score = 1.0
|
||||||
|
|
||||||
|
scores.append(score)
|
||||||
|
samples.append({
|
||||||
|
"task": task_name,
|
||||||
|
"score": score,
|
||||||
|
"tool_calls": result.total_tool_calls,
|
||||||
|
"success": result.success,
|
||||||
|
})
|
||||||
|
|
||||||
|
# Cleanup
|
||||||
|
await self._tool_executor.release_trajectory(eval_tid, reset_workspace=True)
|
||||||
|
|
||||||
|
print(f" [eval] {task_name} → {score}", flush=True)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" [eval] {task_name} → ERROR: {e}", flush=True)
|
||||||
|
scores.append(0.0)
|
||||||
|
samples.append({"task": task_name, "score": 0.0, "error": str(e)})
|
||||||
|
|
||||||
|
end_time = _time.time()
|
||||||
|
|
||||||
|
percent_correct = sum(scores) / len(scores) if scores else 0.0
|
||||||
|
|
||||||
|
print(
|
||||||
|
f"[EndlessTerminalsEnv] Eval: {percent_correct:.1%} pass rate "
|
||||||
|
f"({sum(scores):.0f}/{len(scores)}) in {end_time - start_time:.0f}s",
|
||||||
|
flush=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Store for wandb_log to pick up
|
||||||
|
self._eval_metrics.append(("eval/percent_correct", percent_correct))
|
||||||
|
self._eval_metrics.append(("eval/num_tasks", len(scores)))
|
||||||
|
self._eval_metrics.append(("eval/duration_s", end_time - start_time))
|
||||||
|
|
||||||
|
# Log via atroposlib
|
||||||
|
eval_metrics = {
|
||||||
|
"eval/percent_correct": percent_correct,
|
||||||
|
"eval/num_tasks": len(scores),
|
||||||
|
}
|
||||||
|
await self.evaluate_log(
|
||||||
|
metrics=eval_metrics,
|
||||||
|
samples=samples,
|
||||||
|
start_time=start_time,
|
||||||
|
end_time=end_time,
|
||||||
|
generation_parameters={
|
||||||
|
"temperature": 0.0,
|
||||||
|
"max_tokens": self.config.agent_max_tokens,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
EndlessTerminalsEnv.cli()
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
171
atropos/envs/hermes_compat_test_env.py
Normal file
171
atropos/envs/hermes_compat_test_env.py
Normal file
@@ -0,0 +1,171 @@
|
|||||||
|
"""
|
||||||
|
Hermes-Agent + Atropos (Nomad sandbox) compatibility smoke environment.
|
||||||
|
|
||||||
|
This environment is intended to validate, end-to-end:
|
||||||
|
BaseEnv.process -> AgentEnv -> ToolExecutor (batched) -> Nomad SlotPool -> sandbox_server
|
||||||
|
|
||||||
|
It forces the model to use a sandbox tool by asking it to run a command that
|
||||||
|
generates a high-entropy token inside the sandbox, then repeat it exactly.
|
||||||
|
|
||||||
|
Run (process mode):
|
||||||
|
uv run python -m atropos.envs.hermes_compat_test_env process --env.use_wandb false --env.total_steps 2 --env.group_size 1
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import os
|
||||||
|
from typing import Any, Dict, List, Tuple
|
||||||
|
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
from pydantic import Field
|
||||||
|
|
||||||
|
from atroposlib.envs.base import APIServerConfig, Item
|
||||||
|
|
||||||
|
from ..agent import AgentConfig, AgentResult
|
||||||
|
from ..tools import ToolCall
|
||||||
|
from .agent_env import AgentEnv, AgentEnvConfig
|
||||||
|
|
||||||
|
load_dotenv()
|
||||||
|
|
||||||
|
|
||||||
|
def _forced_tool_item() -> Item:
|
||||||
|
# Use double quotes in the shell command and show JSON escaping explicitly.
|
||||||
|
# This avoids invalid JSON escapes like `\\'` (not valid JSON) that some models produce.
|
||||||
|
cmd = 'python -c "import secrets; print(secrets.token_hex(16))"'
|
||||||
|
return {
|
||||||
|
"command": cmd,
|
||||||
|
"prompt": (
|
||||||
|
"You are acting as an agent inside a sandboxed environment.\n"
|
||||||
|
"You MUST use the terminal tool to execute commands.\n"
|
||||||
|
"Run this exact command:\n"
|
||||||
|
f"{cmd}\n"
|
||||||
|
"When you call the tool, use valid JSON inside <tool_call>. Example:\n"
|
||||||
|
'<tool_call>{"name": "terminal", "arguments": {"command": '
|
||||||
|
'"python -c \\\\"import secrets; print(secrets.token_hex(16))\\\\""}}'
|
||||||
|
"</tool_call>\n"
|
||||||
|
"Then respond with EXACTLY what it printed (the hex token) and nothing else.\n"
|
||||||
|
"Do not guess. Do not explain."
|
||||||
|
),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class HermesCompatTestEnvConfig(AgentEnvConfig):
|
||||||
|
server_base_url: str = Field(
|
||||||
|
default="http://127.0.0.1:8080",
|
||||||
|
description="Base URL for an OpenAI-compatible chat server (without /v1).",
|
||||||
|
)
|
||||||
|
server_model: str = Field(default="hermes-4-36b", description="Model name")
|
||||||
|
tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
|
||||||
|
|
||||||
|
|
||||||
|
class HermesCompatTestEnv(AgentEnv[HermesCompatTestEnvConfig]):
|
||||||
|
name = "hermes_compat_test_env"
|
||||||
|
env_config_cls = HermesCompatTestEnvConfig
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
config: HermesCompatTestEnvConfig,
|
||||||
|
server_configs: List[APIServerConfig],
|
||||||
|
slurm: bool = False,
|
||||||
|
testing: bool = False,
|
||||||
|
):
|
||||||
|
super().__init__(config, server_configs, slurm, testing)
|
||||||
|
self._iter = 0
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def config_init(cls) -> Tuple[HermesCompatTestEnvConfig, List[APIServerConfig]]:
|
||||||
|
base_url = (
|
||||||
|
os.getenv("ATROPOS_SERVER_BASE_URL")
|
||||||
|
or os.getenv("OPENAI_BASE_URL")
|
||||||
|
or os.getenv("LLM_BASE_URL")
|
||||||
|
or "http://127.0.0.1:8080"
|
||||||
|
)
|
||||||
|
model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
|
||||||
|
api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
|
||||||
|
|
||||||
|
env_config = HermesCompatTestEnvConfig(
|
||||||
|
tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
|
||||||
|
group_size=1,
|
||||||
|
use_wandb=False,
|
||||||
|
include_messages=True,
|
||||||
|
ensure_scores_are_not_same=False,
|
||||||
|
total_steps=2,
|
||||||
|
batch_size=1,
|
||||||
|
server_base_url=base_url,
|
||||||
|
server_model=model,
|
||||||
|
# Tooling: sandbox-only terminal.
|
||||||
|
enabled_toolsets=["terminal"],
|
||||||
|
disabled_toolsets=[],
|
||||||
|
# Default to Nomad sandboxing; users can override via --env.* args.
|
||||||
|
sandbox_image=os.getenv("ATROPOS_SANDBOX_IMAGE") or "atropos-sandbox:local",
|
||||||
|
# In local dev it's common for a previous crash to leave the job in backoff.
|
||||||
|
purge_job_on_start=True,
|
||||||
|
purge_job_on_shutdown=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
server_configs = [
|
||||||
|
APIServerConfig(
|
||||||
|
model_name=model,
|
||||||
|
base_url=f"{base_url.rstrip('/')}/v1",
|
||||||
|
api_key=api_key,
|
||||||
|
num_max_requests_at_once=1,
|
||||||
|
num_requests_for_eval=1,
|
||||||
|
timeout=120,
|
||||||
|
)
|
||||||
|
]
|
||||||
|
return env_config, server_configs
|
||||||
|
|
||||||
|
async def setup_agent_env(self) -> None:
|
||||||
|
return None
|
||||||
|
|
||||||
|
async def get_next_item(self) -> Item:
|
||||||
|
self._iter += 1
|
||||||
|
return _forced_tool_item()
|
||||||
|
|
||||||
|
def build_task(self, item: Item) -> str:
|
||||||
|
return str(item.get("prompt") or "")
|
||||||
|
|
||||||
|
def build_agent_config(self, item: Item) -> AgentConfig: # noqa: ARG002
|
||||||
|
# Avoid imposing max_tokens by default; tool-tag responses can be long for some models.
|
||||||
|
return AgentConfig(
|
||||||
|
max_steps=min(8, int(self.config.agent_max_steps)),
|
||||||
|
temperature=0.2,
|
||||||
|
max_tokens=None,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def score_trajectory(self, item: Item, final_response: str) -> float:
|
||||||
|
# Scoring happens in verify_and_score_trajectory so we can inspect tool results.
|
||||||
|
_ = (item, final_response)
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
async def verify_and_score_trajectory(
|
||||||
|
self,
|
||||||
|
item: Item,
|
||||||
|
final_response: str,
|
||||||
|
*,
|
||||||
|
trajectory_id: str, # noqa: ARG002
|
||||||
|
exec_tool, # noqa: ARG002
|
||||||
|
agent_result: AgentResult | None = None,
|
||||||
|
workspace_meta: Dict[str, Any] | None = None, # noqa: ARG002
|
||||||
|
) -> tuple[float, Dict[str, Any]]:
|
||||||
|
if agent_result is None:
|
||||||
|
return 0.0, {"error": "Missing agent_result"}
|
||||||
|
|
||||||
|
observed: str = ""
|
||||||
|
tool_ok = False
|
||||||
|
for step in agent_result.steps:
|
||||||
|
for res in step.tool_results:
|
||||||
|
if not res.success:
|
||||||
|
return 0.0, {"error": res.error, "output": res.output}
|
||||||
|
out = (res.output or "").strip()
|
||||||
|
if out:
|
||||||
|
observed = out.splitlines()[-1].strip()
|
||||||
|
tool_ok = True
|
||||||
|
|
||||||
|
final = (final_response or "").strip()
|
||||||
|
score = 1.0 if tool_ok and agent_result.total_tool_calls > 0 and observed and final == observed else 0.0
|
||||||
|
return score, {"observed": observed, "tool_calls": agent_result.total_tool_calls, "command": item.get("command")}
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
HermesCompatTestEnv.cli()
|
||||||
172
atropos/envs/sandbox_terminal_smoke_env.py
Normal file
172
atropos/envs/sandbox_terminal_smoke_env.py
Normal file
@@ -0,0 +1,172 @@
|
|||||||
|
"""
|
||||||
|
Nomad sandbox terminal smoke environment (training-oriented).
|
||||||
|
|
||||||
|
Validates, end-to-end:
|
||||||
|
BaseEnv.process -> AgentEnv -> ToolExecutor (batched) -> Nomad SlotPool -> sandbox_server
|
||||||
|
|
||||||
|
It forces the model to use a sandbox tool by asking it to run a command that
|
||||||
|
generates a high-entropy token inside the sandbox, then repeat it exactly.
|
||||||
|
|
||||||
|
Run (process mode):
|
||||||
|
uv run python -m atropos.envs.sandbox_terminal_smoke_env process --env.use_wandb false --env.total_steps 2 --env.group_size 1
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import os
|
||||||
|
from typing import Any, Dict, List, Tuple
|
||||||
|
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
from pydantic import Field
|
||||||
|
|
||||||
|
from atroposlib.envs.base import APIServerConfig, Item
|
||||||
|
|
||||||
|
from ..agent import AgentConfig, AgentResult
|
||||||
|
from ..tools import ToolCall
|
||||||
|
from .agent_env import AgentEnv, AgentEnvConfig
|
||||||
|
|
||||||
|
load_dotenv()
|
||||||
|
|
||||||
|
STRICT_TOOLCALL_SYSTEM_PROMPT = None
|
||||||
|
|
||||||
|
|
||||||
|
def _forced_tool_item() -> Item:
|
||||||
|
# Use double quotes in the shell command and show JSON escaping explicitly.
|
||||||
|
# This avoids invalid JSON escapes like `\\'` (not valid JSON) that some models produce.
|
||||||
|
cmd = 'python -c "import secrets; print(secrets.token_hex(16))"'
|
||||||
|
return {
|
||||||
|
"command": cmd,
|
||||||
|
"prompt": (
|
||||||
|
"You MUST use the terminal tool.\n"
|
||||||
|
"Run this exact command:\n"
|
||||||
|
f"{cmd}\n"
|
||||||
|
"When you call the tool, use valid JSON inside <tool_call>. Example:\n"
|
||||||
|
'<tool_call>{"name": "terminal", "arguments": {"command": '
|
||||||
|
'"python -c \\\\"import secrets; print(secrets.token_hex(16))\\\\""}}'
|
||||||
|
"</tool_call>\n"
|
||||||
|
"Then respond with EXACTLY what it printed (the hex token) and nothing else.\n"
|
||||||
|
"Do not guess. Do not explain."
|
||||||
|
),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class SandboxTerminalSmokeEnvConfig(AgentEnvConfig):
|
||||||
|
server_base_url: str = Field(
|
||||||
|
default="http://127.0.0.1:8080",
|
||||||
|
description="Base URL for an OpenAI-compatible chat server (without /v1).",
|
||||||
|
)
|
||||||
|
server_model: str = Field(default="hermes-4-36b", description="Model name")
|
||||||
|
tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
|
||||||
|
|
||||||
|
|
||||||
|
class SandboxTerminalSmokeEnv(AgentEnv[SandboxTerminalSmokeEnvConfig]):
|
||||||
|
name = "sandbox_terminal_smoke_env"
|
||||||
|
env_config_cls = SandboxTerminalSmokeEnvConfig
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
config: SandboxTerminalSmokeEnvConfig,
|
||||||
|
server_configs: List[APIServerConfig],
|
||||||
|
slurm: bool = False,
|
||||||
|
testing: bool = False,
|
||||||
|
):
|
||||||
|
super().__init__(config, server_configs, slurm, testing)
|
||||||
|
self._iter = 0
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def config_init(cls) -> Tuple[SandboxTerminalSmokeEnvConfig, List[APIServerConfig]]:
|
||||||
|
base_url = (
|
||||||
|
os.getenv("ATROPOS_SERVER_BASE_URL")
|
||||||
|
or os.getenv("OPENAI_BASE_URL")
|
||||||
|
or os.getenv("LLM_BASE_URL")
|
||||||
|
or "http://127.0.0.1:8080"
|
||||||
|
)
|
||||||
|
model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
|
||||||
|
api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
|
||||||
|
|
||||||
|
env_config = SandboxTerminalSmokeEnvConfig(
|
||||||
|
tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
|
||||||
|
group_size=1,
|
||||||
|
use_wandb=False,
|
||||||
|
include_messages=True,
|
||||||
|
ensure_scores_are_not_same=False,
|
||||||
|
total_steps=2,
|
||||||
|
batch_size=1,
|
||||||
|
server_base_url=base_url,
|
||||||
|
server_model=model,
|
||||||
|
# Tooling: sandbox-only terminal.
|
||||||
|
enabled_toolsets=["terminal"],
|
||||||
|
disabled_toolsets=[],
|
||||||
|
# Default to Nomad sandboxing; users can override via --env.* args.
|
||||||
|
sandbox_image=os.getenv("ATROPOS_SANDBOX_IMAGE") or "atropos-sandbox:local",
|
||||||
|
purge_job_on_start=True,
|
||||||
|
purge_job_on_shutdown=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
server_configs = [
|
||||||
|
APIServerConfig(
|
||||||
|
model_name=model,
|
||||||
|
base_url=f"{base_url.rstrip('/')}/v1",
|
||||||
|
api_key=api_key,
|
||||||
|
num_max_requests_at_once=1,
|
||||||
|
num_requests_for_eval=1,
|
||||||
|
timeout=120,
|
||||||
|
)
|
||||||
|
]
|
||||||
|
return env_config, server_configs
|
||||||
|
|
||||||
|
async def setup_agent_env(self) -> None:
|
||||||
|
return None
|
||||||
|
|
||||||
|
async def get_next_item(self) -> Item:
|
||||||
|
self._iter += 1
|
||||||
|
return _forced_tool_item()
|
||||||
|
|
||||||
|
def build_task(self, item: Item) -> str:
|
||||||
|
return str(item.get("prompt") or "")
|
||||||
|
|
||||||
|
def build_agent_config(self, item: Item) -> AgentConfig: # noqa: ARG002
|
||||||
|
# Avoid imposing max_tokens by default; tool-tag responses can be long for some models.
|
||||||
|
return AgentConfig(
|
||||||
|
max_steps=min(8, int(self.config.agent_max_steps)),
|
||||||
|
temperature=0.2,
|
||||||
|
max_tokens=None,
|
||||||
|
system_prompt=STRICT_TOOLCALL_SYSTEM_PROMPT,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def score_trajectory(self, item: Item, final_response: str) -> float:
|
||||||
|
# Scoring happens in verify_and_score_trajectory so we can inspect tool results.
|
||||||
|
_ = (item, final_response)
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
async def verify_and_score_trajectory(
|
||||||
|
self,
|
||||||
|
item: Item,
|
||||||
|
final_response: str,
|
||||||
|
*,
|
||||||
|
trajectory_id: str, # noqa: ARG002
|
||||||
|
exec_tool, # noqa: ARG002
|
||||||
|
agent_result: AgentResult | None = None,
|
||||||
|
workspace_meta: Dict[str, Any] | None = None, # noqa: ARG002
|
||||||
|
) -> tuple[float, Dict[str, Any]]:
|
||||||
|
if agent_result is None:
|
||||||
|
return 0.0, {"error": "Missing agent_result"}
|
||||||
|
|
||||||
|
observed: str = ""
|
||||||
|
tool_ok = False
|
||||||
|
for step in agent_result.steps:
|
||||||
|
for res in step.tool_results:
|
||||||
|
if not res.success:
|
||||||
|
return 0.0, {"error": res.error, "output": res.output}
|
||||||
|
out = (res.output or "").strip()
|
||||||
|
if out:
|
||||||
|
observed = out.splitlines()[-1].strip()
|
||||||
|
tool_ok = True
|
||||||
|
|
||||||
|
final = (final_response or "").strip()
|
||||||
|
score = 1.0 if tool_ok and agent_result.total_tool_calls > 0 and observed and final == observed else 0.0
|
||||||
|
return score, {"observed": observed, "tool_calls": agent_result.total_tool_calls, "command": item.get("command")}
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
SandboxTerminalSmokeEnv.cli()
|
||||||
418
atropos/envs/swe_smith_oracle_env.py
Normal file
418
atropos/envs/swe_smith_oracle_env.py
Normal file
@@ -0,0 +1,418 @@
|
|||||||
|
"""
|
||||||
|
SWE-smith-oracle environment.
|
||||||
|
|
||||||
|
This environment is intentionally minimal:
|
||||||
|
- prepares a sandbox workspace by cloning a public GitHub repo at `base_commit`
|
||||||
|
- runs an AtroposAgent tool loop to apply a fix
|
||||||
|
- verifies by running pytest nodeids from the dataset (reward = pass/fail)
|
||||||
|
- Python only (no multi-language support currently, need to properly bauild & add to dropbox)
|
||||||
|
- TODO: Get the other nonpython sandboxes up and running, then add a config knob to switch between them per row
|
||||||
|
- oh and add to dockerhub
|
||||||
|
|
||||||
|
Dataset: NousResearch/SWE-smith-oracle (train; does NOT use SWE-bench eval set).
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import os
|
||||||
|
import random
|
||||||
|
import time
|
||||||
|
from typing import Any, Dict, List, Optional, Tuple
|
||||||
|
|
||||||
|
from pydantic import Field
|
||||||
|
|
||||||
|
from atroposlib.envs.base import APIServerConfig, Item
|
||||||
|
|
||||||
|
from ..agent import AgentConfig
|
||||||
|
from ..tools import ToolCall
|
||||||
|
from .agent_env import AgentEnv, AgentEnvConfig
|
||||||
|
|
||||||
|
|
||||||
|
class SweSmithOracleEnvConfig(AgentEnvConfig):
|
||||||
|
dataset_name: str = Field(default="NousResearch/SWE-smith-oracle")
|
||||||
|
dataset_split: str = Field(default="train")
|
||||||
|
max_items: int = Field(default=0, description="0 = no limit")
|
||||||
|
shuffle: bool = Field(default=True)
|
||||||
|
seed: int = Field(default=0)
|
||||||
|
|
||||||
|
python_only: bool = Field(default=True, description="Filter to Python-evaluable rows")
|
||||||
|
score_include_fail_to_pass: bool = Field(
|
||||||
|
default=True,
|
||||||
|
description=(
|
||||||
|
"If true (default), score tests on PASS_TO_PASS ∪ FAIL_TO_PASS. "
|
||||||
|
"Disable to only run PASS_TO_PASS (faster but weaker signal)."
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
prompt_mode: str = Field(
|
||||||
|
default="problem_statement",
|
||||||
|
description="Task prompt content: 'problem_statement' (fast) or 'problem_statement+text' (slower, includes dataset 'text').",
|
||||||
|
)
|
||||||
|
|
||||||
|
repo_base_url: str = Field(default="https://github.com", description="Base URL for repo cloning")
|
||||||
|
install_timeout_s: float = Field(default=600.0)
|
||||||
|
test_timeout_s: float = Field(default=600.0)
|
||||||
|
|
||||||
|
tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
|
||||||
|
|
||||||
|
|
||||||
|
class SweSmithOracleEnv(AgentEnv[SweSmithOracleEnvConfig]):
|
||||||
|
"""
|
||||||
|
SWE-smith-oracle AgentEnv.
|
||||||
|
|
||||||
|
This is designed for benchmarking multiplexed slot execution vs naive container-per-trajectory.
|
||||||
|
"""
|
||||||
|
|
||||||
|
name = "swe_smith_oracle_env"
|
||||||
|
env_config_cls = SweSmithOracleEnvConfig
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
config: SweSmithOracleEnvConfig,
|
||||||
|
server_configs: List[APIServerConfig],
|
||||||
|
slurm: bool = False,
|
||||||
|
testing: bool = False,
|
||||||
|
):
|
||||||
|
super().__init__(config, server_configs, slurm, testing)
|
||||||
|
self._dataset = None
|
||||||
|
self._indices: List[int] = []
|
||||||
|
self._cursor = 0
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def config_init(cls) -> Tuple[SweSmithOracleEnvConfig, List[APIServerConfig]]:
|
||||||
|
# Defaults for running the env via CLI in offline `process` mode.
|
||||||
|
# Override via env vars or `--env.*` flags as needed.
|
||||||
|
base_url_raw = (
|
||||||
|
os.getenv("ATROPOS_SERVER_BASE_URL")
|
||||||
|
or os.getenv("OPENAI_BASE_URL")
|
||||||
|
or os.getenv("LLM_BASE_URL")
|
||||||
|
or "http://127.0.0.1:8080"
|
||||||
|
)
|
||||||
|
base_url = base_url_raw.rstrip("/")
|
||||||
|
if not base_url.endswith("/v1"):
|
||||||
|
base_url = f"{base_url}/v1"
|
||||||
|
model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
|
||||||
|
api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
|
||||||
|
|
||||||
|
env_config = SweSmithOracleEnvConfig(
|
||||||
|
tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
|
||||||
|
group_size=1,
|
||||||
|
use_wandb=False,
|
||||||
|
rollout_server_url="http://localhost:8000",
|
||||||
|
total_steps=1,
|
||||||
|
batch_size=1,
|
||||||
|
steps_per_eval=1,
|
||||||
|
max_token_length=8192,
|
||||||
|
inference_weight=1.0,
|
||||||
|
wandb_name="swe_smith_oracle",
|
||||||
|
enabled_toolsets=["terminal"],
|
||||||
|
disabled_toolsets=[],
|
||||||
|
sandbox_image=os.getenv("ATROPOS_SANDBOX_IMAGE") or "atropos-sandbox:local",
|
||||||
|
purge_job_on_start=True,
|
||||||
|
purge_job_on_shutdown=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
server_configs = [
|
||||||
|
APIServerConfig(
|
||||||
|
model_name=model,
|
||||||
|
base_url=base_url,
|
||||||
|
api_key=api_key,
|
||||||
|
num_max_requests_at_once=1,
|
||||||
|
num_requests_for_eval=1,
|
||||||
|
timeout=int(os.getenv("ATROPOS_SERVER_TIMEOUT_S") or "300"),
|
||||||
|
),
|
||||||
|
]
|
||||||
|
|
||||||
|
return env_config, server_configs
|
||||||
|
|
||||||
|
async def setup_agent_env(self) -> None:
|
||||||
|
from datasets import load_dataset
|
||||||
|
|
||||||
|
t0 = time.perf_counter()
|
||||||
|
print(
|
||||||
|
f"[SweSmithOracleEnv] loading dataset {self.config.dataset_name}:{self.config.dataset_split} "
|
||||||
|
f"(python_only={self.config.python_only}, max_items={self.config.max_items or 'all'})",
|
||||||
|
flush=True,
|
||||||
|
)
|
||||||
|
ds = load_dataset(self.config.dataset_name, split=self.config.dataset_split)
|
||||||
|
self._dataset = ds
|
||||||
|
|
||||||
|
indices: List[int] = []
|
||||||
|
for idx in range(len(ds)):
|
||||||
|
row = ds[idx]
|
||||||
|
if self.config.python_only and not self._is_python_row(row):
|
||||||
|
continue
|
||||||
|
indices.append(idx)
|
||||||
|
|
||||||
|
if self.config.shuffle:
|
||||||
|
rnd = random.Random(self.config.seed)
|
||||||
|
rnd.shuffle(indices)
|
||||||
|
|
||||||
|
if self.config.max_items and self.config.max_items > 0:
|
||||||
|
indices = indices[: self.config.max_items]
|
||||||
|
|
||||||
|
self._indices = indices
|
||||||
|
self._cursor = 0
|
||||||
|
|
||||||
|
print(
|
||||||
|
f"[SweSmithOracleEnv] loaded {len(self._indices)} items from {self.config.dataset_name}:{self.config.dataset_split} "
|
||||||
|
f"in {time.perf_counter() - t0:.2f}s",
|
||||||
|
flush=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
def _is_python_row(self, row: Dict[str, Any]) -> bool:
|
||||||
|
nodeids = row.get("PASS_TO_PASS")
|
||||||
|
if not isinstance(nodeids, list) or not nodeids:
|
||||||
|
return False
|
||||||
|
for nid in nodeids:
|
||||||
|
if not isinstance(nid, str) or ".py::" not in nid:
|
||||||
|
return False
|
||||||
|
return True
|
||||||
|
|
||||||
|
async def get_next_item(self) -> Item:
|
||||||
|
print(f"[SweSmithOracleEnv] get_next_item() cursor={self._cursor}/{len(self._indices)}", flush=True)
|
||||||
|
if not self._dataset or not self._indices:
|
||||||
|
raise RuntimeError("Dataset not initialized (did setup() run?)")
|
||||||
|
if self._cursor >= len(self._indices):
|
||||||
|
self._cursor = 0
|
||||||
|
idx = self._indices[self._cursor]
|
||||||
|
self._cursor += 1
|
||||||
|
return dict(self._dataset[idx])
|
||||||
|
|
||||||
|
def _repo_name(self, item: Item) -> str:
|
||||||
|
repo = item.get("repo") or ""
|
||||||
|
if isinstance(repo, str) and "/" in repo:
|
||||||
|
return repo.split("/")[-1]
|
||||||
|
return "repo"
|
||||||
|
|
||||||
|
def build_task(self, item: Item) -> str:
|
||||||
|
repo = item.get("repo") or ""
|
||||||
|
base_commit = item.get("base_commit") or ""
|
||||||
|
problem = str(item.get("problem_statement") or "")
|
||||||
|
context = str(item.get("text") or "")
|
||||||
|
|
||||||
|
nodeids = self._tests_for_item(item)
|
||||||
|
tests_list = "\n".join(f"- {t}" for t in nodeids)
|
||||||
|
|
||||||
|
repo_dir = self._repo_name(item)
|
||||||
|
|
||||||
|
tests_block = (
|
||||||
|
"Run these tests to verify:\n"
|
||||||
|
f"{tests_list}\n\n"
|
||||||
|
"When done, briefly describe what you changed and confirm tests pass."
|
||||||
|
)
|
||||||
|
|
||||||
|
prompt_mode = (self.config.prompt_mode or "problem_statement").strip().lower()
|
||||||
|
if prompt_mode not in {"problem_statement", "problem_statement+text"}:
|
||||||
|
raise ValueError(
|
||||||
|
f"Invalid prompt_mode={self.config.prompt_mode!r}. "
|
||||||
|
"Expected 'problem_statement' or 'problem_statement+text'."
|
||||||
|
)
|
||||||
|
|
||||||
|
context_block = ""
|
||||||
|
if prompt_mode == "problem_statement+text" and context:
|
||||||
|
# Note: We intentionally do NOT truncate/cap here. This mode is for debugging / richer prompts and can be slow.
|
||||||
|
context_block = f"\nAdditional context:\n{context}\n"
|
||||||
|
|
||||||
|
return (
|
||||||
|
"You are a senior software engineer. Fix the repository so the specified tests pass.\n\n"
|
||||||
|
f"Repository: {repo} (checked out at base_commit={base_commit})\n"
|
||||||
|
f"Workspace path: ./{repo_dir}\n\n"
|
||||||
|
"Constraints:\n"
|
||||||
|
"- You MUST use the terminal tool to inspect, edit, and verify the repository. Do not respond with a patch file.\n"
|
||||||
|
f"- Start by inspecting the repo (e.g. `ls`, `cd ./{repo_dir}`, `git status`).\n"
|
||||||
|
"- Use a workspace-local virtualenv (e.g. inside the repo at ./.venv) to avoid cross-run contamination.\n"
|
||||||
|
"- Use non-interactive commands only.\n\n"
|
||||||
|
"- Terminal commands run under POSIX /bin/sh and each tool call runs in a fresh shell (no persisted env vars).\n"
|
||||||
|
" Avoid bash-only `source`; prefer `. .venv/bin/activate` or `.venv/bin/python ...`.\n\n"
|
||||||
|
"Problem statement:\n"
|
||||||
|
f"{problem}\n\n"
|
||||||
|
f"{context_block}\n"
|
||||||
|
f"{tests_block}"
|
||||||
|
)
|
||||||
|
|
||||||
|
def build_agent_config(self, item: Item) -> AgentConfig: # noqa: ARG002
|
||||||
|
# SWE tasks are longer than the simple test env.
|
||||||
|
return AgentConfig(
|
||||||
|
max_steps=self.config.agent_max_steps,
|
||||||
|
temperature=self.config.agent_temperature,
|
||||||
|
max_tokens=self.config.agent_max_tokens,
|
||||||
|
tool_delay_s=self.config.agent_tool_delay_s,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def setup_trajectory_workspace(self, item: Item, *, trajectory_id: str, exec_tool) -> Dict[str, Any]:
|
||||||
|
t0 = time.perf_counter()
|
||||||
|
repo = item.get("repo")
|
||||||
|
base_commit = item.get("base_commit")
|
||||||
|
instance_id = item.get("instance_id") or item.get("id") or item.get("problem_id")
|
||||||
|
if not isinstance(repo, str) or not isinstance(base_commit, str):
|
||||||
|
raise RuntimeError("Invalid dataset row: missing repo/base_commit")
|
||||||
|
|
||||||
|
repo_dir = self._repo_name(item)
|
||||||
|
clone_url = f"{self.config.repo_base_url.rstrip('/')}/{repo}.git"
|
||||||
|
print(
|
||||||
|
f"[SweSmithOracleEnv] tid={trajectory_id} setup_trajectory_workspace(): "
|
||||||
|
f"repo={repo} base_commit={base_commit} instance_id={instance_id} dir=./{repo_dir}",
|
||||||
|
flush=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Repo setup strategy:
|
||||||
|
# - Maintain a shared, per-container bare repo cache under /data/repo_cache
|
||||||
|
# - For each trajectory, create an isolated git worktree under the slot workspace
|
||||||
|
# This avoids cloning/fetching full repos per trajectory and is crucial for multiplexing.
|
||||||
|
|
||||||
|
def _repo_cache_slug(repo_name: str) -> str:
|
||||||
|
return repo_name.replace("/", "__")
|
||||||
|
|
||||||
|
repo_slug = _repo_cache_slug(repo)
|
||||||
|
cache_root = "/data/repo_cache"
|
||||||
|
bare_repo = f"{cache_root}/{repo_slug}.git"
|
||||||
|
lock_file = f"{cache_root}/.locks/{repo_slug}.lock"
|
||||||
|
|
||||||
|
# Use flock to serialize operations that mutate the shared bare repo (fetch/worktree).
|
||||||
|
# util-linux (flock) is included in the sandbox image.
|
||||||
|
worktree_cmd = (
|
||||||
|
"set -e; "
|
||||||
|
f"rm -rf {repo_dir}; "
|
||||||
|
f"mkdir -p {cache_root}/.locks; "
|
||||||
|
f": > {lock_file}; "
|
||||||
|
f"flock -x {lock_file} sh -lc '"
|
||||||
|
f"set -e; "
|
||||||
|
"export GIT_TERMINAL_PROMPT=0; "
|
||||||
|
"export GIT_LFS_SKIP_SMUDGE=1; "
|
||||||
|
f"if [ ! -d \"{bare_repo}\" ]; then "
|
||||||
|
f" git init --bare \"{bare_repo}\"; "
|
||||||
|
f" git -C \"{bare_repo}\" remote add origin \"{clone_url}\"; "
|
||||||
|
"fi; "
|
||||||
|
f"git -C \"{bare_repo}\" remote set-url origin \"{clone_url}\"; "
|
||||||
|
f"git -C \"{bare_repo}\" worktree prune || true; "
|
||||||
|
f"if ! git -C \"{bare_repo}\" cat-file -e \"{base_commit}^{{commit}}\" 2>/dev/null; then "
|
||||||
|
f" git -C \"{bare_repo}\" fetch --depth 1 origin \"{base_commit}\" || true; "
|
||||||
|
"fi; "
|
||||||
|
f"if ! git -C \"{bare_repo}\" cat-file -e \"{base_commit}^{{commit}}\" 2>/dev/null; then "
|
||||||
|
f" git -C \"{bare_repo}\" fetch --prune origin; "
|
||||||
|
"fi; "
|
||||||
|
f"git --git-dir=\"{bare_repo}\" worktree add --detach \"{repo_dir}\" \"{base_commit}\"; "
|
||||||
|
"'"
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"[SweSmithOracleEnv] tid={trajectory_id} preparing worktree from repo cache", flush=True)
|
||||||
|
res = await exec_tool(
|
||||||
|
ToolCall(
|
||||||
|
name="terminal",
|
||||||
|
arguments={"command": worktree_cmd, "timeout": self.config.install_timeout_s},
|
||||||
|
)
|
||||||
|
)
|
||||||
|
if not res.success:
|
||||||
|
raise RuntimeError(
|
||||||
|
"git worktree setup failed "
|
||||||
|
f"(repo={repo}, base_commit={base_commit}, instance_id={instance_id}): {res.error}\n{res.output}"
|
||||||
|
)
|
||||||
|
|
||||||
|
print(
|
||||||
|
f"[SweSmithOracleEnv] tid={trajectory_id} setup_trajectory_workspace(): worktree ready in {time.perf_counter() - t0:.2f}s",
|
||||||
|
flush=True,
|
||||||
|
)
|
||||||
|
return {"repo_dir": repo_dir, "base_commit": base_commit}
|
||||||
|
|
||||||
|
def _tests_for_item(self, item: Item) -> List[str]:
|
||||||
|
tests: List[str] = []
|
||||||
|
if self.config.score_include_fail_to_pass:
|
||||||
|
for key in ("PASS_TO_PASS", "FAIL_TO_PASS"):
|
||||||
|
nodeids = item.get(key)
|
||||||
|
if isinstance(nodeids, list):
|
||||||
|
tests.extend([n for n in nodeids if isinstance(n, str)])
|
||||||
|
else:
|
||||||
|
nodeids = item.get("PASS_TO_PASS")
|
||||||
|
if isinstance(nodeids, list):
|
||||||
|
tests.extend([n for n in nodeids if isinstance(n, str)])
|
||||||
|
# Stable order for reproducibility.
|
||||||
|
return sorted(dict.fromkeys(tests))
|
||||||
|
|
||||||
|
def _chunk_nodeids(self, nodeids: List[str], max_per_chunk: int = 50) -> List[List[str]]:
|
||||||
|
chunks: List[List[str]] = []
|
||||||
|
for i in range(0, len(nodeids), max_per_chunk):
|
||||||
|
chunks.append(nodeids[i : i + max_per_chunk])
|
||||||
|
return chunks
|
||||||
|
|
||||||
|
async def verify_and_score_trajectory(
|
||||||
|
self,
|
||||||
|
item: Item,
|
||||||
|
final_response: str, # noqa: ARG002
|
||||||
|
*,
|
||||||
|
trajectory_id: str,
|
||||||
|
exec_tool,
|
||||||
|
agent_result=None,
|
||||||
|
workspace_meta: Optional[Dict[str, Any]] = None,
|
||||||
|
) -> tuple[float, Dict[str, Any]]:
|
||||||
|
_ = trajectory_id
|
||||||
|
repo_dir = self._repo_name(item)
|
||||||
|
|
||||||
|
# Training correctness: do not reward trajectories that never actually used tools.
|
||||||
|
if agent_result is not None and getattr(agent_result, "total_tool_calls", 0) <= 0:
|
||||||
|
print(
|
||||||
|
f"[SweSmithOracleEnv] tid={trajectory_id} verify (dataset_tests): no tool calls; score=0.0",
|
||||||
|
flush=True,
|
||||||
|
)
|
||||||
|
return 0.0, {
|
||||||
|
"verification_mode": "dataset_tests",
|
||||||
|
"error": "No tool calls were made by the agent",
|
||||||
|
}
|
||||||
|
|
||||||
|
nodeids = self._tests_for_item(item)
|
||||||
|
if not nodeids:
|
||||||
|
return 0.0, {"error": "No tests provided"}
|
||||||
|
|
||||||
|
print(f"[SweSmithOracleEnv] tid={trajectory_id} verify (dataset_tests): ensuring venv + deps", flush=True)
|
||||||
|
setup_cmd = (
|
||||||
|
f"cd {repo_dir} && "
|
||||||
|
"python -m venv .venv && "
|
||||||
|
". .venv/bin/activate && "
|
||||||
|
"python -m pip install -U pip setuptools wheel && "
|
||||||
|
"python -m pip install -e . && "
|
||||||
|
"python -m pip install pytest"
|
||||||
|
)
|
||||||
|
setup_res = await exec_tool(
|
||||||
|
ToolCall(name="terminal", arguments={"command": setup_cmd, "timeout": self.config.install_timeout_s})
|
||||||
|
)
|
||||||
|
verification_messages = [{"role": "user", "content": setup_res.to_xml()}]
|
||||||
|
if not setup_res.success:
|
||||||
|
return 0.0, {
|
||||||
|
"verification_mode": "dataset_tests",
|
||||||
|
"phase": "install",
|
||||||
|
"error": setup_res.error,
|
||||||
|
"output": setup_res.output,
|
||||||
|
"verification_messages": verification_messages,
|
||||||
|
}
|
||||||
|
|
||||||
|
chunks = self._chunk_nodeids(nodeids, max_per_chunk=50)
|
||||||
|
for chunk_idx, chunk in enumerate(chunks):
|
||||||
|
joined = " ".join(chunk)
|
||||||
|
cmd = f"cd {repo_dir} && . .venv/bin/activate && python -m pytest -q {joined}"
|
||||||
|
res = await exec_tool(
|
||||||
|
ToolCall(
|
||||||
|
name="terminal",
|
||||||
|
arguments={"command": cmd, "timeout": self.config.test_timeout_s},
|
||||||
|
)
|
||||||
|
)
|
||||||
|
verification_messages.append({"role": "user", "content": res.to_xml()})
|
||||||
|
if not res.success:
|
||||||
|
return 0.0, {
|
||||||
|
"verification_mode": "dataset_tests",
|
||||||
|
"phase": "pytest",
|
||||||
|
"failed_chunk": chunk_idx,
|
||||||
|
"error": res.error,
|
||||||
|
"output": res.output,
|
||||||
|
"verification_messages": verification_messages,
|
||||||
|
}
|
||||||
|
|
||||||
|
return 1.0, {"verification_mode": "dataset_tests", "passed": True, "verification_messages": verification_messages}
|
||||||
|
|
||||||
|
async def score_trajectory(self, item: Item, final_response: str) -> float:
|
||||||
|
# Not used; scoring happens in verify_and_score_trajectory.
|
||||||
|
_ = (item, final_response)
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
SweSmithOracleEnv.cli()
|
||||||
217
atropos/envs/test_env.py
Normal file
217
atropos/envs/test_env.py
Normal file
@@ -0,0 +1,217 @@
|
|||||||
|
"""
|
||||||
|
Simple test environment for validating the atropos-agent setup.
|
||||||
|
|
||||||
|
This environment uses a local OpenAI-compatible server for LLM testing to verify:
|
||||||
|
- BaseEnv extension works correctly
|
||||||
|
- API communication via OpenAI-compatible endpoint
|
||||||
|
- Basic trajectory collection
|
||||||
|
|
||||||
|
This is a minimal environment for testing, not production use.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import os
|
||||||
|
from typing import Dict, List, Optional, Tuple
|
||||||
|
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
from pydantic import Field
|
||||||
|
|
||||||
|
from atroposlib.envs.base import (
|
||||||
|
APIServerConfig,
|
||||||
|
Item,
|
||||||
|
)
|
||||||
|
|
||||||
|
from ..agent import AgentConfig
|
||||||
|
from .agent_env import AgentEnv, AgentEnvConfig
|
||||||
|
|
||||||
|
# Load environment variables from .env file
|
||||||
|
load_dotenv()
|
||||||
|
|
||||||
|
|
||||||
|
# Simple test prompts for validation
|
||||||
|
TEST_PROMPTS = [
|
||||||
|
{
|
||||||
|
"prompt": "What is 2 + 2? Answer with just the number.",
|
||||||
|
"expected": "4",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"prompt": "What is the capital of France? Answer with just the city name.",
|
||||||
|
"expected": "Paris",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"prompt": "What color is the sky on a clear day? Answer with just the color.",
|
||||||
|
"expected": "Blue",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"prompt": "How many days are in a week? Answer with just the number.",
|
||||||
|
"expected": "7",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"prompt": "What is 10 * 5? Answer with just the number.",
|
||||||
|
"expected": "50",
|
||||||
|
},
|
||||||
|
]
|
||||||
|
|
||||||
|
SYSTEM_PROMPT = (
|
||||||
|
"You are a helpful assistant. Answer questions concisely and directly. "
|
||||||
|
"When asked for a simple answer, provide just that answer without explanation."
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class SimpleTestEnvConfig(AgentEnvConfig):
|
||||||
|
"""Configuration for the simple test environment."""
|
||||||
|
|
||||||
|
server_base_url: str = Field(
|
||||||
|
default="http://127.0.0.1:8080",
|
||||||
|
description="Base URL for an OpenAI-compatible server (without /v1)",
|
||||||
|
)
|
||||||
|
server_model: str = Field(
|
||||||
|
default="hermes-4-36b",
|
||||||
|
description="Model name",
|
||||||
|
)
|
||||||
|
tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
|
||||||
|
|
||||||
|
|
||||||
|
class SimpleTestEnv(AgentEnv[SimpleTestEnvConfig]):
|
||||||
|
"""
|
||||||
|
A simple test environment to validate the atropos-agent setup.
|
||||||
|
|
||||||
|
Uses a local OpenAI-compatible LLM endpoint with basic question-answering tasks.
|
||||||
|
Scoring is based on whether the response contains the expected answer.
|
||||||
|
"""
|
||||||
|
|
||||||
|
name = "simple_test_env"
|
||||||
|
env_config_cls = SimpleTestEnvConfig
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
config: SimpleTestEnvConfig,
|
||||||
|
server_configs: List[APIServerConfig],
|
||||||
|
slurm: bool = False,
|
||||||
|
testing: bool = False,
|
||||||
|
):
|
||||||
|
super().__init__(config, server_configs, slurm, testing)
|
||||||
|
self.iter = 0
|
||||||
|
self.test_prompts = TEST_PROMPTS
|
||||||
|
self.percent_correct_buffer: List[float] = []
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def config_init(cls) -> Tuple[SimpleTestEnvConfig, List[APIServerConfig]]:
|
||||||
|
"""
|
||||||
|
Initialize configuration with local server settings from environment variables.
|
||||||
|
"""
|
||||||
|
base_url = (
|
||||||
|
os.getenv("ATROPOS_SERVER_BASE_URL")
|
||||||
|
or os.getenv("OPENAI_BASE_URL")
|
||||||
|
or os.getenv("LLM_BASE_URL")
|
||||||
|
or "http://127.0.0.1:8080"
|
||||||
|
)
|
||||||
|
model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
|
||||||
|
api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
|
||||||
|
|
||||||
|
env_config = SimpleTestEnvConfig(
|
||||||
|
tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
|
||||||
|
group_size=4,
|
||||||
|
use_wandb=False, # Disable wandb for simple testing
|
||||||
|
rollout_server_url="http://localhost:8000",
|
||||||
|
total_steps=10,
|
||||||
|
batch_size=16,
|
||||||
|
steps_per_eval=5,
|
||||||
|
max_token_length=2048,
|
||||||
|
inference_weight=1.0,
|
||||||
|
wandb_name="simple_test",
|
||||||
|
server_base_url=base_url,
|
||||||
|
server_model=model,
|
||||||
|
)
|
||||||
|
|
||||||
|
# OpenAI-compatible servers typically expose chat completions at /v1.
|
||||||
|
server_configs = [
|
||||||
|
APIServerConfig(
|
||||||
|
model_name=model,
|
||||||
|
base_url=f"{base_url}/v1",
|
||||||
|
api_key=api_key,
|
||||||
|
num_max_requests_at_once=4,
|
||||||
|
num_requests_for_eval=8,
|
||||||
|
timeout=120, # Local models may be slower
|
||||||
|
),
|
||||||
|
]
|
||||||
|
|
||||||
|
return env_config, server_configs
|
||||||
|
|
||||||
|
async def setup_agent_env(self):
|
||||||
|
"""Setup the environment - load test data."""
|
||||||
|
print(f"SimpleTestEnv setup complete. {len(self.test_prompts)} test prompts loaded.")
|
||||||
|
print(f"Using server at: {self.config.server_base_url}")
|
||||||
|
print(f"Model: {self.config.server_model}")
|
||||||
|
|
||||||
|
async def get_next_item(self) -> Item:
|
||||||
|
"""Get the next test prompt."""
|
||||||
|
item = self.test_prompts[self.iter % len(self.test_prompts)]
|
||||||
|
self.iter += 1
|
||||||
|
return item
|
||||||
|
|
||||||
|
def build_task(self, item: Item) -> str:
|
||||||
|
return item["prompt"]
|
||||||
|
|
||||||
|
def build_agent_config(self, item: Item) -> AgentConfig: # noqa: ARG002
|
||||||
|
return AgentConfig(
|
||||||
|
max_steps=5,
|
||||||
|
temperature=0.7,
|
||||||
|
max_tokens=256,
|
||||||
|
system_prompt=SYSTEM_PROMPT,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def score_trajectory(self, item: Item, final_response: str) -> float:
|
||||||
|
expected = item["expected"].lower()
|
||||||
|
response_lower = (final_response or "").lower()
|
||||||
|
score = 1.0 if expected in response_lower else 0.0
|
||||||
|
self.percent_correct_buffer.append(score)
|
||||||
|
return score
|
||||||
|
|
||||||
|
async def evaluate(self, *args, **kwargs):
|
||||||
|
"""
|
||||||
|
Simple evaluation - run through all test prompts once.
|
||||||
|
"""
|
||||||
|
correct = 0
|
||||||
|
total = len(self.test_prompts)
|
||||||
|
|
||||||
|
for item in self.test_prompts:
|
||||||
|
messages = [
|
||||||
|
{"role": "system", "content": SYSTEM_PROMPT},
|
||||||
|
{"role": "user", "content": item["prompt"]},
|
||||||
|
]
|
||||||
|
|
||||||
|
response = await self.server.chat_completion(
|
||||||
|
messages=messages,
|
||||||
|
n=1,
|
||||||
|
max_tokens=256,
|
||||||
|
temperature=0.0, # Greedy for eval
|
||||||
|
split="eval",
|
||||||
|
)
|
||||||
|
|
||||||
|
response_text = response.choices[0].message.content or ""
|
||||||
|
expected = item["expected"].lower()
|
||||||
|
|
||||||
|
if expected in response_text.lower():
|
||||||
|
correct += 1
|
||||||
|
|
||||||
|
accuracy = correct / total
|
||||||
|
print(f"Evaluation: {correct}/{total} = {accuracy:.2%} accuracy")
|
||||||
|
return {"eval_accuracy": accuracy}
|
||||||
|
|
||||||
|
async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
|
||||||
|
"""Log metrics (simplified for testing)."""
|
||||||
|
if wandb_metrics is None:
|
||||||
|
wandb_metrics = {}
|
||||||
|
|
||||||
|
if self.percent_correct_buffer:
|
||||||
|
avg_correct = sum(self.percent_correct_buffer) / len(self.percent_correct_buffer)
|
||||||
|
wandb_metrics["train/percent_correct"] = avg_correct
|
||||||
|
print(f"Train accuracy: {avg_correct:.2%}")
|
||||||
|
self.percent_correct_buffer = []
|
||||||
|
|
||||||
|
await super().wandb_log(wandb_metrics)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
# Allow running as CLI
|
||||||
|
SimpleTestEnv.cli()
|
||||||
165
atropos/envs/toolserver_smoke_env.py
Normal file
165
atropos/envs/toolserver_smoke_env.py
Normal file
@@ -0,0 +1,165 @@
|
|||||||
|
"""
|
||||||
|
ToolServer routing smoke environment.
|
||||||
|
|
||||||
|
Validates that:
|
||||||
|
- sandbox tools run through Nomad SlotPool (terminal -> bash in sandbox)
|
||||||
|
- external tools run through ToolServer (skills_list)
|
||||||
|
|
||||||
|
This env uses ToolServer in-process by default (`tool_server_url="inprocess"`),
|
||||||
|
so it is self-contained for local testing.
|
||||||
|
|
||||||
|
Run:
|
||||||
|
uv run python -m atropos.envs.toolserver_smoke_env process --env.use_wandb false --env.total_steps 1 --env.group_size 1
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import os
|
||||||
|
from typing import Any, Dict, List, Tuple
|
||||||
|
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
from pydantic import Field
|
||||||
|
|
||||||
|
from atroposlib.envs.base import APIServerConfig, Item
|
||||||
|
|
||||||
|
from ..agent import AgentConfig, AgentResult
|
||||||
|
from .agent_env import AgentEnv, AgentEnvConfig
|
||||||
|
|
||||||
|
load_dotenv()
|
||||||
|
|
||||||
|
|
||||||
|
class ToolServerSmokeEnvConfig(AgentEnvConfig):
|
||||||
|
server_base_url: str = Field(
|
||||||
|
default="http://127.0.0.1:8080",
|
||||||
|
description="Base URL for an OpenAI-compatible chat server (without /v1).",
|
||||||
|
)
|
||||||
|
server_model: str = Field(default="hermes-4-36b", description="Model name")
|
||||||
|
tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
|
||||||
|
|
||||||
|
|
||||||
|
class ToolServerSmokeEnv(AgentEnv[ToolServerSmokeEnvConfig]):
|
||||||
|
name = "toolserver_smoke_env"
|
||||||
|
env_config_cls = ToolServerSmokeEnvConfig
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
config: ToolServerSmokeEnvConfig,
|
||||||
|
server_configs: List[APIServerConfig],
|
||||||
|
slurm: bool = False,
|
||||||
|
testing: bool = False,
|
||||||
|
):
|
||||||
|
super().__init__(config, server_configs, slurm, testing)
|
||||||
|
self._iter = 0
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def config_init(cls) -> Tuple[ToolServerSmokeEnvConfig, List[APIServerConfig]]:
|
||||||
|
base_url = (
|
||||||
|
os.getenv("ATROPOS_SERVER_BASE_URL")
|
||||||
|
or os.getenv("OPENAI_BASE_URL")
|
||||||
|
or os.getenv("LLM_BASE_URL")
|
||||||
|
or "http://127.0.0.1:8080"
|
||||||
|
)
|
||||||
|
model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
|
||||||
|
api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
|
||||||
|
|
||||||
|
env_config = ToolServerSmokeEnvConfig(
|
||||||
|
tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
|
||||||
|
group_size=1,
|
||||||
|
use_wandb=False,
|
||||||
|
include_messages=True,
|
||||||
|
ensure_scores_are_not_same=False,
|
||||||
|
total_steps=1,
|
||||||
|
batch_size=1,
|
||||||
|
server_base_url=base_url,
|
||||||
|
server_model=model,
|
||||||
|
enabled_toolsets=["terminal", "skills"],
|
||||||
|
disabled_toolsets=[],
|
||||||
|
# Self-contained ToolServer for local smoke.
|
||||||
|
tool_server_url="inprocess",
|
||||||
|
sandbox_image=os.getenv("ATROPOS_SANDBOX_IMAGE") or "atropos-sandbox:local",
|
||||||
|
purge_job_on_start=True,
|
||||||
|
purge_job_on_shutdown=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
server_configs = [
|
||||||
|
APIServerConfig(
|
||||||
|
model_name=model,
|
||||||
|
base_url=f"{base_url.rstrip('/')}/v1",
|
||||||
|
api_key=api_key,
|
||||||
|
num_max_requests_at_once=1,
|
||||||
|
num_requests_for_eval=1,
|
||||||
|
timeout=120,
|
||||||
|
)
|
||||||
|
]
|
||||||
|
return env_config, server_configs
|
||||||
|
|
||||||
|
async def setup_agent_env(self) -> None:
|
||||||
|
return None
|
||||||
|
|
||||||
|
async def get_next_item(self) -> Item:
|
||||||
|
self._iter += 1
|
||||||
|
return {
|
||||||
|
"prompt": (
|
||||||
|
"You MUST call exactly one tool per assistant message.\n"
|
||||||
|
"\n"
|
||||||
|
"Step 1) Call the skills_list tool (no arguments), then stop.\n"
|
||||||
|
"Step 2) After you receive the tool response, call the terminal tool to run:\n"
|
||||||
|
"python -c \"print('ok')\"\n"
|
||||||
|
"Step 3) After you receive the terminal tool response, answer with just: ok\n"
|
||||||
|
"\n"
|
||||||
|
"Tool call format requirements:\n"
|
||||||
|
"- Every tool call MUST be a complete XML block with a closing tag.\n"
|
||||||
|
"- Do NOT emit a second <tool_call> in the same assistant message.\n"
|
||||||
|
"\n"
|
||||||
|
"Example:\n"
|
||||||
|
"<tool_call>{\"name\": \"skills_list\", \"arguments\": {}}</tool_call>\n"
|
||||||
|
"Do not include anything else in your final answer."
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
def build_task(self, item: Item) -> str:
|
||||||
|
return str(item.get("prompt") or "")
|
||||||
|
|
||||||
|
def build_agent_config(self, item: Item) -> AgentConfig: # noqa: ARG002
|
||||||
|
return AgentConfig(
|
||||||
|
max_steps=min(10, int(self.config.agent_max_steps)),
|
||||||
|
temperature=0.2,
|
||||||
|
max_tokens=None,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def score_trajectory(self, item: Item, final_response: str) -> float:
|
||||||
|
_ = (item, final_response)
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
async def verify_and_score_trajectory(
|
||||||
|
self,
|
||||||
|
item: Item,
|
||||||
|
final_response: str,
|
||||||
|
*,
|
||||||
|
trajectory_id: str, # noqa: ARG002
|
||||||
|
exec_tool, # noqa: ARG002
|
||||||
|
agent_result: AgentResult | None = None,
|
||||||
|
workspace_meta: Dict[str, Any] | None = None, # noqa: ARG002
|
||||||
|
) -> tuple[float, Dict[str, Any]]:
|
||||||
|
if agent_result is None:
|
||||||
|
return 0.0, {"error": "Missing agent_result"}
|
||||||
|
|
||||||
|
called = {c.name for s in agent_result.steps for c in s.tool_calls}
|
||||||
|
need = {"skills_list", "terminal"}
|
||||||
|
if not need.issubset(called):
|
||||||
|
return 0.0, {"error": f"Missing tool calls: {sorted(need - called)}", "called": sorted(called)}
|
||||||
|
|
||||||
|
terminal_ok = False
|
||||||
|
for step in agent_result.steps:
|
||||||
|
for call, res in zip(step.tool_calls, step.tool_results):
|
||||||
|
if call.name != "terminal":
|
||||||
|
continue
|
||||||
|
if res.success and (res.output or "").strip().splitlines()[-1].strip() == "ok":
|
||||||
|
terminal_ok = True
|
||||||
|
|
||||||
|
score = 1.0 if terminal_ok and (final_response or "").strip() == "ok" else 0.0
|
||||||
|
return score, {"called": sorted(called), "final": (final_response or "").strip()}
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
ToolServerSmokeEnv.cli()
|
||||||
11
atropos/nomad/__init__.py
Normal file
11
atropos/nomad/__init__.py
Normal file
@@ -0,0 +1,11 @@
|
|||||||
|
"""
|
||||||
|
Nomad integration for atropos-agent.
|
||||||
|
|
||||||
|
Provides:
|
||||||
|
- NomadClient: Client for Nomad HTTP API
|
||||||
|
- Job templates for sandbox containers
|
||||||
|
"""
|
||||||
|
|
||||||
|
from .client import NomadClient
|
||||||
|
|
||||||
|
__all__ = ["NomadClient"]
|
||||||
500
atropos/nomad/client.py
Normal file
500
atropos/nomad/client.py
Normal file
@@ -0,0 +1,500 @@
|
|||||||
|
"""
|
||||||
|
Nomad API Client for atropos-agent.
|
||||||
|
|
||||||
|
Provides a simple async client for interacting with the Nomad HTTP API:
|
||||||
|
- Submit/stop jobs
|
||||||
|
- Query allocations
|
||||||
|
- Get allocation addresses
|
||||||
|
- Scale jobs up/down
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from enum import Enum
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
|
||||||
|
import aiohttp
|
||||||
|
|
||||||
|
|
||||||
|
class AllocationStatus(Enum):
|
||||||
|
"""Nomad allocation status."""
|
||||||
|
PENDING = "pending"
|
||||||
|
RUNNING = "running"
|
||||||
|
COMPLETE = "complete"
|
||||||
|
FAILED = "failed"
|
||||||
|
LOST = "lost"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class Allocation:
|
||||||
|
"""Information about a Nomad allocation."""
|
||||||
|
id: str
|
||||||
|
job_id: str
|
||||||
|
task_group: str
|
||||||
|
node_id: str
|
||||||
|
status: AllocationStatus
|
||||||
|
# Network info for reaching the allocation
|
||||||
|
address: Optional[str] = None
|
||||||
|
port: Optional[int] = None
|
||||||
|
|
||||||
|
@property
|
||||||
|
def http_address(self) -> Optional[str]:
|
||||||
|
"""Get full HTTP address for the allocation."""
|
||||||
|
if self.address and self.port:
|
||||||
|
return f"http://{self.address}:{self.port}"
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class JobStatus:
|
||||||
|
"""Status of a Nomad job."""
|
||||||
|
id: str
|
||||||
|
name: str
|
||||||
|
status: str
|
||||||
|
allocations: List[Allocation] = field(default_factory=list)
|
||||||
|
count: int = 0 # Number of task groups
|
||||||
|
|
||||||
|
|
||||||
|
class NomadClient:
|
||||||
|
"""
|
||||||
|
Async client for Nomad HTTP API.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
client = NomadClient(address="http://localhost:4646")
|
||||||
|
|
||||||
|
# Submit a job
|
||||||
|
await client.submit_job(job_spec)
|
||||||
|
|
||||||
|
# Get allocations
|
||||||
|
allocs = await client.get_job_allocations("sandbox-python")
|
||||||
|
|
||||||
|
# Scale job
|
||||||
|
await client.scale_job("sandbox-python", count=5)
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
address: str = "http://localhost:4646",
|
||||||
|
token: Optional[str] = None,
|
||||||
|
timeout: float = 30.0,
|
||||||
|
):
|
||||||
|
self.address = address.rstrip("/")
|
||||||
|
self.token = token or os.environ.get("NOMAD_TOKEN")
|
||||||
|
self.timeout = aiohttp.ClientTimeout(total=timeout)
|
||||||
|
self._session: Optional[aiohttp.ClientSession] = None
|
||||||
|
|
||||||
|
async def _get_session(self) -> aiohttp.ClientSession:
|
||||||
|
"""Get or create HTTP session."""
|
||||||
|
if self._session is None or self._session.closed:
|
||||||
|
headers = {}
|
||||||
|
if self.token:
|
||||||
|
headers["X-Nomad-Token"] = self.token
|
||||||
|
self._session = aiohttp.ClientSession(
|
||||||
|
timeout=self.timeout,
|
||||||
|
headers=headers,
|
||||||
|
)
|
||||||
|
return self._session
|
||||||
|
|
||||||
|
async def close(self):
|
||||||
|
"""Close the HTTP session."""
|
||||||
|
if self._session and not self._session.closed:
|
||||||
|
await self._session.close()
|
||||||
|
|
||||||
|
async def __aenter__(self):
|
||||||
|
return self
|
||||||
|
|
||||||
|
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
||||||
|
await self.close()
|
||||||
|
|
||||||
|
async def _request(
|
||||||
|
self,
|
||||||
|
method: str,
|
||||||
|
path: str,
|
||||||
|
data: Optional[Dict[str, Any]] = None,
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""Make an HTTP request to Nomad API."""
|
||||||
|
session = await self._get_session()
|
||||||
|
url = f"{self.address}{path}"
|
||||||
|
|
||||||
|
try:
|
||||||
|
async with session.request(method, url, json=data) as response:
|
||||||
|
if response.status == 404:
|
||||||
|
return {"error": "not_found", "status": 404}
|
||||||
|
|
||||||
|
text = await response.text()
|
||||||
|
if not text:
|
||||||
|
return {"status": response.status}
|
||||||
|
|
||||||
|
try:
|
||||||
|
result = json.loads(text)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
return {"text": text, "status": response.status}
|
||||||
|
|
||||||
|
if response.status >= 400:
|
||||||
|
return {"error": result, "status": response.status}
|
||||||
|
|
||||||
|
return result if isinstance(result, dict) else {"data": result, "status": response.status}
|
||||||
|
|
||||||
|
except aiohttp.ClientError as e:
|
||||||
|
return {"error": str(e), "status": 0}
|
||||||
|
|
||||||
|
# Job Operations
|
||||||
|
|
||||||
|
async def submit_job(self, job_spec: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Submit a job to Nomad.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
job_spec: Job specification dict (HCL converted to JSON)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Response with EvalID if successful
|
||||||
|
"""
|
||||||
|
return await self._request("POST", "/v1/jobs", {"Job": job_spec})
|
||||||
|
|
||||||
|
async def stop_job(self, job_id: str, purge: bool = False) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Stop (and optionally purge) a job.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
job_id: Job identifier
|
||||||
|
purge: If True, completely remove the job
|
||||||
|
"""
|
||||||
|
path = f"/v1/job/{job_id}"
|
||||||
|
if purge:
|
||||||
|
path += "?purge=true"
|
||||||
|
return await self._request("DELETE", path)
|
||||||
|
|
||||||
|
async def get_job(self, job_id: str) -> Optional[Dict[str, Any]]:
|
||||||
|
"""Get job details."""
|
||||||
|
result = await self._request("GET", f"/v1/job/{job_id}")
|
||||||
|
if "error" in result and result.get("status") == 404:
|
||||||
|
return None
|
||||||
|
return result
|
||||||
|
|
||||||
|
async def get_job_status(self, job_id: str) -> Optional[JobStatus]:
|
||||||
|
"""Get job status with allocations."""
|
||||||
|
job = await self.get_job(job_id)
|
||||||
|
if not job:
|
||||||
|
return None
|
||||||
|
|
||||||
|
allocs = await self.get_job_allocations(job_id)
|
||||||
|
|
||||||
|
# Get count from task groups
|
||||||
|
count = 0
|
||||||
|
task_groups = job.get("TaskGroups", [])
|
||||||
|
for tg in task_groups:
|
||||||
|
count += tg.get("Count", 1)
|
||||||
|
|
||||||
|
return JobStatus(
|
||||||
|
id=job_id,
|
||||||
|
name=job.get("Name", job_id),
|
||||||
|
status=job.get("Status", "unknown"),
|
||||||
|
allocations=allocs,
|
||||||
|
count=count,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Allocation Operations
|
||||||
|
|
||||||
|
async def get_job_allocations(self, job_id: str) -> List[Allocation]:
|
||||||
|
"""Get all allocations for a job."""
|
||||||
|
result = await self._request("GET", f"/v1/job/{job_id}/allocations")
|
||||||
|
|
||||||
|
if "error" in result:
|
||||||
|
return []
|
||||||
|
|
||||||
|
allocs_data = result.get("data", result) if isinstance(result, dict) else result
|
||||||
|
if not isinstance(allocs_data, list):
|
||||||
|
return []
|
||||||
|
|
||||||
|
allocations = []
|
||||||
|
for alloc_data in allocs_data:
|
||||||
|
# Parse allocation info
|
||||||
|
alloc_id = alloc_data.get("ID", "")
|
||||||
|
status_str = alloc_data.get("ClientStatus", "unknown")
|
||||||
|
|
||||||
|
try:
|
||||||
|
status = AllocationStatus(status_str)
|
||||||
|
except ValueError:
|
||||||
|
status = AllocationStatus.PENDING
|
||||||
|
|
||||||
|
# Get network info - need to fetch detailed allocation for this
|
||||||
|
address = None
|
||||||
|
port = None
|
||||||
|
|
||||||
|
# First try the summary data
|
||||||
|
resources = alloc_data.get("AllocatedResources") or {}
|
||||||
|
shared = resources.get("Shared") or {}
|
||||||
|
networks = shared.get("Networks") or []
|
||||||
|
|
||||||
|
# If no networks in summary, fetch detailed allocation
|
||||||
|
if not networks and alloc_id:
|
||||||
|
detailed = await self.get_allocation(alloc_id)
|
||||||
|
if detailed:
|
||||||
|
resources = detailed.get("AllocatedResources") or {}
|
||||||
|
shared = resources.get("Shared") or {}
|
||||||
|
networks = shared.get("Networks") or []
|
||||||
|
|
||||||
|
if networks:
|
||||||
|
network = networks[0]
|
||||||
|
address = network.get("IP")
|
||||||
|
# Look for dynamic ports OR reserved ports (Singularity/raw_exec uses reserved)
|
||||||
|
dyn_ports = network.get("DynamicPorts") or []
|
||||||
|
reserved_ports = network.get("ReservedPorts") or []
|
||||||
|
for dp in dyn_ports + reserved_ports:
|
||||||
|
if dp.get("Label") == "http":
|
||||||
|
port = dp.get("Value")
|
||||||
|
break
|
||||||
|
|
||||||
|
allocations.append(Allocation(
|
||||||
|
id=alloc_id,
|
||||||
|
job_id=job_id,
|
||||||
|
task_group=alloc_data.get("TaskGroup", ""),
|
||||||
|
node_id=alloc_data.get("NodeID", ""),
|
||||||
|
status=status,
|
||||||
|
address=address,
|
||||||
|
port=port,
|
||||||
|
))
|
||||||
|
|
||||||
|
return allocations
|
||||||
|
|
||||||
|
async def get_allocation(self, alloc_id: str) -> Optional[Dict[str, Any]]:
|
||||||
|
"""Get detailed allocation info."""
|
||||||
|
result = await self._request("GET", f"/v1/allocation/{alloc_id}")
|
||||||
|
if "error" in result and result.get("status") == 404:
|
||||||
|
return None
|
||||||
|
return result
|
||||||
|
|
||||||
|
# Scaling Operations
|
||||||
|
|
||||||
|
async def scale_job(self, job_id: str, count: int, task_group: str = "sandbox") -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Scale a job's task group to specified count.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
job_id: Job identifier
|
||||||
|
count: Desired number of allocations
|
||||||
|
task_group: Name of task group to scale
|
||||||
|
"""
|
||||||
|
payload = {
|
||||||
|
"Count": count,
|
||||||
|
"Target": {
|
||||||
|
"Group": task_group,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
return await self._request("POST", f"/v1/job/{job_id}/scale", payload)
|
||||||
|
|
||||||
|
async def get_job_scale_status(self, job_id: str) -> Dict[str, int]:
|
||||||
|
"""
|
||||||
|
Get current scale status for a job.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict mapping task group name to count
|
||||||
|
"""
|
||||||
|
result = await self._request("GET", f"/v1/job/{job_id}/scale")
|
||||||
|
|
||||||
|
if "error" in result:
|
||||||
|
return {}
|
||||||
|
|
||||||
|
task_groups = result.get("TaskGroups", {})
|
||||||
|
return {
|
||||||
|
name: info.get("Running", 0)
|
||||||
|
for name, info in task_groups.items()
|
||||||
|
}
|
||||||
|
|
||||||
|
# Health Check
|
||||||
|
|
||||||
|
async def is_healthy(self) -> bool:
|
||||||
|
"""Check if Nomad is reachable and healthy."""
|
||||||
|
try:
|
||||||
|
result = await self._request("GET", "/v1/status/leader")
|
||||||
|
return "error" not in result
|
||||||
|
except Exception:
|
||||||
|
return False
|
||||||
|
|
||||||
|
async def get_leader(self) -> Optional[str]:
|
||||||
|
"""Get current Nomad leader address."""
|
||||||
|
result = await self._request("GET", "/v1/status/leader")
|
||||||
|
if isinstance(result, dict) and "data" in result:
|
||||||
|
return result["data"]
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def load_job_template(
|
||||||
|
template_name: str = "sandbox",
|
||||||
|
**kwargs,
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Load and configure a job template.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
template_name: Name of template (e.g., "sandbox")
|
||||||
|
**kwargs: Template variables to substitute
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Job specification dict ready for Nomad API
|
||||||
|
"""
|
||||||
|
# Default job template for sandbox container
|
||||||
|
if template_name == "sandbox":
|
||||||
|
return create_sandbox_job(**kwargs)
|
||||||
|
else:
|
||||||
|
raise ValueError(f"Unknown template: {template_name}")
|
||||||
|
|
||||||
|
|
||||||
|
def create_sandbox_job(
|
||||||
|
job_id: str = "atropos-sandbox",
|
||||||
|
image: str = "atropos-sandbox:local", # Use :local tag to avoid registry pull
|
||||||
|
count: int = 1,
|
||||||
|
slots_per_container: int = 10,
|
||||||
|
privileged: bool = False,
|
||||||
|
cpu: int = 500,
|
||||||
|
memory: int = 512,
|
||||||
|
port: int = 8080,
|
||||||
|
datacenter: str = "dc1",
|
||||||
|
driver: str = "docker", # "docker" or "singularity"
|
||||||
|
singularity_image: str = None, # Path to .sif file for singularity driver
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Create a sandbox job specification.
|
||||||
|
|
||||||
|
This job runs the sandbox_server.py inside a container,
|
||||||
|
with the specified number of slots for agent workspaces.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
job_id: Unique job identifier
|
||||||
|
image: Docker image to use (for docker driver)
|
||||||
|
count: Number of container instances
|
||||||
|
slots_per_container: Number of slots per container
|
||||||
|
privileged: Run container in privileged mode (recommended for bubblewrap)
|
||||||
|
cpu: CPU allocation in MHz
|
||||||
|
memory: Memory allocation in MB
|
||||||
|
port: HTTP port for sandbox server
|
||||||
|
datacenter: Nomad datacenter
|
||||||
|
driver: Container driver - "docker" or "singularity"
|
||||||
|
singularity_image: Path to .sif file (required if driver="singularity")
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Job specification dict
|
||||||
|
"""
|
||||||
|
# Build task config based on driver
|
||||||
|
if driver == "singularity":
|
||||||
|
if not singularity_image:
|
||||||
|
raise ValueError("singularity_image path required when driver='singularity'")
|
||||||
|
|
||||||
|
# Use raw_exec driver to run apptainer via shell for variable expansion
|
||||||
|
# The container binds the allocation directory for workspace persistence
|
||||||
|
# For raw_exec, we use static port since Nomad's dynamic port mapping doesn't
|
||||||
|
# work the same as Docker - the process runs directly on the host.
|
||||||
|
shell_cmd = (
|
||||||
|
f'apptainer run '
|
||||||
|
f'--bind "$NOMAD_ALLOC_DIR/data:/data" '
|
||||||
|
f'--pwd /app '
|
||||||
|
f'--env PYTHONUNBUFFERED=1 '
|
||||||
|
f'{singularity_image} '
|
||||||
|
f'python sandbox_server.py '
|
||||||
|
f'--port {port} '
|
||||||
|
f'--slots {slots_per_container} '
|
||||||
|
f'--data-dir /data'
|
||||||
|
)
|
||||||
|
task_config = {
|
||||||
|
"command": "/bin/sh",
|
||||||
|
"args": ["-c", shell_cmd],
|
||||||
|
}
|
||||||
|
task_driver = "raw_exec"
|
||||||
|
else:
|
||||||
|
# Docker driver (default)
|
||||||
|
task_config = {
|
||||||
|
"image": image,
|
||||||
|
"force_pull": False, # Use local image, don't try to pull
|
||||||
|
"ports": ["http"],
|
||||||
|
"privileged": privileged,
|
||||||
|
"command": "python",
|
||||||
|
"args": [
|
||||||
|
"sandbox_server.py",
|
||||||
|
"--port", str(port),
|
||||||
|
"--slots", str(slots_per_container),
|
||||||
|
"--data-dir", "/data",
|
||||||
|
],
|
||||||
|
# Note: On Linux, you can mount persistent storage:
|
||||||
|
# "volumes": ["${NOMAD_ALLOC_DIR}/data:/data"],
|
||||||
|
# On macOS/Docker Desktop, skip volumes for PoC
|
||||||
|
# (container /data is ephemeral but works for testing)
|
||||||
|
}
|
||||||
|
task_driver = "docker"
|
||||||
|
|
||||||
|
# For Singularity/raw_exec, use static ports since the process runs directly on host.
|
||||||
|
# For Docker, use dynamic ports with port mapping.
|
||||||
|
if driver == "singularity":
|
||||||
|
network_config = {
|
||||||
|
"Mode": "host",
|
||||||
|
"ReservedPorts": [
|
||||||
|
{
|
||||||
|
"Label": "http",
|
||||||
|
"Value": port,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
network_config = {
|
||||||
|
"Mode": "host",
|
||||||
|
"DynamicPorts": [
|
||||||
|
{
|
||||||
|
"Label": "http",
|
||||||
|
"To": port,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
}
|
||||||
|
|
||||||
|
return {
|
||||||
|
"ID": job_id,
|
||||||
|
"Name": job_id,
|
||||||
|
"Type": "service",
|
||||||
|
"Datacenters": [datacenter],
|
||||||
|
"TaskGroups": [
|
||||||
|
{
|
||||||
|
"Name": "sandbox",
|
||||||
|
"Count": count,
|
||||||
|
# Speed up deployments and avoid Consul checks. Without this, Nomad may
|
||||||
|
# keep an "active deployment" around for the default MinHealthyTime,
|
||||||
|
# which blocks immediate scaling under load.
|
||||||
|
"Update": {
|
||||||
|
"HealthCheck": "task_states",
|
||||||
|
"MinHealthyTime": 0,
|
||||||
|
},
|
||||||
|
"Networks": [network_config],
|
||||||
|
"Tasks": [
|
||||||
|
{
|
||||||
|
"Name": "sandbox-server",
|
||||||
|
"Driver": task_driver,
|
||||||
|
"Config": task_config,
|
||||||
|
"Env": {
|
||||||
|
"PYTHONUNBUFFERED": "1",
|
||||||
|
"NOMAD_ALLOC_DIR": "${NOMAD_ALLOC_DIR}",
|
||||||
|
},
|
||||||
|
"Resources": {
|
||||||
|
"CPU": cpu,
|
||||||
|
"MemoryMB": memory,
|
||||||
|
},
|
||||||
|
# Note: Services with Checks require Consul, which we skip for the PoC
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"RestartPolicy": {
|
||||||
|
"Attempts": 3,
|
||||||
|
"Interval": 300_000_000_000, # 5 minutes
|
||||||
|
"Delay": 10_000_000_000, # 10 seconds
|
||||||
|
"Mode": "delay",
|
||||||
|
},
|
||||||
|
"ReschedulePolicy": {
|
||||||
|
"Attempts": 5,
|
||||||
|
"Interval": 3600_000_000_000, # 1 hour
|
||||||
|
"Delay": 30_000_000_000, # 30 seconds
|
||||||
|
"DelayFunction": "exponential",
|
||||||
|
"MaxDelay": 300_000_000_000, # 5 minutes
|
||||||
|
"Unlimited": False,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
],
|
||||||
|
}
|
||||||
1912
atropos/sandbox_server.py
Normal file
1912
atropos/sandbox_server.py
Normal file
File diff suppressed because it is too large
Load Diff
20
atropos/slots/__init__.py
Normal file
20
atropos/slots/__init__.py
Normal file
@@ -0,0 +1,20 @@
|
|||||||
|
"""
|
||||||
|
Slot-based multiplexing for atropos-agent.
|
||||||
|
|
||||||
|
Provides:
|
||||||
|
- Slot: Isolated workspace for a single trajectory
|
||||||
|
- SlotPool: Manages slots across Nomad allocations
|
||||||
|
- SandboxExecutor: Executes tools in sandbox containers
|
||||||
|
"""
|
||||||
|
|
||||||
|
from .executor import SandboxExecutor
|
||||||
|
from .pool import SlotPool, SlotPoolConfig
|
||||||
|
from .slot import Slot, SlotState
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"Slot",
|
||||||
|
"SlotState",
|
||||||
|
"SlotPool",
|
||||||
|
"SlotPoolConfig",
|
||||||
|
"SandboxExecutor",
|
||||||
|
]
|
||||||
457
atropos/slots/executor.py
Normal file
457
atropos/slots/executor.py
Normal file
@@ -0,0 +1,457 @@
|
|||||||
|
"""
|
||||||
|
SandboxExecutor - HTTP client for sandbox container communication.
|
||||||
|
|
||||||
|
Sends tool execution requests to sandbox_server.py running inside Nomad containers.
|
||||||
|
Supports single and batch execution for efficiency.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import uuid
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from typing import Any, Dict, List, Optional, Tuple
|
||||||
|
|
||||||
|
import aiohttp
|
||||||
|
|
||||||
|
from .slot import Slot, SlotState
|
||||||
|
from ..tools.base import ToolCall, ToolResult
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ExecutionRequest:
|
||||||
|
"""Request to execute a tool in a slot."""
|
||||||
|
slot: Slot
|
||||||
|
tool_name: str
|
||||||
|
args: Dict[str, Any]
|
||||||
|
execution_id: str = field(default_factory=lambda: str(uuid.uuid4()))
|
||||||
|
timeout: float = 30.0
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ExecutionResult:
|
||||||
|
"""Result from sandbox execution."""
|
||||||
|
success: bool
|
||||||
|
output: str = ""
|
||||||
|
error: str = ""
|
||||||
|
execution_id: str = ""
|
||||||
|
slot_id: str = ""
|
||||||
|
metadata: Dict[str, Any] = field(default_factory=dict)
|
||||||
|
|
||||||
|
def to_tool_result(self) -> ToolResult:
|
||||||
|
"""Convert to ToolResult for agent consumption."""
|
||||||
|
return ToolResult(
|
||||||
|
success=self.success,
|
||||||
|
output=self.output,
|
||||||
|
error=self.error,
|
||||||
|
metadata=self.metadata,
|
||||||
|
uniq_id=self.execution_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class SandboxExecutor:
|
||||||
|
"""
|
||||||
|
HTTP client for executing tools in sandbox containers.
|
||||||
|
|
||||||
|
Communicates with sandbox_server.py running inside Nomad allocations.
|
||||||
|
Supports both single execution and batched parallel execution.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
executor = SandboxExecutor()
|
||||||
|
|
||||||
|
# Single execution
|
||||||
|
result = await executor.execute(slot, "bash", {"command": "ls"})
|
||||||
|
|
||||||
|
# Batch execution
|
||||||
|
results = await executor.execute_batch([
|
||||||
|
(slot1, "bash", {"command": "ls"}),
|
||||||
|
(slot2, "write_file", {"path": "test.txt", "content": "hello"}),
|
||||||
|
])
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
timeout: float = 30.0,
|
||||||
|
max_retries: int = 3,
|
||||||
|
retry_delay: float = 1.0,
|
||||||
|
):
|
||||||
|
self.timeout = aiohttp.ClientTimeout(total=timeout)
|
||||||
|
self.max_retries = max_retries
|
||||||
|
self.retry_delay = retry_delay
|
||||||
|
self._session: Optional[aiohttp.ClientSession] = None
|
||||||
|
|
||||||
|
async def _get_session(self) -> aiohttp.ClientSession:
|
||||||
|
"""Get or create HTTP session."""
|
||||||
|
if self._session is None or self._session.closed:
|
||||||
|
self._session = aiohttp.ClientSession(timeout=self.timeout)
|
||||||
|
return self._session
|
||||||
|
|
||||||
|
async def close(self):
|
||||||
|
"""Close HTTP session."""
|
||||||
|
if self._session and not self._session.closed:
|
||||||
|
await self._session.close()
|
||||||
|
|
||||||
|
async def __aenter__(self):
|
||||||
|
return self
|
||||||
|
|
||||||
|
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
||||||
|
await self.close()
|
||||||
|
|
||||||
|
async def execute(
|
||||||
|
self,
|
||||||
|
slot: Slot,
|
||||||
|
tool_name: str,
|
||||||
|
args: Dict[str, Any],
|
||||||
|
timeout: Optional[float] = None,
|
||||||
|
) -> ExecutionResult:
|
||||||
|
"""
|
||||||
|
Execute a tool in a slot's workspace.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
slot: Slot to execute in
|
||||||
|
tool_name: Name of tool (bash, read_file, write_file)
|
||||||
|
args: Tool arguments
|
||||||
|
timeout: Optional timeout override
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
ExecutionResult with output or error
|
||||||
|
"""
|
||||||
|
execution_id = str(uuid.uuid4())
|
||||||
|
exec_timeout = timeout or self.timeout.total or 30.0
|
||||||
|
|
||||||
|
# Mark slot as executing
|
||||||
|
original_state = slot.state
|
||||||
|
try:
|
||||||
|
if slot.state == SlotState.ACQUIRED:
|
||||||
|
slot.start_execution(execution_id)
|
||||||
|
|
||||||
|
result = await self._send_execute_request(
|
||||||
|
container_addr=slot.container_addr,
|
||||||
|
slot_id=slot.slot_id,
|
||||||
|
tool_name=tool_name,
|
||||||
|
args=args,
|
||||||
|
execution_id=execution_id,
|
||||||
|
timeout=exec_timeout,
|
||||||
|
)
|
||||||
|
result.slot_id = slot.slot_id
|
||||||
|
return result
|
||||||
|
|
||||||
|
finally:
|
||||||
|
# Restore slot state
|
||||||
|
if slot.state == SlotState.EXECUTING:
|
||||||
|
slot.end_execution()
|
||||||
|
|
||||||
|
async def _send_execute_request(
|
||||||
|
self,
|
||||||
|
container_addr: str,
|
||||||
|
slot_id: str,
|
||||||
|
tool_name: str,
|
||||||
|
args: Dict[str, Any],
|
||||||
|
execution_id: str,
|
||||||
|
timeout: float,
|
||||||
|
) -> ExecutionResult:
|
||||||
|
"""Send execution request to sandbox server with retry logic."""
|
||||||
|
session = await self._get_session()
|
||||||
|
url = f"{container_addr}/execute"
|
||||||
|
|
||||||
|
payload = {
|
||||||
|
"slot_id": slot_id,
|
||||||
|
"tool": tool_name,
|
||||||
|
"args": args,
|
||||||
|
"execution_id": execution_id,
|
||||||
|
"timeout": timeout,
|
||||||
|
}
|
||||||
|
|
||||||
|
last_error = None
|
||||||
|
for attempt in range(self.max_retries):
|
||||||
|
try:
|
||||||
|
async with session.post(url, json=payload) as response:
|
||||||
|
data = await response.json()
|
||||||
|
|
||||||
|
return ExecutionResult(
|
||||||
|
success=data.get("success", False),
|
||||||
|
output=data.get("output", ""),
|
||||||
|
error=data.get("error", ""),
|
||||||
|
execution_id=data.get("execution_id", execution_id),
|
||||||
|
metadata=data.get("metadata", {}),
|
||||||
|
)
|
||||||
|
|
||||||
|
except aiohttp.ClientError as e:
|
||||||
|
last_error = str(e)
|
||||||
|
if attempt < self.max_retries - 1:
|
||||||
|
await asyncio.sleep(self.retry_delay * (attempt + 1))
|
||||||
|
continue
|
||||||
|
except asyncio.TimeoutError:
|
||||||
|
last_error = f"Request timed out after {timeout}s"
|
||||||
|
break
|
||||||
|
except Exception as e:
|
||||||
|
last_error = str(e)
|
||||||
|
break
|
||||||
|
|
||||||
|
return ExecutionResult(
|
||||||
|
success=False,
|
||||||
|
error=f"Failed after {self.max_retries} attempts: {last_error}",
|
||||||
|
execution_id=execution_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def execute_batch(
|
||||||
|
self,
|
||||||
|
requests: List[Tuple[Slot, str, Dict[str, Any]]],
|
||||||
|
timeout: Optional[float] = None,
|
||||||
|
) -> List[ExecutionResult]:
|
||||||
|
"""
|
||||||
|
Execute multiple tools in parallel across slots.
|
||||||
|
|
||||||
|
This is the key optimization - we batch tool calls to maximize
|
||||||
|
container utilization while agents are waiting for LLM responses.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
requests: List of (slot, tool_name, args) tuples
|
||||||
|
timeout: Optional timeout override
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of ExecutionResults in same order as requests
|
||||||
|
"""
|
||||||
|
if not requests:
|
||||||
|
return []
|
||||||
|
|
||||||
|
# Group requests by container address for batch API
|
||||||
|
by_container: Dict[str, List[Tuple[int, Slot, str, Dict[str, Any], str]]] = {}
|
||||||
|
|
||||||
|
for idx, (slot, tool_name, args) in enumerate(requests):
|
||||||
|
execution_id = str(uuid.uuid4())
|
||||||
|
container = slot.container_addr
|
||||||
|
|
||||||
|
if container not in by_container:
|
||||||
|
by_container[container] = []
|
||||||
|
by_container[container].append((idx, slot, tool_name, args, execution_id))
|
||||||
|
|
||||||
|
# Mark slots as executing
|
||||||
|
if slot.state == SlotState.ACQUIRED:
|
||||||
|
slot.start_execution(execution_id)
|
||||||
|
|
||||||
|
# Execute batches in parallel
|
||||||
|
exec_timeout = timeout or self.timeout.total or 30.0
|
||||||
|
batch_tasks = []
|
||||||
|
|
||||||
|
for container_addr, batch_requests in by_container.items():
|
||||||
|
task = self._send_batch_request(
|
||||||
|
container_addr=container_addr,
|
||||||
|
batch_requests=batch_requests,
|
||||||
|
timeout=exec_timeout,
|
||||||
|
)
|
||||||
|
batch_tasks.append(task)
|
||||||
|
|
||||||
|
# Gather all batch results
|
||||||
|
batch_results = await asyncio.gather(*batch_tasks, return_exceptions=True)
|
||||||
|
|
||||||
|
# Collect results in original order
|
||||||
|
results: List[Optional[ExecutionResult]] = [None] * len(requests)
|
||||||
|
|
||||||
|
for batch_result in batch_results:
|
||||||
|
if isinstance(batch_result, Exception):
|
||||||
|
# Mark all in this batch as failed
|
||||||
|
continue
|
||||||
|
|
||||||
|
for idx, result in batch_result:
|
||||||
|
results[idx] = result
|
||||||
|
|
||||||
|
# Fill in any missing results
|
||||||
|
for idx, result in enumerate(results):
|
||||||
|
if result is None:
|
||||||
|
slot, tool_name, args = requests[idx]
|
||||||
|
results[idx] = ExecutionResult(
|
||||||
|
success=False,
|
||||||
|
error="Batch execution failed",
|
||||||
|
slot_id=slot.slot_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
# End execution on all slots
|
||||||
|
for slot, _, _ in requests:
|
||||||
|
if slot.state == SlotState.EXECUTING:
|
||||||
|
slot.end_execution()
|
||||||
|
|
||||||
|
return results # type: ignore
|
||||||
|
|
||||||
|
async def _send_batch_request(
|
||||||
|
self,
|
||||||
|
container_addr: str,
|
||||||
|
batch_requests: List[Tuple[int, Slot, str, Dict[str, Any], str]],
|
||||||
|
timeout: float,
|
||||||
|
) -> List[Tuple[int, ExecutionResult]]:
|
||||||
|
"""Send batch execution request to a single container."""
|
||||||
|
session = await self._get_session()
|
||||||
|
url = f"{container_addr}/batch"
|
||||||
|
|
||||||
|
# Build batch payload
|
||||||
|
payload = [
|
||||||
|
{
|
||||||
|
"slot_id": slot.slot_id,
|
||||||
|
"tool": tool_name,
|
||||||
|
"args": args,
|
||||||
|
"execution_id": execution_id,
|
||||||
|
"timeout": timeout,
|
||||||
|
}
|
||||||
|
for _, slot, tool_name, args, execution_id in batch_requests
|
||||||
|
]
|
||||||
|
|
||||||
|
try:
|
||||||
|
async with session.post(url, json=payload) as response:
|
||||||
|
data = await response.json()
|
||||||
|
|
||||||
|
if not isinstance(data, list):
|
||||||
|
raise ValueError(f"Expected list response, got {type(data)}")
|
||||||
|
|
||||||
|
results = []
|
||||||
|
for i, (idx, slot, _, _, execution_id) in enumerate(batch_requests):
|
||||||
|
if i < len(data):
|
||||||
|
item = data[i]
|
||||||
|
result = ExecutionResult(
|
||||||
|
success=item.get("success", False),
|
||||||
|
output=item.get("output", ""),
|
||||||
|
error=item.get("error", ""),
|
||||||
|
execution_id=item.get("execution_id", execution_id),
|
||||||
|
slot_id=slot.slot_id,
|
||||||
|
metadata=item.get("metadata", {}),
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
result = ExecutionResult(
|
||||||
|
success=False,
|
||||||
|
error="Missing result in batch response",
|
||||||
|
execution_id=execution_id,
|
||||||
|
slot_id=slot.slot_id,
|
||||||
|
)
|
||||||
|
results.append((idx, result))
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
# Return error for all requests in batch
|
||||||
|
return [
|
||||||
|
(idx, ExecutionResult(
|
||||||
|
success=False,
|
||||||
|
error=str(e),
|
||||||
|
execution_id=execution_id,
|
||||||
|
slot_id=slot.slot_id,
|
||||||
|
))
|
||||||
|
for idx, slot, _, _, execution_id in batch_requests
|
||||||
|
]
|
||||||
|
|
||||||
|
async def reset_slot(self, slot: Slot) -> ExecutionResult:
|
||||||
|
"""
|
||||||
|
Reset a slot's workspace (delete all files).
|
||||||
|
|
||||||
|
Useful when reusing a slot for a new trajectory.
|
||||||
|
"""
|
||||||
|
session = await self._get_session()
|
||||||
|
url = f"{slot.container_addr}/reset"
|
||||||
|
|
||||||
|
try:
|
||||||
|
async with session.post(url, json={"slot_id": slot.slot_id}) as response:
|
||||||
|
data = await response.json()
|
||||||
|
return ExecutionResult(
|
||||||
|
success=data.get("success", False),
|
||||||
|
output=data.get("output", ""),
|
||||||
|
error=data.get("error", ""),
|
||||||
|
slot_id=slot.slot_id,
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
return ExecutionResult(
|
||||||
|
success=False,
|
||||||
|
error=str(e),
|
||||||
|
slot_id=slot.slot_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def health_check(self, container_addr: str) -> bool:
|
||||||
|
"""Check if a sandbox container is healthy."""
|
||||||
|
session = await self._get_session()
|
||||||
|
url = f"{container_addr}/health"
|
||||||
|
|
||||||
|
try:
|
||||||
|
async with session.get(url) as response:
|
||||||
|
data = await response.json()
|
||||||
|
return data.get("status") == "ok"
|
||||||
|
except Exception:
|
||||||
|
return False
|
||||||
|
|
||||||
|
async def get_container_status(
|
||||||
|
self,
|
||||||
|
container_addr: str
|
||||||
|
) -> Optional[Dict[str, Any]]:
|
||||||
|
"""Get status info from a sandbox container."""
|
||||||
|
session = await self._get_session()
|
||||||
|
url = f"{container_addr}/health"
|
||||||
|
|
||||||
|
try:
|
||||||
|
async with session.get(url) as response:
|
||||||
|
return await response.json()
|
||||||
|
except Exception:
|
||||||
|
return None
|
||||||
|
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
# Artifact helpers (optional)
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
|
||||||
|
async def _post_json(
|
||||||
|
self,
|
||||||
|
url: str,
|
||||||
|
payload: Dict[str, Any],
|
||||||
|
timeout: Optional[float] = None,
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
session = await self._get_session()
|
||||||
|
try:
|
||||||
|
async with session.post(url, json=payload, timeout=timeout) as response:
|
||||||
|
data = await response.json()
|
||||||
|
if isinstance(data, dict):
|
||||||
|
data.setdefault("http_status", response.status)
|
||||||
|
return data
|
||||||
|
return {"success": False, "error": f"Unexpected response type: {type(data)}", "http_status": response.status}
|
||||||
|
except Exception as e:
|
||||||
|
return {"success": False, "error": str(e)}
|
||||||
|
|
||||||
|
async def read_artifact(
|
||||||
|
self,
|
||||||
|
slot: Slot,
|
||||||
|
path: str,
|
||||||
|
*,
|
||||||
|
encoding: str = "text",
|
||||||
|
max_bytes: Optional[int] = None,
|
||||||
|
include_sha256: bool = False,
|
||||||
|
timeout: Optional[float] = None,
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
url = f"{slot.container_addr}/artifacts/read"
|
||||||
|
payload: Dict[str, Any] = {"slot_id": slot.slot_id, "path": path, "encoding": encoding, "include_sha256": include_sha256}
|
||||||
|
if max_bytes is not None:
|
||||||
|
payload["max_bytes"] = max_bytes
|
||||||
|
return await self._post_json(url, payload, timeout=timeout)
|
||||||
|
|
||||||
|
async def list_artifacts(
|
||||||
|
self,
|
||||||
|
slot: Slot,
|
||||||
|
path: str = ".",
|
||||||
|
*,
|
||||||
|
recursive: bool = False,
|
||||||
|
max_entries: Optional[int] = None,
|
||||||
|
timeout: Optional[float] = None,
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
url = f"{slot.container_addr}/artifacts/list"
|
||||||
|
payload: Dict[str, Any] = {"slot_id": slot.slot_id, "path": path, "recursive": recursive}
|
||||||
|
if max_entries is not None:
|
||||||
|
payload["max_entries"] = max_entries
|
||||||
|
return await self._post_json(url, payload, timeout=timeout)
|
||||||
|
|
||||||
|
async def archive_artifacts(
|
||||||
|
self,
|
||||||
|
slot: Slot,
|
||||||
|
path: str = ".",
|
||||||
|
*,
|
||||||
|
archive_format: str = "tar.gz",
|
||||||
|
max_bytes: Optional[int] = None,
|
||||||
|
max_entries: Optional[int] = None,
|
||||||
|
timeout: Optional[float] = None,
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
url = f"{slot.container_addr}/artifacts/archive"
|
||||||
|
payload: Dict[str, Any] = {"slot_id": slot.slot_id, "path": path, "format": archive_format}
|
||||||
|
if max_bytes is not None:
|
||||||
|
payload["max_bytes"] = max_bytes
|
||||||
|
if max_entries is not None:
|
||||||
|
payload["max_entries"] = max_entries
|
||||||
|
return await self._post_json(url, payload, timeout=timeout)
|
||||||
659
atropos/slots/pool.py
Normal file
659
atropos/slots/pool.py
Normal file
@@ -0,0 +1,659 @@
|
|||||||
|
"""
|
||||||
|
SlotPool - Manages slots across Nomad allocations.
|
||||||
|
|
||||||
|
The SlotPool is the core abstraction for slot-based multiplexing:
|
||||||
|
- Tracks available/acquired slots across containers
|
||||||
|
- Handles slot acquisition and release
|
||||||
|
- Auto-scales Nomad job count based on demand
|
||||||
|
- Provides batched tool execution
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import subprocess
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any, Dict, List, Optional, Tuple
|
||||||
|
|
||||||
|
from ..nomad.client import (
|
||||||
|
Allocation,
|
||||||
|
AllocationStatus,
|
||||||
|
NomadClient,
|
||||||
|
create_sandbox_job,
|
||||||
|
)
|
||||||
|
from .executor import ExecutionResult, SandboxExecutor
|
||||||
|
from .slot import Slot, SlotState, create_slots_for_allocation
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class SlotPoolConfig:
|
||||||
|
"""Configuration for SlotPool."""
|
||||||
|
|
||||||
|
# Nomad settings
|
||||||
|
nomad_address: str = "http://localhost:4646"
|
||||||
|
job_id: str = "atropos-sandbox"
|
||||||
|
datacenter: str = "dc1"
|
||||||
|
|
||||||
|
# Container settings
|
||||||
|
image: str = "atropos-sandbox:local" # Use :local tag to avoid registry pull
|
||||||
|
slots_per_container: int = 10
|
||||||
|
privileged: bool = False
|
||||||
|
cpu: int = 500 # MHz
|
||||||
|
memory: int = 512 # MB
|
||||||
|
|
||||||
|
# Driver selection: "docker" or "singularity"
|
||||||
|
driver: str = "docker"
|
||||||
|
# Path to .sif file for singularity driver (required if driver="singularity")
|
||||||
|
singularity_image: Optional[str] = None
|
||||||
|
|
||||||
|
# Scaling settings
|
||||||
|
min_containers: int = 1
|
||||||
|
max_containers: int = 10
|
||||||
|
|
||||||
|
# Timeouts
|
||||||
|
acquire_timeout: float = 30.0 # Seconds between acquire polls (also triggers scale-up attempts)
|
||||||
|
health_check_interval: float = 30.0 # Seconds between health checks
|
||||||
|
scale_cooldown: float = 60.0 # Seconds between scale operations
|
||||||
|
|
||||||
|
# Job lifecycle
|
||||||
|
purge_job_on_start: bool = False # Purge any pre-existing job before starting (local dev/training friendly)
|
||||||
|
|
||||||
|
# Local Docker image convenience (macOS/Nomad dev mode)
|
||||||
|
auto_build_local_image: bool = True # If image endswith :local and is missing, build it from the bundled Dockerfile.
|
||||||
|
dockerfile_path: Optional[str] = None # Override Dockerfile path (default: Hermes-Agent/atropos/Dockerfile).
|
||||||
|
docker_build_context: Optional[str] = None # Override build context (default: Hermes-Agent/atropos).
|
||||||
|
|
||||||
|
|
||||||
|
class SlotPool:
|
||||||
|
"""
|
||||||
|
Manages a pool of slots across Nomad allocations.
|
||||||
|
|
||||||
|
The SlotPool:
|
||||||
|
- Deploys sandbox containers to Nomad
|
||||||
|
- Tracks slots across all running containers
|
||||||
|
- Handles slot acquisition/release
|
||||||
|
- Auto-scales based on demand
|
||||||
|
- Provides batched execution via SandboxExecutor
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
config = SlotPoolConfig(
|
||||||
|
nomad_address="http://localhost:4646",
|
||||||
|
job_id="my-sandbox",
|
||||||
|
slots_per_container=10,
|
||||||
|
)
|
||||||
|
|
||||||
|
pool = SlotPool(config)
|
||||||
|
await pool.start()
|
||||||
|
|
||||||
|
# Acquire a slot
|
||||||
|
slot = await pool.acquire()
|
||||||
|
|
||||||
|
# Execute tool
|
||||||
|
result = await pool.execute(slot, "bash", {"command": "ls"})
|
||||||
|
|
||||||
|
# Release slot
|
||||||
|
await pool.release(slot)
|
||||||
|
|
||||||
|
# Shutdown
|
||||||
|
await pool.stop()
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, config: Optional[SlotPoolConfig] = None):
|
||||||
|
self.config = config or SlotPoolConfig()
|
||||||
|
|
||||||
|
# Nomad client
|
||||||
|
self.nomad = NomadClient(address=self.config.nomad_address)
|
||||||
|
|
||||||
|
# Sandbox executor for tool execution
|
||||||
|
self.executor = SandboxExecutor()
|
||||||
|
|
||||||
|
# Slot tracking
|
||||||
|
self._slots: Dict[str, Slot] = {} # slot_key -> Slot
|
||||||
|
self._available_queue: asyncio.Queue[str] = asyncio.Queue()
|
||||||
|
self._lock = asyncio.Lock()
|
||||||
|
self._scale_lock = asyncio.Lock()
|
||||||
|
|
||||||
|
# State
|
||||||
|
self._started = False
|
||||||
|
self._health_task: Optional[asyncio.Task] = None
|
||||||
|
self._scale_task: Optional[asyncio.Task] = None
|
||||||
|
self._last_scale_time = 0.0
|
||||||
|
|
||||||
|
def _default_dockerfile_path(self) -> Path:
|
||||||
|
# Hermes-Agent/atropos/Dockerfile lives next to this module in source checkouts.
|
||||||
|
return Path(__file__).resolve().parents[1] / "Dockerfile"
|
||||||
|
|
||||||
|
def _default_build_context(self) -> Path:
|
||||||
|
return Path(__file__).resolve().parents[1]
|
||||||
|
|
||||||
|
def _docker_image_exists(self, image: str) -> bool:
|
||||||
|
try:
|
||||||
|
proc = subprocess.run(
|
||||||
|
["docker", "image", "inspect", image],
|
||||||
|
stdout=subprocess.DEVNULL,
|
||||||
|
stderr=subprocess.DEVNULL,
|
||||||
|
check=False,
|
||||||
|
env={**os.environ, "DOCKER_CLI_HINTS": "false"},
|
||||||
|
)
|
||||||
|
return proc.returncode == 0
|
||||||
|
except FileNotFoundError:
|
||||||
|
return False
|
||||||
|
|
||||||
|
def _try_build_local_image(self, image: str) -> None:
|
||||||
|
dockerfile = Path(self.config.dockerfile_path) if self.config.dockerfile_path else self._default_dockerfile_path()
|
||||||
|
context = Path(self.config.docker_build_context) if self.config.docker_build_context else self._default_build_context()
|
||||||
|
|
||||||
|
if not dockerfile.exists():
|
||||||
|
raise RuntimeError(
|
||||||
|
f"Sandbox Dockerfile not found at {dockerfile}. "
|
||||||
|
"Build the sandbox image manually or set --env.purge_job_on_start false and provide a non-local image."
|
||||||
|
)
|
||||||
|
if not context.exists():
|
||||||
|
raise RuntimeError(f"Docker build context not found at {context}")
|
||||||
|
|
||||||
|
# Prefer buildx+--load to ensure the image ends up in the local daemon (required by Nomad's docker driver).
|
||||||
|
buildx_cmd = [
|
||||||
|
"docker",
|
||||||
|
"buildx",
|
||||||
|
"build",
|
||||||
|
"--load",
|
||||||
|
"-t",
|
||||||
|
image,
|
||||||
|
"-f",
|
||||||
|
str(dockerfile),
|
||||||
|
str(context),
|
||||||
|
]
|
||||||
|
proc = subprocess.run(buildx_cmd, check=False, env={**os.environ, "DOCKER_CLI_HINTS": "false"})
|
||||||
|
if proc.returncode == 0:
|
||||||
|
return
|
||||||
|
|
||||||
|
# Fallback to classic docker build if buildx isn't available.
|
||||||
|
build_cmd = ["docker", "build", "-t", image, "-f", str(dockerfile), str(context)]
|
||||||
|
proc2 = subprocess.run(build_cmd, check=False, env={**os.environ, "DOCKER_CLI_HINTS": "false"})
|
||||||
|
if proc2.returncode != 0:
|
||||||
|
raise RuntimeError(
|
||||||
|
f"Failed to build local sandbox image {image}. "
|
||||||
|
f"Tried: {' '.join(buildx_cmd)} and {' '.join(build_cmd)}"
|
||||||
|
)
|
||||||
|
|
||||||
|
def _ensure_local_image(self) -> None:
|
||||||
|
image = (self.config.image or "").strip()
|
||||||
|
if not image.endswith(":local"):
|
||||||
|
return
|
||||||
|
if not self.config.auto_build_local_image:
|
||||||
|
return
|
||||||
|
|
||||||
|
if self._docker_image_exists(image):
|
||||||
|
return
|
||||||
|
|
||||||
|
logger.info(f"Local sandbox image {image} not found; building it now...")
|
||||||
|
self._try_build_local_image(image)
|
||||||
|
|
||||||
|
def _slot_key(self, alloc_id: str, slot_id: str) -> str:
|
||||||
|
"""Generate unique key for a slot."""
|
||||||
|
return f"{alloc_id}:{slot_id}"
|
||||||
|
|
||||||
|
@property
|
||||||
|
def total_slots(self) -> int:
|
||||||
|
"""Total number of slots in pool."""
|
||||||
|
return len(self._slots)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def available_slots(self) -> int:
|
||||||
|
"""Number of available slots."""
|
||||||
|
return sum(1 for s in self._slots.values() if s.is_available)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def acquired_slots(self) -> int:
|
||||||
|
"""Number of acquired slots."""
|
||||||
|
return sum(1 for s in self._slots.values() if s.is_acquired)
|
||||||
|
|
||||||
|
async def start(self) -> None:
|
||||||
|
"""
|
||||||
|
Start the slot pool.
|
||||||
|
|
||||||
|
- Checks if Nomad is healthy
|
||||||
|
- Deploys sandbox job if not running
|
||||||
|
- Discovers existing allocations
|
||||||
|
- Starts health check background task
|
||||||
|
"""
|
||||||
|
if self._started:
|
||||||
|
return
|
||||||
|
|
||||||
|
logger.info(f"Starting SlotPool (job_id={self.config.job_id})")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Make sure local sandbox images exist before Nomad tries to pull them.
|
||||||
|
# This is a common footgun in macOS dev mode with :local tags.
|
||||||
|
self._ensure_local_image()
|
||||||
|
|
||||||
|
# Check Nomad health
|
||||||
|
if not await self.nomad.is_healthy():
|
||||||
|
raise RuntimeError(f"Nomad is not reachable at {self.config.nomad_address}")
|
||||||
|
|
||||||
|
if self.config.purge_job_on_start:
|
||||||
|
logger.info(f"Purging any existing Nomad job: {self.config.job_id}")
|
||||||
|
await self.nomad.stop_job(self.config.job_id, purge=True)
|
||||||
|
|
||||||
|
# Check if job exists (after optional purge)
|
||||||
|
job = await self.nomad.get_job(self.config.job_id)
|
||||||
|
|
||||||
|
if job is None:
|
||||||
|
# Deploy new job
|
||||||
|
logger.info(f"Deploying sandbox job: {self.config.job_id} (driver={self.config.driver})")
|
||||||
|
job_spec = create_sandbox_job(
|
||||||
|
job_id=self.config.job_id,
|
||||||
|
image=self.config.image,
|
||||||
|
count=self.config.min_containers,
|
||||||
|
slots_per_container=self.config.slots_per_container,
|
||||||
|
privileged=self.config.privileged,
|
||||||
|
cpu=self.config.cpu,
|
||||||
|
memory=self.config.memory,
|
||||||
|
datacenter=self.config.datacenter,
|
||||||
|
driver=self.config.driver,
|
||||||
|
singularity_image=self.config.singularity_image,
|
||||||
|
)
|
||||||
|
result = await self.nomad.submit_job(job_spec)
|
||||||
|
if "error" in result:
|
||||||
|
raise RuntimeError(f"Failed to submit job: {result}")
|
||||||
|
|
||||||
|
# Wait for allocations to be running (even if the job already existed).
|
||||||
|
await self._wait_for_healthy_allocations(self.config.min_containers)
|
||||||
|
|
||||||
|
# Discover existing allocations and slots
|
||||||
|
await self._refresh_slots()
|
||||||
|
|
||||||
|
# Start health check task
|
||||||
|
self._health_task = asyncio.create_task(self._health_check_loop())
|
||||||
|
|
||||||
|
self._started = True
|
||||||
|
logger.info(f"SlotPool started: {self.total_slots} slots available")
|
||||||
|
except Exception:
|
||||||
|
# Ensure aiohttp sessions are not leaked if we fail to start.
|
||||||
|
await self.stop(purge_job=False)
|
||||||
|
raise
|
||||||
|
|
||||||
|
async def stop(self, purge_job: bool = False) -> None:
|
||||||
|
"""
|
||||||
|
Stop the slot pool.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
purge_job: If True, also stop the Nomad job
|
||||||
|
"""
|
||||||
|
logger.info("Stopping SlotPool")
|
||||||
|
|
||||||
|
# Cancel health check task
|
||||||
|
if self._health_task:
|
||||||
|
self._health_task.cancel()
|
||||||
|
try:
|
||||||
|
await self._health_task
|
||||||
|
except asyncio.CancelledError:
|
||||||
|
pass
|
||||||
|
finally:
|
||||||
|
self._health_task = None
|
||||||
|
|
||||||
|
if self._scale_task:
|
||||||
|
self._scale_task.cancel()
|
||||||
|
try:
|
||||||
|
await self._scale_task
|
||||||
|
except asyncio.CancelledError:
|
||||||
|
pass
|
||||||
|
finally:
|
||||||
|
self._scale_task = None
|
||||||
|
|
||||||
|
# Optionally stop the job (do this even if start() never completed).
|
||||||
|
if purge_job:
|
||||||
|
logger.info(f"Stopping Nomad job: {self.config.job_id}")
|
||||||
|
await self.nomad.stop_job(self.config.job_id, purge=True)
|
||||||
|
|
||||||
|
# Close connections
|
||||||
|
await self.executor.close()
|
||||||
|
await self.nomad.close()
|
||||||
|
|
||||||
|
self._started = False
|
||||||
|
self._slots.clear()
|
||||||
|
|
||||||
|
# Clear the queue
|
||||||
|
while not self._available_queue.empty():
|
||||||
|
try:
|
||||||
|
self._available_queue.get_nowait()
|
||||||
|
except asyncio.QueueEmpty:
|
||||||
|
break
|
||||||
|
|
||||||
|
async def acquire(self, trajectory_id: Optional[str] = None) -> Slot:
|
||||||
|
"""
|
||||||
|
Acquire an available slot.
|
||||||
|
|
||||||
|
If no slots are available, waits up to acquire_timeout seconds.
|
||||||
|
If still no slots, attempts to scale up.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
trajectory_id: Optional ID of trajectory acquiring the slot
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Acquired Slot
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
asyncio.TimeoutError: If no slot becomes available
|
||||||
|
"""
|
||||||
|
if not self._started:
|
||||||
|
raise RuntimeError("SlotPool not started")
|
||||||
|
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
# Try to get an available slot
|
||||||
|
slot_key = await asyncio.wait_for(
|
||||||
|
self._available_queue.get(),
|
||||||
|
timeout=self.config.acquire_timeout,
|
||||||
|
)
|
||||||
|
except asyncio.TimeoutError:
|
||||||
|
# Try to scale up, but keep waiting even if scaling isn't possible.
|
||||||
|
# In practice, slots may become available shortly (e.g. contention),
|
||||||
|
# and scaling may be temporarily blocked by Nomad deployments.
|
||||||
|
await self._try_scale_up()
|
||||||
|
continue
|
||||||
|
|
||||||
|
slot = self._slots.get(slot_key)
|
||||||
|
if slot is None:
|
||||||
|
# Slot was removed; discard stale queue entry and retry.
|
||||||
|
continue
|
||||||
|
|
||||||
|
try:
|
||||||
|
slot.acquire(trajectory_id)
|
||||||
|
except RuntimeError:
|
||||||
|
# Slot isn't actually available (e.g. duplicate queue entry); retry.
|
||||||
|
continue
|
||||||
|
|
||||||
|
logger.debug(f"Acquired slot {slot.slot_id} (alloc={slot.alloc_id[:8]})")
|
||||||
|
return slot
|
||||||
|
|
||||||
|
async def release(self, slot: Slot, reset_workspace: bool = False) -> None:
|
||||||
|
"""
|
||||||
|
Release a slot back to the pool.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
slot: Slot to release
|
||||||
|
reset_workspace: If True, clear the workspace files
|
||||||
|
"""
|
||||||
|
slot_key = self._slot_key(slot.alloc_id, slot.slot_id)
|
||||||
|
|
||||||
|
if slot_key not in self._slots:
|
||||||
|
logger.warning(f"Releasing unknown slot: {slot_key}")
|
||||||
|
return
|
||||||
|
|
||||||
|
# Optionally reset workspace
|
||||||
|
if reset_workspace:
|
||||||
|
await self.executor.reset_slot(slot)
|
||||||
|
|
||||||
|
slot.release()
|
||||||
|
await self._available_queue.put(slot_key)
|
||||||
|
|
||||||
|
logger.debug(f"Released slot {slot.slot_id}")
|
||||||
|
|
||||||
|
async def execute(
|
||||||
|
self,
|
||||||
|
slot: Slot,
|
||||||
|
tool_name: str,
|
||||||
|
args: Dict[str, Any],
|
||||||
|
timeout: Optional[float] = None,
|
||||||
|
) -> ExecutionResult:
|
||||||
|
"""
|
||||||
|
Execute a tool in a slot's workspace.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
slot: Slot to execute in
|
||||||
|
tool_name: Name of tool (bash, read_file, write_file)
|
||||||
|
args: Tool arguments
|
||||||
|
timeout: Optional timeout override
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
ExecutionResult
|
||||||
|
"""
|
||||||
|
return await self.executor.execute(slot, tool_name, args, timeout)
|
||||||
|
|
||||||
|
async def execute_batch(
|
||||||
|
self,
|
||||||
|
requests: List[Tuple[Slot, str, Dict[str, Any]]],
|
||||||
|
timeout: Optional[float] = None,
|
||||||
|
) -> List[ExecutionResult]:
|
||||||
|
"""
|
||||||
|
Execute multiple tools in parallel.
|
||||||
|
|
||||||
|
This is the key optimization - batch execution across multiple slots
|
||||||
|
maximizes container utilization.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
requests: List of (slot, tool_name, args) tuples
|
||||||
|
timeout: Optional timeout override
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of ExecutionResults in same order
|
||||||
|
"""
|
||||||
|
return await self.executor.execute_batch(requests, timeout)
|
||||||
|
|
||||||
|
async def _refresh_slots(self) -> None:
|
||||||
|
"""Refresh slot inventory from Nomad allocations."""
|
||||||
|
async with self._lock:
|
||||||
|
allocs = await self.nomad.get_job_allocations(self.config.job_id)
|
||||||
|
|
||||||
|
# Track which slots we've seen
|
||||||
|
seen_keys = set()
|
||||||
|
|
||||||
|
for alloc in allocs:
|
||||||
|
if alloc.status != AllocationStatus.RUNNING:
|
||||||
|
continue
|
||||||
|
|
||||||
|
if not alloc.http_address:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Check container health
|
||||||
|
healthy = await self.executor.health_check(alloc.http_address)
|
||||||
|
if not healthy:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Create slots for this allocation
|
||||||
|
for i in range(self.config.slots_per_container):
|
||||||
|
slot_id = f"slot_{i}"
|
||||||
|
slot_key = self._slot_key(alloc.id, slot_id)
|
||||||
|
seen_keys.add(slot_key)
|
||||||
|
|
||||||
|
if slot_key not in self._slots:
|
||||||
|
# New slot
|
||||||
|
slot = Slot(
|
||||||
|
slot_id=slot_id,
|
||||||
|
alloc_id=alloc.id,
|
||||||
|
container_addr=alloc.http_address,
|
||||||
|
)
|
||||||
|
self._slots[slot_key] = slot
|
||||||
|
await self._available_queue.put(slot_key)
|
||||||
|
logger.debug(f"Added slot: {slot_key}")
|
||||||
|
|
||||||
|
# Remove slots from dead allocations
|
||||||
|
for slot_key in list(self._slots.keys()):
|
||||||
|
if slot_key not in seen_keys:
|
||||||
|
slot = self._slots.pop(slot_key)
|
||||||
|
logger.debug(f"Removed slot: {slot_key}")
|
||||||
|
|
||||||
|
async def _wait_for_healthy_allocations(
|
||||||
|
self,
|
||||||
|
min_count: int,
|
||||||
|
timeout: float = 120.0
|
||||||
|
) -> None:
|
||||||
|
"""Wait for allocations to become healthy."""
|
||||||
|
import time
|
||||||
|
start = time.time()
|
||||||
|
|
||||||
|
def _summarize_alloc_detail(detail: Dict[str, Any]) -> str:
|
||||||
|
task_states = detail.get("TaskStates") or {}
|
||||||
|
parts: List[str] = []
|
||||||
|
if isinstance(task_states, dict):
|
||||||
|
for task_name, st in task_states.items():
|
||||||
|
events = (st or {}).get("Events") or []
|
||||||
|
if isinstance(events, list) and events:
|
||||||
|
# Include a few recent events; the latest can be a generic restart message
|
||||||
|
# while the true root cause is slightly earlier (e.g. image pull failure).
|
||||||
|
recent = events[-3:]
|
||||||
|
msgs: List[str] = []
|
||||||
|
for ev in recent:
|
||||||
|
desc = ev.get("DisplayMessage") or ev.get("Message") or ev.get("Type") or ""
|
||||||
|
if desc:
|
||||||
|
msgs.append(desc)
|
||||||
|
if msgs:
|
||||||
|
parts.append(f"{task_name}: " + " | ".join(msgs))
|
||||||
|
return "; ".join(parts)
|
||||||
|
|
||||||
|
def _alloc_events_lower(detail: Dict[str, Any]) -> str:
|
||||||
|
task_states = detail.get("TaskStates") or {}
|
||||||
|
texts: List[str] = []
|
||||||
|
if isinstance(task_states, dict):
|
||||||
|
for _task_name, st in task_states.items():
|
||||||
|
events = (st or {}).get("Events") or []
|
||||||
|
if isinstance(events, list):
|
||||||
|
for ev in events[-10:]:
|
||||||
|
desc = ev.get("DisplayMessage") or ev.get("Message") or ev.get("Type") or ""
|
||||||
|
if desc:
|
||||||
|
texts.append(desc)
|
||||||
|
return " ".join(texts).lower()
|
||||||
|
|
||||||
|
while time.time() - start < timeout:
|
||||||
|
allocs = await self.nomad.get_job_allocations(self.config.job_id)
|
||||||
|
|
||||||
|
healthy_count = 0
|
||||||
|
for alloc in allocs:
|
||||||
|
if alloc.status == AllocationStatus.RUNNING and alloc.http_address:
|
||||||
|
if await self.executor.health_check(alloc.http_address):
|
||||||
|
healthy_count += 1
|
||||||
|
|
||||||
|
# Fast-fail on obvious driver/image errors to avoid waiting out the full timeout.
|
||||||
|
if alloc.id:
|
||||||
|
detail = await self.nomad.get_allocation(alloc.id)
|
||||||
|
if isinstance(detail, dict):
|
||||||
|
summary = _summarize_alloc_detail(detail)
|
||||||
|
lowered = _alloc_events_lower(detail) or summary.lower()
|
||||||
|
if "failed to pull" in lowered or "pull access denied" in lowered:
|
||||||
|
raise RuntimeError(
|
||||||
|
"Nomad allocation failed to start due to a Docker image pull error. "
|
||||||
|
f"Allocation {alloc.id[:8]}: {summary}\n"
|
||||||
|
"If you're using a local image tag (e.g. `atropos-sandbox:local`) on macOS, "
|
||||||
|
"make sure the image is loaded into Docker, e.g.:\n"
|
||||||
|
" docker buildx build --load -t atropos-sandbox:local -f Hermes-Agent/atropos/Dockerfile Hermes-Agent/atropos"
|
||||||
|
)
|
||||||
|
if "exceeded allowed attempts" in lowered:
|
||||||
|
raise RuntimeError(
|
||||||
|
"Nomad allocation is crash-looping and has entered restart backoff. "
|
||||||
|
f"Allocation {alloc.id[:8]}: {summary}\n"
|
||||||
|
"Inspect logs with:\n"
|
||||||
|
f" nomad alloc logs -stderr -task sandbox-server {alloc.id}\n"
|
||||||
|
"Common causes include: missing local Docker image tag, container entrypoint error, "
|
||||||
|
"or sandbox-server startup failure."
|
||||||
|
)
|
||||||
|
|
||||||
|
if healthy_count >= min_count:
|
||||||
|
return
|
||||||
|
|
||||||
|
await asyncio.sleep(2.0)
|
||||||
|
|
||||||
|
# Timed out: include allocation status detail to help debugging.
|
||||||
|
allocs = await self.nomad.get_job_allocations(self.config.job_id)
|
||||||
|
alloc_lines: List[str] = []
|
||||||
|
for alloc in allocs[:10]:
|
||||||
|
addr = alloc.http_address or "-"
|
||||||
|
line = f"{alloc.id[:8]} status={alloc.status.value} http={addr}"
|
||||||
|
detail = await self.nomad.get_allocation(alloc.id)
|
||||||
|
if isinstance(detail, dict):
|
||||||
|
summary = _summarize_alloc_detail(detail)
|
||||||
|
if summary:
|
||||||
|
line += f" detail={summary}"
|
||||||
|
alloc_lines.append(line)
|
||||||
|
|
||||||
|
hint = (
|
||||||
|
"Timed out waiting for healthy sandbox allocations.\n"
|
||||||
|
f"Job: {self.config.job_id}, desired_healthy: {min_count}\n"
|
||||||
|
"Allocations:\n - " + "\n - ".join(alloc_lines)
|
||||||
|
)
|
||||||
|
raise RuntimeError(hint)
|
||||||
|
|
||||||
|
async def _try_scale_up(self) -> bool:
|
||||||
|
"""Attempt to scale up the job."""
|
||||||
|
import time
|
||||||
|
|
||||||
|
async with self._scale_lock:
|
||||||
|
# Check cooldown
|
||||||
|
if time.time() - self._last_scale_time < self.config.scale_cooldown:
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Check max containers
|
||||||
|
status = await self.nomad.get_job_status(self.config.job_id)
|
||||||
|
if status is None:
|
||||||
|
return False
|
||||||
|
|
||||||
|
current_count = status.count
|
||||||
|
if current_count >= self.config.max_containers:
|
||||||
|
logger.warning(f"Cannot scale up: already at max ({self.config.max_containers})")
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Scale up
|
||||||
|
new_count = min(current_count + 1, self.config.max_containers)
|
||||||
|
logger.info(f"Scaling up from {current_count} to {new_count} containers")
|
||||||
|
|
||||||
|
scale_resp = await self.nomad.scale_job(
|
||||||
|
self.config.job_id,
|
||||||
|
count=new_count,
|
||||||
|
task_group="sandbox",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Nomad may return non-JSON errors (e.g. plain text) with a status field.
|
||||||
|
if isinstance(scale_resp, dict) and scale_resp.get("status", 200) >= 400:
|
||||||
|
logger.warning(f"Scale request rejected: {scale_resp}")
|
||||||
|
self._last_scale_time = time.time()
|
||||||
|
return False
|
||||||
|
|
||||||
|
self._last_scale_time = time.time()
|
||||||
|
|
||||||
|
# Wait for new allocation in the background so contended acquires can still
|
||||||
|
# make progress (e.g. by grabbing slots released by other trajectories).
|
||||||
|
if self._scale_task is None or self._scale_task.done():
|
||||||
|
self._scale_task = asyncio.create_task(self._wait_for_scale(new_count))
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
async def _wait_for_scale(self, desired_count: int) -> None:
|
||||||
|
try:
|
||||||
|
await self._wait_for_healthy_allocations(desired_count, timeout=60.0)
|
||||||
|
await self._refresh_slots()
|
||||||
|
except asyncio.CancelledError:
|
||||||
|
raise
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to scale up: {e}")
|
||||||
|
|
||||||
|
async def _health_check_loop(self) -> None:
|
||||||
|
"""Background task to monitor container health."""
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
await asyncio.sleep(self.config.health_check_interval)
|
||||||
|
await self._refresh_slots()
|
||||||
|
except asyncio.CancelledError:
|
||||||
|
break
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Health check error: {e}")
|
||||||
|
|
||||||
|
def get_stats(self) -> Dict[str, Any]:
|
||||||
|
"""Get pool statistics."""
|
||||||
|
slots_by_state = {}
|
||||||
|
for slot in self._slots.values():
|
||||||
|
state = slot.state.value
|
||||||
|
slots_by_state[state] = slots_by_state.get(state, 0) + 1
|
||||||
|
|
||||||
|
container_count = len({s.alloc_id for s in self._slots.values()}) if self._slots else 0
|
||||||
|
|
||||||
|
return {
|
||||||
|
"total_slots": self.total_slots,
|
||||||
|
"available_slots": self.available_slots,
|
||||||
|
"acquired_slots": self.acquired_slots,
|
||||||
|
"containers": container_count,
|
||||||
|
"slots_by_state": slots_by_state,
|
||||||
|
"started": self._started,
|
||||||
|
}
|
||||||
159
atropos/slots/slot.py
Normal file
159
atropos/slots/slot.py
Normal file
@@ -0,0 +1,159 @@
|
|||||||
|
"""
|
||||||
|
Slot abstraction for atropos-agent.
|
||||||
|
|
||||||
|
A Slot represents an isolated workspace for a single agent trajectory.
|
||||||
|
Slots are hosted on Nomad allocations and provide workspace isolation
|
||||||
|
via filesystem directories.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from enum import Enum
|
||||||
|
from typing import Any, Dict, Optional
|
||||||
|
import uuid
|
||||||
|
|
||||||
|
|
||||||
|
class SlotState(Enum):
|
||||||
|
"""State of a slot in the pool."""
|
||||||
|
AVAILABLE = "available" # Ready to be acquired
|
||||||
|
ACQUIRED = "acquired" # Assigned to a trajectory
|
||||||
|
EXECUTING = "executing" # Currently executing a tool
|
||||||
|
RELEASING = "releasing" # Being released back to pool
|
||||||
|
ERROR = "error" # In error state
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class Slot:
|
||||||
|
"""
|
||||||
|
An isolated workspace for a single agent trajectory.
|
||||||
|
|
||||||
|
Slots are the unit of scheduling - each trajectory runs in its own slot,
|
||||||
|
with an isolated workspace directory. Multiple slots share a container.
|
||||||
|
|
||||||
|
Attributes:
|
||||||
|
slot_id: Unique identifier for this slot (e.g., "slot_0")
|
||||||
|
alloc_id: Nomad allocation ID hosting this slot
|
||||||
|
container_addr: HTTP address of the sandbox server (e.g., "http://10.0.0.1:8080")
|
||||||
|
workspace_dir: Path to workspace in container (e.g., "/data/slot_0")
|
||||||
|
state: Current state of the slot
|
||||||
|
trajectory_id: ID of trajectory currently using this slot (if acquired)
|
||||||
|
metadata: Additional metadata
|
||||||
|
"""
|
||||||
|
slot_id: str
|
||||||
|
alloc_id: str
|
||||||
|
container_addr: str
|
||||||
|
workspace_dir: str = ""
|
||||||
|
state: SlotState = SlotState.AVAILABLE
|
||||||
|
trajectory_id: Optional[str] = None
|
||||||
|
metadata: Dict[str, Any] = field(default_factory=dict)
|
||||||
|
|
||||||
|
def __post_init__(self):
|
||||||
|
"""Set default workspace_dir if not provided."""
|
||||||
|
if not self.workspace_dir:
|
||||||
|
self.workspace_dir = f"/data/{self.slot_id}"
|
||||||
|
|
||||||
|
@property
|
||||||
|
def is_available(self) -> bool:
|
||||||
|
"""Check if slot is available for acquisition."""
|
||||||
|
return self.state == SlotState.AVAILABLE
|
||||||
|
|
||||||
|
@property
|
||||||
|
def is_acquired(self) -> bool:
|
||||||
|
"""Check if slot is currently acquired."""
|
||||||
|
return self.state in (SlotState.ACQUIRED, SlotState.EXECUTING)
|
||||||
|
|
||||||
|
def acquire(self, trajectory_id: Optional[str] = None) -> None:
|
||||||
|
"""
|
||||||
|
Mark slot as acquired by a trajectory.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
trajectory_id: Optional ID of acquiring trajectory
|
||||||
|
"""
|
||||||
|
if not self.is_available:
|
||||||
|
raise RuntimeError(f"Cannot acquire slot {self.slot_id}: state is {self.state}")
|
||||||
|
|
||||||
|
self.state = SlotState.ACQUIRED
|
||||||
|
self.trajectory_id = trajectory_id or str(uuid.uuid4())
|
||||||
|
|
||||||
|
def start_execution(self, execution_id: Optional[str] = None) -> None:
|
||||||
|
"""Mark slot as executing."""
|
||||||
|
if self.state != SlotState.ACQUIRED:
|
||||||
|
raise RuntimeError(f"Cannot start execution on slot {self.slot_id}: state is {self.state}")
|
||||||
|
|
||||||
|
self.state = SlotState.EXECUTING
|
||||||
|
if execution_id:
|
||||||
|
self.metadata["current_execution_id"] = execution_id
|
||||||
|
|
||||||
|
def end_execution(self) -> None:
|
||||||
|
"""Mark execution as complete, return to acquired state."""
|
||||||
|
if self.state != SlotState.EXECUTING:
|
||||||
|
raise RuntimeError(f"Cannot end execution on slot {self.slot_id}: state is {self.state}")
|
||||||
|
|
||||||
|
self.state = SlotState.ACQUIRED
|
||||||
|
self.metadata.pop("current_execution_id", None)
|
||||||
|
|
||||||
|
def release(self) -> None:
|
||||||
|
"""Release slot back to available state."""
|
||||||
|
self.state = SlotState.AVAILABLE
|
||||||
|
self.trajectory_id = None
|
||||||
|
self.metadata.pop("current_execution_id", None)
|
||||||
|
|
||||||
|
def mark_error(self, error: str) -> None:
|
||||||
|
"""Mark slot as in error state."""
|
||||||
|
self.state = SlotState.ERROR
|
||||||
|
self.metadata["error"] = error
|
||||||
|
|
||||||
|
def to_dict(self) -> Dict[str, Any]:
|
||||||
|
"""Convert to dictionary for serialization."""
|
||||||
|
return {
|
||||||
|
"slot_id": self.slot_id,
|
||||||
|
"alloc_id": self.alloc_id,
|
||||||
|
"container_addr": self.container_addr,
|
||||||
|
"workspace_dir": self.workspace_dir,
|
||||||
|
"state": self.state.value,
|
||||||
|
"trajectory_id": self.trajectory_id,
|
||||||
|
"metadata": self.metadata,
|
||||||
|
}
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_dict(cls, data: Dict[str, Any]) -> "Slot":
|
||||||
|
"""Create from dictionary."""
|
||||||
|
return cls(
|
||||||
|
slot_id=data["slot_id"],
|
||||||
|
alloc_id=data["alloc_id"],
|
||||||
|
container_addr=data["container_addr"],
|
||||||
|
workspace_dir=data.get("workspace_dir", ""),
|
||||||
|
state=SlotState(data.get("state", "available")),
|
||||||
|
trajectory_id=data.get("trajectory_id"),
|
||||||
|
metadata=data.get("metadata", {}),
|
||||||
|
)
|
||||||
|
|
||||||
|
def __repr__(self) -> str:
|
||||||
|
return f"Slot({self.slot_id}, state={self.state.value}, alloc={self.alloc_id[:8]}...)"
|
||||||
|
|
||||||
|
|
||||||
|
def create_slots_for_allocation(
|
||||||
|
alloc_id: str,
|
||||||
|
container_addr: str,
|
||||||
|
num_slots: int = 10,
|
||||||
|
) -> list["Slot"]:
|
||||||
|
"""
|
||||||
|
Create slots for a Nomad allocation.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
alloc_id: Nomad allocation ID
|
||||||
|
container_addr: HTTP address of sandbox server
|
||||||
|
num_slots: Number of slots to create
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of Slot objects
|
||||||
|
"""
|
||||||
|
slots = []
|
||||||
|
for i in range(num_slots):
|
||||||
|
slot_id = f"slot_{i}"
|
||||||
|
slots.append(Slot(
|
||||||
|
slot_id=slot_id,
|
||||||
|
alloc_id=alloc_id,
|
||||||
|
container_addr=container_addr,
|
||||||
|
workspace_dir=f"/data/{slot_id}",
|
||||||
|
))
|
||||||
|
return slots
|
||||||
2
atropos/terminal/__init__.py
Normal file
2
atropos/terminal/__init__.py
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
"""Terminal helpers for stateful sandbox interactions."""
|
||||||
|
|
||||||
115
atropos/terminal/asciinema_stream.py
Normal file
115
atropos/terminal/asciinema_stream.py
Normal file
@@ -0,0 +1,115 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import pyte
|
||||||
|
|
||||||
|
|
||||||
|
class AsciinemaStreamDecoder:
|
||||||
|
def __init__(self, *, default_width: int = 80, default_height: int = 24) -> None:
|
||||||
|
self._default_width = max(1, int(default_width))
|
||||||
|
self._default_height = max(1, int(default_height))
|
||||||
|
self._buffer = ""
|
||||||
|
self._has_header = False
|
||||||
|
self.width = self._default_width
|
||||||
|
self.height = self._default_height
|
||||||
|
self._screen = pyte.Screen(self.width, self.height)
|
||||||
|
self._stream = pyte.Stream(self._screen)
|
||||||
|
|
||||||
|
def reset(self) -> None:
|
||||||
|
self._buffer = ""
|
||||||
|
self._has_header = False
|
||||||
|
self.width = self._default_width
|
||||||
|
self.height = self._default_height
|
||||||
|
self._screen = pyte.Screen(self.width, self.height)
|
||||||
|
self._stream = pyte.Stream(self._screen)
|
||||||
|
|
||||||
|
def feed(self, chunk: str | bytes) -> None:
|
||||||
|
if not chunk:
|
||||||
|
return
|
||||||
|
if isinstance(chunk, bytes):
|
||||||
|
chunk = chunk.decode("utf-8", errors="replace")
|
||||||
|
self._buffer += chunk
|
||||||
|
while True:
|
||||||
|
line, sep, rest = self._buffer.partition("\n")
|
||||||
|
if not sep:
|
||||||
|
break
|
||||||
|
self._buffer = rest
|
||||||
|
line = line.strip()
|
||||||
|
if not line:
|
||||||
|
continue
|
||||||
|
parsed = self._parse_json_line(line)
|
||||||
|
if parsed is None:
|
||||||
|
continue
|
||||||
|
if not self._has_header:
|
||||||
|
if isinstance(parsed, dict):
|
||||||
|
self._init_from_header(parsed)
|
||||||
|
continue
|
||||||
|
if isinstance(parsed, list):
|
||||||
|
self._has_header = True
|
||||||
|
self._apply_event(parsed)
|
||||||
|
continue
|
||||||
|
continue
|
||||||
|
if isinstance(parsed, list):
|
||||||
|
self._apply_event(parsed)
|
||||||
|
|
||||||
|
def render(self) -> str:
|
||||||
|
return "\n".join(self._screen.display)
|
||||||
|
|
||||||
|
def _parse_json_line(self, line: str) -> Any | None:
|
||||||
|
try:
|
||||||
|
return json.loads(line)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _init_from_header(self, header: dict[str, Any]) -> None:
|
||||||
|
width = _coerce_int(
|
||||||
|
header.get("width") or header.get("columns") or header.get("cols"),
|
||||||
|
self._default_width,
|
||||||
|
)
|
||||||
|
height = _coerce_int(
|
||||||
|
header.get("height") or header.get("rows") or header.get("lines"),
|
||||||
|
self._default_height,
|
||||||
|
)
|
||||||
|
self.width = max(1, width)
|
||||||
|
self.height = max(1, height)
|
||||||
|
self._screen = pyte.Screen(self.width, self.height)
|
||||||
|
self._stream = pyte.Stream(self._screen)
|
||||||
|
self._has_header = True
|
||||||
|
|
||||||
|
def _apply_event(self, event: list[Any]) -> None:
|
||||||
|
if len(event) < 2:
|
||||||
|
return
|
||||||
|
event_type = event[1]
|
||||||
|
payload = event[2] if len(event) > 2 else ""
|
||||||
|
if event_type == "o":
|
||||||
|
if isinstance(payload, str):
|
||||||
|
self._stream.feed(payload)
|
||||||
|
elif event_type == "r":
|
||||||
|
width, height = _parse_resize(payload)
|
||||||
|
if width and height:
|
||||||
|
self.width = width
|
||||||
|
self.height = height
|
||||||
|
self._screen.resize(width, height)
|
||||||
|
|
||||||
|
|
||||||
|
def _coerce_int(value: Any, default: int) -> int:
|
||||||
|
try:
|
||||||
|
return int(value)
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
return int(default)
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_resize(payload: Any) -> tuple[int, int]:
|
||||||
|
if isinstance(payload, str) and "x" in payload:
|
||||||
|
left, right = payload.lower().split("x", 1)
|
||||||
|
return _coerce_int(left, 0), _coerce_int(right, 0)
|
||||||
|
if isinstance(payload, dict):
|
||||||
|
width = _coerce_int(payload.get("width") or payload.get("columns") or payload.get("cols"), 0)
|
||||||
|
height = _coerce_int(payload.get("height") or payload.get("rows") or payload.get("lines"), 0)
|
||||||
|
return width, height
|
||||||
|
if isinstance(payload, list) and len(payload) >= 2:
|
||||||
|
return _coerce_int(payload[0], 0), _coerce_int(payload[1], 0)
|
||||||
|
return 0, 0
|
||||||
|
|
||||||
26
atropos/tools/__init__.py
Normal file
26
atropos/tools/__init__.py
Normal file
@@ -0,0 +1,26 @@
|
|||||||
|
"""
|
||||||
|
Tool abstractions for atropos-agent.
|
||||||
|
|
||||||
|
Provides base Tool class and common tool implementations.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from .base import Tool, ToolCall, ToolRegistry, ToolResult, ToolSchema
|
||||||
|
from .build_registry import build_tool_registry
|
||||||
|
from .sandbox_stubs import BashTool, ReadFileTool, TerminalTool, WriteFileTool
|
||||||
|
from .terminal_stateful_tool import TerminalStatefulTool
|
||||||
|
from .tmux_tool import TmuxTool
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"Tool",
|
||||||
|
"ToolCall",
|
||||||
|
"ToolRegistry",
|
||||||
|
"ToolResult",
|
||||||
|
"ToolSchema",
|
||||||
|
"BashTool",
|
||||||
|
"ReadFileTool",
|
||||||
|
"WriteFileTool",
|
||||||
|
"TerminalTool",
|
||||||
|
"TerminalStatefulTool",
|
||||||
|
"TmuxTool",
|
||||||
|
"build_tool_registry",
|
||||||
|
]
|
||||||
423
atropos/tools/base.py
Normal file
423
atropos/tools/base.py
Normal file
@@ -0,0 +1,423 @@
|
|||||||
|
"""
|
||||||
|
Base Tool abstraction for atropos-agent.
|
||||||
|
|
||||||
|
Tools follow a simple pattern:
|
||||||
|
1. Define schema (name, description, parameters)
|
||||||
|
2. Implement execute() method
|
||||||
|
3. Return ToolResult with output/error
|
||||||
|
|
||||||
|
Tool calls use Hermes-style XML tags:
|
||||||
|
<tool_call>{"name": "bash", "arguments": {"command": "ls"}}</tool_call>
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
import uuid
|
||||||
|
from abc import ABC, abstractmethod
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from typing import Any, Dict, List, Literal, Optional
|
||||||
|
|
||||||
|
from pydantic import BaseModel, Field
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ToolSchema:
|
||||||
|
"""JSON Schema for a tool's parameters."""
|
||||||
|
|
||||||
|
name: str
|
||||||
|
description: str
|
||||||
|
parameters: Dict[str, Any] = field(default_factory=dict)
|
||||||
|
required: List[str] = field(default_factory=list)
|
||||||
|
external: bool = False # Whether the tool must be executed via an external ToolServer (secret proxy) and not inside the sandbox.
|
||||||
|
|
||||||
|
def to_dict(self) -> Dict[str, Any]:
|
||||||
|
"""Convert to OpenAI-compatible function schema."""
|
||||||
|
return {
|
||||||
|
"type": "function",
|
||||||
|
"function": {
|
||||||
|
"name": self.name,
|
||||||
|
"description": self.description,
|
||||||
|
"parameters": {
|
||||||
|
"type": "object",
|
||||||
|
"properties": self.parameters,
|
||||||
|
"required": self.required,
|
||||||
|
},
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
def to_prompt_description(self) -> str:
|
||||||
|
"""Convert to human-readable description for system prompt."""
|
||||||
|
params_desc = []
|
||||||
|
for name, spec in self.parameters.items():
|
||||||
|
req = "(required)" if name in self.required else "(optional)"
|
||||||
|
desc = spec.get("description", "")
|
||||||
|
param_type = spec.get("type", "string")
|
||||||
|
params_desc.append(f" - {name} ({param_type}) {req}: {desc}")
|
||||||
|
|
||||||
|
params_str = "\n".join(params_desc) if params_desc else " (no parameters)"
|
||||||
|
return f"**{self.name}**: {self.description}\nParameters:\n{params_str}"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ToolCall:
|
||||||
|
"""A parsed tool call from model output."""
|
||||||
|
|
||||||
|
name: str
|
||||||
|
arguments: Dict[str, Any]
|
||||||
|
raw_text: str = "" # Original XML/JSON text
|
||||||
|
uniq_id: str = field(default_factory=lambda: str(uuid.uuid4())) # Unique tool-call id for traceability/reconstruction.
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def parse_from_text(cls, text: str) -> List["ToolCall"]:
|
||||||
|
"""
|
||||||
|
Extract tool calls from text using Hermes-style XML tags.
|
||||||
|
|
||||||
|
Supported formats (STRICT: requires well-formed closing tags):
|
||||||
|
- Hermes JSON wrapper:
|
||||||
|
<tool_call>{"name": "...", "arguments": {...}}</tool_call>
|
||||||
|
- GLM/llama.cpp style:
|
||||||
|
<tool_call>terminal{"command":"ls -la"}</tool_call>
|
||||||
|
"""
|
||||||
|
calls: List["ToolCall"] = []
|
||||||
|
|
||||||
|
if not text:
|
||||||
|
return calls
|
||||||
|
|
||||||
|
def _append_from_payload(*, name: str, arguments: Dict[str, Any], raw: str, uniq_id: Optional[str] = None) -> None:
|
||||||
|
if not isinstance(name, str) or not name:
|
||||||
|
return
|
||||||
|
if not isinstance(arguments, dict):
|
||||||
|
return
|
||||||
|
calls.append(
|
||||||
|
cls(
|
||||||
|
name=name,
|
||||||
|
arguments=arguments,
|
||||||
|
raw_text=raw,
|
||||||
|
uniq_id=uniq_id or str(uuid.uuid4()),
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
# STRICT parsing: only accept well-formed <tool_call>...</tool_call> blocks.
|
||||||
|
pattern = r"<tool_call>\s*(.*?)\s*</tool_call>"
|
||||||
|
for inner in re.findall(pattern, text, re.DOTALL):
|
||||||
|
cleaned = (inner or "").strip()
|
||||||
|
if not cleaned:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Hermes JSON wrapper.
|
||||||
|
if cleaned.startswith("{"):
|
||||||
|
try:
|
||||||
|
data = json.loads(cleaned)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
continue
|
||||||
|
uniq_id = data.get("uniq_id") or data.get("id") or None
|
||||||
|
_append_from_payload(
|
||||||
|
name=data.get("name", ""),
|
||||||
|
arguments=data.get("arguments", {}),
|
||||||
|
raw=inner,
|
||||||
|
uniq_id=uniq_id,
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
|
# GLM/llama.cpp style: terminal{...}
|
||||||
|
m = re.match(r"^\s*([A-Za-z0-9_.:\\-]+)\s*(\{.*\})\s*$", cleaned, re.DOTALL)
|
||||||
|
if not m:
|
||||||
|
continue
|
||||||
|
name = m.group(1)
|
||||||
|
args_text = m.group(2)
|
||||||
|
try:
|
||||||
|
args = json.loads(args_text)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
continue
|
||||||
|
_append_from_payload(name=name, arguments=args, raw=inner)
|
||||||
|
|
||||||
|
return calls
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def has_tool_call(cls, text: str) -> bool:
|
||||||
|
"""Check if text contains any tool calls."""
|
||||||
|
return bool(re.search(r"<tool_call>", text))
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ToolResult:
|
||||||
|
"""Result from executing a tool."""
|
||||||
|
|
||||||
|
success: bool
|
||||||
|
output: str = ""
|
||||||
|
error: str = ""
|
||||||
|
metadata: Dict[str, Any] = field(default_factory=dict)
|
||||||
|
uniq_id: Optional[str] = None # Should match ToolCall.uniq_id for async execution tracking.
|
||||||
|
|
||||||
|
def to_xml(self) -> str:
|
||||||
|
"""Format as XML for including in conversation."""
|
||||||
|
data = {
|
||||||
|
"success": self.success,
|
||||||
|
"output": self.output,
|
||||||
|
}
|
||||||
|
if self.uniq_id:
|
||||||
|
data["uniq_id"] = self.uniq_id
|
||||||
|
if self.error:
|
||||||
|
data["error"] = self.error
|
||||||
|
if self.metadata:
|
||||||
|
data["metadata"] = self.metadata
|
||||||
|
return f"<tool_response>{json.dumps(data)}</tool_response>"
|
||||||
|
|
||||||
|
def to_dict(self) -> Dict[str, Any]:
|
||||||
|
"""Convert to dictionary."""
|
||||||
|
return {
|
||||||
|
"success": self.success,
|
||||||
|
"output": self.output,
|
||||||
|
"error": self.error,
|
||||||
|
"metadata": self.metadata,
|
||||||
|
"uniq_id": self.uniq_id,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class Tool(ABC):
|
||||||
|
"""
|
||||||
|
Abstract base class for tools.
|
||||||
|
|
||||||
|
Subclasses must implement:
|
||||||
|
- schema: ToolSchema describing the tool
|
||||||
|
- execute(): async method that performs the tool action
|
||||||
|
"""
|
||||||
|
|
||||||
|
@property
|
||||||
|
@abstractmethod
|
||||||
|
def schema(self) -> ToolSchema:
|
||||||
|
"""Return the tool's schema."""
|
||||||
|
pass
|
||||||
|
|
||||||
|
@property
|
||||||
|
def name(self) -> str:
|
||||||
|
"""Tool name (from schema)."""
|
||||||
|
return self.schema.name
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
async def execute(self, **kwargs) -> ToolResult:
|
||||||
|
"""
|
||||||
|
Execute the tool with given arguments.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
**kwargs: Tool-specific arguments
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
ToolResult with success/failure and output
|
||||||
|
"""
|
||||||
|
pass
|
||||||
|
|
||||||
|
def is_available(self) -> tuple[bool, str | None]:
|
||||||
|
"""
|
||||||
|
Return whether this tool should be exposed/executable in the current process.
|
||||||
|
|
||||||
|
Tools that depend on optional binaries/services/env vars can override this
|
||||||
|
to avoid advertising a tool that will fail at runtime.
|
||||||
|
"""
|
||||||
|
return True, None
|
||||||
|
|
||||||
|
async def __call__(self, **kwargs) -> ToolResult:
|
||||||
|
"""Allow calling tool instance directly."""
|
||||||
|
return await self.execute(**kwargs)
|
||||||
|
|
||||||
|
# Note: This is only wrapping declarations for the external ToolServer (for execution on external process tools), and tools preinstalled in envs
|
||||||
|
class ToolRegistry:
|
||||||
|
"""Registry of available tools."""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self._tools: Dict[str, Tool] = {}
|
||||||
|
|
||||||
|
def register(self, tool: Tool) -> None:
|
||||||
|
"""Register a tool."""
|
||||||
|
self._tools[tool.name] = tool
|
||||||
|
|
||||||
|
def get(self, name: str) -> Optional[Tool]:
|
||||||
|
"""Get a tool by name."""
|
||||||
|
return self._tools.get(name)
|
||||||
|
|
||||||
|
def list_tools(self) -> List[Tool]:
|
||||||
|
"""List all registered tools."""
|
||||||
|
return list(self._tools.values())
|
||||||
|
|
||||||
|
def get_schemas(self) -> List[ToolSchema]:
|
||||||
|
"""Get schemas for all registered tools."""
|
||||||
|
return [tool.schema for tool in self._tools.values()]
|
||||||
|
|
||||||
|
def get_prompt_description(self) -> str:
|
||||||
|
"""Generate tool descriptions for system prompt."""
|
||||||
|
descriptions = [tool.schema.to_prompt_description() for tool in self._tools.values()]
|
||||||
|
return "\n\n".join(descriptions)
|
||||||
|
|
||||||
|
def get_prompt_tool_definitions_json(self) -> str:
|
||||||
|
"""
|
||||||
|
Return a Hermes-style JSON list of tool definitions for use inside a `<tools>...</tools>` block.
|
||||||
|
|
||||||
|
Hermes trajectories historically use a simplified schema list:
|
||||||
|
[{"name": ..., "description": ..., "parameters": {...}, "required": null}, ...]
|
||||||
|
"""
|
||||||
|
formatted: List[Dict[str, Any]] = []
|
||||||
|
for tool in self._tools.values():
|
||||||
|
fn = tool.schema.to_dict().get("function", {})
|
||||||
|
formatted.append(
|
||||||
|
{
|
||||||
|
"name": fn.get("name", tool.name),
|
||||||
|
"description": fn.get("description", ""),
|
||||||
|
"parameters": fn.get("parameters", {}),
|
||||||
|
# Keep parity with Hermes saved trajectories (required is typically null there).
|
||||||
|
"required": None,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return json.dumps(formatted, ensure_ascii=False)
|
||||||
|
|
||||||
|
async def execute(self, call: ToolCall) -> ToolResult:
|
||||||
|
"""Execute a tool call."""
|
||||||
|
tool = self.get(call.name)
|
||||||
|
if tool is None:
|
||||||
|
return ToolResult(
|
||||||
|
success=False,
|
||||||
|
error=f"Unknown tool: {call.name}",
|
||||||
|
uniq_id=call.uniq_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
result = await tool.execute(**call.arguments)
|
||||||
|
if result.uniq_id is None:
|
||||||
|
result.uniq_id = call.uniq_id
|
||||||
|
return result
|
||||||
|
except Exception as e:
|
||||||
|
return ToolResult(
|
||||||
|
success=False,
|
||||||
|
error=f"Tool execution error: {str(e)}",
|
||||||
|
uniq_id=call.uniq_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# =============================================================================
|
||||||
|
# FastAPI / transport models
|
||||||
|
# =============================================================================
|
||||||
|
|
||||||
|
|
||||||
|
class ToolCallPayload(BaseModel):
|
||||||
|
name: str
|
||||||
|
arguments: Dict[str, Any] = Field(default_factory=dict)
|
||||||
|
uniq_id: str
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_tool_call(cls, call: ToolCall) -> "ToolCallPayload":
|
||||||
|
return cls(name=call.name, arguments=call.arguments, uniq_id=call.uniq_id)
|
||||||
|
|
||||||
|
def to_tool_call(self) -> ToolCall:
|
||||||
|
return ToolCall(name=self.name, arguments=self.arguments, uniq_id=self.uniq_id)
|
||||||
|
|
||||||
|
|
||||||
|
class ToolResultPayload(BaseModel):
|
||||||
|
success: bool
|
||||||
|
output: str = ""
|
||||||
|
error: str = ""
|
||||||
|
metadata: Dict[str, Any] = Field(default_factory=dict)
|
||||||
|
uniq_id: Optional[str] = None
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_tool_result(cls, result: ToolResult) -> "ToolResultPayload":
|
||||||
|
return cls(
|
||||||
|
success=result.success,
|
||||||
|
output=result.output,
|
||||||
|
error=result.error,
|
||||||
|
metadata=result.metadata,
|
||||||
|
uniq_id=result.uniq_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
def to_tool_result(self) -> ToolResult:
|
||||||
|
return ToolResult(
|
||||||
|
success=self.success,
|
||||||
|
output=self.output,
|
||||||
|
error=self.error,
|
||||||
|
metadata=self.metadata,
|
||||||
|
uniq_id=self.uniq_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class ToolExecutorExecuteRequest(BaseModel):
|
||||||
|
trajectory_id: str
|
||||||
|
tool: ToolCallPayload
|
||||||
|
timeout_s: Optional[float] = None
|
||||||
|
|
||||||
|
|
||||||
|
class ToolExecutorReleaseRequest(BaseModel):
|
||||||
|
trajectory_id: str
|
||||||
|
reset_workspace: bool = False
|
||||||
|
|
||||||
|
|
||||||
|
class ToolServerExecuteRequest(BaseModel):
|
||||||
|
trajectory_id: Optional[str] = None
|
||||||
|
tool: ToolCallPayload
|
||||||
|
timeout_s: Optional[float] = None
|
||||||
|
# Optional sandbox context for tools that need workspace artifacts.
|
||||||
|
# This is set by ToolExecutor and is NOT model-controlled.
|
||||||
|
slot_id: Optional[str] = None
|
||||||
|
container_addr: Optional[str] = None
|
||||||
|
|
||||||
|
|
||||||
|
# =============================================================================
|
||||||
|
# Artifact transport models
|
||||||
|
# =============================================================================
|
||||||
|
|
||||||
|
|
||||||
|
class ArtifactReadRequestPayload(BaseModel):
|
||||||
|
trajectory_id: str
|
||||||
|
path: str
|
||||||
|
encoding: Literal["text", "base64"] = "text"
|
||||||
|
max_bytes: Optional[int] = None
|
||||||
|
include_sha256: bool = False
|
||||||
|
|
||||||
|
|
||||||
|
class ArtifactReadResponsePayload(BaseModel):
|
||||||
|
success: bool
|
||||||
|
content: str = ""
|
||||||
|
error: str = ""
|
||||||
|
encoding: str = "text"
|
||||||
|
truncated: bool = False
|
||||||
|
bytes: int = 0
|
||||||
|
file_size: Optional[int] = None
|
||||||
|
path: str = ""
|
||||||
|
mime: Optional[str] = None
|
||||||
|
sha256: Optional[str] = None
|
||||||
|
|
||||||
|
|
||||||
|
class ArtifactListRequestPayload(BaseModel):
|
||||||
|
trajectory_id: str
|
||||||
|
path: str = "."
|
||||||
|
recursive: bool = False
|
||||||
|
max_entries: Optional[int] = None
|
||||||
|
|
||||||
|
|
||||||
|
class ArtifactListEntryPayload(BaseModel):
|
||||||
|
path: str
|
||||||
|
is_dir: bool
|
||||||
|
size: int
|
||||||
|
mtime: float
|
||||||
|
|
||||||
|
|
||||||
|
class ArtifactListResponsePayload(BaseModel):
|
||||||
|
success: bool
|
||||||
|
entries: List[ArtifactListEntryPayload] = Field(default_factory=list)
|
||||||
|
truncated: bool = False
|
||||||
|
error: str = ""
|
||||||
|
|
||||||
|
|
||||||
|
class ArtifactArchiveRequestPayload(BaseModel):
|
||||||
|
trajectory_id: str
|
||||||
|
path: str = "."
|
||||||
|
format: Literal["tar.gz", "tgz"] = "tar.gz"
|
||||||
|
max_bytes: Optional[int] = None
|
||||||
|
max_entries: Optional[int] = None
|
||||||
|
|
||||||
|
|
||||||
|
class ArtifactArchiveResponsePayload(BaseModel):
|
||||||
|
success: bool
|
||||||
|
content: str = ""
|
||||||
|
error: str = ""
|
||||||
|
encoding: str = "base64"
|
||||||
|
format: str = "tar.gz"
|
||||||
|
bytes: int = 0
|
||||||
|
entry_count: int = 0
|
||||||
64
atropos/tools/build_registry.py
Normal file
64
atropos/tools/build_registry.py
Normal file
@@ -0,0 +1,64 @@
|
|||||||
|
"""
|
||||||
|
Unified tool registry builder for Hermes-Agent Atropos integration.
|
||||||
|
|
||||||
|
This composes:
|
||||||
|
- sandbox tool stubs (terminal/bash/read_file/write_file + stateful terminal/tmux)
|
||||||
|
- Hermes external tools (web/vision/image/moa/skills/browser), executed via ToolServer
|
||||||
|
|
||||||
|
ToolExecutor only needs the schema + `external` routing bit; ToolServer executes
|
||||||
|
the external tools via Hermes' existing implementations.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import List, Optional
|
||||||
|
|
||||||
|
from .base import ToolRegistry
|
||||||
|
from .hermes_external_tools import build_external_tools
|
||||||
|
from .sandbox_stubs import BashTool, ReadFileTool, TerminalTool, WriteFileTool
|
||||||
|
from .terminal_stateful_tool import TerminalStatefulTool
|
||||||
|
from .tmux_tool import TmuxTool
|
||||||
|
from .toolset_resolver import resolve_multiple_toolsets
|
||||||
|
|
||||||
|
|
||||||
|
def build_tool_registry(
|
||||||
|
*,
|
||||||
|
enabled_toolsets: Optional[List[str]] = None,
|
||||||
|
disabled_toolsets: Optional[List[str]] = None,
|
||||||
|
tool_server_url: Optional[str] = None,
|
||||||
|
) -> ToolRegistry:
|
||||||
|
"""
|
||||||
|
Build a ToolRegistry for AgentEnv / ToolExecutor / ToolServer.
|
||||||
|
|
||||||
|
If `tool_server_url` is not provided, external tools will be omitted so we do
|
||||||
|
not advertise tools that cannot execute.
|
||||||
|
"""
|
||||||
|
enabled_toolsets = enabled_toolsets or ["default"]
|
||||||
|
|
||||||
|
# Resolve tool names using Hermes toolsets plus Atropos additions.
|
||||||
|
selected = set(resolve_multiple_toolsets(enabled_toolsets))
|
||||||
|
if disabled_toolsets:
|
||||||
|
selected -= set(resolve_multiple_toolsets(disabled_toolsets))
|
||||||
|
|
||||||
|
reg = ToolRegistry()
|
||||||
|
|
||||||
|
# Always register sandbox tools if selected.
|
||||||
|
sandbox_by_name = {
|
||||||
|
"terminal": TerminalTool(),
|
||||||
|
"bash": BashTool(),
|
||||||
|
"read_file": ReadFileTool(),
|
||||||
|
"write_file": WriteFileTool(),
|
||||||
|
"terminal_stateful": TerminalStatefulTool(),
|
||||||
|
"tmux": TmuxTool(),
|
||||||
|
}
|
||||||
|
for name, tool in sandbox_by_name.items():
|
||||||
|
if name in selected:
|
||||||
|
reg.register(tool)
|
||||||
|
|
||||||
|
# External tools: only include when ToolServer is configured.
|
||||||
|
if tool_server_url:
|
||||||
|
for tool in build_external_tools(selected_tool_names=selected):
|
||||||
|
if tool.name in selected:
|
||||||
|
reg.register(tool)
|
||||||
|
|
||||||
|
return reg
|
||||||
90
atropos/tools/hermes_external_tools.py
Normal file
90
atropos/tools/hermes_external_tools.py
Normal file
@@ -0,0 +1,90 @@
|
|||||||
|
"""
|
||||||
|
Hermes external tool adapter for Atropos ToolServer.
|
||||||
|
|
||||||
|
These tools reuse Hermes-Agent's existing tool runner (`model_tools.handle_function_call`)
|
||||||
|
so we don't duplicate external tool implementations.
|
||||||
|
|
||||||
|
Important:
|
||||||
|
- These are marked `external=True` and should be executed ONLY by ToolServer.
|
||||||
|
- We run `handle_function_call` in a worker thread because the Hermes implementation
|
||||||
|
uses `asyncio.run()` internally for some async tools (web_extract, vision, MoA, etc).
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
|
||||||
|
import model_tools
|
||||||
|
|
||||||
|
from .base import Tool, ToolResult, ToolSchema
|
||||||
|
|
||||||
|
|
||||||
|
def _schema_from_openai_tool_dict(tool: Dict[str, Any], *, external: bool) -> ToolSchema:
|
||||||
|
fn = tool.get("function") or {}
|
||||||
|
name = str(fn.get("name") or "")
|
||||||
|
description = str(fn.get("description") or "")
|
||||||
|
params = fn.get("parameters") or {}
|
||||||
|
properties = params.get("properties") or {}
|
||||||
|
required = params.get("required") or []
|
||||||
|
if not isinstance(required, list):
|
||||||
|
required = []
|
||||||
|
return ToolSchema(
|
||||||
|
name=name,
|
||||||
|
description=description,
|
||||||
|
parameters=dict(properties),
|
||||||
|
required=[str(x) for x in required if isinstance(x, (str, int))],
|
||||||
|
external=external,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class HermesExternalTool(Tool):
|
||||||
|
def __init__(self, schema: ToolSchema):
|
||||||
|
self._schema = schema
|
||||||
|
|
||||||
|
@property
|
||||||
|
def schema(self) -> ToolSchema:
|
||||||
|
return self._schema
|
||||||
|
|
||||||
|
async def execute(self, task_id: Optional[str] = None, **kwargs: Any) -> ToolResult:
|
||||||
|
# `model_tools.handle_function_call` returns a JSON string (success or error).
|
||||||
|
# Run in a thread because some Hermes tool handlers call `asyncio.run()`.
|
||||||
|
raw = await asyncio.to_thread(model_tools.handle_function_call, self.name, kwargs, task_id)
|
||||||
|
|
||||||
|
try:
|
||||||
|
parsed = json.loads(raw)
|
||||||
|
except Exception:
|
||||||
|
# Keep as plain string.
|
||||||
|
return ToolResult(success=True, output=str(raw))
|
||||||
|
|
||||||
|
if isinstance(parsed, dict) and parsed.get("error"):
|
||||||
|
return ToolResult(success=False, error=str(parsed.get("error")), output="")
|
||||||
|
|
||||||
|
return ToolResult(success=True, output=json.dumps(parsed, ensure_ascii=False))
|
||||||
|
|
||||||
|
|
||||||
|
def build_external_tools(
|
||||||
|
*,
|
||||||
|
selected_tool_names: Optional[set[str]] = None,
|
||||||
|
) -> List[HermesExternalTool]:
|
||||||
|
"""
|
||||||
|
Build external tool wrappers from Hermes tool declarations.
|
||||||
|
|
||||||
|
Filters out sandbox-oriented tools (e.g. `terminal`) since those should run
|
||||||
|
inside the sandbox via ToolExecutor.
|
||||||
|
"""
|
||||||
|
# IMPORTANT: Hermes' `model_tools.get_tool_definitions()` only understands Hermes toolsets.
|
||||||
|
# Atropos envs add extra toolsets (filesystem/sandbox/stateful). To avoid noisy "Unknown toolset"
|
||||||
|
# prints and accidental filtering, we fetch ALL Hermes tool definitions here and filter by name.
|
||||||
|
tools = model_tools.get_tool_definitions(enabled_toolsets=None, disabled_toolsets=None, quiet_mode=True)
|
||||||
|
|
||||||
|
wrappers: List[HermesExternalTool] = []
|
||||||
|
for t in tools:
|
||||||
|
schema = _schema_from_openai_tool_dict(t, external=True)
|
||||||
|
if schema.name in {"terminal"}:
|
||||||
|
continue
|
||||||
|
if selected_tool_names is not None and schema.name not in selected_tool_names:
|
||||||
|
continue
|
||||||
|
wrappers.append(HermesExternalTool(schema))
|
||||||
|
return wrappers
|
||||||
99
atropos/tools/sandbox_stubs.py
Normal file
99
atropos/tools/sandbox_stubs.py
Normal file
@@ -0,0 +1,99 @@
|
|||||||
|
"""
|
||||||
|
Sandbox tool stubs for Atropos ToolExecutor.
|
||||||
|
|
||||||
|
These tools are executed inside the sandbox containers via:
|
||||||
|
ToolExecutor -> SlotPool -> sandbox_server.py
|
||||||
|
|
||||||
|
They intentionally do NOT execute anything on the host process. If they are
|
||||||
|
called directly (outside ToolExecutor), they return a clear error.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
from .base import Tool, ToolResult, ToolSchema
|
||||||
|
|
||||||
|
|
||||||
|
class TerminalTool(Tool):
|
||||||
|
@property
|
||||||
|
def schema(self) -> ToolSchema:
|
||||||
|
return ToolSchema(
|
||||||
|
name="terminal",
|
||||||
|
description=(
|
||||||
|
"Execute a command inside the sandbox slot workspace and return stdout/stderr. "
|
||||||
|
"Filesystem persists within a trajectory slot. Background processes are not supported "
|
||||||
|
"in stateless mode. Commands run under POSIX /bin/sh and each tool call runs in a fresh "
|
||||||
|
"shell (no persisted env vars). Avoid bash-only syntax like `source`; prefer `. .venv/bin/activate` "
|
||||||
|
"or invoke `.venv/bin/python ...` directly."
|
||||||
|
),
|
||||||
|
parameters={
|
||||||
|
"command": {"type": "string", "description": "The command to execute"},
|
||||||
|
"timeout": {
|
||||||
|
"type": "integer",
|
||||||
|
"description": "Command timeout in seconds (optional).",
|
||||||
|
"minimum": 1,
|
||||||
|
},
|
||||||
|
"background": {
|
||||||
|
"type": "boolean",
|
||||||
|
"description": "Not supported in sandbox terminal (always false).",
|
||||||
|
"default": False,
|
||||||
|
},
|
||||||
|
},
|
||||||
|
required=["command"],
|
||||||
|
external=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def execute(self, **_kwargs) -> ToolResult:
|
||||||
|
return ToolResult(
|
||||||
|
success=False,
|
||||||
|
error="terminal must be executed via ToolExecutor inside the sandbox",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class BashTool(Tool):
|
||||||
|
@property
|
||||||
|
def schema(self) -> ToolSchema:
|
||||||
|
return ToolSchema(
|
||||||
|
name="bash",
|
||||||
|
description="Execute a bash command inside the sandbox slot workspace.",
|
||||||
|
parameters={"command": {"type": "string", "description": "The bash command to execute"}},
|
||||||
|
required=["command"],
|
||||||
|
external=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def execute(self, **_kwargs) -> ToolResult:
|
||||||
|
return ToolResult(success=False, error="bash must be executed via ToolExecutor inside the sandbox")
|
||||||
|
|
||||||
|
|
||||||
|
class ReadFileTool(Tool):
|
||||||
|
@property
|
||||||
|
def schema(self) -> ToolSchema:
|
||||||
|
return ToolSchema(
|
||||||
|
name="read_file",
|
||||||
|
description="Read a file from the sandbox slot workspace.",
|
||||||
|
parameters={"path": {"type": "string", "description": "Path to the file"}},
|
||||||
|
required=["path"],
|
||||||
|
external=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def execute(self, **_kwargs) -> ToolResult:
|
||||||
|
return ToolResult(success=False, error="read_file must be executed via ToolExecutor inside the sandbox")
|
||||||
|
|
||||||
|
|
||||||
|
class WriteFileTool(Tool):
|
||||||
|
@property
|
||||||
|
def schema(self) -> ToolSchema:
|
||||||
|
return ToolSchema(
|
||||||
|
name="write_file",
|
||||||
|
description="Write a file into the sandbox slot workspace.",
|
||||||
|
parameters={
|
||||||
|
"path": {"type": "string", "description": "Path to the file"},
|
||||||
|
"content": {"type": "string", "description": "File content"},
|
||||||
|
},
|
||||||
|
required=["path", "content"],
|
||||||
|
external=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def execute(self, **_kwargs) -> ToolResult:
|
||||||
|
return ToolResult(success=False, error="write_file must be executed via ToolExecutor inside the sandbox")
|
||||||
45
atropos/tools/terminal_stateful_tool.py
Normal file
45
atropos/tools/terminal_stateful_tool.py
Normal file
@@ -0,0 +1,45 @@
|
|||||||
|
"""
|
||||||
|
Stateful terminal tool schema.
|
||||||
|
|
||||||
|
This is a sandbox tool that routes to the sandbox server as `bash_stateful`
|
||||||
|
via ToolExecutor mapping. It exists to expose an explicit, opt-in terminal
|
||||||
|
primitive suitable for stateful workflows (e.g. tmux sessions / TUIs).
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
from .base import Tool, ToolResult, ToolSchema
|
||||||
|
|
||||||
|
|
||||||
|
class TerminalStatefulTool(Tool):
|
||||||
|
@property
|
||||||
|
def schema(self) -> ToolSchema:
|
||||||
|
return ToolSchema(
|
||||||
|
name="terminal_stateful",
|
||||||
|
description=(
|
||||||
|
"Execute a command in the sandbox, allowing stateful/background processes to persist "
|
||||||
|
"across tool calls within the same trajectory slot (e.g. tmux sessions). "
|
||||||
|
"Use sparingly; output is still non-interactive."
|
||||||
|
),
|
||||||
|
parameters={
|
||||||
|
"command": {"type": "string", "description": "The command to execute"},
|
||||||
|
"timeout": {
|
||||||
|
"type": "integer",
|
||||||
|
"description": "Command timeout in seconds (optional).",
|
||||||
|
"minimum": 1,
|
||||||
|
},
|
||||||
|
},
|
||||||
|
required=["command"],
|
||||||
|
)
|
||||||
|
|
||||||
|
def is_available(self) -> tuple[bool, str | None]:
|
||||||
|
return True, None
|
||||||
|
|
||||||
|
async def execute(self, command: str, timeout: Optional[int] = None) -> ToolResult:
|
||||||
|
_ = (command, timeout)
|
||||||
|
return ToolResult(
|
||||||
|
success=False,
|
||||||
|
error="terminal_stateful must be executed via ToolExecutor inside the sandbox",
|
||||||
|
)
|
||||||
89
atropos/tools/tmux_tool.py
Normal file
89
atropos/tools/tmux_tool.py
Normal file
@@ -0,0 +1,89 @@
|
|||||||
|
"""
|
||||||
|
tmux tool schema (sandbox).
|
||||||
|
|
||||||
|
This is a sandbox tool that provides basic tmux session control suitable for
|
||||||
|
TUI-style terminal interactions:
|
||||||
|
- send keys (arrow keys, enter, etc.)
|
||||||
|
- capture the current screen buffer
|
||||||
|
|
||||||
|
Execution is routed by ToolExecutor to the sandbox server's `tmux` backend.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import Any, Dict, Optional
|
||||||
|
|
||||||
|
from .base import Tool, ToolResult, ToolSchema
|
||||||
|
|
||||||
|
|
||||||
|
class TmuxTool(Tool):
|
||||||
|
@property
|
||||||
|
def schema(self) -> ToolSchema:
|
||||||
|
return ToolSchema(
|
||||||
|
name="tmux",
|
||||||
|
description=(
|
||||||
|
"Control a per-trajectory tmux session inside the sandbox (stateful terminal). "
|
||||||
|
"Use this for TUI-style interactions: send keys and capture the current screen."
|
||||||
|
),
|
||||||
|
parameters={
|
||||||
|
"action": {
|
||||||
|
"type": "string",
|
||||||
|
"description": "Action to perform: start | send_keys | stream | stop.",
|
||||||
|
"enum": ["start", "send_keys", "stream", "stop", "capture"],
|
||||||
|
},
|
||||||
|
"keys": {
|
||||||
|
"description": "Keys to send (string or list of strings) when action=send_keys.",
|
||||||
|
},
|
||||||
|
"block": {
|
||||||
|
"type": "boolean",
|
||||||
|
"description": "If true, wait for shell command completion (only valid at a shell prompt).",
|
||||||
|
"default": False,
|
||||||
|
},
|
||||||
|
"min_wait_s": {
|
||||||
|
"type": "number",
|
||||||
|
"description": "For non-blocking send_keys, sleep this long after sending keys (seconds).",
|
||||||
|
"default": 0.0,
|
||||||
|
},
|
||||||
|
"max_wait_s": {
|
||||||
|
"type": "number",
|
||||||
|
"description": "For blocking send_keys, max time to wait for completion (seconds).",
|
||||||
|
},
|
||||||
|
"capture_entire": {
|
||||||
|
"type": "boolean",
|
||||||
|
"description": "Deprecated. Streaming is preferred.",
|
||||||
|
"default": False,
|
||||||
|
},
|
||||||
|
"max_bytes": {
|
||||||
|
"type": "integer",
|
||||||
|
"description": "Max bytes to return per stream call.",
|
||||||
|
},
|
||||||
|
"reset": {
|
||||||
|
"type": "boolean",
|
||||||
|
"description": "If true, reset stream offset to the beginning of the asciinema recording.",
|
||||||
|
"default": False,
|
||||||
|
},
|
||||||
|
"pane_width": {
|
||||||
|
"type": "integer",
|
||||||
|
"description": "Pane width for action=start (columns).",
|
||||||
|
"minimum": 20,
|
||||||
|
},
|
||||||
|
"pane_height": {
|
||||||
|
"type": "integer",
|
||||||
|
"description": "Pane height for action=start (rows).",
|
||||||
|
"minimum": 10,
|
||||||
|
},
|
||||||
|
},
|
||||||
|
required=["action"],
|
||||||
|
)
|
||||||
|
|
||||||
|
def is_available(self) -> tuple[bool, str | None]:
|
||||||
|
return True, None
|
||||||
|
|
||||||
|
async def execute(self, **kwargs: Dict[str, Any]) -> ToolResult:
|
||||||
|
# This tool is intended to be executed via ToolExecutor -> sandbox server.
|
||||||
|
# We keep a safe fallback for non-sandbox contexts.
|
||||||
|
action = str(kwargs.get("action") or "").strip()
|
||||||
|
return ToolResult(
|
||||||
|
success=False,
|
||||||
|
error=f"tmux tool must be executed in the sandbox (got action={action!r})",
|
||||||
|
)
|
||||||
500
atropos/tools/tool_executor.py
Normal file
500
atropos/tools/tool_executor.py
Normal file
@@ -0,0 +1,500 @@
|
|||||||
|
"""
|
||||||
|
ToolExecutor - queued, batched tool dispatch for multiplexed agent trajectories.
|
||||||
|
|
||||||
|
This component is responsible for:
|
||||||
|
- Maintaining trajectory -> Slot affinity (workspace continuity)
|
||||||
|
- Batching sandbox tool calls across trajectories to maximize container utilization
|
||||||
|
- Routing external tools (ToolSchema.external=True) to a ToolServer (Phase 4.5)
|
||||||
|
|
||||||
|
For now, only sandbox tools are executed:
|
||||||
|
- bash
|
||||||
|
- read_file
|
||||||
|
- write_file
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import time
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
|
||||||
|
from .base import (
|
||||||
|
ArtifactArchiveRequestPayload,
|
||||||
|
ArtifactArchiveResponsePayload,
|
||||||
|
ArtifactListRequestPayload,
|
||||||
|
ArtifactListResponsePayload,
|
||||||
|
ArtifactReadRequestPayload,
|
||||||
|
ArtifactReadResponsePayload,
|
||||||
|
ToolCall,
|
||||||
|
ToolCallPayload,
|
||||||
|
ToolRegistry,
|
||||||
|
ToolResult,
|
||||||
|
ToolResultPayload,
|
||||||
|
ToolServerExecuteRequest,
|
||||||
|
)
|
||||||
|
from ..backends.base import ToolBackend
|
||||||
|
from ..slots import Slot
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ToolExecutorConfig:
|
||||||
|
batch_window_ms: int = 20
|
||||||
|
max_batch_size: int = 200
|
||||||
|
allow_network: bool = True
|
||||||
|
require_sandbox: bool = False
|
||||||
|
require_stateful_sandbox: bool = False
|
||||||
|
tool_server_url: Optional[str] = None
|
||||||
|
tool_server_token: Optional[str] = None
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class _QueuedToolRequest:
|
||||||
|
trajectory_id: str
|
||||||
|
call: ToolCall
|
||||||
|
timeout_s: Optional[float]
|
||||||
|
future: asyncio.Future
|
||||||
|
|
||||||
|
|
||||||
|
class ToolExecutor:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
backend: ToolBackend,
|
||||||
|
tools: ToolRegistry,
|
||||||
|
config: Optional[ToolExecutorConfig] = None,
|
||||||
|
) -> None:
|
||||||
|
self.backend = backend
|
||||||
|
self.tools = tools
|
||||||
|
self.config = config or ToolExecutorConfig()
|
||||||
|
|
||||||
|
self._queue: asyncio.Queue[Optional[_QueuedToolRequest]] = asyncio.Queue()
|
||||||
|
self._task: Optional[asyncio.Task] = None
|
||||||
|
self._stopping = asyncio.Event()
|
||||||
|
|
||||||
|
self._slots_lock = asyncio.Lock()
|
||||||
|
self._slot_by_trajectory: Dict[str, Slot] = {}
|
||||||
|
|
||||||
|
self._tool_server_client: Optional[httpx.AsyncClient] = None
|
||||||
|
self._tool_server_lock = asyncio.Lock()
|
||||||
|
|
||||||
|
# lightweight stats for status endpoints
|
||||||
|
self.total_requests: int = 0
|
||||||
|
self.total_errors: int = 0
|
||||||
|
self.latencies_s: List[float] = []
|
||||||
|
|
||||||
|
async def start(self) -> None:
|
||||||
|
if self._task is None:
|
||||||
|
self._task = asyncio.create_task(self._run_loop())
|
||||||
|
|
||||||
|
def queue_size(self) -> int:
|
||||||
|
return self._queue.qsize()
|
||||||
|
|
||||||
|
async def close(self) -> None:
|
||||||
|
self._stopping.set()
|
||||||
|
await self._queue.put(None)
|
||||||
|
if self._task:
|
||||||
|
await self._task
|
||||||
|
self._task = None
|
||||||
|
|
||||||
|
client = self._tool_server_client
|
||||||
|
self._tool_server_client = None
|
||||||
|
if client is not None:
|
||||||
|
await client.aclose()
|
||||||
|
|
||||||
|
# Best-effort release any remaining slots.
|
||||||
|
async with self._slots_lock:
|
||||||
|
slots = list(self._slot_by_trajectory.items())
|
||||||
|
self._slot_by_trajectory.clear()
|
||||||
|
|
||||||
|
for _, slot in slots:
|
||||||
|
try:
|
||||||
|
await self.backend.release(slot, reset_workspace=False)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
async def execute(
|
||||||
|
self,
|
||||||
|
trajectory_id: str,
|
||||||
|
call: ToolCall,
|
||||||
|
timeout_s: Optional[float] = None,
|
||||||
|
) -> ToolResult:
|
||||||
|
if self._task is None:
|
||||||
|
raise RuntimeError("ToolExecutor not started (call start() first)")
|
||||||
|
|
||||||
|
# Allow tool args to suggest a timeout (Hermes-compatible terminal tool),
|
||||||
|
# but never let the model choose "infinite" timeouts.
|
||||||
|
if timeout_s is None:
|
||||||
|
raw_timeout = call.arguments.get("timeout")
|
||||||
|
if isinstance(raw_timeout, (int, float)):
|
||||||
|
timeout_s = float(raw_timeout)
|
||||||
|
if timeout_s is not None:
|
||||||
|
timeout_s = max(1.0, min(float(timeout_s), 600.0))
|
||||||
|
|
||||||
|
loop = asyncio.get_running_loop()
|
||||||
|
fut: asyncio.Future = loop.create_future()
|
||||||
|
started = time.perf_counter()
|
||||||
|
await self._queue.put(_QueuedToolRequest(trajectory_id=trajectory_id, call=call, timeout_s=timeout_s, future=fut))
|
||||||
|
try:
|
||||||
|
result: ToolResult = await fut
|
||||||
|
return result
|
||||||
|
finally:
|
||||||
|
self.latencies_s.append(time.perf_counter() - started)
|
||||||
|
|
||||||
|
async def release_trajectory(self, trajectory_id: str, reset_workspace: bool = False) -> None:
|
||||||
|
async with self._slots_lock:
|
||||||
|
slot = self._slot_by_trajectory.pop(trajectory_id, None)
|
||||||
|
|
||||||
|
if slot is not None:
|
||||||
|
await self.backend.release(slot, reset_workspace=reset_workspace)
|
||||||
|
|
||||||
|
async def _get_slot_if_present(self, trajectory_id: str) -> Optional[Slot]:
|
||||||
|
async with self._slots_lock:
|
||||||
|
return self._slot_by_trajectory.get(trajectory_id)
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------
|
||||||
|
# Artifact helpers (optional)
|
||||||
|
# ---------------------------------------------------------------------
|
||||||
|
|
||||||
|
async def read_artifact(self, req: ArtifactReadRequestPayload) -> ArtifactReadResponsePayload:
|
||||||
|
slot = await self._get_slot_if_present(req.trajectory_id)
|
||||||
|
if slot is None:
|
||||||
|
return ArtifactReadResponsePayload(success=False, error="No active slot for trajectory (run a sandbox tool first)")
|
||||||
|
data = await self.backend.read_artifact(
|
||||||
|
slot,
|
||||||
|
req.path,
|
||||||
|
encoding=req.encoding,
|
||||||
|
max_bytes=req.max_bytes,
|
||||||
|
include_sha256=req.include_sha256,
|
||||||
|
)
|
||||||
|
if isinstance(data, dict):
|
||||||
|
data = dict(data)
|
||||||
|
data.pop("http_status", None)
|
||||||
|
try:
|
||||||
|
return ArtifactReadResponsePayload(**(data or {}))
|
||||||
|
except Exception as e:
|
||||||
|
return ArtifactReadResponsePayload(success=False, error=f"Invalid artifact read response: {e}")
|
||||||
|
|
||||||
|
async def list_artifacts(self, req: ArtifactListRequestPayload) -> ArtifactListResponsePayload:
|
||||||
|
slot = await self._get_slot_if_present(req.trajectory_id)
|
||||||
|
if slot is None:
|
||||||
|
return ArtifactListResponsePayload(success=False, error="No active slot for trajectory (run a sandbox tool first)")
|
||||||
|
data = await self.backend.list_artifacts(
|
||||||
|
slot,
|
||||||
|
req.path,
|
||||||
|
recursive=req.recursive,
|
||||||
|
max_entries=req.max_entries,
|
||||||
|
)
|
||||||
|
if isinstance(data, dict):
|
||||||
|
data = dict(data)
|
||||||
|
data.pop("http_status", None)
|
||||||
|
try:
|
||||||
|
return ArtifactListResponsePayload(**(data or {}))
|
||||||
|
except Exception as e:
|
||||||
|
return ArtifactListResponsePayload(success=False, error=f"Invalid artifact list response: {e}")
|
||||||
|
|
||||||
|
async def archive_artifacts(self, req: ArtifactArchiveRequestPayload) -> ArtifactArchiveResponsePayload:
|
||||||
|
slot = await self._get_slot_if_present(req.trajectory_id)
|
||||||
|
if slot is None:
|
||||||
|
return ArtifactArchiveResponsePayload(success=False, error="No active slot for trajectory (run a sandbox tool first)")
|
||||||
|
data = await self.backend.archive_artifacts(
|
||||||
|
slot,
|
||||||
|
req.path,
|
||||||
|
archive_format=req.format,
|
||||||
|
max_bytes=req.max_bytes,
|
||||||
|
max_entries=req.max_entries,
|
||||||
|
)
|
||||||
|
if isinstance(data, dict):
|
||||||
|
data = dict(data)
|
||||||
|
data.pop("http_status", None)
|
||||||
|
try:
|
||||||
|
return ArtifactArchiveResponsePayload(**(data or {}))
|
||||||
|
except Exception as e:
|
||||||
|
return ArtifactArchiveResponsePayload(success=False, error=f"Invalid artifact archive response: {e}")
|
||||||
|
|
||||||
|
async def _get_or_acquire_slot(self, trajectory_id: str) -> Slot:
|
||||||
|
async with self._slots_lock:
|
||||||
|
existing = self._slot_by_trajectory.get(trajectory_id)
|
||||||
|
if existing is not None:
|
||||||
|
return existing
|
||||||
|
|
||||||
|
slot = await self.backend.acquire(trajectory_id)
|
||||||
|
|
||||||
|
async with self._slots_lock:
|
||||||
|
existing = self._slot_by_trajectory.get(trajectory_id)
|
||||||
|
if existing is not None:
|
||||||
|
# Another coroutine won the race; return its slot.
|
||||||
|
await self.backend.release(slot, reset_workspace=False)
|
||||||
|
return existing
|
||||||
|
self._slot_by_trajectory[trajectory_id] = slot
|
||||||
|
return slot
|
||||||
|
|
||||||
|
async def _run_loop(self) -> None:
|
||||||
|
pending: List[_QueuedToolRequest] = []
|
||||||
|
deadline: Optional[float] = None
|
||||||
|
|
||||||
|
batch_window_s = max(0.0, self.config.batch_window_ms / 1000.0)
|
||||||
|
max_batch = max(1, self.config.max_batch_size)
|
||||||
|
|
||||||
|
while True:
|
||||||
|
if self._stopping.is_set() and self._queue.empty() and not pending:
|
||||||
|
break
|
||||||
|
|
||||||
|
timeout = None
|
||||||
|
if pending and deadline is not None:
|
||||||
|
timeout = max(0.0, deadline - time.perf_counter())
|
||||||
|
|
||||||
|
try:
|
||||||
|
item = await asyncio.wait_for(self._queue.get(), timeout=timeout)
|
||||||
|
if item is None:
|
||||||
|
continue
|
||||||
|
pending.append(item)
|
||||||
|
if len(pending) == 1:
|
||||||
|
deadline = time.perf_counter() + batch_window_s
|
||||||
|
if len(pending) < max_batch:
|
||||||
|
continue
|
||||||
|
except asyncio.TimeoutError:
|
||||||
|
# batch window elapsed
|
||||||
|
pass
|
||||||
|
|
||||||
|
if not pending:
|
||||||
|
deadline = None
|
||||||
|
continue
|
||||||
|
|
||||||
|
batch = pending
|
||||||
|
pending = []
|
||||||
|
deadline = None
|
||||||
|
|
||||||
|
await self._execute_batch(batch)
|
||||||
|
|
||||||
|
async def _get_tool_server_client(self) -> httpx.AsyncClient:
|
||||||
|
url = self.config.tool_server_url
|
||||||
|
if not url:
|
||||||
|
raise RuntimeError("ToolServer not configured")
|
||||||
|
|
||||||
|
if self._tool_server_client is not None:
|
||||||
|
return self._tool_server_client
|
||||||
|
|
||||||
|
async with self._tool_server_lock:
|
||||||
|
if self._tool_server_client is None:
|
||||||
|
self._tool_server_client = httpx.AsyncClient(base_url=url.rstrip("/"))
|
||||||
|
return self._tool_server_client
|
||||||
|
|
||||||
|
def _tool_server_headers(self) -> Dict[str, str]:
|
||||||
|
token = self.config.tool_server_token
|
||||||
|
if not token:
|
||||||
|
return {}
|
||||||
|
return {"Authorization": f"Bearer {token}"}
|
||||||
|
|
||||||
|
async def _execute_external(self, req: _QueuedToolRequest) -> ToolResult:
|
||||||
|
client = await self._get_tool_server_client()
|
||||||
|
slot_id: Optional[str] = None
|
||||||
|
container_addr: Optional[str] = None
|
||||||
|
slot = await self._get_slot_if_present(req.trajectory_id)
|
||||||
|
if slot is not None:
|
||||||
|
slot_id = slot.slot_id
|
||||||
|
container_addr = slot.container_addr
|
||||||
|
|
||||||
|
payload = ToolServerExecuteRequest(
|
||||||
|
trajectory_id=req.trajectory_id,
|
||||||
|
tool=ToolCallPayload.from_tool_call(req.call),
|
||||||
|
timeout_s=req.timeout_s,
|
||||||
|
slot_id=slot_id,
|
||||||
|
container_addr=container_addr,
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
resp = await client.post(
|
||||||
|
"/execute",
|
||||||
|
json=payload.model_dump(),
|
||||||
|
headers=self._tool_server_headers(),
|
||||||
|
timeout=req.timeout_s,
|
||||||
|
)
|
||||||
|
resp.raise_for_status()
|
||||||
|
data = resp.json()
|
||||||
|
parsed = ToolResultPayload(**data)
|
||||||
|
result = parsed.to_tool_result()
|
||||||
|
if result.uniq_id is None:
|
||||||
|
result.uniq_id = req.call.uniq_id
|
||||||
|
return result
|
||||||
|
except Exception as e:
|
||||||
|
return ToolResult(
|
||||||
|
success=False,
|
||||||
|
error=f"External tool failed: {e}",
|
||||||
|
uniq_id=req.call.uniq_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def _execute_batch(self, batch: List[_QueuedToolRequest]) -> None:
|
||||||
|
# Resolve tool schemas once per request and separate sandbox/external/unknown.
|
||||||
|
sandbox_items: List[_QueuedToolRequest] = []
|
||||||
|
external_items: List[_QueuedToolRequest] = []
|
||||||
|
unknown_items: List[_QueuedToolRequest] = []
|
||||||
|
|
||||||
|
for it in batch:
|
||||||
|
tool = self.tools.get(it.call.name)
|
||||||
|
if tool is None:
|
||||||
|
unknown_items.append(it)
|
||||||
|
continue
|
||||||
|
|
||||||
|
schema = tool.schema
|
||||||
|
if not schema.external:
|
||||||
|
sandbox_items.append(it)
|
||||||
|
else:
|
||||||
|
external_items.append(it)
|
||||||
|
|
||||||
|
for it in unknown_items:
|
||||||
|
self.total_requests += 1
|
||||||
|
self.total_errors += 1
|
||||||
|
if not it.future.done():
|
||||||
|
it.future.set_result(
|
||||||
|
ToolResult(
|
||||||
|
success=False,
|
||||||
|
error=f"Unknown tool: {it.call.name}",
|
||||||
|
uniq_id=it.call.uniq_id,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
if external_items:
|
||||||
|
if not self.config.tool_server_url:
|
||||||
|
for it in external_items:
|
||||||
|
self.total_requests += 1
|
||||||
|
self.total_errors += 1
|
||||||
|
if not it.future.done():
|
||||||
|
it.future.set_result(
|
||||||
|
ToolResult(
|
||||||
|
success=False,
|
||||||
|
error=f"External tool not available (ToolServer not configured): {it.call.name}",
|
||||||
|
uniq_id=it.call.uniq_id,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
results = await asyncio.gather(*[self._execute_external(it) for it in external_items])
|
||||||
|
for it, res in zip(external_items, results):
|
||||||
|
self.total_requests += 1
|
||||||
|
if not getattr(res, "success", False):
|
||||||
|
self.total_errors += 1
|
||||||
|
if not it.future.done():
|
||||||
|
it.future.set_result(res)
|
||||||
|
|
||||||
|
if not sandbox_items:
|
||||||
|
return
|
||||||
|
|
||||||
|
# Acquire slots for the distinct trajectories in this batch.
|
||||||
|
try:
|
||||||
|
traj_ids = list({it.trajectory_id for it in sandbox_items})
|
||||||
|
slots = await asyncio.gather(*[self._get_or_acquire_slot(tid) for tid in traj_ids])
|
||||||
|
slot_by_traj = dict(zip(traj_ids, slots))
|
||||||
|
except Exception as e:
|
||||||
|
for it in sandbox_items:
|
||||||
|
self.total_requests += 1
|
||||||
|
self.total_errors += 1
|
||||||
|
if not it.future.done():
|
||||||
|
it.future.set_result(
|
||||||
|
ToolResult(
|
||||||
|
success=False,
|
||||||
|
error=f"Failed to acquire slot: {e}",
|
||||||
|
uniq_id=it.call.uniq_id,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
return
|
||||||
|
|
||||||
|
# Group by timeout so we don't accidentally make short timeouts wait on long ones.
|
||||||
|
by_timeout: Dict[float, List[_QueuedToolRequest]] = {}
|
||||||
|
default_timeout = self.backend.default_timeout_s
|
||||||
|
|
||||||
|
for it in sandbox_items:
|
||||||
|
t = it.timeout_s
|
||||||
|
if t is None:
|
||||||
|
t = default_timeout
|
||||||
|
if t is None:
|
||||||
|
t = 30.0
|
||||||
|
by_timeout.setdefault(float(t), []).append(it)
|
||||||
|
|
||||||
|
for timeout_s, items in by_timeout.items():
|
||||||
|
requests = []
|
||||||
|
dispatched: List[_QueuedToolRequest] = []
|
||||||
|
for it in items:
|
||||||
|
slot = slot_by_traj[it.trajectory_id]
|
||||||
|
tool_name = it.call.name
|
||||||
|
args = dict(it.call.arguments)
|
||||||
|
|
||||||
|
# Hermes compatibility: treat `terminal` as an alias of sandbox `bash`.
|
||||||
|
if tool_name == "terminal":
|
||||||
|
if args.get("background"):
|
||||||
|
self.total_requests += 1
|
||||||
|
self.total_errors += 1
|
||||||
|
if not it.future.done():
|
||||||
|
it.future.set_result(
|
||||||
|
ToolResult(
|
||||||
|
success=False,
|
||||||
|
error="terminal background execution is not supported in sandbox",
|
||||||
|
uniq_id=it.call.uniq_id,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
tool_name = "bash"
|
||||||
|
# `timeout` is handled at the ToolExecutor level, not passed to the sandbox tool args.
|
||||||
|
args.pop("timeout", None)
|
||||||
|
elif tool_name == "terminal_stateful":
|
||||||
|
tool_name = "bash_stateful"
|
||||||
|
args.pop("timeout", None)
|
||||||
|
elif tool_name == "tmux":
|
||||||
|
# `tmux` is a sandbox tool backed by the stateful session manager.
|
||||||
|
# Network policy is env-controlled.
|
||||||
|
args.pop("allow_network", None)
|
||||||
|
|
||||||
|
if tool_name == "bash":
|
||||||
|
# Network policy is set by the environment/executor, not by the model.
|
||||||
|
args.pop("allow_network", None)
|
||||||
|
args.pop("require_sandbox", None)
|
||||||
|
args["allow_network"] = bool(self.config.allow_network)
|
||||||
|
args["require_sandbox"] = bool(self.config.require_sandbox)
|
||||||
|
# `timeout` is handled at the ToolExecutor level, not passed to the sandbox tool args.
|
||||||
|
args.pop("timeout", None)
|
||||||
|
elif tool_name == "bash_stateful":
|
||||||
|
# Network policy is set by the environment/executor, not by the model.
|
||||||
|
args.pop("allow_network", None)
|
||||||
|
args.pop("require_sandbox", None)
|
||||||
|
args.pop("require_stateful_sandbox", None)
|
||||||
|
args["allow_network"] = bool(self.config.allow_network)
|
||||||
|
args["require_stateful_sandbox"] = bool(self.config.require_stateful_sandbox)
|
||||||
|
args.pop("timeout", None)
|
||||||
|
elif tool_name == "tmux":
|
||||||
|
# Network policy applies to the underlying stateful session.
|
||||||
|
args.pop("allow_network", None)
|
||||||
|
args.pop("require_sandbox", None)
|
||||||
|
args.pop("require_stateful_sandbox", None)
|
||||||
|
args["allow_network"] = bool(self.config.allow_network)
|
||||||
|
args["require_stateful_sandbox"] = bool(self.config.require_stateful_sandbox)
|
||||||
|
|
||||||
|
requests.append((slot, tool_name, args))
|
||||||
|
dispatched.append(it)
|
||||||
|
|
||||||
|
results = None
|
||||||
|
try:
|
||||||
|
if not dispatched:
|
||||||
|
continue
|
||||||
|
results = await self.backend.execute_batch(requests, timeout_s=timeout_s)
|
||||||
|
except Exception as e:
|
||||||
|
for it in items:
|
||||||
|
self.total_requests += 1
|
||||||
|
self.total_errors += 1
|
||||||
|
if not it.future.done():
|
||||||
|
it.future.set_result(
|
||||||
|
ToolResult(
|
||||||
|
success=False,
|
||||||
|
error=f"Batch execution failed: {e}",
|
||||||
|
uniq_id=it.call.uniq_id,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
|
for it, res in zip(dispatched, results):
|
||||||
|
self.total_requests += 1
|
||||||
|
if not getattr(res, "success", False):
|
||||||
|
self.total_errors += 1
|
||||||
|
tool_result = res.to_tool_result()
|
||||||
|
tool_result.uniq_id = it.call.uniq_id
|
||||||
|
if not it.future.done():
|
||||||
|
it.future.set_result(tool_result)
|
||||||
88
atropos/tools/toolset_resolver.py
Normal file
88
atropos/tools/toolset_resolver.py
Normal file
@@ -0,0 +1,88 @@
|
|||||||
|
"""
|
||||||
|
Toolset resolution for Hermes-Agent Atropos integration.
|
||||||
|
|
||||||
|
We primarily reuse Hermes-Agent toolsets (`toolsets.py`), but Atropos training/envs
|
||||||
|
need a few extra sandbox-oriented toolsets that Hermes doesn't expose by default
|
||||||
|
(e.g. filesystem + stateful terminal).
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import Any, Dict, List, Optional, Set
|
||||||
|
|
||||||
|
import toolsets as hermes_toolsets
|
||||||
|
|
||||||
|
|
||||||
|
ATROPOS_TOOLSETS: Dict[str, Dict[str, Any]] = {
|
||||||
|
"filesystem": {
|
||||||
|
"description": "Read/write files in the sandbox workspace.",
|
||||||
|
"tools": ["read_file", "write_file"],
|
||||||
|
"includes": [],
|
||||||
|
},
|
||||||
|
"terminal_stateful": {
|
||||||
|
"description": "Stateful terminal execution (tmux/TUI support) inside the sandbox.",
|
||||||
|
"tools": ["terminal_stateful", "tmux"],
|
||||||
|
"includes": [],
|
||||||
|
},
|
||||||
|
"sandbox": {
|
||||||
|
"description": "Sandbox tools (terminal + filesystem).",
|
||||||
|
"tools": [],
|
||||||
|
"includes": ["terminal", "filesystem"],
|
||||||
|
},
|
||||||
|
"default": {
|
||||||
|
"description": "Default toolset for Atropos AgentEnv tasks.",
|
||||||
|
"tools": [],
|
||||||
|
"includes": ["sandbox"],
|
||||||
|
},
|
||||||
|
"full": {
|
||||||
|
"description": "All Hermes tools plus Atropos sandbox additions.",
|
||||||
|
"tools": [],
|
||||||
|
"includes": ["all", "filesystem", "sandbox", "terminal_stateful"],
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def validate_toolset(name: str) -> bool:
|
||||||
|
if name in {"all", "*"}:
|
||||||
|
return True
|
||||||
|
return hermes_toolsets.validate_toolset(name) or name in ATROPOS_TOOLSETS
|
||||||
|
|
||||||
|
|
||||||
|
def resolve_toolset(name: str, visited: Optional[Set[str]] = None) -> List[str]:
|
||||||
|
if visited is None:
|
||||||
|
visited = set()
|
||||||
|
|
||||||
|
if name in {"all", "*"}:
|
||||||
|
# Union Hermes + Atropos toolsets.
|
||||||
|
all_tools: Set[str] = set()
|
||||||
|
for tname in hermes_toolsets.get_toolset_names():
|
||||||
|
all_tools.update(resolve_toolset(tname, visited=set()))
|
||||||
|
for tname, spec in ATROPOS_TOOLSETS.items():
|
||||||
|
# Avoid recursion: some Atropos toolsets (e.g. "full") include "all".
|
||||||
|
if tname == "full" or "all" in (spec.get("includes") or []):
|
||||||
|
continue
|
||||||
|
all_tools.update(resolve_toolset(tname, visited=set()))
|
||||||
|
return sorted(all_tools)
|
||||||
|
|
||||||
|
if name in ATROPOS_TOOLSETS:
|
||||||
|
if name in visited:
|
||||||
|
return []
|
||||||
|
visited.add(name)
|
||||||
|
spec = ATROPOS_TOOLSETS[name]
|
||||||
|
tools: Set[str] = set(spec.get("tools", []))
|
||||||
|
for inc in spec.get("includes", []):
|
||||||
|
tools.update(resolve_toolset(inc, visited=set(visited)))
|
||||||
|
return sorted(tools)
|
||||||
|
|
||||||
|
# Fall back to Hermes toolsets.
|
||||||
|
# IMPORTANT: do not pre-add `name` to `visited` here; Hermes' resolver uses
|
||||||
|
# `visited` for its own cycle detection and will treat the presence of `name`
|
||||||
|
# as a circular dependency.
|
||||||
|
return sorted(hermes_toolsets.resolve_toolset(name, visited=set(visited)))
|
||||||
|
|
||||||
|
|
||||||
|
def resolve_multiple_toolsets(names: List[str]) -> List[str]:
|
||||||
|
tools: Set[str] = set()
|
||||||
|
for name in names:
|
||||||
|
tools.update(resolve_toolset(name, visited=set()))
|
||||||
|
return sorted(tools)
|
||||||
415
atropos_compatible_agent.py
Normal file
415
atropos_compatible_agent.py
Normal file
@@ -0,0 +1,415 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Atropos-compatible Hermes agent runner.
|
||||||
|
|
||||||
|
This is a minimal subclass of Hermes-Agent's `AIAgent` that swaps the OpenAI
|
||||||
|
function-calling backend for Atroposlib's `ManagedServer`/`ServerManager` backend
|
||||||
|
and uses Hermes-style XML tool tags:
|
||||||
|
|
||||||
|
- <tool_call>{"name": "...", "arguments": {...}}</tool_call>
|
||||||
|
- <tool_response>{...}</tool_response>
|
||||||
|
|
||||||
|
Tool observations are appended as `role="user"` messages containing one or more
|
||||||
|
`<tool_response>` blocks so they survive common chat templates during tokenization.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
import time
|
||||||
|
import warnings
|
||||||
|
import os
|
||||||
|
from contextlib import asynccontextmanager
|
||||||
|
from typing import Any, AsyncGenerator, Dict, List, Optional, Tuple
|
||||||
|
|
||||||
|
from model_tools import cleanup_vm, handle_function_call
|
||||||
|
from run_agent import AIAgent
|
||||||
|
|
||||||
|
_TOOL_CALL_RE = re.compile(r"<tool_call>\\s*(.*?)\\s*</tool_call>", re.DOTALL)
|
||||||
|
|
||||||
|
|
||||||
|
ATROPOS_TOOL_SYSTEM_PROMPT = """You are a helpful AI assistant with access to tools.
|
||||||
|
|
||||||
|
## Available Tools
|
||||||
|
<tools>
|
||||||
|
{tool_descriptions}
|
||||||
|
</tools>
|
||||||
|
|
||||||
|
## How to Use Tools
|
||||||
|
To call a tool, output:
|
||||||
|
<tool_call>{{"name": "tool_name", "arguments": {{"arg1": "value1"}}}}</tool_call>
|
||||||
|
|
||||||
|
You may include optional reasoning in <think>...</think> before tool calls.
|
||||||
|
|
||||||
|
After each tool call, you will receive tool results as:
|
||||||
|
<tool_response>{{...}}</tool_response>
|
||||||
|
|
||||||
|
Continue until finished, then provide a final response with no <tool_call> blocks.
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
class AtroposAIAgent(AIAgent):
|
||||||
|
"""
|
||||||
|
Hermes `AIAgent` variant that uses Atroposlib ServerManager/ManagedServer.
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
- The default Hermes `AIAgent` remains unchanged; this class is opt-in.
|
||||||
|
- The underlying server must expose `managed_server(tokenizer=...)` OR be a single
|
||||||
|
APIServer-compatible object usable by Atroposlib's `ManagedServer`.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
*,
|
||||||
|
server: Any,
|
||||||
|
tokenizer: Any = None,
|
||||||
|
model: str = "local",
|
||||||
|
max_iterations: int = 10,
|
||||||
|
tool_delay: float = 0.0,
|
||||||
|
enabled_toolsets: Optional[List[str]] = None,
|
||||||
|
disabled_toolsets: Optional[List[str]] = None,
|
||||||
|
save_trajectories: bool = False,
|
||||||
|
verbose_logging: bool = False,
|
||||||
|
quiet_mode: bool = False,
|
||||||
|
ephemeral_system_prompt: Optional[str] = None,
|
||||||
|
log_prefix_chars: int = 100,
|
||||||
|
log_prefix: str = "",
|
||||||
|
session_id: Optional[str] = None,
|
||||||
|
temperature: Optional[float] = None,
|
||||||
|
max_tokens: Optional[int] = None,
|
||||||
|
):
|
||||||
|
# Call parent init mainly to reuse tool selection + trajectory saving utilities.
|
||||||
|
super().__init__(
|
||||||
|
base_url="http://unused",
|
||||||
|
api_key="dummy-key",
|
||||||
|
model=model,
|
||||||
|
max_iterations=max_iterations,
|
||||||
|
tool_delay=tool_delay,
|
||||||
|
enabled_toolsets=enabled_toolsets,
|
||||||
|
disabled_toolsets=disabled_toolsets,
|
||||||
|
save_trajectories=save_trajectories,
|
||||||
|
verbose_logging=verbose_logging,
|
||||||
|
quiet_mode=quiet_mode,
|
||||||
|
ephemeral_system_prompt=ephemeral_system_prompt,
|
||||||
|
log_prefix_chars=log_prefix_chars,
|
||||||
|
log_prefix=log_prefix,
|
||||||
|
session_id=session_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.server = server
|
||||||
|
self.tokenizer = tokenizer
|
||||||
|
self.temperature = temperature
|
||||||
|
self.max_tokens = max_tokens
|
||||||
|
|
||||||
|
@asynccontextmanager
|
||||||
|
async def _managed(self) -> AsyncGenerator[Any, None]:
|
||||||
|
if hasattr(self.server, "managed_server"):
|
||||||
|
with warnings.catch_warnings():
|
||||||
|
warnings.filterwarnings(
|
||||||
|
"ignore",
|
||||||
|
message=r"Using OpenAIServer with managed_server does not allow for state tracking",
|
||||||
|
category=UserWarning,
|
||||||
|
)
|
||||||
|
async with self.server.managed_server(tokenizer=self.tokenizer) as managed:
|
||||||
|
yield managed
|
||||||
|
return
|
||||||
|
|
||||||
|
# Fall back to directly wrapping a single server object.
|
||||||
|
from atroposlib.envs.server_handling.managed_server import ManagedServer
|
||||||
|
|
||||||
|
managed = ManagedServer(server=self.server, tokenizer=self.tokenizer)
|
||||||
|
try:
|
||||||
|
yield managed
|
||||||
|
finally:
|
||||||
|
managed.reset()
|
||||||
|
|
||||||
|
def _tool_descriptions_text(self) -> str:
|
||||||
|
if not self.tools:
|
||||||
|
return "(no tools available)"
|
||||||
|
|
||||||
|
parts: List[str] = []
|
||||||
|
for tool in self.tools:
|
||||||
|
fn = (tool or {}).get("function", {})
|
||||||
|
name = fn.get("name", "")
|
||||||
|
desc = (fn.get("description") or "").strip()
|
||||||
|
if not name:
|
||||||
|
continue
|
||||||
|
if desc:
|
||||||
|
parts.append(f"- {name}: {desc}")
|
||||||
|
else:
|
||||||
|
parts.append(f"- {name}")
|
||||||
|
return "\n".join(parts) if parts else "(no tools available)"
|
||||||
|
|
||||||
|
def _build_system_prompt(self, system_message: Optional[str]) -> Optional[str]:
|
||||||
|
tool_prompt = ATROPOS_TOOL_SYSTEM_PROMPT.format(
|
||||||
|
tool_descriptions=self._tool_descriptions_text()
|
||||||
|
)
|
||||||
|
|
||||||
|
parts: List[str] = []
|
||||||
|
if system_message:
|
||||||
|
parts.append(system_message)
|
||||||
|
if self.ephemeral_system_prompt:
|
||||||
|
parts.append(self.ephemeral_system_prompt)
|
||||||
|
parts.append(tool_prompt)
|
||||||
|
|
||||||
|
return "\n\n".join(parts)
|
||||||
|
|
||||||
|
def _parse_tool_calls(self, content: str) -> Tuple[List[Tuple[str, Dict[str, Any]]], List[str]]:
|
||||||
|
"""
|
||||||
|
Returns:
|
||||||
|
(calls, errors)
|
||||||
|
"""
|
||||||
|
calls: List[Tuple[str, Dict[str, Any]]] = []
|
||||||
|
errors: List[str] = []
|
||||||
|
|
||||||
|
for raw in _TOOL_CALL_RE.findall(content or ""):
|
||||||
|
try:
|
||||||
|
payload = json.loads(raw)
|
||||||
|
except json.JSONDecodeError as exc:
|
||||||
|
errors.append(f"Invalid JSON inside <tool_call>: {exc}")
|
||||||
|
continue
|
||||||
|
|
||||||
|
name = payload.get("name")
|
||||||
|
args = payload.get("arguments", {})
|
||||||
|
if not isinstance(name, str) or not name:
|
||||||
|
errors.append("Tool call missing 'name' string")
|
||||||
|
continue
|
||||||
|
if not isinstance(args, dict):
|
||||||
|
errors.append("Tool call 'arguments' must be an object")
|
||||||
|
continue
|
||||||
|
|
||||||
|
calls.append((name, args))
|
||||||
|
|
||||||
|
return calls, errors
|
||||||
|
|
||||||
|
async def run_conversation_async(
|
||||||
|
self,
|
||||||
|
user_message: str,
|
||||||
|
system_message: Optional[str] = None,
|
||||||
|
conversation_history: Optional[List[Dict[str, Any]]] = None,
|
||||||
|
task_id: Optional[str] = None,
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
import uuid
|
||||||
|
|
||||||
|
effective_task_id = task_id or str(uuid.uuid4())
|
||||||
|
|
||||||
|
messages: List[Dict[str, Any]] = conversation_history.copy() if conversation_history else []
|
||||||
|
messages.append({"role": "user", "content": user_message})
|
||||||
|
|
||||||
|
active_system_prompt = self._build_system_prompt(system_message)
|
||||||
|
|
||||||
|
api_call_count = 0
|
||||||
|
final_response: Optional[str] = None
|
||||||
|
managed_state: Optional[Dict[str, Any]] = None
|
||||||
|
completed = False
|
||||||
|
|
||||||
|
try:
|
||||||
|
async with self._managed() as managed:
|
||||||
|
while api_call_count < self.max_iterations:
|
||||||
|
api_call_count += 1
|
||||||
|
|
||||||
|
api_messages = messages.copy()
|
||||||
|
if active_system_prompt:
|
||||||
|
api_messages = [{"role": "system", "content": active_system_prompt}] + api_messages
|
||||||
|
|
||||||
|
chat_kwargs: Dict[str, Any] = {"messages": api_messages, "n": 1}
|
||||||
|
if self.max_tokens is not None:
|
||||||
|
chat_kwargs["max_tokens"] = self.max_tokens
|
||||||
|
if self.temperature is not None:
|
||||||
|
chat_kwargs["temperature"] = self.temperature
|
||||||
|
|
||||||
|
# Prefer OpenAI tool calling when supported by the backend:
|
||||||
|
# - Many providers normalize Hermes-style <tool_call> tags into tool_calls when `tools` is provided.
|
||||||
|
# - ManagedServer (atroposlib) does prompt->completion conversion and does not support `tools`.
|
||||||
|
# Only pass `tools` when we're calling an OpenAI-compatible chat endpoint directly.
|
||||||
|
tool_schemas = self.tools if self.tools else None
|
||||||
|
managed_cls = type(managed).__name__
|
||||||
|
if tool_schemas and managed_cls != "ManagedServer":
|
||||||
|
chat_kwargs["tools"] = tool_schemas
|
||||||
|
|
||||||
|
if os.getenv("HERMES_DEBUG_ATROPOS_REQUEST") == "1":
|
||||||
|
meta = {
|
||||||
|
"managed_type": managed_cls,
|
||||||
|
"model": getattr(getattr(managed, "config", None), "model_name", self.model),
|
||||||
|
"base_url": getattr(getattr(managed, "config", None), "base_url", None),
|
||||||
|
"kwargs": chat_kwargs,
|
||||||
|
}
|
||||||
|
# Avoid dumping megabytes of data accidentally.
|
||||||
|
# (Messages can be large; this is still "full" but bounded.)
|
||||||
|
print("\n=== HERMES_DEBUG_ATROPOS_REQUEST ===", flush=True)
|
||||||
|
print(json.dumps(meta, ensure_ascii=False, indent=2)[:200_000], flush=True)
|
||||||
|
|
||||||
|
response = await managed.chat_completion(**chat_kwargs)
|
||||||
|
|
||||||
|
if os.getenv("HERMES_DEBUG_ATROPOS_RESPONSE") == "1":
|
||||||
|
try:
|
||||||
|
dumped = response.model_dump() # openai pydantic model
|
||||||
|
except Exception:
|
||||||
|
dumped = getattr(response, "__dict__", {"repr": repr(response)})
|
||||||
|
print("\n=== HERMES_DEBUG_ATROPOS_RESPONSE: ChatCompletion (raw) ===", flush=True)
|
||||||
|
print(json.dumps(dumped, ensure_ascii=False, indent=2), flush=True)
|
||||||
|
|
||||||
|
if hasattr(managed, "get_state"):
|
||||||
|
managed_state = managed.get_state()
|
||||||
|
|
||||||
|
msg = response.choices[0].message
|
||||||
|
assistant_content = (msg.content or "")
|
||||||
|
msg_reasoning = getattr(msg, "reasoning", None)
|
||||||
|
|
||||||
|
# Use tool_calls if the backend provides them (preferred).
|
||||||
|
structured_tool_calls = getattr(msg, "tool_calls", None)
|
||||||
|
|
||||||
|
# If the backend emits content="" but includes useful text in reasoning,
|
||||||
|
# use it for parsing *only if needed* (e.g. tool tags).
|
||||||
|
if assistant_content == "" and isinstance(msg_reasoning, str) and msg_reasoning:
|
||||||
|
if os.getenv("HERMES_DEBUG_ATROPOS_RESPONSE") == "1":
|
||||||
|
print("\n=== HERMES_DEBUG_ATROPOS_RESPONSE: message.reasoning present (content empty) ===", flush=True)
|
||||||
|
print(msg_reasoning, flush=True)
|
||||||
|
|
||||||
|
assistant_msg: Dict[str, Any] = {"role": "assistant", "content": assistant_content}
|
||||||
|
if structured_tool_calls:
|
||||||
|
# Preserve tool_calls so the next request is consistent with OpenAI protocol.
|
||||||
|
try:
|
||||||
|
assistant_msg["tool_calls"] = [
|
||||||
|
{
|
||||||
|
"id": tc.id,
|
||||||
|
"type": tc.type,
|
||||||
|
"function": {"name": tc.function.name, "arguments": tc.function.arguments},
|
||||||
|
}
|
||||||
|
for tc in structured_tool_calls
|
||||||
|
]
|
||||||
|
except Exception:
|
||||||
|
# Best-effort; keep conversation moving.
|
||||||
|
pass
|
||||||
|
messages.append(assistant_msg)
|
||||||
|
|
||||||
|
# Mode A: OpenAI tool calling (preferred when supported)
|
||||||
|
if structured_tool_calls:
|
||||||
|
for tc in structured_tool_calls:
|
||||||
|
tool_start = time.time()
|
||||||
|
try:
|
||||||
|
tool_args = json.loads(tc.function.arguments or "{}")
|
||||||
|
except Exception:
|
||||||
|
tool_args = {}
|
||||||
|
tool_result = handle_function_call(tc.function.name, tool_args, effective_task_id)
|
||||||
|
tool_duration = time.time() - tool_start
|
||||||
|
|
||||||
|
# Keep the raw tool result as tool content (OpenAI protocol expects role=tool).
|
||||||
|
messages.append(
|
||||||
|
{
|
||||||
|
"role": "tool",
|
||||||
|
"tool_call_id": tc.id,
|
||||||
|
"content": tool_result,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
if self.tool_delay and self.tool_delay > 0:
|
||||||
|
await asyncio.sleep(self.tool_delay)
|
||||||
|
|
||||||
|
# Continue loop after tool execution.
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Mode B: Hermes XML tool tags in assistant text (fallback).
|
||||||
|
parse_source = assistant_content or (msg_reasoning or "")
|
||||||
|
tool_calls, parse_errors = self._parse_tool_calls(parse_source)
|
||||||
|
|
||||||
|
if parse_errors and not tool_calls:
|
||||||
|
# Ask the model to retry with valid tool JSON.
|
||||||
|
err_text = "; ".join(parse_errors[:3])
|
||||||
|
messages.append(
|
||||||
|
{
|
||||||
|
"role": "user",
|
||||||
|
"content": (
|
||||||
|
f"<tool_response>{json.dumps({'error': err_text}, ensure_ascii=False)}</tool_response>\n"
|
||||||
|
"The previous <tool_call> blocks were invalid. Please output valid JSON inside <tool_call>."
|
||||||
|
),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
|
if not tool_calls:
|
||||||
|
# No tool calls: treat as final answer.
|
||||||
|
final_response = (assistant_content or "").strip()
|
||||||
|
completed = True
|
||||||
|
break
|
||||||
|
|
||||||
|
tool_responses: List[str] = []
|
||||||
|
for tool_name, tool_args in tool_calls:
|
||||||
|
tool_start = time.time()
|
||||||
|
tool_result = handle_function_call(tool_name, tool_args, effective_task_id)
|
||||||
|
tool_duration = time.time() - tool_start
|
||||||
|
|
||||||
|
try:
|
||||||
|
parsed = json.loads(tool_result)
|
||||||
|
payload: Any = parsed
|
||||||
|
except Exception:
|
||||||
|
payload = tool_result
|
||||||
|
|
||||||
|
tool_payload = {
|
||||||
|
"name": tool_name,
|
||||||
|
"duration_s": round(tool_duration, 3),
|
||||||
|
"result": payload,
|
||||||
|
}
|
||||||
|
tool_responses.append(
|
||||||
|
f"<tool_response>{json.dumps(tool_payload, ensure_ascii=False)}</tool_response>"
|
||||||
|
)
|
||||||
|
|
||||||
|
if self.tool_delay and self.tool_delay > 0:
|
||||||
|
await asyncio.sleep(self.tool_delay)
|
||||||
|
|
||||||
|
messages.append({"role": "user", "content": "\n".join(tool_responses)})
|
||||||
|
|
||||||
|
if final_response is None:
|
||||||
|
final_response = "I've reached the maximum number of iterations."
|
||||||
|
|
||||||
|
finally:
|
||||||
|
try:
|
||||||
|
cleanup_vm(effective_task_id)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Save trajectory using Hermes formatting (optional).
|
||||||
|
self._save_trajectory(messages, user_message, completed=completed)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"final_response": final_response,
|
||||||
|
"messages": messages,
|
||||||
|
"api_calls": api_call_count,
|
||||||
|
"completed": completed,
|
||||||
|
"managed_state": managed_state,
|
||||||
|
"system_prompt": active_system_prompt,
|
||||||
|
"task_id": effective_task_id,
|
||||||
|
}
|
||||||
|
|
||||||
|
def run_conversation(self, *args: Any, **kwargs: Any) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Sync wrapper for convenience.
|
||||||
|
|
||||||
|
If called from within a running event loop (e.g. prompt_toolkit), this
|
||||||
|
runs the async conversation in a dedicated thread to avoid nested loops.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
asyncio.get_running_loop()
|
||||||
|
except RuntimeError:
|
||||||
|
return asyncio.run(self.run_conversation_async(*args, **kwargs))
|
||||||
|
|
||||||
|
import queue
|
||||||
|
import threading
|
||||||
|
|
||||||
|
out: "queue.Queue[object]" = queue.Queue(maxsize=1)
|
||||||
|
|
||||||
|
def runner() -> None:
|
||||||
|
try:
|
||||||
|
out.put(asyncio.run(self.run_conversation_async(*args, **kwargs)))
|
||||||
|
except BaseException as exc: # noqa: BLE001
|
||||||
|
out.put(exc)
|
||||||
|
|
||||||
|
thread = threading.Thread(target=runner, daemon=True)
|
||||||
|
thread.start()
|
||||||
|
|
||||||
|
result = out.get()
|
||||||
|
if isinstance(result, BaseException):
|
||||||
|
raise result
|
||||||
|
return result # type: ignore[return-value]
|
||||||
83
configs/endless_terminals.yaml
Normal file
83
configs/endless_terminals.yaml
Normal file
@@ -0,0 +1,83 @@
|
|||||||
|
# Endless Terminals Environment Configuration
|
||||||
|
#
|
||||||
|
# Two modes:
|
||||||
|
# 1. Dataset mode (default): Load pre-generated tasks from HuggingFace
|
||||||
|
# 2. Procedural mode: Generate tasks on-demand via LLM
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# python -m atropos.envs.endless_terminals_env process \
|
||||||
|
# --config configs/endless_terminals.yaml
|
||||||
|
|
||||||
|
# Environment settings
|
||||||
|
env:
|
||||||
|
# Dataset mode (primary - recommended)
|
||||||
|
use_dataset: true # Load from HuggingFace (fast, no vLLM needed)
|
||||||
|
dataset_name: "obiwan96/endless-terminals-train"
|
||||||
|
dataset_split: "train"
|
||||||
|
dataset_cache_dir: "~/.cache/huggingface/datasets"
|
||||||
|
tasks_base_dir: "" # Set to dir containing task_* folders if not using default paths
|
||||||
|
# Example: "/path/to/endless-terminals-train"
|
||||||
|
|
||||||
|
# Task generation (fallback if use_dataset=false)
|
||||||
|
task_gen_model: "Qwen/Qwen3-32B" # Only needed if use_dataset=false
|
||||||
|
task_gen_temperature: 1.0
|
||||||
|
task_gen_max_tokens: 2048
|
||||||
|
|
||||||
|
# Container settings
|
||||||
|
base_container_image: "ubuntu:22.04"
|
||||||
|
container_timeout_s: 180
|
||||||
|
test_timeout_s: 60
|
||||||
|
|
||||||
|
# Workspace
|
||||||
|
workspace_dir: "/tmp/endless_terminals_workspace"
|
||||||
|
keep_failed_tasks: false # Set true to debug failed tasks
|
||||||
|
|
||||||
|
# Agent config (increased for long traces)
|
||||||
|
agent_max_steps: 32
|
||||||
|
agent_temperature: 0.7
|
||||||
|
agent_max_tokens: null # Let backend decide
|
||||||
|
|
||||||
|
# Tooling: terminal only
|
||||||
|
enabled_toolsets: ["terminal"]
|
||||||
|
disabled_toolsets: []
|
||||||
|
|
||||||
|
# Training settings
|
||||||
|
group_size: 4 # Parallel trajectory collection
|
||||||
|
batch_size: 32
|
||||||
|
total_steps: 1000 # Total training episodes
|
||||||
|
use_wandb: false # Enable for experiment tracking
|
||||||
|
include_messages: true
|
||||||
|
|
||||||
|
# Tool execution backend (nomad or modal)
|
||||||
|
tool_pool_mode: "nomad"
|
||||||
|
|
||||||
|
# Nomad settings (if using nomad)
|
||||||
|
nomad_address: "http://localhost:4646"
|
||||||
|
sandbox_job_id: "atropos-sandbox-endless"
|
||||||
|
sandbox_image: "atropos-sandbox:local"
|
||||||
|
slots_per_container: 10
|
||||||
|
min_containers: 1
|
||||||
|
max_containers: 10
|
||||||
|
privileged: false
|
||||||
|
acquire_timeout_s: 30.0
|
||||||
|
purge_job_on_start: true
|
||||||
|
purge_job_on_shutdown: true
|
||||||
|
|
||||||
|
# Modal settings (if using modal instead)
|
||||||
|
# modal_app_name: "atropos-endless"
|
||||||
|
# modal_image: "python:3.11"
|
||||||
|
# modal_slots_per_sandbox: 10
|
||||||
|
# modal_min_sandboxes: 1
|
||||||
|
# modal_max_sandboxes: 5
|
||||||
|
|
||||||
|
# Server config
|
||||||
|
server_base_url: "http://127.0.0.1:8080"
|
||||||
|
server_model: "hermes-4-36b"
|
||||||
|
tokenizer_name: "NousResearch/Hermes-4.3-36B"
|
||||||
|
|
||||||
|
# Server configs are auto-generated from env vars and env.server_* settings
|
||||||
|
# Override via environment variables:
|
||||||
|
# ATROPOS_SERVER_BASE_URL
|
||||||
|
# ATROPOS_SERVER_MODEL
|
||||||
|
# ATROPOS_SERVER_API_KEY
|
||||||
|
# ATROPOS_TOKENIZER_NAME
|
||||||
224
docs/MODAL_BACKEND.md
Normal file
224
docs/MODAL_BACKEND.md
Normal file
@@ -0,0 +1,224 @@
|
|||||||
|
# Modal Backend
|
||||||
|
|
||||||
|
Hermes Agent uses [Modal](https://modal.com) for scalable, isolated cloud execution environments. There are two Modal integrations:
|
||||||
|
|
||||||
|
1. **Terminal Tool** (`tools/terminal_tool.py`) - For CLI/agent command execution
|
||||||
|
2. **Atropos Backend** (`atropos/backends/modal_backend.py`) - For batch RL training workloads
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Terminal Tool (CLI/Agent)
|
||||||
|
|
||||||
|
The terminal tool provides a simple interface for executing commands in Modal sandboxes.
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
|
||||||
|
Set environment variables:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export TERMINAL_ENV=modal
|
||||||
|
export TERMINAL_MODAL_IMAGE=python:3.11
|
||||||
|
export TERMINAL_MODAL_APP_NAME=hermes-sandbox
|
||||||
|
```
|
||||||
|
|
||||||
|
Or use a YAML config file (`modal_profiles.yaml`):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
profiles:
|
||||||
|
default:
|
||||||
|
image: python:3.11
|
||||||
|
cpu: 1.0
|
||||||
|
memory: 2048
|
||||||
|
min_pool: 1
|
||||||
|
max_pool: 5
|
||||||
|
idle_timeout: 120
|
||||||
|
|
||||||
|
gpu:
|
||||||
|
image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
|
||||||
|
gpu: T4
|
||||||
|
memory: 16384
|
||||||
|
min_pool: 0
|
||||||
|
max_pool: 2
|
||||||
|
```
|
||||||
|
|
||||||
|
### Features
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
|---------|-------------|
|
||||||
|
| **Sandbox Pool** | Pre-warmed sandboxes for low latency |
|
||||||
|
| **Auto-scaling** | Grows/shrinks pool based on demand |
|
||||||
|
| **Idle Timeout** | Sandboxes auto-terminate when unused |
|
||||||
|
| **Profile Selection** | Different configs for different workloads |
|
||||||
|
| **Credential Injection** | `modal.Secret` integration |
|
||||||
|
|
||||||
|
### Usage
|
||||||
|
|
||||||
|
```python
|
||||||
|
from tools.terminal_tool import terminal_tool
|
||||||
|
|
||||||
|
# Simple command
|
||||||
|
output = terminal_tool("echo hello", task_id="my-task")
|
||||||
|
|
||||||
|
# With profile selection
|
||||||
|
output = terminal_tool("python train.py", task_id="training", profile="gpu")
|
||||||
|
|
||||||
|
# Cleanup when done
|
||||||
|
from tools.terminal_tool import cleanup_vm
|
||||||
|
cleanup_vm("my-task")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
_ModalPoolManager (singleton)
|
||||||
|
├── "default" pool → [sandbox-0, sandbox-1, ...]
|
||||||
|
└── "gpu" pool → [sandbox-0, ...]
|
||||||
|
|
||||||
|
Each pool:
|
||||||
|
- Maintains min_pool warm sandboxes
|
||||||
|
- Scales up to max_pool on demand
|
||||||
|
- Background thread scales down idle sandboxes
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Atropos Backend (RL Training)
|
||||||
|
|
||||||
|
The Atropos backend is designed for high-throughput batch execution during reinforcement learning training.
|
||||||
|
|
||||||
|
### Key Concept: Slot-based Multiplexing
|
||||||
|
|
||||||
|
Instead of one sandbox per trajectory, multiple trajectories share sandboxes via **slots**:
|
||||||
|
|
||||||
|
```
|
||||||
|
Sandbox (1 container)
|
||||||
|
├── Slot 0 → Trajectory A (workspace: /data/slot_0)
|
||||||
|
├── Slot 1 → Trajectory B (workspace: /data/slot_1)
|
||||||
|
└── Slot 2 → Trajectory C (workspace: /data/slot_2)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Benefits**:
|
||||||
|
- Fewer containers = lower cost
|
||||||
|
- Shared warm-up time
|
||||||
|
- Better GPU utilization
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
|
||||||
|
```python
|
||||||
|
from atropos.backends.modal_backend import ModalSandboxConfig, ModalToolBackend
|
||||||
|
|
||||||
|
config = ModalSandboxConfig(
|
||||||
|
name="default",
|
||||||
|
image="python:3.11",
|
||||||
|
cpu=1.0,
|
||||||
|
memory=2048,
|
||||||
|
slots_per_sandbox=10, # 10 trajectories per container
|
||||||
|
min_sandboxes=1,
|
||||||
|
max_sandboxes=5,
|
||||||
|
)
|
||||||
|
|
||||||
|
backend = ModalToolBackend(config.with_app_name("my-training"))
|
||||||
|
```
|
||||||
|
|
||||||
|
### Multi-Profile Support
|
||||||
|
|
||||||
|
Different trajectory types can request different resources:
|
||||||
|
|
||||||
|
```python
|
||||||
|
backend = ModalToolBackend.with_profiles(
|
||||||
|
app_name="rl-training",
|
||||||
|
profiles={
|
||||||
|
"default": ModalSandboxConfig(
|
||||||
|
name="default",
|
||||||
|
cpu=1.0,
|
||||||
|
memory=2048,
|
||||||
|
),
|
||||||
|
"pytorch-gpu": ModalSandboxConfig(
|
||||||
|
name="pytorch-gpu",
|
||||||
|
image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
|
||||||
|
gpu="T4",
|
||||||
|
memory=16384,
|
||||||
|
),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# CPU task
|
||||||
|
slot1 = await backend.acquire("traj-1", profile="default")
|
||||||
|
|
||||||
|
# GPU task
|
||||||
|
slot2 = await backend.acquire("traj-2", profile="pytorch-gpu")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Batched Execution
|
||||||
|
|
||||||
|
The key optimization - execute many commands in parallel:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Acquire slots for multiple trajectories
|
||||||
|
slots = [await backend.acquire(f"traj-{i}") for i in range(50)]
|
||||||
|
|
||||||
|
# Execute batch across all slots in parallel
|
||||||
|
results = await backend.execute_batch([
|
||||||
|
(slot, "bash", {"command": "python step.py"})
|
||||||
|
for slot in slots
|
||||||
|
])
|
||||||
|
|
||||||
|
# Release slots
|
||||||
|
for slot in slots:
|
||||||
|
await backend.release(slot)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
ModalToolBackend
|
||||||
|
└── _ModalMultiProfileManager
|
||||||
|
├── "default" → _ModalSandboxPool
|
||||||
|
│ ├── Sandbox 0 (slots 0-9)
|
||||||
|
│ └── Sandbox 1 (slots 0-9)
|
||||||
|
│
|
||||||
|
└── "pytorch-gpu" → _ModalSandboxPool
|
||||||
|
└── Sandbox 0 (slots 0-9)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Credentials
|
||||||
|
|
||||||
|
Inject secrets securely using Modal's secret management:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Create secret in Modal dashboard or CLI
|
||||||
|
modal secret create my-api-key API_KEY=sk-xxx
|
||||||
|
```
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Reference in config
|
||||||
|
config = ModalSandboxConfig(
|
||||||
|
secrets=["my-api-key"], # Modal secret names
|
||||||
|
env_vars={"DEBUG": "1"}, # Additional env vars
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### "Modal package not installed"
|
||||||
|
```bash
|
||||||
|
pip install modal
|
||||||
|
modal token new # Authenticate
|
||||||
|
```
|
||||||
|
|
||||||
|
### "Sandbox creation failed"
|
||||||
|
- Check Modal dashboard for quota limits
|
||||||
|
- Verify image exists and is accessible
|
||||||
|
- Check secret names are correct
|
||||||
|
|
||||||
|
### Shutdown errors
|
||||||
|
These are harmless warnings during Python interpreter shutdown:
|
||||||
|
```
|
||||||
|
[Modal] Error terminating ...: cannot schedule new futures after interpreter shutdown
|
||||||
|
```
|
||||||
|
|
||||||
|
The sandboxes will auto-terminate via Modal's idle_timeout anyway.
|
||||||
34
hermes
34
hermes
@@ -7,6 +7,40 @@ Usage: ./hermes [options]
|
|||||||
"""
|
"""
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
"""
|
||||||
|
Fire (google/python-fire) does not support POSIX-style short flags like `-p`.
|
||||||
|
We translate the most common shorthands to their long equivalents so wrapper
|
||||||
|
scripts can reliably use:
|
||||||
|
- `-p "..."` -> `--prompt "..."` (no TUI/banner; print result and exit)
|
||||||
|
- `-q "..."` -> `--query "..."` (single-shot with banner UX)
|
||||||
|
"""
|
||||||
|
|
||||||
|
import sys
|
||||||
|
|
||||||
|
def _rewrite_short_flags(argv: list[str]) -> list[str]:
|
||||||
|
rewritten: list[str] = []
|
||||||
|
i = 0
|
||||||
|
while i < len(argv):
|
||||||
|
arg = argv[i]
|
||||||
|
if arg == "-p":
|
||||||
|
rewritten.append("--prompt")
|
||||||
|
if i + 1 < len(argv):
|
||||||
|
rewritten.append(argv[i + 1])
|
||||||
|
i += 2
|
||||||
|
continue
|
||||||
|
if arg == "-q":
|
||||||
|
rewritten.append("--query")
|
||||||
|
if i + 1 < len(argv):
|
||||||
|
rewritten.append(argv[i + 1])
|
||||||
|
i += 2
|
||||||
|
continue
|
||||||
|
rewritten.append(arg)
|
||||||
|
i += 1
|
||||||
|
return rewritten
|
||||||
|
|
||||||
|
sys.argv = [sys.argv[0]] + _rewrite_short_flags(sys.argv[1:])
|
||||||
|
|
||||||
from cli import main
|
from cli import main
|
||||||
import fire
|
import fire
|
||||||
|
|
||||||
fire.Fire(main)
|
fire.Fire(main)
|
||||||
|
|||||||
659
hermes_agent.egg-info/PKG-INFO
Normal file
659
hermes_agent.egg-info/PKG-INFO
Normal file
@@ -0,0 +1,659 @@
|
|||||||
|
Metadata-Version: 2.4
|
||||||
|
Name: hermes-agent
|
||||||
|
Version: 0.1.0
|
||||||
|
Summary: AI agent with advanced tool-calling and toolsets
|
||||||
|
Author: Nous Research
|
||||||
|
License: MIT
|
||||||
|
Requires-Python: >=3.10
|
||||||
|
Description-Content-Type: text/markdown
|
||||||
|
Requires-Dist: openai
|
||||||
|
Requires-Dist: python-dotenv
|
||||||
|
Requires-Dist: fire
|
||||||
|
Requires-Dist: httpx
|
||||||
|
Requires-Dist: rich
|
||||||
|
Requires-Dist: tenacity
|
||||||
|
Requires-Dist: pyyaml
|
||||||
|
Requires-Dist: prompt_toolkit
|
||||||
|
Requires-Dist: requests
|
||||||
|
Requires-Dist: jinja2
|
||||||
|
Requires-Dist: pydantic>=2.0
|
||||||
|
Requires-Dist: firecrawl-py
|
||||||
|
Requires-Dist: fal-client
|
||||||
|
Requires-Dist: litellm>=1.75.5
|
||||||
|
Requires-Dist: typer
|
||||||
|
Requires-Dist: platformdirs
|
||||||
|
Provides-Extra: modal
|
||||||
|
Requires-Dist: modal; extra == "modal"
|
||||||
|
Requires-Dist: boto3; extra == "modal"
|
||||||
|
Provides-Extra: dev
|
||||||
|
Requires-Dist: pytest; extra == "dev"
|
||||||
|
Requires-Dist: pytest-asyncio; extra == "dev"
|
||||||
|
Provides-Extra: atropos
|
||||||
|
Requires-Dist: atroposlib @ git+https://github.com/NousResearch/atropos.git ; extra == "atropos"
|
||||||
|
Requires-Dist: aiohttp; extra == "atropos"
|
||||||
|
Requires-Dist: fastapi; extra == "atropos"
|
||||||
|
Requires-Dist: uvicorn; extra == "atropos"
|
||||||
|
Requires-Dist: pyte; extra == "atropos"
|
||||||
|
|
||||||
|
# Hermes Agent
|
||||||
|
|
||||||
|
An AI agent with advanced tool-calling capabilities, featuring a flexible toolsets system for organizing and managing tools.
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
- **Interactive CLI**: Beautiful terminal interface with animated feedback, personalities, and session management
|
||||||
|
- **Web Tools**: Search, extract content, and crawl websites
|
||||||
|
- **Terminal Tools**: Execute commands via local, Docker, Singularity, Modal, or SSH backends
|
||||||
|
- **Browser Tools**: Automate web browsers to navigate, click, type, and extract content
|
||||||
|
- **Vision Tools**: Analyze images from URLs
|
||||||
|
- **Reasoning Tools**: Advanced multi-model reasoning (Mixture of Agents)
|
||||||
|
- **Creative Tools**: Generate images from text prompts
|
||||||
|
- **Skills Tools**: On-demand knowledge documents with progressive disclosure
|
||||||
|
- **Toolsets System**: Organize tools into logical groups for different scenarios
|
||||||
|
- **Batch Processing**: Process datasets in parallel with checkpointing and statistics tracking
|
||||||
|
- **Ephemeral System Prompts**: Guide model behavior without polluting training datasets
|
||||||
|
|
||||||
|
## Quick Start (CLI)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# After setup (see below), just run:
|
||||||
|
./hermes
|
||||||
|
|
||||||
|
# Or with options:
|
||||||
|
./hermes --model "anthropic/claude-sonnet-4" --toolsets "web,terminal"
|
||||||
|
```
|
||||||
|
|
||||||
|
The CLI provides:
|
||||||
|
- Animated spinners during thinking and tool execution
|
||||||
|
- Kawaii-style feedback messages
|
||||||
|
- `/commands` for configuration, history, and session management
|
||||||
|
- Customizable personalities (`/personality kawaii`, `/personality pirate`, etc.)
|
||||||
|
- Persistent configuration via `cli-config.yaml`
|
||||||
|
|
||||||
|
## Setup
|
||||||
|
|
||||||
|
### 1. Clone the Repository
|
||||||
|
```bash
|
||||||
|
# Clone with submodules (recommended)
|
||||||
|
git clone --recurse-submodules https://github.com/NousResearch/Hermes-Agent.git
|
||||||
|
cd Hermes-Agent
|
||||||
|
|
||||||
|
# Or if already cloned without submodules:
|
||||||
|
git submodule update --init --recursive
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Install Dependencies
|
||||||
|
```bash
|
||||||
|
# Create and activate virtual environment (recommended)
|
||||||
|
python3 -m venv venv
|
||||||
|
source venv/bin/activate # On Windows: venv\Scripts\activate
|
||||||
|
|
||||||
|
# Install Python packages
|
||||||
|
pip install -r requirements.txt
|
||||||
|
|
||||||
|
# Install mini-swe-agent for terminal tools
|
||||||
|
pip install -e ./mini-swe-agent
|
||||||
|
|
||||||
|
# Install Node.js dependencies for browser tools (requires Node.js)
|
||||||
|
npm install
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Configure Environment Variables
|
||||||
|
```bash
|
||||||
|
# Copy the example environment file
|
||||||
|
cp .env.example .env
|
||||||
|
|
||||||
|
# Edit .env and add your API keys
|
||||||
|
nano .env # or use your preferred editor
|
||||||
|
```
|
||||||
|
|
||||||
|
**Required API Keys:**
|
||||||
|
- `OPENROUTER_API_KEY` - LLM access via OpenRouter (get at: https://openrouter.ai/keys)
|
||||||
|
- `FIRECRAWL_API_KEY` - Web tools (get at: https://firecrawl.dev/)
|
||||||
|
- `NOUS_API_KEY` - Vision & reasoning tools (get at: https://inference-api.nousresearch.com/)
|
||||||
|
- `FAL_KEY` - Image generation (get at: https://fal.ai/)
|
||||||
|
|
||||||
|
**Optional API Keys (for specific features):**
|
||||||
|
- `BROWSERBASE_API_KEY` - Browser automation (get at: https://browserbase.com/)
|
||||||
|
- `BROWSERBASE_PROJECT_ID` - From Browserbase dashboard
|
||||||
|
- `MORPH_API_KEY` - For legacy Hecate terminal backend (get at: https://morph.so/)
|
||||||
|
|
||||||
|
### 4. Configure Terminal Backend
|
||||||
|
|
||||||
|
The terminal tool uses **mini-swe-agent** environments. Configure in `.env` or `cli-config.yaml`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Backend: "local", "docker", "singularity", "modal", or "ssh"
|
||||||
|
TERMINAL_ENV=local # Default: runs on host machine (no isolation)
|
||||||
|
TERMINAL_ENV=ssh # Remote execution via SSH (agent code stays local)
|
||||||
|
TERMINAL_ENV=singularity # Recommended for HPC: Apptainer/Singularity containers
|
||||||
|
TERMINAL_ENV=docker # Isolated Docker containers
|
||||||
|
TERMINAL_ENV=modal # Cloud execution via Modal
|
||||||
|
|
||||||
|
# Container image (for docker/singularity/modal backends)
|
||||||
|
TERMINAL_DOCKER_IMAGE=python:3.11-slim
|
||||||
|
TERMINAL_SINGULARITY_IMAGE=docker://python:3.11-slim
|
||||||
|
TERMINAL_TIMEOUT=60
|
||||||
|
|
||||||
|
# SSH backend (for ssh)
|
||||||
|
TERMINAL_SSH_HOST=my-server.example.com
|
||||||
|
TERMINAL_SSH_USER=myuser
|
||||||
|
TERMINAL_SSH_KEY=~/.ssh/id_rsa # Optional, uses ssh-agent if not set
|
||||||
|
```
|
||||||
|
|
||||||
|
**Backend Requirements:**
|
||||||
|
- **local**: No extra setup (runs directly on your machine, no isolation)
|
||||||
|
- **ssh**: SSH access to remote machine (great for sandboxing - agent can't touch its own code)
|
||||||
|
- **singularity**: Requires Apptainer or Singularity installed (common on HPC clusters, no root needed)
|
||||||
|
- **docker**: Requires Docker installed and user in `docker` group
|
||||||
|
- **modal**: Requires Modal account (see setup below)
|
||||||
|
|
||||||
|
### Singularity/Apptainer Setup (Recommended for HPC)
|
||||||
|
|
||||||
|
Singularity/Apptainer provides rootless container execution, ideal for HPC clusters:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Verify Apptainer is installed
|
||||||
|
apptainer --version # or: singularity --version
|
||||||
|
|
||||||
|
# 2. Set up cache directories (important for parallel workers)
|
||||||
|
# Use /scratch if available (HPC), otherwise /tmp
|
||||||
|
export APPTAINER_CACHEDIR=/scratch/$USER/.apptainer
|
||||||
|
export APPTAINER_TMPDIR=/scratch/$USER/.apptainer/tmp
|
||||||
|
mkdir -p "$APPTAINER_CACHEDIR" "$APPTAINER_TMPDIR"
|
||||||
|
|
||||||
|
# 3. Pre-build SIF image (recommended for parallel batch processing)
|
||||||
|
# This avoids race conditions when multiple workers start simultaneously
|
||||||
|
apptainer build $APPTAINER_CACHEDIR/python-nodejs.sif docker://nikolaik/python-nodejs:python3.11-nodejs20
|
||||||
|
|
||||||
|
# 4. Configure .env to use the local SIF
|
||||||
|
TERMINAL_ENV=singularity
|
||||||
|
TERMINAL_SINGULARITY_IMAGE=/scratch/$USER/.apptainer/python-nodejs.sif
|
||||||
|
```
|
||||||
|
|
||||||
|
**Tip:** The batch scripts in `configs/` automatically handle SIF pre-building if `/scratch` is available.
|
||||||
|
|
||||||
|
### Modal Cloud Backend Setup
|
||||||
|
|
||||||
|
[Modal](https://modal.com) provides serverless cloud compute for running sandboxed environments at scale.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Install Modal and dependencies
|
||||||
|
pip install modal boto3
|
||||||
|
|
||||||
|
# 2. Authenticate with Modal (opens browser)
|
||||||
|
modal setup
|
||||||
|
|
||||||
|
# 3. Set terminal backend to modal in .env
|
||||||
|
TERMINAL_ENV=modal
|
||||||
|
```
|
||||||
|
|
||||||
|
Modal uses CLI-based authentication (stored in `~/.modal/`), so no API key is needed in `.env`. After running `modal setup`, commands will automatically execute in Modal's cloud sandboxes.
|
||||||
|
|
||||||
|
### Browser Tools Setup
|
||||||
|
|
||||||
|
Browser tools enable the agent to navigate websites, fill forms, click buttons, and extract content. They use [agent-browser](https://github.com/vercel-labs/agent-browser) CLI with [Browserbase](https://browserbase.com) cloud execution.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Install Node.js (if not already installed)
|
||||||
|
# Use nvm (recommended) or your package manager
|
||||||
|
|
||||||
|
# 2. Install agent-browser CLI (choose one option):
|
||||||
|
npm install -g agent-browser # Option A: Global install (recommended)
|
||||||
|
npm install # Option B: Local install (uses npx fallback)
|
||||||
|
|
||||||
|
# 3. Get Browserbase credentials
|
||||||
|
# Sign up at https://browserbase.com/ and get your:
|
||||||
|
# - API Key (from Settings → API Keys)
|
||||||
|
# - Project ID (from your project dashboard)
|
||||||
|
|
||||||
|
# 4. Add to your .env file:
|
||||||
|
BROWSERBASE_API_KEY=your_api_key_here
|
||||||
|
BROWSERBASE_PROJECT_ID=your_project_id_here
|
||||||
|
```
|
||||||
|
|
||||||
|
**Available Browser Tools:**
|
||||||
|
|
||||||
|
| Tool | Description |
|
||||||
|
|------|-------------|
|
||||||
|
| `browser_navigate` | Navigate to a URL |
|
||||||
|
| `browser_snapshot` | Get text-based page snapshot with element refs |
|
||||||
|
| `browser_click` | Click an element by ref (e.g., `@e5`) |
|
||||||
|
| `browser_type` | Type text into an input field |
|
||||||
|
| `browser_scroll` | Scroll up or down |
|
||||||
|
| `browser_back` | Go back in browser history |
|
||||||
|
| `browser_press` | Press a keyboard key (Enter, Tab, etc.) |
|
||||||
|
| `browser_close` | Close the browser session |
|
||||||
|
| `browser_get_images` | Get list of images on the page |
|
||||||
|
|
||||||
|
**Example Usage:**
|
||||||
|
```bash
|
||||||
|
# Use browser tools with web search and vision
|
||||||
|
python run_agent.py \
|
||||||
|
--query "Go to amazon.com and find the price of the latest Kindle" \
|
||||||
|
--enabled_toolsets=browser,web,vision
|
||||||
|
|
||||||
|
# Use browser-focused distribution
|
||||||
|
python batch_runner.py \
|
||||||
|
--dataset_file=browser_tasks.jsonl \
|
||||||
|
--distribution=browser_use \
|
||||||
|
--run_name=browser_run
|
||||||
|
```
|
||||||
|
|
||||||
|
See `.env.example` for all available configuration options including debug settings.
|
||||||
|
|
||||||
|
### Skills Tools
|
||||||
|
|
||||||
|
Skills are on-demand knowledge documents the agent can load when needed. They follow a **progressive disclosure** pattern to minimize token usage:
|
||||||
|
|
||||||
|
```
|
||||||
|
skills/
|
||||||
|
├── mlops/ # Category folder
|
||||||
|
│ ├── axolotl/ # Skill folder
|
||||||
|
│ │ ├── SKILL.md # Main instructions (required)
|
||||||
|
│ │ ├── references/ # Additional docs, API specs
|
||||||
|
│ │ └── templates/ # Output formats, configs
|
||||||
|
│ └── vllm/
|
||||||
|
│ └── SKILL.md
|
||||||
|
```
|
||||||
|
|
||||||
|
**Available Skills Tools:**
|
||||||
|
|
||||||
|
| Tool | Description |
|
||||||
|
|------|-------------|
|
||||||
|
| `skills_categories` | List available skill categories (~50 tokens) |
|
||||||
|
| `skills_list` | List skills with name + description (~3k tokens for 40 skills) |
|
||||||
|
| `skill_view` | Load full skill content, tags, and linked files |
|
||||||
|
|
||||||
|
**Example Usage:**
|
||||||
|
```bash
|
||||||
|
# Use skills tools
|
||||||
|
python run_agent.py \
|
||||||
|
--query "What skills do you have for fine-tuning? Show me the axolotl skill." \
|
||||||
|
--enabled_toolsets=skills
|
||||||
|
```
|
||||||
|
|
||||||
|
**Creating Skills:**
|
||||||
|
|
||||||
|
Skills use YAML frontmatter for metadata:
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
name: my-skill
|
||||||
|
description: Brief description shown in skills_list
|
||||||
|
tags: [tag1, tag2]
|
||||||
|
related_skills: [other-skill]
|
||||||
|
version: 1.0.0
|
||||||
|
---
|
||||||
|
# Skill Content
|
||||||
|
|
||||||
|
Instructions, examples, and guidelines here...
|
||||||
|
```
|
||||||
|
|
||||||
|
Skills can include:
|
||||||
|
- `references/` - Additional documentation, API specs, examples
|
||||||
|
- `templates/` - Output formats, config files, boilerplate code
|
||||||
|
- `scripts/` - Executable helpers (Python, shell scripts)
|
||||||
|
|
||||||
|
## Session Logging
|
||||||
|
|
||||||
|
Every conversation is automatically logged to `logs/` for debugging and inspection:
|
||||||
|
|
||||||
|
```
|
||||||
|
logs/
|
||||||
|
├── session_20260201_143052_a1b2c3.json
|
||||||
|
├── session_20260201_150217_d4e5f6.json
|
||||||
|
└── ...
|
||||||
|
```
|
||||||
|
|
||||||
|
**Log Format:**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"session_id": "20260201_143052_a1b2c3",
|
||||||
|
"model": "anthropic/claude-sonnet-4",
|
||||||
|
"session_start": "2026-02-01T14:30:52.123456",
|
||||||
|
"last_updated": "2026-02-01T14:35:12.789012",
|
||||||
|
"message_count": 8,
|
||||||
|
"conversations": [
|
||||||
|
{"from": "system", "value": "..."},
|
||||||
|
{"from": "human", "value": "..."},
|
||||||
|
{"from": "gpt", "value": "..."},
|
||||||
|
{"from": "tool", "value": "..."}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- **Automatic**: Logs are created and updated automatically after each conversation turn
|
||||||
|
- **Session ID in Banner**: The CLI displays the session ID in the welcome banner
|
||||||
|
- **Trajectory Format**: Uses the same format as batch processing for consistency
|
||||||
|
- **Git Ignored**: `logs/` is in `.gitignore` so logs aren't committed
|
||||||
|
|
||||||
|
## Interactive CLI
|
||||||
|
|
||||||
|
The CLI provides a rich interactive experience for working with the agent.
|
||||||
|
|
||||||
|
### Running the CLI
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Basic usage
|
||||||
|
./hermes
|
||||||
|
|
||||||
|
# With specific model
|
||||||
|
./hermes --model "anthropic/claude-sonnet-4"
|
||||||
|
|
||||||
|
# With specific toolsets
|
||||||
|
./hermes --toolsets "web,terminal,skills"
|
||||||
|
```
|
||||||
|
|
||||||
|
### CLI Commands
|
||||||
|
|
||||||
|
| Command | Description |
|
||||||
|
|---------|-------------|
|
||||||
|
| `/help` | Show available commands |
|
||||||
|
| `/tools` | List available tools by toolset |
|
||||||
|
| `/toolsets` | List available toolsets |
|
||||||
|
| `/model [name]` | Show or change the current model |
|
||||||
|
| `/prompt [text]` | View/set custom system prompt |
|
||||||
|
| `/personality [name]` | Set a predefined personality |
|
||||||
|
| `/clear` | Clear screen and reset conversation |
|
||||||
|
| `/reset` | Reset conversation only |
|
||||||
|
| `/history` | Show conversation history |
|
||||||
|
| `/save` | Save current conversation to file |
|
||||||
|
| `/config` | Show current configuration |
|
||||||
|
| `/quit` | Exit the CLI |
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
|
||||||
|
Copy `cli-config.yaml.example` to `cli-config.yaml` and customize:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# Model settings
|
||||||
|
model:
|
||||||
|
default: "anthropic/claude-sonnet-4"
|
||||||
|
|
||||||
|
# Terminal backend (local, docker, singularity, modal, or ssh)
|
||||||
|
terminal:
|
||||||
|
env_type: "local"
|
||||||
|
cwd: "." # Use current directory
|
||||||
|
|
||||||
|
# Or use SSH for remote execution (keeps agent code isolated)
|
||||||
|
# terminal:
|
||||||
|
# env_type: "ssh"
|
||||||
|
# ssh_host: "my-server.example.com"
|
||||||
|
# ssh_user: "myuser"
|
||||||
|
# ssh_key: "~/.ssh/id_rsa"
|
||||||
|
# cwd: "/home/myuser/project"
|
||||||
|
|
||||||
|
# Enable specific toolsets
|
||||||
|
toolsets:
|
||||||
|
- all # or: web, terminal, browser, vision, etc.
|
||||||
|
|
||||||
|
# Custom personalities (use with /personality command)
|
||||||
|
agent:
|
||||||
|
personalities:
|
||||||
|
helpful: "You are a helpful assistant."
|
||||||
|
kawaii: "You are a kawaii assistant! Use cute expressions..."
|
||||||
|
```
|
||||||
|
|
||||||
|
### Personalities
|
||||||
|
|
||||||
|
Built-in personalities available via `/personality`:
|
||||||
|
- `helpful`, `concise`, `technical`, `creative`, `teacher`
|
||||||
|
- `kawaii`, `catgirl`, `pirate`, `shakespeare`, `surfer`
|
||||||
|
- `noir`, `uwu`, `philosopher`, `hype`
|
||||||
|
|
||||||
|
## Toolsets System
|
||||||
|
|
||||||
|
The agent uses a toolsets system for organizing and managing tools. All tools must be part of a toolset to be accessible - individual tool selection is not supported. This ensures consistent and logical grouping of capabilities.
|
||||||
|
|
||||||
|
### Key Concepts
|
||||||
|
|
||||||
|
- **Toolsets**: Logical groups of tools for specific use cases (e.g., "research", "development", "debugging")
|
||||||
|
- **Composition**: Toolsets can include other toolsets for powerful combinations
|
||||||
|
- **Custom Toolsets**: Create your own toolsets at runtime or by editing `toolsets.py`
|
||||||
|
- **Toolset-Only Access**: Tools are only accessible through toolsets, not individually
|
||||||
|
|
||||||
|
### Available Toolsets
|
||||||
|
|
||||||
|
See `toolsets.py` for the complete list of predefined toolsets including:
|
||||||
|
- Basic toolsets (web, terminal, vision, creative, reasoning)
|
||||||
|
- Composite toolsets (research, development, analysis, etc.)
|
||||||
|
- Scenario-specific toolsets (debugging, documentation, API testing, etc.)
|
||||||
|
- Special toolsets (safe mode without terminal, minimal, offline)
|
||||||
|
|
||||||
|
### Using Toolsets
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Use a predefined toolset
|
||||||
|
python run_agent.py --enabled_toolsets=research --query "Find latest AI papers"
|
||||||
|
|
||||||
|
# Combine multiple toolsets
|
||||||
|
python run_agent.py --enabled_toolsets=web,vision --query "Analyze this website"
|
||||||
|
|
||||||
|
# Enable all toolsets explicitly (same as omitting the flag)
|
||||||
|
python run_agent.py --enabled_toolsets=all --query "Do web research and run commands if helpful"
|
||||||
|
|
||||||
|
# Safe mode (no terminal access)
|
||||||
|
python run_agent.py --enabled_toolsets=safe --query "Help without running commands"
|
||||||
|
|
||||||
|
# List all available toolsets and tools
|
||||||
|
python run_agent.py --list_tools
|
||||||
|
```
|
||||||
|
|
||||||
|
See `toolsets.py` for the complete list of available toolsets and how to create custom ones.
|
||||||
|
|
||||||
|
## Basic Usage
|
||||||
|
|
||||||
|
### Default (all tools enabled)
|
||||||
|
```bash
|
||||||
|
# Uses OpenRouter by default - just set OPENROUTER_API_KEY in .env
|
||||||
|
python run_agent.py \
|
||||||
|
--query "search up the latest docs on jit in python 3.13 and write me basic example that's not in their docs. profile its perf" \
|
||||||
|
--max_turns 20 \
|
||||||
|
--model anthropic/claude-sonnet-4-20250514
|
||||||
|
```
|
||||||
|
|
||||||
|
### With specific toolset
|
||||||
|
```bash
|
||||||
|
python run_agent.py \
|
||||||
|
--query "Debug this Python error" \
|
||||||
|
--enabled_toolsets=debugging \
|
||||||
|
--model anthropic/claude-sonnet-4-20250514
|
||||||
|
```
|
||||||
|
|
||||||
|
### Python API
|
||||||
|
```python
|
||||||
|
from run_agent import AIAgent
|
||||||
|
|
||||||
|
# Uses OpenRouter by default (reads OPENROUTER_API_KEY from .env)
|
||||||
|
agent = AIAgent(
|
||||||
|
model="anthropic/claude-sonnet-4-20250514",
|
||||||
|
enabled_toolsets=["research"]
|
||||||
|
)
|
||||||
|
response = agent.chat("Find information about quantum computing")
|
||||||
|
|
||||||
|
# Create custom toolset at runtime
|
||||||
|
from toolsets import create_custom_toolset
|
||||||
|
|
||||||
|
create_custom_toolset(
|
||||||
|
name="my_tools",
|
||||||
|
description="My custom toolkit",
|
||||||
|
tools=["web_search"],
|
||||||
|
includes=["terminal", "vision"]
|
||||||
|
)
|
||||||
|
|
||||||
|
agent = AIAgent(enabled_toolsets=["my_tools"])
|
||||||
|
```
|
||||||
|
|
||||||
|
## Batch Processing
|
||||||
|
|
||||||
|
Process multiple prompts from a dataset in parallel with automatic checkpointing and statistics tracking:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Basic batch processing
|
||||||
|
python batch_runner.py \
|
||||||
|
--dataset_file=prompts.jsonl \
|
||||||
|
--batch_size=20 \
|
||||||
|
--run_name=my_run
|
||||||
|
|
||||||
|
# With specific distribution
|
||||||
|
python batch_runner.py \
|
||||||
|
--dataset_file=prompts.jsonl \
|
||||||
|
--batch_size=20 \
|
||||||
|
--run_name=image_run \
|
||||||
|
--distribution=image_gen \
|
||||||
|
--num_workers=4
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key Features:**
|
||||||
|
- Parallel processing with configurable workers
|
||||||
|
- Toolset distributions for varied data generation
|
||||||
|
- Automatic checkpointing and resume capability
|
||||||
|
- Combined output in `data/<run_name>/trajectories.jsonl`
|
||||||
|
- Tool usage statistics and success rates
|
||||||
|
|
||||||
|
Use `--list_distributions` to see available toolset distributions for varied data generation.
|
||||||
|
|
||||||
|
### Trajectory Compression
|
||||||
|
|
||||||
|
Post-process trajectories to fit within token budgets for training:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Compress a directory of JSONL files
|
||||||
|
python trajectory_compressor.py --input=data/my_run
|
||||||
|
|
||||||
|
# Compress a single JSONL file
|
||||||
|
python trajectory_compressor.py --input=data/trajectories.jsonl
|
||||||
|
|
||||||
|
# Compress a 15% sample (useful for creating smaller training sets)
|
||||||
|
python trajectory_compressor.py --input=data/trajectories.jsonl --sample_percent=15
|
||||||
|
|
||||||
|
# Custom output and token target
|
||||||
|
python trajectory_compressor.py \
|
||||||
|
--input=data/trajectories.jsonl \
|
||||||
|
--output=data/compressed.jsonl \
|
||||||
|
--target_max_tokens=16000
|
||||||
|
```
|
||||||
|
|
||||||
|
**Features:**
|
||||||
|
- Protects first turns (system, human, first GPT response, first tool call)
|
||||||
|
- Protects last N turns (configurable)
|
||||||
|
- Summarizes middle turns using LLM to fit target token budget
|
||||||
|
- Supports both directory and single file input
|
||||||
|
- Optional random sampling with `--sample_percent`
|
||||||
|
- Configurable via `configs/trajectory_compression.yaml`
|
||||||
|
|
||||||
|
### Ephemeral System Prompts
|
||||||
|
|
||||||
|
The ephemeral system prompt feature allows you to guide the model's behavior during batch processing **without** saving that prompt to the training dataset trajectories. This is useful for:
|
||||||
|
|
||||||
|
- Guiding model behavior during data collection
|
||||||
|
- Adding task-specific instructions
|
||||||
|
- Keeping saved trajectories clean and focused on tool-calling format
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
```bash
|
||||||
|
python batch_runner.py \
|
||||||
|
--dataset_file=prompts.jsonl \
|
||||||
|
--batch_size=10 \
|
||||||
|
--run_name=my_run \
|
||||||
|
--ephemeral_system_prompt="You are a helpful assistant focused on image generation."
|
||||||
|
```
|
||||||
|
|
||||||
|
The ephemeral prompt will influence the model's behavior during execution, but **only the standard tool-calling system prompt** will be saved in the trajectory files.
|
||||||
|
|
||||||
|
The ephemeral prompt influences model behavior during execution, but **only the standard tool-calling system prompt** is saved in trajectory files.
|
||||||
|
|
||||||
|
## Command Line Arguments
|
||||||
|
|
||||||
|
**Single Agent (`run_agent.py`):**
|
||||||
|
- `--query`: The question or task for the agent
|
||||||
|
- `--model`: Model to use (default: claude-opus-4-20250514)
|
||||||
|
- `--api_key`: API key for authentication
|
||||||
|
- `--base_url`: API endpoint URL
|
||||||
|
- `--max_turns`: Maximum number of tool-calling iterations
|
||||||
|
- `--enabled_toolsets`: Comma-separated list of toolsets to enable. Use `all` (or `*`) to enable everything. If omitted, all toolsets are enabled by default.
|
||||||
|
- `--disabled_toolsets`: Comma-separated list of toolsets to disable
|
||||||
|
- `--list_tools`: List all available toolsets and tools
|
||||||
|
- `--save_trajectories`: Save conversation trajectories to JSONL files
|
||||||
|
|
||||||
|
**Batch Processing (`batch_runner.py`):**
|
||||||
|
- `--dataset_file`: Path to JSONL file with prompts
|
||||||
|
- `--batch_size`: Number of prompts per batch
|
||||||
|
- `--run_name`: Name for this run (for output/checkpointing)
|
||||||
|
- `--distribution`: Toolset distribution to use (default: "default")
|
||||||
|
- `--num_workers`: Number of parallel workers (default: 4)
|
||||||
|
- `--resume`: Resume from checkpoint if interrupted
|
||||||
|
- `--ephemeral_system_prompt`: System prompt used during execution but NOT saved to trajectories
|
||||||
|
- `--list_distributions`: List available toolset distributions
|
||||||
|
|
||||||
|
## Environment Variables
|
||||||
|
|
||||||
|
All environment variables can be configured in the `.env` file (copy from `.env.example`).
|
||||||
|
|
||||||
|
**LLM Provider (OpenRouter):**
|
||||||
|
- `OPENROUTER_API_KEY`: Primary LLM access via OpenRouter (supports Claude, GPT-4, Gemini, etc.)
|
||||||
|
- `LLM_MODEL`: Default model (e.g., `anthropic/claude-sonnet-4`, `openai/gpt-4o`)
|
||||||
|
|
||||||
|
**Tool API Keys:**
|
||||||
|
- `FIRECRAWL_API_KEY`: Web tools (search, extract, crawl)
|
||||||
|
- `NOUS_API_KEY`: Vision and reasoning tools
|
||||||
|
- `FAL_KEY`: Image generation tools
|
||||||
|
|
||||||
|
**Terminal Tool Configuration (mini-swe-agent backend):**
|
||||||
|
- `TERMINAL_ENV`: Backend type - `local`, `docker`, `singularity`, `modal`, or `ssh` (default: `local`)
|
||||||
|
- `TERMINAL_DOCKER_IMAGE`: Docker image for docker backend (default: `python:3.11-slim`)
|
||||||
|
- `TERMINAL_SINGULARITY_IMAGE`: Singularity/Apptainer image (can be `docker://...` URL or local `.sif` path)
|
||||||
|
- `TERMINAL_TIMEOUT`: Command timeout in seconds (default: `60`)
|
||||||
|
- `TERMINAL_LIFETIME_SECONDS`: Cleanup inactive environments after this time (default: `300`)
|
||||||
|
- `TERMINAL_CWD`: Working directory inside containers (default: `/tmp`)
|
||||||
|
- `TERMINAL_SCRATCH_DIR`: Custom scratch directory for sandbox storage (optional, auto-detects `/scratch`)
|
||||||
|
- `SUDO_PASSWORD`: Enable sudo commands by piping password via `sudo -S` (works with all backends)
|
||||||
|
- If unset in CLI mode, you'll be prompted interactively when sudo is needed (45s timeout)
|
||||||
|
|
||||||
|
**SSH Backend Configuration (for remote execution):**
|
||||||
|
- `TERMINAL_SSH_HOST`: Remote server hostname or IP
|
||||||
|
- `TERMINAL_SSH_USER`: SSH username
|
||||||
|
- `TERMINAL_SSH_PORT`: SSH port (default: `22`)
|
||||||
|
- `TERMINAL_SSH_KEY`: Path to SSH private key (optional, uses ssh-agent if not set)
|
||||||
|
|
||||||
|
**Browser Tool Configuration (agent-browser + Browserbase):**
|
||||||
|
- `BROWSERBASE_API_KEY`: Browserbase API key for cloud browser execution
|
||||||
|
- `BROWSERBASE_PROJECT_ID`: Browserbase project ID
|
||||||
|
- `BROWSER_SESSION_TIMEOUT`: Session timeout in seconds (default: `300`)
|
||||||
|
|
||||||
|
**Legacy Hecate Terminal Backend (optional):**
|
||||||
|
- `MORPH_API_KEY`: For Hecate/MorphCloud terminal backend
|
||||||
|
- `HECATE_VM_LIFETIME_SECONDS`: VM lifetime (default: 300)
|
||||||
|
- `HECATE_DEFAULT_SNAPSHOT_ID`: Default snapshot (default: snapshot_p5294qxt)
|
||||||
|
|
||||||
|
**Debug Options:**
|
||||||
|
- `WEB_TOOLS_DEBUG`, `VISION_TOOLS_DEBUG`, `MOA_TOOLS_DEBUG`, `IMAGE_TOOLS_DEBUG`: Enable debug logging
|
||||||
|
|
||||||
|
## Key Files
|
||||||
|
|
||||||
|
| File | Purpose |
|
||||||
|
|------|---------|
|
||||||
|
| `hermes` | CLI launcher script (run with `./hermes`) |
|
||||||
|
| `cli.py` | Interactive CLI implementation |
|
||||||
|
| `cli-config.yaml` | CLI configuration (copy from `.example`) |
|
||||||
|
| `run_agent.py` | Main agent runner - single query execution |
|
||||||
|
| `batch_runner.py` | Parallel batch processing with checkpointing |
|
||||||
|
| `model_tools.py` | Core tool definitions and handlers |
|
||||||
|
| `toolsets.py` | Toolset definitions and composition |
|
||||||
|
| `toolset_distributions.py` | Probability distributions for data generation |
|
||||||
|
| `trajectory_compressor.py` | Post-process trajectories for training |
|
||||||
|
| `tools/` | Individual tool implementations |
|
||||||
|
| `tools/skills_tool.py` | Skills system with progressive disclosure |
|
||||||
|
| `skills/` | On-demand knowledge documents |
|
||||||
|
| `docs/` | Documentation |
|
||||||
|
| `configs/` | Example batch run scripts |
|
||||||
|
|
||||||
|
# Atropos Integrations & RL Training
|
||||||
|
|
||||||
|
## Nomad Setup
|
||||||
|
Follow this: https://developer.hashicorp.com/nomad/docs/deploy
|
||||||
|
|
||||||
|
## Atropos dependencies
|
||||||
|
python3 -m venv .venv
|
||||||
|
source .venv/bin/activate
|
||||||
|
pip install -e '.[atropos]'
|
||||||
70
hermes_agent.egg-info/SOURCES.txt
Normal file
70
hermes_agent.egg-info/SOURCES.txt
Normal file
@@ -0,0 +1,70 @@
|
|||||||
|
README.md
|
||||||
|
atropos_compatible_agent.py
|
||||||
|
batch_runner.py
|
||||||
|
local_server.py
|
||||||
|
model_tools.py
|
||||||
|
pyproject.toml
|
||||||
|
run_agent.py
|
||||||
|
toolset_distributions.py
|
||||||
|
toolsets.py
|
||||||
|
trajectory_compressor.py
|
||||||
|
atropos/__init__.py
|
||||||
|
atropos/sandbox_server.py
|
||||||
|
atropos/agent/__init__.py
|
||||||
|
atropos/agent/atropos_agent.py
|
||||||
|
atropos/api/__init__.py
|
||||||
|
atropos/api/tool_executor_server.py
|
||||||
|
atropos/api/tool_server.py
|
||||||
|
atropos/backends/__init__.py
|
||||||
|
atropos/backends/base.py
|
||||||
|
atropos/backends/modal_backend.py
|
||||||
|
atropos/backends/nomad_backend.py
|
||||||
|
atropos/envs/__init__.py
|
||||||
|
atropos/envs/agent_env.py
|
||||||
|
atropos/envs/hermes_compat_test_env.py
|
||||||
|
atropos/envs/sandbox_terminal_smoke_env.py
|
||||||
|
atropos/envs/swe_smith_oracle_env.py
|
||||||
|
atropos/envs/test_env.py
|
||||||
|
atropos/envs/toolserver_smoke_env.py
|
||||||
|
atropos/nomad/__init__.py
|
||||||
|
atropos/nomad/client.py
|
||||||
|
atropos/slots/__init__.py
|
||||||
|
atropos/slots/executor.py
|
||||||
|
atropos/slots/pool.py
|
||||||
|
atropos/slots/slot.py
|
||||||
|
atropos/terminal/__init__.py
|
||||||
|
atropos/terminal/asciinema_stream.py
|
||||||
|
atropos/tools/__init__.py
|
||||||
|
atropos/tools/base.py
|
||||||
|
atropos/tools/build_registry.py
|
||||||
|
atropos/tools/hermes_external_tools.py
|
||||||
|
atropos/tools/sandbox_stubs.py
|
||||||
|
atropos/tools/terminal_stateful_tool.py
|
||||||
|
atropos/tools/tmux_tool.py
|
||||||
|
atropos/tools/tool_executor.py
|
||||||
|
atropos/tools/toolset_resolver.py
|
||||||
|
hermes_agent.egg-info/PKG-INFO
|
||||||
|
hermes_agent.egg-info/SOURCES.txt
|
||||||
|
hermes_agent.egg-info/dependency_links.txt
|
||||||
|
hermes_agent.egg-info/entry_points.txt
|
||||||
|
hermes_agent.egg-info/requires.txt
|
||||||
|
hermes_agent.egg-info/top_level.txt
|
||||||
|
tests/test_batch_runner.py
|
||||||
|
tests/test_checkpoint_resumption.py
|
||||||
|
tests/test_modal_integration.py
|
||||||
|
tests/test_modal_stress.py
|
||||||
|
tests/test_modal_terminal.py
|
||||||
|
tests/test_nous_api_limits.py
|
||||||
|
tests/test_nous_api_pattern.py
|
||||||
|
tests/test_temperature_fix.py
|
||||||
|
tests/test_tool_call_parsing.py
|
||||||
|
tests/test_web_tools.py
|
||||||
|
tools/__init__.py
|
||||||
|
tools/browser_tool.py
|
||||||
|
tools/image_generation_tool.py
|
||||||
|
tools/mixture_of_agents_tool.py
|
||||||
|
tools/skills_tool.py
|
||||||
|
tools/terminal_hecate.py
|
||||||
|
tools/terminal_tool.py
|
||||||
|
tools/vision_tools.py
|
||||||
|
tools/web_tools.py
|
||||||
1
hermes_agent.egg-info/dependency_links.txt
Normal file
1
hermes_agent.egg-info/dependency_links.txt
Normal file
@@ -0,0 +1 @@
|
|||||||
|
|
||||||
4
hermes_agent.egg-info/entry_points.txt
Normal file
4
hermes_agent.egg-info/entry_points.txt
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
[console_scripts]
|
||||||
|
hermes-agent = run_agent:main
|
||||||
|
hermes-atropos-sandbox-smoke = atropos.envs.sandbox_terminal_smoke_env:SandboxTerminalSmokeEnv.cli
|
||||||
|
hermes-atropos-toolserver-smoke = atropos.envs.toolserver_smoke_env:ToolServerSmokeEnv.cli
|
||||||
31
hermes_agent.egg-info/requires.txt
Normal file
31
hermes_agent.egg-info/requires.txt
Normal file
@@ -0,0 +1,31 @@
|
|||||||
|
openai
|
||||||
|
python-dotenv
|
||||||
|
fire
|
||||||
|
httpx
|
||||||
|
rich
|
||||||
|
tenacity
|
||||||
|
pyyaml
|
||||||
|
prompt_toolkit
|
||||||
|
requests
|
||||||
|
jinja2
|
||||||
|
pydantic>=2.0
|
||||||
|
firecrawl-py
|
||||||
|
fal-client
|
||||||
|
litellm>=1.75.5
|
||||||
|
typer
|
||||||
|
platformdirs
|
||||||
|
|
||||||
|
[atropos]
|
||||||
|
atroposlib @ git+https://github.com/NousResearch/atropos.git
|
||||||
|
aiohttp
|
||||||
|
fastapi
|
||||||
|
uvicorn
|
||||||
|
pyte
|
||||||
|
|
||||||
|
[dev]
|
||||||
|
pytest
|
||||||
|
pytest-asyncio
|
||||||
|
|
||||||
|
[modal]
|
||||||
|
modal
|
||||||
|
boto3
|
||||||
10
hermes_agent.egg-info/top_level.txt
Normal file
10
hermes_agent.egg-info/top_level.txt
Normal file
@@ -0,0 +1,10 @@
|
|||||||
|
atropos
|
||||||
|
atropos_compatible_agent
|
||||||
|
batch_runner
|
||||||
|
local_server
|
||||||
|
model_tools
|
||||||
|
run_agent
|
||||||
|
tools
|
||||||
|
toolset_distributions
|
||||||
|
toolsets
|
||||||
|
trajectory_compressor
|
||||||
353
local_server.py
Normal file
353
local_server.py
Normal file
@@ -0,0 +1,353 @@
|
|||||||
|
"""
|
||||||
|
Local OpenAI-compatible server implementation for Hermes-Agent (Atropos integration).
|
||||||
|
|
||||||
|
Extends the Atropos APIServer to work with local OpenAI-compatible APIs (e.g. vLLM, SGLang),
|
||||||
|
providing tokens_and_logprobs_completion support via client-side tokenization.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import os
|
||||||
|
import warnings
|
||||||
|
from typing import Any, List, Optional
|
||||||
|
|
||||||
|
import openai
|
||||||
|
from openai.types.chat.chat_completion import ChatCompletion
|
||||||
|
from openai.types.completion import Completion
|
||||||
|
|
||||||
|
from atroposlib.envs.server_handling.server_baseline import (
|
||||||
|
APIServer,
|
||||||
|
APIServerConfig,
|
||||||
|
ReasoningConfig,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class LocalServer(APIServer):
|
||||||
|
"""
|
||||||
|
OpenAI-compatible local server with tokens_and_logprobs support.
|
||||||
|
|
||||||
|
Uses an OpenAI-compatible API (typically at a /v1 endpoint) and handles
|
||||||
|
token extraction via client-side tokenization.
|
||||||
|
|
||||||
|
Note: Many local servers don't return per-token logprobs in the standard API,
|
||||||
|
so this implementation uses placeholder logprobs (0.0) for PoC purposes.
|
||||||
|
For production training, use vLLM/SGLang servers that return real logprobs.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
config: APIServerConfig,
|
||||||
|
tokenizer: Optional[Any] = None,
|
||||||
|
tokenizer_name: str = "gpt2",
|
||||||
|
reasoning_config: Optional[ReasoningConfig] = None,
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Initialize the local server.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
config: Server configuration
|
||||||
|
tokenizer: Pre-initialized tokenizer (optional)
|
||||||
|
tokenizer_name: Name of tokenizer to load if tokenizer not provided
|
||||||
|
reasoning_config: Optional reasoning configuration
|
||||||
|
"""
|
||||||
|
# Build the OpenAI client pointing to the server's /v1 endpoint
|
||||||
|
base_url = config.base_url
|
||||||
|
if base_url and not base_url.endswith("/v1"):
|
||||||
|
base_url = f"{base_url.rstrip('/')}/v1"
|
||||||
|
|
||||||
|
self.openai = openai.AsyncClient(
|
||||||
|
api_key=config.api_key or "local", # Local servers often ignore auth
|
||||||
|
base_url=base_url,
|
||||||
|
timeout=config.timeout,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Initialize tokenizer
|
||||||
|
if tokenizer is not None:
|
||||||
|
self.tokenizer = tokenizer
|
||||||
|
else:
|
||||||
|
try:
|
||||||
|
from transformers import AutoTokenizer # type: ignore
|
||||||
|
except ModuleNotFoundError as exc:
|
||||||
|
raise ModuleNotFoundError(
|
||||||
|
"Missing optional dependency 'transformers'. Pass a tokenizer instance to LocalServer, "
|
||||||
|
"or install transformers to enable `tokenizer_name` auto-loading."
|
||||||
|
) from exc
|
||||||
|
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
|
||||||
|
|
||||||
|
# Add a simple chat template if the tokenizer doesn't have one
|
||||||
|
# This is needed for ManagedServer's chat_completion to work
|
||||||
|
if not hasattr(self.tokenizer, 'chat_template') or self.tokenizer.chat_template is None:
|
||||||
|
# Simple ChatML-style template
|
||||||
|
self.tokenizer.chat_template = (
|
||||||
|
"{% for message in messages %}"
|
||||||
|
"{% if message['role'] == 'system' %}<|im_start|>system\n{{ message['content'] }}<|im_end|>\n"
|
||||||
|
"{% elif message['role'] == 'user' %}<|im_start|>user\n{{ message['content'] }}<|im_end|>\n"
|
||||||
|
"{% elif message['role'] == 'assistant' %}<|im_start|>assistant\n{{ message['content'] }}<|im_end|>\n"
|
||||||
|
"{% endif %}"
|
||||||
|
"{% endfor %}"
|
||||||
|
"{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
|
||||||
|
)
|
||||||
|
|
||||||
|
super().__init__(config, reasoning_config=reasoning_config)
|
||||||
|
# Local servers are treated as always-healthy unless a status task is enabled.
|
||||||
|
self.server_healthy = True
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_env(
|
||||||
|
cls,
|
||||||
|
base_url: Optional[str] = None,
|
||||||
|
model: Optional[str] = None,
|
||||||
|
api_key: Optional[str] = None,
|
||||||
|
tokenizer_name: str = "gpt2",
|
||||||
|
**kwargs,
|
||||||
|
) -> "LocalServer":
|
||||||
|
"""
|
||||||
|
Create a LocalServer from environment variables (or explicit overrides).
|
||||||
|
|
||||||
|
Env vars (checked in order):
|
||||||
|
- base URL: ATROPOS_SERVER_BASE_URL, OPENAI_BASE_URL, LOCAL_LLM_BASE_URL, LLM_BASE_URL
|
||||||
|
- model: ATROPOS_SERVER_MODEL, LLM_MODEL, LOCAL_LLM_MODEL
|
||||||
|
- api key: ATROPOS_SERVER_API_KEY, OPENAI_API_KEY, LOCAL_LLM_API_KEY, LLM_API_KEY
|
||||||
|
"""
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
load_dotenv()
|
||||||
|
|
||||||
|
base_url = (
|
||||||
|
base_url
|
||||||
|
or os.getenv("ATROPOS_SERVER_BASE_URL")
|
||||||
|
or os.getenv("OPENAI_BASE_URL")
|
||||||
|
or os.getenv("LOCAL_LLM_BASE_URL")
|
||||||
|
or os.getenv("LLM_BASE_URL")
|
||||||
|
or "http://localhost:11434"
|
||||||
|
)
|
||||||
|
model = (
|
||||||
|
model
|
||||||
|
or os.getenv("ATROPOS_SERVER_MODEL")
|
||||||
|
or os.getenv("LLM_MODEL")
|
||||||
|
or os.getenv("LOCAL_LLM_MODEL")
|
||||||
|
or "hermes3:8b"
|
||||||
|
)
|
||||||
|
api_key = (
|
||||||
|
api_key
|
||||||
|
or os.getenv("ATROPOS_SERVER_API_KEY")
|
||||||
|
or os.getenv("OPENAI_API_KEY")
|
||||||
|
or os.getenv("LOCAL_LLM_API_KEY")
|
||||||
|
or os.getenv("LLM_API_KEY")
|
||||||
|
)
|
||||||
|
|
||||||
|
config = APIServerConfig(
|
||||||
|
model_name=model,
|
||||||
|
base_url=base_url,
|
||||||
|
api_key=api_key or "local",
|
||||||
|
timeout=kwargs.get("timeout", 120),
|
||||||
|
num_max_requests_at_once=kwargs.get("num_max_requests_at_once", 4),
|
||||||
|
num_requests_for_eval=kwargs.get("num_requests_for_eval", 4),
|
||||||
|
health_check=False, # Local dev servers often lack /health
|
||||||
|
)
|
||||||
|
|
||||||
|
return cls(config, tokenizer_name=tokenizer_name)
|
||||||
|
|
||||||
|
async def check_server_status_task(self, chat_completion: bool = True):
|
||||||
|
"""
|
||||||
|
Check if the server is healthy.
|
||||||
|
|
||||||
|
For local development, we generally assume the server is healthy.
|
||||||
|
"""
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
# Simple health check via a minimal completion
|
||||||
|
if chat_completion:
|
||||||
|
await self.openai.chat.completions.create(
|
||||||
|
model=self.config.model_name,
|
||||||
|
messages=[{"role": "user", "content": "hi"}],
|
||||||
|
max_tokens=1,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
await self.openai.completions.create(
|
||||||
|
model=self.config.model_name,
|
||||||
|
prompt="hi",
|
||||||
|
max_tokens=1,
|
||||||
|
)
|
||||||
|
self.server_healthy = True
|
||||||
|
except Exception:
|
||||||
|
self.server_healthy = False
|
||||||
|
await asyncio.sleep(5)
|
||||||
|
|
||||||
|
async def _chat_completion_wrapper(self, **kwargs) -> ChatCompletion:
|
||||||
|
"""
|
||||||
|
Wrapper for chat completion using an OpenAI-compatible API.
|
||||||
|
"""
|
||||||
|
assert kwargs.get("model") is not None, "Model is required!"
|
||||||
|
assert kwargs.get("messages") is not None, "Messages are required!"
|
||||||
|
|
||||||
|
n = kwargs.get("n", 1)
|
||||||
|
|
||||||
|
# Some OpenAI-compatible servers don't support n > 1, so we make multiple requests.
|
||||||
|
if n > 1:
|
||||||
|
completion_list = await asyncio.gather(
|
||||||
|
*[self.openai.chat.completions.create(**{**kwargs, "n": 1}) for _ in range(n)]
|
||||||
|
)
|
||||||
|
# Merge completions
|
||||||
|
completions = completion_list[0]
|
||||||
|
for c in completion_list[1:]:
|
||||||
|
for choice in c.choices:
|
||||||
|
choice.index = len(completions.choices)
|
||||||
|
completions.choices.append(choice)
|
||||||
|
return completions
|
||||||
|
else:
|
||||||
|
return await self.openai.chat.completions.create(**kwargs)
|
||||||
|
|
||||||
|
async def _completion_wrapper(self, **kwargs) -> Completion:
|
||||||
|
"""
|
||||||
|
Wrapper for completion using an OpenAI-compatible API.
|
||||||
|
"""
|
||||||
|
assert kwargs.get("model") is not None, "Model is required!"
|
||||||
|
assert kwargs.get("prompt") is not None, "Prompt is required!"
|
||||||
|
|
||||||
|
n = kwargs.get("n", 1)
|
||||||
|
|
||||||
|
# Some OpenAI-compatible servers don't support n > 1.
|
||||||
|
if n > 1:
|
||||||
|
completion_list = await asyncio.gather(
|
||||||
|
*[self.openai.completions.create(**{**kwargs, "n": 1}) for _ in range(n)]
|
||||||
|
)
|
||||||
|
completions = completion_list[0]
|
||||||
|
for c in completion_list[1:]:
|
||||||
|
for choice in c.choices:
|
||||||
|
choice.index = len(completions.choices)
|
||||||
|
completions.choices.append(choice)
|
||||||
|
return completions
|
||||||
|
else:
|
||||||
|
return await self.openai.completions.create(**kwargs)
|
||||||
|
|
||||||
|
async def _tokens_and_logprobs_completion_wrapper(
|
||||||
|
self, **kwargs
|
||||||
|
) -> tuple[List[int], List[List[int]], List[List[float]], List[str]]:
|
||||||
|
"""
|
||||||
|
Wrapper for tokens and logprobs completion.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple of (prompt_tokens, output_tokens_list, output_logprobs_list, finish_reasons)
|
||||||
|
|
||||||
|
Note: Many OpenAI-compatible local servers don't return per-token logprobs,
|
||||||
|
so we use placeholder logprobs (0.0). For real training, use vLLM/SGLang.
|
||||||
|
"""
|
||||||
|
model = kwargs.get("model")
|
||||||
|
assert model is not None, "Model is required!"
|
||||||
|
|
||||||
|
# Handle input_ids (from ManagedServer) or prompt
|
||||||
|
if "input_ids" in kwargs:
|
||||||
|
prompt_tokens = kwargs.pop("input_ids")
|
||||||
|
prompt = self.tokenizer.decode(prompt_tokens)
|
||||||
|
kwargs.pop("prompt", None)
|
||||||
|
else:
|
||||||
|
prompt = kwargs.pop("prompt", "")
|
||||||
|
prompt_tokens = self.tokenizer.encode(prompt, add_special_tokens=True)
|
||||||
|
|
||||||
|
n = kwargs.pop("n", 1)
|
||||||
|
max_tokens = kwargs.pop("max_tokens", 256)
|
||||||
|
temperature = kwargs.pop("temperature", 0.7)
|
||||||
|
stop = kwargs.pop("stop", None)
|
||||||
|
|
||||||
|
# Make completion requests
|
||||||
|
completions = []
|
||||||
|
for _ in range(n):
|
||||||
|
try:
|
||||||
|
response = await self.openai.completions.create(
|
||||||
|
model=model,
|
||||||
|
prompt=prompt,
|
||||||
|
max_tokens=max_tokens,
|
||||||
|
temperature=temperature,
|
||||||
|
stop=stop,
|
||||||
|
)
|
||||||
|
completions.append(response)
|
||||||
|
except Exception as e:
|
||||||
|
# Fallback to chat completion if completion endpoint not supported
|
||||||
|
warnings.warn(f"Completion API failed, trying chat: {e}")
|
||||||
|
response = await self.openai.chat.completions.create(
|
||||||
|
model=model,
|
||||||
|
messages=[{"role": "user", "content": prompt}],
|
||||||
|
max_tokens=max_tokens,
|
||||||
|
temperature=temperature,
|
||||||
|
stop=stop,
|
||||||
|
)
|
||||||
|
# Convert to completion-like response
|
||||||
|
completions.append(response)
|
||||||
|
|
||||||
|
output_tokens_list = []
|
||||||
|
output_logprobs_list = []
|
||||||
|
finish_reasons = []
|
||||||
|
|
||||||
|
for completion in completions:
|
||||||
|
# Extract text from response
|
||||||
|
if hasattr(completion.choices[0], "text"):
|
||||||
|
# Completion API response
|
||||||
|
text = completion.choices[0].text
|
||||||
|
finish_reason = completion.choices[0].finish_reason or "stop"
|
||||||
|
else:
|
||||||
|
# Chat completion API response
|
||||||
|
text = completion.choices[0].message.content or ""
|
||||||
|
finish_reason = completion.choices[0].finish_reason or "stop"
|
||||||
|
|
||||||
|
# Tokenize output
|
||||||
|
output_tokens = self.tokenizer.encode(text, add_special_tokens=False)
|
||||||
|
|
||||||
|
# Placeholder logprobs (many local servers don't provide per-token logprobs).
|
||||||
|
# In production, use vLLM/SGLang which return real logprobs
|
||||||
|
output_logprobs = [0.0] * len(output_tokens)
|
||||||
|
|
||||||
|
output_tokens_list.append(output_tokens)
|
||||||
|
output_logprobs_list.append(output_logprobs)
|
||||||
|
finish_reasons.append(finish_reason)
|
||||||
|
|
||||||
|
return prompt_tokens, output_tokens_list, output_logprobs_list, finish_reasons
|
||||||
|
|
||||||
|
def managed_server(self, tokenizer=None, track_tree: bool = False):
|
||||||
|
"""
|
||||||
|
Create a ManagedServer context manager for this server.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
tokenizer: Optional tokenizer override
|
||||||
|
track_tree: Whether to maintain tree structure for multi-turn
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
ManagedServer context manager
|
||||||
|
"""
|
||||||
|
from atroposlib.envs.server_handling.managed_server import ManagedServer
|
||||||
|
|
||||||
|
return ManagedServerContext(
|
||||||
|
self,
|
||||||
|
tokenizer=tokenizer or self.tokenizer,
|
||||||
|
track_tree=track_tree,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class ManagedServerContext:
|
||||||
|
"""
|
||||||
|
Context manager wrapper for ManagedServer.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
async with server.managed_server(tokenizer=tokenizer) as managed:
|
||||||
|
response = await managed.chat_completion(...)
|
||||||
|
state = managed.get_state()
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, server: LocalServer, tokenizer, track_tree: bool = False):
|
||||||
|
self.server = server
|
||||||
|
self.tokenizer = tokenizer
|
||||||
|
self.track_tree = track_tree
|
||||||
|
self.managed = None
|
||||||
|
|
||||||
|
async def __aenter__(self):
|
||||||
|
from atroposlib.envs.server_handling.managed_server import ManagedServer
|
||||||
|
|
||||||
|
self.managed = ManagedServer(
|
||||||
|
self.server,
|
||||||
|
tokenizer=self.tokenizer,
|
||||||
|
track_tree=self.track_tree,
|
||||||
|
)
|
||||||
|
return self.managed
|
||||||
|
|
||||||
|
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
||||||
|
if self.managed:
|
||||||
|
self.managed.reset()
|
||||||
|
return False
|
||||||
61
memory-bank/activeContext.md
Normal file
61
memory-bank/activeContext.md
Normal file
@@ -0,0 +1,61 @@
|
|||||||
|
# Active Context
|
||||||
|
|
||||||
|
## Current Focus
|
||||||
|
Tinker RL training integration - pipeline fully wired up, waiting on Tinker billing to test.
|
||||||
|
|
||||||
|
## Recently Completed (Feb 9, 2026)
|
||||||
|
|
||||||
|
### Tinker RL Training Integration
|
||||||
|
Created a complete agent training pipeline using Tinker (Thinking Machines) + Atropos:
|
||||||
|
|
||||||
|
**New Files Created:**
|
||||||
|
1. `tinker-atropos/tinker_atropos/environments/gsm8k_agent.py` - Agent GSM8k environment with:
|
||||||
|
- Python REPL tool calling (Hermes-style `<tool_call>` format)
|
||||||
|
- Multi-step agent loop within `collect_trajectories()`
|
||||||
|
- Math answer verification via `math_verify`
|
||||||
|
- Subprocess-based Python execution
|
||||||
|
- WandB metrics (percent_correct, tool_use_rate)
|
||||||
|
2. `tinker-atropos/configs/gsm8k_agent.yaml` - Config for Qwen3-4B-Instruct training
|
||||||
|
|
||||||
|
**Dependencies Updated:**
|
||||||
|
- `pyproject.toml` `[atropos]` extra now includes: tinker SDK, torch, wandb, math-verify
|
||||||
|
- Installed: tinker 0.12.0, tinker-atropos 0.1.0, torch (CPU)
|
||||||
|
|
||||||
|
**README Updated:**
|
||||||
|
- Added comprehensive "RL Training with Tinker" section with architecture diagram, quick start, config docs
|
||||||
|
- Added TINKER_API_KEY and WANDB_API_KEY to optional keys table
|
||||||
|
|
||||||
|
**Verified Working:**
|
||||||
|
- Tinker SDK connection ✅
|
||||||
|
- All imports (tinker, tinker_atropos, trainer, environment) ✅
|
||||||
|
- Python REPL execution + tool call parsing ✅
|
||||||
|
- Math verification ✅
|
||||||
|
- Atropos run-api (port 8000) ✅
|
||||||
|
- Tinker trainer starts, loads config, creates inference server (port 8001) ✅
|
||||||
|
|
||||||
|
**Blocked:** Tinker billing (402 error) - user's payment didn't process (possibly regional card issue)
|
||||||
|
|
||||||
|
### Main Branch Merge (Feb 9, 2026)
|
||||||
|
Merged `origin/main` into `atropos-integrations` - 22,560 lines, 79 files, 5 conflicts resolved.
|
||||||
|
|
||||||
|
### Modal Backend (Feb 8, 2026)
|
||||||
|
Merged modal-integration branch, working with Modal Sandboxes.
|
||||||
|
|
||||||
|
### Singularity/Apptainer (Feb 6, 2026)
|
||||||
|
Completed and tested.
|
||||||
|
|
||||||
|
## Architecture: Training Pipeline
|
||||||
|
|
||||||
|
```
|
||||||
|
Terminal 1: run-api (port 8000) - Atropos Rollout API
|
||||||
|
Terminal 2: launch_training.py (port 8001) - Tinker Trainer + FastAPI inference
|
||||||
|
Terminal 3: gsm8k_agent.py serve - Environment (generates trajectories)
|
||||||
|
```
|
||||||
|
|
||||||
|
The agent env gets math problems → model calls Python REPL tool → scores answer → sends to Atropos → Tinker does LoRA training → updates sampling weights → repeat.
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
- [ ] Resolve Tinker billing to test full training loop
|
||||||
|
- [ ] Run GSM8k agent training for ~20 steps (proof of concept)
|
||||||
|
- [ ] Monitor WandB for reward improvement
|
||||||
|
- [ ] Graduate to more complex agent envs (SWE tasks with Modal backend)
|
||||||
55
memory-bank/productContext.md
Normal file
55
memory-bank/productContext.md
Normal file
@@ -0,0 +1,55 @@
|
|||||||
|
# Product Context: Hermes-Agent
|
||||||
|
|
||||||
|
## Why This Project Exists
|
||||||
|
|
||||||
|
Hermes-Agent addresses several key challenges in the AI agent space:
|
||||||
|
|
||||||
|
1. **Unified Tool Interface** - Provides a clean, consistent interface for LLMs to use various tools (web, terminal, browser, vision, etc.) without requiring custom integration for each model provider.
|
||||||
|
|
||||||
|
2. **Training Data Generation** - Enables efficient generation of high-quality tool-calling trajectories for fine-tuning LLMs, with features like batch processing, checkpointing, and trajectory compression.
|
||||||
|
|
||||||
|
3. **Flexible Deployment** - Supports multiple execution environments (local, Docker, Singularity, Modal, SSH) to accommodate different security and isolation requirements.
|
||||||
|
|
||||||
|
4. **Developer Experience** - Offers a beautiful, interactive CLI with kawaii-style feedback that makes working with AI agents enjoyable.
|
||||||
|
|
||||||
|
## Problems It Solves
|
||||||
|
|
||||||
|
### For AI Researchers
|
||||||
|
- **Data Generation at Scale**: Parallel batch processing with content-based checkpointing for fault tolerance
|
||||||
|
- **Clean Trajectories**: Trajectory compression to fit token budgets while preserving important information
|
||||||
|
- **Toolset Distributions**: Probability-based tool selection for varied training data
|
||||||
|
|
||||||
|
### For Developers
|
||||||
|
- **Tool Orchestration**: Logical grouping of tools into toolsets (research, development, debugging, etc.)
|
||||||
|
- **Session Persistence**: Conversation history and session logging for debugging
|
||||||
|
- **Multi-Model Support**: Works with any OpenAI-compatible API (OpenRouter, local models, etc.)
|
||||||
|
|
||||||
|
### For MLOps
|
||||||
|
- **Skills System**: On-demand knowledge documents for specific tools/frameworks (Axolotl, vLLM, TRL, etc.)
|
||||||
|
- **Sandboxed Execution**: Terminal commands can run in isolated environments (Docker, Singularity, Modal)
|
||||||
|
- **Configurable Backends**: Easy switching between local and cloud execution
|
||||||
|
|
||||||
|
## How It Should Work
|
||||||
|
|
||||||
|
### User Flow (CLI)
|
||||||
|
1. User launches `./hermes`
|
||||||
|
2. Beautiful welcome banner displays with caduceus logo, model info, and available tools
|
||||||
|
3. User types a natural language request
|
||||||
|
4. Agent processes request, potentially calling tools with animated feedback
|
||||||
|
5. Agent responds with results, conversation continues
|
||||||
|
6. Session is automatically logged for debugging
|
||||||
|
|
||||||
|
### User Flow (Batch Processing)
|
||||||
|
1. User prepares JSONL file with prompts
|
||||||
|
2. Runs `batch_runner.py` with distribution and worker count
|
||||||
|
3. System processes prompts in parallel, saves checkpoints
|
||||||
|
4. Completed trajectories saved to `data/<run_name>/trajectories.jsonl`
|
||||||
|
5. Optional: compress trajectories with `trajectory_compressor.py`
|
||||||
|
|
||||||
|
## User Experience Goals
|
||||||
|
|
||||||
|
- **Delightful Interaction**: Kawaii ASCII faces, animated spinners, cute messages
|
||||||
|
- **Informative Feedback**: Clear progress indication during tool execution
|
||||||
|
- **Configurable Personalities**: From "helpful" to "pirate" to "Shakespeare"
|
||||||
|
- **Easy Configuration**: YAML config file + environment variables + CLI flags
|
||||||
|
- **Graceful Degradation**: Missing tools/APIs don't break the system, just disable features
|
||||||
96
memory-bank/progress.md
Normal file
96
memory-bank/progress.md
Normal file
@@ -0,0 +1,96 @@
|
|||||||
|
# Progress
|
||||||
|
|
||||||
|
## Completed Features
|
||||||
|
|
||||||
|
### ✅ Modal Backend Integration (Feb 8, 2026 - MERGED & TESTED)
|
||||||
|
Merged the `modal-integration` branch and fixed integration issues.
|
||||||
|
|
||||||
|
**What Works:**
|
||||||
|
- `ModalToolBackend` implements full `ToolBackend` interface (start, stop, acquire, release, execute_batch)
|
||||||
|
- Modal Sandboxes used for long-lived containers (not Functions)
|
||||||
|
- `sandbox.exec()` for direct command execution (no HTTP server needed)
|
||||||
|
- Slot-based multiplexing matching Nomad pattern
|
||||||
|
- Multi-profile support (`ModalSandboxConfig`, `_ModalMultiProfileManager`)
|
||||||
|
- YAML profile loading (`modal_profiles.yaml`)
|
||||||
|
- `AgentEnvConfig` fields for all Modal settings (`--env.modal_*`)
|
||||||
|
- `create_tool_backend()` supports `tool_pool_mode="modal"`
|
||||||
|
- Terminal tool (`tools/terminal_tool.py`) native Modal integration with pool management
|
||||||
|
- Named sandbox recovery via `Sandbox.from_name()`
|
||||||
|
- Auto-scaling sandbox pool per profile
|
||||||
|
- Artifact helpers (read, list, archive)
|
||||||
|
|
||||||
|
**CLI Usage:**
|
||||||
|
```bash
|
||||||
|
# Atropos backend
|
||||||
|
python -m atropos.envs.swe_smith_oracle_env process \
|
||||||
|
--env.tool_pool_mode modal \
|
||||||
|
--env.modal_image python:3.11
|
||||||
|
|
||||||
|
# Terminal tool
|
||||||
|
TERMINAL_ENV=modal ./hermes
|
||||||
|
```
|
||||||
|
|
||||||
|
**Files Modified/Created:**
|
||||||
|
- `atropos/backends/modal_backend.py` - Full implementation (~1200 lines)
|
||||||
|
- `atropos/backends/__init__.py` - `create_tool_backend()` updated
|
||||||
|
- `atropos/envs/agent_env.py` - 15 Modal config fields added
|
||||||
|
- `tools/terminal_tool.py` - Native Modal sandbox pool
|
||||||
|
- `docs/MODAL_BACKEND.md` - Documentation
|
||||||
|
- `modal_profiles.yaml.example` - Example profiles
|
||||||
|
- `tests/test_modal_integration.py` - Integration tests
|
||||||
|
- `tests/test_modal_stress.py` - Stress tests
|
||||||
|
- `tests/test_modal_terminal.py` - Terminal tool tests
|
||||||
|
|
||||||
|
### ✅ Singularity/Apptainer Sandbox Integration (Feb 6, 2026 - FULLY TESTED)
|
||||||
|
Adapted the Atropos sandbox environment from Docker to Singularity/Apptainer for HPC clusters.
|
||||||
|
|
||||||
|
**What Works:**
|
||||||
|
- `create_sandbox_job()` supports both `driver="docker"` and `driver="singularity"`
|
||||||
|
- SlotPoolConfig and NomadBackendConfig propagate driver settings
|
||||||
|
- Singularity container runs sandbox_server.py via Nomad's raw_exec driver
|
||||||
|
- All sandbox operations work: bash execution, file read/write
|
||||||
|
- **CLI arguments** `--env.driver` and `--env.singularity_image` for AgentEnvConfig
|
||||||
|
- **Static port binding** for Singularity (ReservedPorts vs DynamicPorts)
|
||||||
|
|
||||||
|
### ✅ Memory Bank Initialized (Feb 5, 2026)
|
||||||
|
Set up project documentation structure for context persistence.
|
||||||
|
|
||||||
|
## In Progress
|
||||||
|
None currently.
|
||||||
|
|
||||||
|
## Known Issues
|
||||||
|
- Modal backend not yet live-tested with actual Modal cloud credentials
|
||||||
|
- `bwrap_available: false` in Singularity containers
|
||||||
|
- Health check timing - may need longer wait for container startup on slower systems
|
||||||
|
|
||||||
|
## What's Left to Build
|
||||||
|
|
||||||
|
### Modal Backend
|
||||||
|
- [ ] Live test with Modal credentials on actual cloud
|
||||||
|
- [ ] Test multi-profile GPU workflows
|
||||||
|
- [ ] Test sandbox recovery after restart
|
||||||
|
- [ ] Integrate with SWE-smith-oracle env for GRPO training loop
|
||||||
|
- [ ] Performance benchmarking vs Nomad backend
|
||||||
|
|
||||||
|
### HPC Deployment
|
||||||
|
- [ ] Test on actual HPC cluster with Slurm/PBS integration
|
||||||
|
- [ ] Document cluster-specific deployment procedures
|
||||||
|
|
||||||
|
### Documentation
|
||||||
|
- [ ] Add Singularity deployment to README
|
||||||
|
- [ ] Create HPC deployment skill in skills/mlops/
|
||||||
|
|
||||||
|
## Evolution of Decisions
|
||||||
|
|
||||||
|
### Container Runtime Selection
|
||||||
|
- **Initial**: Docker-only via Nomad docker driver
|
||||||
|
- **Problem**: HPC clusters don't allow Docker without sudo
|
||||||
|
- **Solution**: Added Singularity/Apptainer support via raw_exec driver
|
||||||
|
- **Result**: Both runtimes now supported with same API
|
||||||
|
|
||||||
|
### Modal Backend Architecture
|
||||||
|
- **Initial**: Stub placeholder raising RuntimeError
|
||||||
|
- **Investigation**: Modal Sandboxes vs Functions - chose Sandboxes for long-lived containers
|
||||||
|
- **Design**: Direct `sandbox.exec()` instead of HTTP/sandbox_server.py (simpler, no networking needed)
|
||||||
|
- **Implementation**: Merged from `modal-integration` branch, fixed agent_env.py config fields
|
||||||
|
- **Result**: Three backends now supported: Nomad/Docker, Nomad/Singularity, Modal
|
||||||
44
memory-bank/projectbrief.md
Normal file
44
memory-bank/projectbrief.md
Normal file
@@ -0,0 +1,44 @@
|
|||||||
|
# Project Brief: Hermes-Agent
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
Hermes-Agent is an AI agent harness for LLMs with advanced tool-calling capabilities, featuring a flexible toolsets system for organizing and managing tools. Named after Hermes, the Greek messenger god, it serves as a bridge between human intent and AI-powered task execution.
|
||||||
|
|
||||||
|
## Core Requirements
|
||||||
|
|
||||||
|
### Primary Goals
|
||||||
|
1. **Interactive CLI Experience** - Beautiful terminal interface with animated feedback, personalities, and session management
|
||||||
|
2. **Flexible Tool System** - Modular tools organized into logical toolsets for different use cases
|
||||||
|
3. **Batch Processing** - Process multiple prompts in parallel with checkpointing and statistics
|
||||||
|
4. **Multi-Backend Support** - Support for local, Docker, Singularity, Modal, and SSH terminal backends
|
||||||
|
5. **Training Data Generation** - Save conversation trajectories in formats suitable for LLM fine-tuning
|
||||||
|
|
||||||
|
### Target Users
|
||||||
|
- AI researchers generating training data
|
||||||
|
- Developers needing an AI assistant with tool access
|
||||||
|
- MLOps practitioners automating workflows
|
||||||
|
- Anyone needing a powerful CLI-based AI agent
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
### In Scope
|
||||||
|
- Interactive CLI with rich formatting and kawaii-style feedback
|
||||||
|
- Web tools (search, extract, crawl via Firecrawl)
|
||||||
|
- Terminal tools (command execution across multiple backends)
|
||||||
|
- Browser automation (via agent-browser + Browserbase)
|
||||||
|
- Vision tools (image analysis)
|
||||||
|
- Image generation (FLUX via FAL.ai)
|
||||||
|
- Mixture-of-Agents reasoning
|
||||||
|
- Skills system for on-demand knowledge
|
||||||
|
- Batch processing with parallel workers
|
||||||
|
- Trajectory compression for training
|
||||||
|
|
||||||
|
### Out of Scope (Current)
|
||||||
|
- Proactive suggestions (agent only runs on request)
|
||||||
|
- Clipboard integration (no local system access)
|
||||||
|
- Real-time streaming of thinking/reasoning (deferred)
|
||||||
|
|
||||||
|
## Success Metrics
|
||||||
|
- Clean, maintainable tool architecture
|
||||||
|
- Reliable tool execution with proper error handling
|
||||||
|
- Efficient context management for long conversations
|
||||||
|
- High-quality trajectory data for training
|
||||||
191
memory-bank/systemPatterns.md
Normal file
191
memory-bank/systemPatterns.md
Normal file
@@ -0,0 +1,191 @@
|
|||||||
|
# System Patterns: Hermes-Agent
|
||||||
|
|
||||||
|
## Architecture Overview
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ CLI (cli.py) │
|
||||||
|
│ - Rich welcome banner with caduceus │
|
||||||
|
│ - prompt_toolkit for input with history │
|
||||||
|
│ - Kawaii-style feedback and personalities │
|
||||||
|
└────────────────────────────┬────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ AIAgent (run_agent.py) │
|
||||||
|
│ - Conversation loop with tool calling │
|
||||||
|
│ - KawaiiSpinner for animated feedback │
|
||||||
|
│ - Retry logic with exponential backoff │
|
||||||
|
│ - Session logging to logs/ directory │
|
||||||
|
└────────────────────────────┬────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ Tool Routing (model_tools.py) │
|
||||||
|
│ - get_tool_definitions() - returns tools for API calls │
|
||||||
|
│ - handle_function_call() - dispatches to tool handlers │
|
||||||
|
│ - Toolset filtering (enabled/disabled) │
|
||||||
|
└────────────────────────────┬────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
┌─────────────────┼─────────────────┐
|
||||||
|
▼ ▼ ▼
|
||||||
|
┌───────────┐ ┌───────────┐ ┌───────────┐
|
||||||
|
│ Web Tools │ │ Terminal │ │ Browser │
|
||||||
|
│ (Firecrawl)│ │ (mini-swe)│ │(agent-brw)│
|
||||||
|
└───────────┘ └───────────┘ └───────────┘
|
||||||
|
│ │ │
|
||||||
|
└─────────────────┼─────────────────┘
|
||||||
|
▼
|
||||||
|
┌───────────────┐
|
||||||
|
│ Toolsets │
|
||||||
|
│ (toolsets.py)│
|
||||||
|
│ Composition │
|
||||||
|
└───────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Key Design Patterns
|
||||||
|
|
||||||
|
### 1. Toolset Composition Pattern
|
||||||
|
Toolsets can include other toolsets, allowing flexible composition:
|
||||||
|
|
||||||
|
```python
|
||||||
|
TOOLSETS = {
|
||||||
|
"web": {"tools": ["web_search", "web_extract"], "includes": []},
|
||||||
|
"debugging": {"tools": ["terminal"], "includes": ["web"]},
|
||||||
|
"full_stack": {"tools": [], "includes": ["web", "terminal", "vision", "browser"]}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Resolution is recursive with cycle detection.
|
||||||
|
|
||||||
|
### 2. Graceful Degradation Pattern
|
||||||
|
Each tool module has a `check_*_requirements()` function:
|
||||||
|
- Tools are only loaded if requirements are met
|
||||||
|
- Missing API keys disable tools, not crash the system
|
||||||
|
- Import errors are caught and tools marked unavailable
|
||||||
|
|
||||||
|
```python
|
||||||
|
try:
|
||||||
|
from tools.web_tools import web_search_tool, check_firecrawl_api_key
|
||||||
|
except ModuleNotFoundError:
|
||||||
|
web_search_tool = None
|
||||||
|
def check_firecrawl_api_key(): return False
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Session Isolation Pattern (task_id)
|
||||||
|
Stateful tools (terminal, browser) use `task_id` to isolate concurrent sessions:
|
||||||
|
- Each batch worker gets unique task_id
|
||||||
|
- VMs and browser sessions are tracked per task_id
|
||||||
|
- Cleanup functions release resources: `cleanup_vm(task_id)`, `cleanup_browser(task_id)`
|
||||||
|
|
||||||
|
### 4. Trajectory Format Pattern
|
||||||
|
Conversations are saved in ShareGPT format for training:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{"from": "system", "value": "System prompt with <tools>...</tools>"}
|
||||||
|
{"from": "human", "value": "User message"}
|
||||||
|
{"from": "gpt", "value": "<think>reasoning</think>\n<tool_call>{...}</tool_call>"}
|
||||||
|
{"from": "tool", "value": "<tool_response>{...}</tool_response>"}
|
||||||
|
{"from": "gpt", "value": "Final response"}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5. Ephemeral System Prompt Pattern
|
||||||
|
Guide model behavior during data collection without saving to trajectories:
|
||||||
|
- `ephemeral_system_prompt` influences execution
|
||||||
|
- Only standard tool-calling system prompt saved to trajectories
|
||||||
|
- Keeps training data clean
|
||||||
|
|
||||||
|
### 6. Retry with Validation Pattern
|
||||||
|
The agent validates responses before accepting:
|
||||||
|
- Check tool names against `valid_tool_names` set
|
||||||
|
- Validate JSON arguments can be parsed
|
||||||
|
- Check for content after `<think>` blocks
|
||||||
|
- Roll back to last valid state on persistent failures
|
||||||
|
|
||||||
|
## Component Relationships
|
||||||
|
|
||||||
|
### AIAgent Class
|
||||||
|
- Central orchestrator for conversations
|
||||||
|
- Manages conversation history
|
||||||
|
- Calls OpenAI-compatible API
|
||||||
|
- Routes tool calls to handlers
|
||||||
|
- Provides animated feedback (KawaiiSpinner)
|
||||||
|
|
||||||
|
### Tool Modules (tools/*.py)
|
||||||
|
- Self-contained tool implementations
|
||||||
|
- Export: handler function + check function + schema
|
||||||
|
- Return JSON strings (never raw dicts)
|
||||||
|
- Accept optional `task_id` for stateful tools
|
||||||
|
|
||||||
|
### Toolsets System (toolsets.py)
|
||||||
|
- Defines logical groupings of tools
|
||||||
|
- Supports composition via `includes`
|
||||||
|
- `resolve_toolset()` recursively resolves all tools
|
||||||
|
- `validate_toolset()` checks if name is valid
|
||||||
|
|
||||||
|
### Model Tools (model_tools.py)
|
||||||
|
- Aggregates all tool definitions
|
||||||
|
- Routes function calls to correct handlers
|
||||||
|
- Filters tools based on enabled/disabled toolsets
|
||||||
|
- Bridge between agent and tool implementations
|
||||||
|
|
||||||
|
## Critical Implementation Paths
|
||||||
|
|
||||||
|
### Tool Execution Flow
|
||||||
|
1. AIAgent receives tool_calls from API response
|
||||||
|
2. Validates tool names against `valid_tool_names`
|
||||||
|
3. Validates JSON arguments can be parsed
|
||||||
|
4. Calls `handle_function_call()` with tool name, args, task_id
|
||||||
|
5. `handle_function_call()` routes to appropriate handler
|
||||||
|
6. Tool executes, returns JSON string
|
||||||
|
7. Result added to conversation as tool message
|
||||||
|
8. Loop continues until natural language response
|
||||||
|
|
||||||
|
### Configuration Loading Flow
|
||||||
|
1. `cli.py` calls `load_cli_config()`
|
||||||
|
2. Loads `cli-config.yaml`, merges with defaults
|
||||||
|
3. Sets environment variables for terminal config
|
||||||
|
4. `AIAgent` reads env vars when initializing terminal tool
|
||||||
|
5. Terminal tool creates appropriate backend based on `TERMINAL_ENV`
|
||||||
|
|
||||||
|
## Atropos Backend Architecture
|
||||||
|
|
||||||
|
### Backend Hierarchy
|
||||||
|
```
|
||||||
|
ToolBackend (Protocol - base.py)
|
||||||
|
├── NomadToolBackend → SlotPool → NomadClient + SandboxExecutor (HTTP)
|
||||||
|
│ ├── Docker driver (default)
|
||||||
|
│ └── Singularity driver (HPC)
|
||||||
|
└── ModalToolBackend → _ModalSandboxPool → modal.Sandbox.exec() (direct)
|
||||||
|
└── _ModalMultiProfileManager (multi-profile support)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Slot-Based Multiplexing Pattern
|
||||||
|
All backends share the same slot multiplexing concept:
|
||||||
|
- **Sandbox/Container**: Long-lived compute unit
|
||||||
|
- **Slot**: Isolated workspace directory within a sandbox (e.g., `/data/slot_0`)
|
||||||
|
- **Trajectory**: One agent task using one slot
|
||||||
|
- Multiple trajectories share a sandbox via different slots
|
||||||
|
|
||||||
|
### Nomad Backend (HTTP-based)
|
||||||
|
- Deploys `sandbox_server.py` inside containers (Docker or Singularity)
|
||||||
|
- Uses `SandboxExecutor` for HTTP communication (POST /execute, POST /batch)
|
||||||
|
- Nomad manages container lifecycle (scaling, health checks)
|
||||||
|
- Tools: bash, bash_stateful, read_file, write_file, tmux
|
||||||
|
|
||||||
|
### Modal Backend (exec-based)
|
||||||
|
- Creates `modal.Sandbox` instances (long-lived containers)
|
||||||
|
- Uses `sandbox.exec("bash", "-c", command)` directly (no HTTP server)
|
||||||
|
- Modal manages container lifecycle (idle_timeout, max_lifetime)
|
||||||
|
- Multi-profile support: different resource configs (CPU, GPU, memory)
|
||||||
|
- Named sandboxes for recovery: `Sandbox.from_name(app_name, sandbox_name)`
|
||||||
|
- YAML config via `modal_profiles.yaml`
|
||||||
|
|
||||||
|
### Backend Selection
|
||||||
|
```python
|
||||||
|
# In agent_env.py / create_tool_backend()
|
||||||
|
if mode == "nomad":
|
||||||
|
return NomadToolBackend(NomadBackendConfig.from_agent_env_config(cfg))
|
||||||
|
if mode == "modal":
|
||||||
|
return ModalToolBackend(ModalSandboxConfig.from_agent_env_config(cfg))
|
||||||
|
```
|
||||||
113
memory-bank/techContext.md
Normal file
113
memory-bank/techContext.md
Normal file
@@ -0,0 +1,113 @@
|
|||||||
|
# Technical Context: Hermes-Agent
|
||||||
|
|
||||||
|
## Technologies Used
|
||||||
|
|
||||||
|
### Core Stack
|
||||||
|
- **Python 3.11+** - Primary language
|
||||||
|
- **OpenAI SDK** - For LLM API interactions (OpenAI-compatible)
|
||||||
|
- **OpenRouter** - Default LLM provider (supports multiple models)
|
||||||
|
- **Rich** - Terminal formatting and panels
|
||||||
|
- **prompt_toolkit** - Interactive input with history
|
||||||
|
- **Fire** - CLI argument parsing
|
||||||
|
- **PyYAML** - Configuration files
|
||||||
|
- **python-dotenv** - Environment variable management
|
||||||
|
|
||||||
|
### Tool Dependencies
|
||||||
|
- **Firecrawl** - Web search and extraction (`FIRECRAWL_API_KEY`)
|
||||||
|
- **mini-swe-agent** - Terminal tool backend (local/docker/singularity/modal/ssh)
|
||||||
|
- **agent-browser** - Browser automation (npm package)
|
||||||
|
- **Browserbase** - Cloud browser execution (`BROWSERBASE_API_KEY`)
|
||||||
|
- **FAL.ai** - Image generation with FLUX (`FAL_KEY`)
|
||||||
|
- **Nous API** - Vision and MoA tools (`NOUS_API_KEY`)
|
||||||
|
|
||||||
|
### Optional Dependencies
|
||||||
|
- **Modal** - Cloud compute for sandboxed environments
|
||||||
|
- **Singularity/Apptainer** - Rootless containers (HPC environments)
|
||||||
|
- **Docker** - Container isolation
|
||||||
|
|
||||||
|
## Development Setup
|
||||||
|
|
||||||
|
### Quick Start
|
||||||
|
```bash
|
||||||
|
# Clone with submodules
|
||||||
|
git clone --recurse-submodules https://github.com/NousResearch/Hermes-Agent.git
|
||||||
|
cd Hermes-Agent
|
||||||
|
|
||||||
|
# Create virtual environment
|
||||||
|
python3 -m venv venv
|
||||||
|
source venv/bin/activate
|
||||||
|
|
||||||
|
# Install dependencies
|
||||||
|
pip install -r requirements.txt
|
||||||
|
pip install -e ./mini-swe-agent
|
||||||
|
|
||||||
|
# Install browser tools (optional)
|
||||||
|
npm install
|
||||||
|
|
||||||
|
# Configure environment
|
||||||
|
cp .env.example .env
|
||||||
|
# Edit .env with your API keys
|
||||||
|
```
|
||||||
|
|
||||||
|
### Key Configuration Files
|
||||||
|
- `.env` - API keys and secrets
|
||||||
|
- `cli-config.yaml` - CLI configuration (model, terminal, toolsets, personalities)
|
||||||
|
- `configs/` - Batch run scripts and configuration
|
||||||
|
|
||||||
|
### Environment Variables
|
||||||
|
|
||||||
|
**Required for Full Functionality:**
|
||||||
|
- `OPENROUTER_API_KEY` - Primary LLM access
|
||||||
|
- `FIRECRAWL_API_KEY` - Web tools
|
||||||
|
- `NOUS_API_KEY` - Vision and reasoning tools
|
||||||
|
- `FAL_KEY` - Image generation
|
||||||
|
|
||||||
|
**Terminal Backend:**
|
||||||
|
- `TERMINAL_ENV` - Backend type: `local`, `docker`, `singularity`, `modal`, `ssh`
|
||||||
|
- `TERMINAL_CWD` - Working directory
|
||||||
|
- `TERMINAL_DOCKER_IMAGE` / `TERMINAL_SINGULARITY_IMAGE` - Container images
|
||||||
|
- `TERMINAL_SSH_HOST/USER/KEY` - SSH backend config
|
||||||
|
- `SUDO_PASSWORD` - Optional sudo support
|
||||||
|
|
||||||
|
**Browser:**
|
||||||
|
- `BROWSERBASE_API_KEY` - Browser automation
|
||||||
|
- `BROWSERBASE_PROJECT_ID` - Browserbase project
|
||||||
|
|
||||||
|
## Technical Constraints
|
||||||
|
|
||||||
|
1. **Context Window Limits** - Long tool outputs can exhaust context; trajectory compression helps
|
||||||
|
2. **API Rate Limits** - OpenRouter and tool APIs have rate limits; exponential backoff implemented
|
||||||
|
3. **Tool Availability** - Tools gracefully degrade if dependencies/keys missing
|
||||||
|
4. **Async Compatibility** - Some tools are async, handled via `asyncio.run()` in sync context
|
||||||
|
|
||||||
|
## Dependency Graph
|
||||||
|
|
||||||
|
```
|
||||||
|
tools/*.py → tools/__init__.py → model_tools.py → toolsets.py → toolset_distributions.py
|
||||||
|
↑
|
||||||
|
run_agent.py ──────────────────────────┘
|
||||||
|
cli.py → run_agent.py (uses AIAgent with quiet_mode=True)
|
||||||
|
batch_runner.py → run_agent.py + toolset_distributions.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Tool Usage Patterns
|
||||||
|
|
||||||
|
### Adding a New Tool
|
||||||
|
1. Create `tools/your_tool.py` with handler + requirements check
|
||||||
|
2. Export in `tools/__init__.py`
|
||||||
|
3. Register in `model_tools.py` (definitions + handler routing)
|
||||||
|
4. Add to toolset in `toolsets.py`
|
||||||
|
5. Optionally add to `toolset_distributions.py` for batch processing
|
||||||
|
|
||||||
|
### Tool Handler Pattern
|
||||||
|
```python
|
||||||
|
def your_tool(param: str, task_id: str = None) -> str:
|
||||||
|
"""Execute tool and return JSON string result."""
|
||||||
|
try:
|
||||||
|
result = {"success": True, "data": "..."}
|
||||||
|
return json.dumps(result, ensure_ascii=False)
|
||||||
|
except Exception as e:
|
||||||
|
return json.dumps({"error": str(e)}, ensure_ascii=False)
|
||||||
|
```
|
||||||
|
|
||||||
|
All tool handlers MUST return a JSON string, never raw dicts.
|
||||||
Submodule mini-swe-agent updated: 07aa6a7385...9ddd61b62d
134
modal_profiles.yaml.example
Normal file
134
modal_profiles.yaml.example
Normal file
@@ -0,0 +1,134 @@
|
|||||||
|
# Modal Sandbox Profiles Configuration
|
||||||
|
# =====================================
|
||||||
|
# This file defines different sandbox profiles for heterogeneous workloads.
|
||||||
|
# Copy to modal_profiles.yaml and customize as needed.
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# terminal_tool("python train.py", profile="pytorch-gpu")
|
||||||
|
# terminal_tool("npm test", profile="node")
|
||||||
|
#
|
||||||
|
# Each profile can specify:
|
||||||
|
# - image: Docker image to use
|
||||||
|
# - gpu: GPU type (null, "T4", "A10G", "A100", "H100")
|
||||||
|
# - cpu: CPU cores (float)
|
||||||
|
# - memory: Memory in MB
|
||||||
|
# - min_pool: Minimum warm sandboxes (cost vs latency tradeoff)
|
||||||
|
# - max_pool: Maximum sandboxes (hard cost cap)
|
||||||
|
# - idle_timeout: Server-side auto-cleanup in seconds
|
||||||
|
# - max_lifetime: Maximum sandbox lifetime in seconds
|
||||||
|
# - scale_down_idle: Client-side scale-down threshold in seconds
|
||||||
|
# - workdir: Working directory inside container
|
||||||
|
# - secrets: List of Modal Secret names to inject (created via dashboard/CLI)
|
||||||
|
# - env_vars: Dict of environment variables to pass directly
|
||||||
|
# - use_dotenv: If true, loads local .env file into sandbox
|
||||||
|
#
|
||||||
|
# SECRETS SETUP:
|
||||||
|
# Create secrets via Modal dashboard or CLI:
|
||||||
|
# modal secret create huggingface-token HF_TOKEN=hf_xxx
|
||||||
|
# modal secret create openai-key OPENAI_API_KEY=sk-xxx
|
||||||
|
# Then reference by name in profile's secrets list.
|
||||||
|
|
||||||
|
# Default profile used when no profile specified
|
||||||
|
default_profile: default
|
||||||
|
|
||||||
|
profiles:
|
||||||
|
# Default Python environment - good for most tasks
|
||||||
|
default:
|
||||||
|
image: python:3.11
|
||||||
|
gpu: null
|
||||||
|
cpu: 1.0
|
||||||
|
memory: 2048
|
||||||
|
min_pool: 1 # Keep 1 warm for fast response
|
||||||
|
max_pool: 5
|
||||||
|
idle_timeout: 120 # Modal terminates if idle 2 min
|
||||||
|
max_lifetime: 3600 # Max 1 hour
|
||||||
|
scale_down_idle: 180
|
||||||
|
workdir: /workspace
|
||||||
|
secrets: [] # Add secret names here: ["my-api-keys"]
|
||||||
|
env_vars: {} # Add env vars here: {DEBUG: "1"}
|
||||||
|
use_dotenv: false # Set to true to load local .env
|
||||||
|
|
||||||
|
# PyTorch with GPU for ML training/inference
|
||||||
|
pytorch-gpu:
|
||||||
|
image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
|
||||||
|
gpu: T4 # Options: T4, A10G, A100, H100
|
||||||
|
cpu: 4.0
|
||||||
|
memory: 16384 # 16GB
|
||||||
|
min_pool: 0 # Don't keep GPU sandboxes warm (expensive!)
|
||||||
|
max_pool: 2
|
||||||
|
idle_timeout: 60 # Shorter idle timeout for GPU (cost)
|
||||||
|
max_lifetime: 1800 # 30 min max for GPU tasks
|
||||||
|
scale_down_idle: 60
|
||||||
|
workdir: /workspace
|
||||||
|
# ML-specific secrets
|
||||||
|
secrets:
|
||||||
|
- huggingface-token # HF_TOKEN env var
|
||||||
|
- wandb-key # WANDB_API_KEY env var
|
||||||
|
env_vars:
|
||||||
|
CUDA_VISIBLE_DEVICES: "0"
|
||||||
|
PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:True"
|
||||||
|
|
||||||
|
# High-end GPU for large models
|
||||||
|
pytorch-a100:
|
||||||
|
image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
|
||||||
|
gpu: A100
|
||||||
|
cpu: 8.0
|
||||||
|
memory: 65536 # 64GB
|
||||||
|
min_pool: 0
|
||||||
|
max_pool: 1 # Only 1 at a time (very expensive)
|
||||||
|
idle_timeout: 30
|
||||||
|
max_lifetime: 3600
|
||||||
|
scale_down_idle: 30
|
||||||
|
workdir: /workspace
|
||||||
|
|
||||||
|
# Node.js for JavaScript/TypeScript tasks
|
||||||
|
node:
|
||||||
|
image: node:18
|
||||||
|
gpu: null
|
||||||
|
cpu: 1.0
|
||||||
|
memory: 2048
|
||||||
|
min_pool: 0 # Create on-demand
|
||||||
|
max_pool: 3
|
||||||
|
idle_timeout: 120
|
||||||
|
max_lifetime: 3600
|
||||||
|
scale_down_idle: 180
|
||||||
|
workdir: /workspace
|
||||||
|
|
||||||
|
# High memory for data processing
|
||||||
|
high-memory:
|
||||||
|
image: python:3.11
|
||||||
|
gpu: null
|
||||||
|
cpu: 4.0
|
||||||
|
memory: 32768 # 32GB
|
||||||
|
min_pool: 0
|
||||||
|
max_pool: 2
|
||||||
|
idle_timeout: 120
|
||||||
|
max_lifetime: 3600
|
||||||
|
scale_down_idle: 180
|
||||||
|
workdir: /workspace
|
||||||
|
|
||||||
|
# Rust development environment
|
||||||
|
rust:
|
||||||
|
image: rust:1.75
|
||||||
|
gpu: null
|
||||||
|
cpu: 2.0
|
||||||
|
memory: 4096
|
||||||
|
min_pool: 0
|
||||||
|
max_pool: 2
|
||||||
|
idle_timeout: 120
|
||||||
|
max_lifetime: 3600
|
||||||
|
scale_down_idle: 180
|
||||||
|
workdir: /workspace
|
||||||
|
|
||||||
|
# Go development environment
|
||||||
|
golang:
|
||||||
|
image: golang:1.21
|
||||||
|
gpu: null
|
||||||
|
cpu: 2.0
|
||||||
|
memory: 4096
|
||||||
|
min_pool: 0
|
||||||
|
max_pool: 2
|
||||||
|
idle_timeout: 120
|
||||||
|
max_lifetime: 3600
|
||||||
|
scale_down_idle: 180
|
||||||
|
workdir: /workspace
|
||||||
37
nomad-dev.hcl
Normal file
37
nomad-dev.hcl
Normal file
@@ -0,0 +1,37 @@
|
|||||||
|
# Nomad Development Configuration (Hermes-Agent)
|
||||||
|
# Run with: nomad agent -dev -config=nomad-dev.hcl
|
||||||
|
#
|
||||||
|
# This is intended for local development only.
|
||||||
|
|
||||||
|
client {
|
||||||
|
enabled = true
|
||||||
|
|
||||||
|
options {
|
||||||
|
# Enable Docker volume mounts for persistent slot workspaces
|
||||||
|
"docker.volumes.enabled" = "true"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
# Docker driver plugin configuration
|
||||||
|
plugin "docker" {
|
||||||
|
config {
|
||||||
|
# CRITICAL: Enable volume mounts
|
||||||
|
volumes {
|
||||||
|
enabled = true
|
||||||
|
}
|
||||||
|
|
||||||
|
# Allow privileged containers if needed
|
||||||
|
allow_privileged = false
|
||||||
|
|
||||||
|
# Garbage collection settings
|
||||||
|
gc {
|
||||||
|
image = true
|
||||||
|
# NOTE: For local dev we often rely on locally built images like `atropos-sandbox:local`.
|
||||||
|
# A short image GC delay can delete these between runs, causing confusing "Failed to pull"
|
||||||
|
# crash loops. Keep this comfortably long; tighten it for CI/production if needed.
|
||||||
|
image_delay = "24h"
|
||||||
|
container = true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
31
nomad-singularity.hcl
Normal file
31
nomad-singularity.hcl
Normal file
@@ -0,0 +1,31 @@
|
|||||||
|
# Nomad Configuration for Singularity/Apptainer Sandbox
|
||||||
|
# Run with: nomad agent -dev -config=nomad-singularity.hcl
|
||||||
|
#
|
||||||
|
# This uses the raw_exec driver to run Apptainer containers.
|
||||||
|
# Suitable for HPC environments where Docker cannot run without sudo.
|
||||||
|
|
||||||
|
client {
|
||||||
|
enabled = true
|
||||||
|
|
||||||
|
options {
|
||||||
|
# Enable raw_exec driver for Singularity/Apptainer
|
||||||
|
"driver.raw_exec.enable" = "1"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
# raw_exec driver plugin configuration
|
||||||
|
plugin "raw_exec" {
|
||||||
|
config {
|
||||||
|
enabled = true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
# Optional: If you have the nomad-driver-singularity plugin installed,
|
||||||
|
# uncomment the following instead of using raw_exec:
|
||||||
|
# plugin "singularity" {
|
||||||
|
# config {
|
||||||
|
# enabled = true
|
||||||
|
# # Allow bind mounts
|
||||||
|
# bind_paths = ["/tmp", "/var/tmp"]
|
||||||
|
# }
|
||||||
|
# }
|
||||||
@@ -19,6 +19,7 @@ dependencies = [
|
|||||||
"rich",
|
"rich",
|
||||||
"tenacity",
|
"tenacity",
|
||||||
"pyyaml",
|
"pyyaml",
|
||||||
|
"prompt_toolkit",
|
||||||
"requests",
|
"requests",
|
||||||
"jinja2",
|
"jinja2",
|
||||||
"pydantic>=2.0",
|
"pydantic>=2.0",
|
||||||
@@ -39,6 +40,19 @@ dev = ["pytest", "pytest-asyncio"]
|
|||||||
messaging = ["python-telegram-bot>=20.0", "discord.py>=2.0", "aiohttp>=3.9.0"]
|
messaging = ["python-telegram-bot>=20.0", "discord.py>=2.0", "aiohttp>=3.9.0"]
|
||||||
cron = ["croniter"]
|
cron = ["croniter"]
|
||||||
cli = ["simple-term-menu"]
|
cli = ["simple-term-menu"]
|
||||||
|
# Install Atropos + Tinker training integration from source.
|
||||||
|
atropos = [
|
||||||
|
"atroposlib @ git+https://github.com/NousResearch/atropos.git",
|
||||||
|
"tinker @ git+https://github.com/thinking-machines-lab/tinker.git",
|
||||||
|
# Atropos integration runtime deps (kept optional for Hermes-only users)
|
||||||
|
"aiohttp",
|
||||||
|
"fastapi",
|
||||||
|
"uvicorn",
|
||||||
|
"pyte",
|
||||||
|
"torch",
|
||||||
|
"wandb",
|
||||||
|
"math-verify",
|
||||||
|
]
|
||||||
all = [
|
all = [
|
||||||
"hermes-agent[modal]",
|
"hermes-agent[modal]",
|
||||||
"hermes-agent[messaging]",
|
"hermes-agent[messaging]",
|
||||||
@@ -50,9 +64,21 @@ all = [
|
|||||||
[project.scripts]
|
[project.scripts]
|
||||||
hermes = "hermes_cli.main:main"
|
hermes = "hermes_cli.main:main"
|
||||||
hermes-agent = "run_agent:main"
|
hermes-agent = "run_agent:main"
|
||||||
|
hermes-atropos-sandbox-smoke = "atropos.envs.sandbox_terminal_smoke_env:SandboxTerminalSmokeEnv.cli"
|
||||||
|
hermes-atropos-toolserver-smoke = "atropos.envs.toolserver_smoke_env:ToolServerSmokeEnv.cli"
|
||||||
|
|
||||||
[tool.setuptools]
|
[tool.setuptools]
|
||||||
py-modules = ["run_agent", "model_tools", "toolsets", "batch_runner", "trajectory_compressor", "toolset_distributions", "cli"]
|
py-modules = [
|
||||||
|
"run_agent",
|
||||||
|
"model_tools",
|
||||||
|
"toolsets",
|
||||||
|
"batch_runner",
|
||||||
|
"trajectory_compressor",
|
||||||
|
"toolset_distributions",
|
||||||
|
"atropos_compatible_agent",
|
||||||
|
"local_server",
|
||||||
|
"cli",
|
||||||
|
]
|
||||||
|
|
||||||
[tool.setuptools.packages.find]
|
[tool.setuptools.packages.find]
|
||||||
include = ["tools", "hermes_cli", "gateway", "cron"]
|
include = ["tools", "hermes_cli", "gateway", "cron", "atropos", "atropos.*"]
|
||||||
|
|||||||
44
run_agent.py
44
run_agent.py
@@ -30,7 +30,6 @@ import threading
|
|||||||
import uuid
|
import uuid
|
||||||
from typing import List, Dict, Any, Optional
|
from typing import List, Dict, Any, Optional
|
||||||
from openai import OpenAI
|
from openai import OpenAI
|
||||||
import fire
|
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
@@ -1581,6 +1580,16 @@ class AIAgent:
|
|||||||
if active_system_prompt:
|
if active_system_prompt:
|
||||||
# Insert system message at the beginning
|
# Insert system message at the beginning
|
||||||
api_messages = [{"role": "system", "content": active_system_prompt}] + api_messages
|
api_messages = [{"role": "system", "content": active_system_prompt}] + api_messages
|
||||||
|
|
||||||
|
if os.getenv("HERMES_DEBUG_OPENAI_REQUEST") == "1":
|
||||||
|
meta = {
|
||||||
|
"model": self.model,
|
||||||
|
"base_url": self.base_url,
|
||||||
|
"messages": api_messages,
|
||||||
|
"tools": self.tools if self.tools else None,
|
||||||
|
}
|
||||||
|
print("\n=== HERMES_DEBUG_OPENAI_REQUEST ===", flush=True)
|
||||||
|
print(json.dumps(meta, ensure_ascii=False, indent=2)[:200_000], flush=True)
|
||||||
|
|
||||||
# Calculate approximate request size for logging
|
# Calculate approximate request size for logging
|
||||||
total_chars = sum(len(str(msg)) for msg in api_messages)
|
total_chars = sum(len(str(msg)) for msg in api_messages)
|
||||||
@@ -1594,12 +1603,13 @@ class AIAgent:
|
|||||||
print(f"{self.log_prefix} 📊 Request size: {len(api_messages)} messages, ~{approx_tokens:,} tokens (~{total_chars:,} chars)")
|
print(f"{self.log_prefix} 📊 Request size: {len(api_messages)} messages, ~{approx_tokens:,} tokens (~{total_chars:,} chars)")
|
||||||
print(f"{self.log_prefix} 🔧 Available tools: {len(self.tools) if self.tools else 0}")
|
print(f"{self.log_prefix} 🔧 Available tools: {len(self.tools) if self.tools else 0}")
|
||||||
else:
|
else:
|
||||||
# Animated thinking spinner in quiet mode
|
# Animated thinking spinner in quiet mode (disable for wrappers/non-TTY usage)
|
||||||
face = random.choice(KawaiiSpinner.KAWAII_THINKING)
|
if os.getenv("HERMES_DISABLE_SPINNER") != "1":
|
||||||
verb = random.choice(KawaiiSpinner.THINKING_VERBS)
|
face = random.choice(KawaiiSpinner.KAWAII_THINKING)
|
||||||
spinner_type = random.choice(['brain', 'sparkle', 'pulse', 'moon', 'star'])
|
verb = random.choice(KawaiiSpinner.THINKING_VERBS)
|
||||||
thinking_spinner = KawaiiSpinner(f"{face} {verb}...", spinner_type=spinner_type)
|
spinner_type = random.choice(['brain', 'sparkle', 'pulse', 'moon', 'star'])
|
||||||
thinking_spinner.start()
|
thinking_spinner = KawaiiSpinner(f"{face} {verb}...", spinner_type=spinner_type)
|
||||||
|
thinking_spinner.start()
|
||||||
|
|
||||||
# Log request details if verbose
|
# Log request details if verbose
|
||||||
if self.verbose_logging:
|
if self.verbose_logging:
|
||||||
@@ -1659,6 +1669,14 @@ class AIAgent:
|
|||||||
api_kwargs["extra_body"] = extra_body
|
api_kwargs["extra_body"] = extra_body
|
||||||
|
|
||||||
response = self.client.chat.completions.create(**api_kwargs)
|
response = self.client.chat.completions.create(**api_kwargs)
|
||||||
|
|
||||||
|
if os.getenv("HERMES_DEBUG_OPENAI_RESPONSE") == "1":
|
||||||
|
try:
|
||||||
|
dumped = response.model_dump()
|
||||||
|
except Exception:
|
||||||
|
dumped = getattr(response, "__dict__", {"repr": repr(response)})
|
||||||
|
print("\n=== HERMES_DEBUG_OPENAI_RESPONSE: ChatCompletion (raw) ===", flush=True)
|
||||||
|
print(json.dumps(dumped, ensure_ascii=False, indent=2), flush=True)
|
||||||
|
|
||||||
api_duration = time.time() - api_start_time
|
api_duration = time.time() - api_start_time
|
||||||
|
|
||||||
@@ -2137,7 +2155,7 @@ class AIAgent:
|
|||||||
tool_start_time = time.time()
|
tool_start_time = time.time()
|
||||||
|
|
||||||
# Execute the tool - with animated spinner in quiet mode
|
# Execute the tool - with animated spinner in quiet mode
|
||||||
if self.quiet_mode:
|
if self.quiet_mode and os.getenv("HERMES_DISABLE_SPINNER") != "1":
|
||||||
# Tool-specific spinner animations
|
# Tool-specific spinner animations
|
||||||
tool_spinners = {
|
tool_spinners = {
|
||||||
'web_search': ('arrows', ['🔍', '🌐', '📡', '🔎']),
|
'web_search': ('arrows', ['🔍', '🌐', '📡', '🔎']),
|
||||||
@@ -2167,6 +2185,9 @@ class AIAgent:
|
|||||||
tool_duration = time.time() - tool_start_time
|
tool_duration = time.time() - tool_start_time
|
||||||
cute_msg = self._get_cute_tool_message(function_name, function_args, tool_duration)
|
cute_msg = self._get_cute_tool_message(function_name, function_args, tool_duration)
|
||||||
spinner.stop(cute_msg)
|
spinner.stop(cute_msg)
|
||||||
|
elif self.quiet_mode:
|
||||||
|
function_result = handle_function_call(function_name, function_args, effective_task_id)
|
||||||
|
tool_duration = time.time() - tool_start_time
|
||||||
else:
|
else:
|
||||||
function_result = handle_function_call(function_name, function_args, effective_task_id)
|
function_result = handle_function_call(function_name, function_args, effective_task_id)
|
||||||
tool_duration = time.time() - tool_start_time
|
tool_duration = time.time() - tool_start_time
|
||||||
@@ -2635,4 +2656,11 @@ def main(
|
|||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
try:
|
||||||
|
import fire # type: ignore
|
||||||
|
except ModuleNotFoundError as exc:
|
||||||
|
raise SystemExit(
|
||||||
|
"Missing optional dependency 'fire'. Install hermes-agent with its CLI extras or add `fire` "
|
||||||
|
f"to your environment. Original error: {exc}"
|
||||||
|
) from exc
|
||||||
fire.Fire(main)
|
fire.Fire(main)
|
||||||
|
|||||||
62
scripts/launch_llama_cpp_glm47_flash.sh
Executable file
62
scripts/launch_llama_cpp_glm47_flash.sh
Executable file
@@ -0,0 +1,62 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Launch a local llama.cpp OpenAI-compatible server running GLM-4.7-Flash (GGUF).
|
||||||
|
#
|
||||||
|
# Requires:
|
||||||
|
# - `llama-server` installed (e.g. `brew install llama.cpp`)
|
||||||
|
#
|
||||||
|
# Default settings are chosen to avoid clashing with Atropos sandbox_server
|
||||||
|
# (which commonly uses port 8080 in local dev).
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# Hermes-Agent/scripts/launch_llama_cpp_glm47_flash.sh
|
||||||
|
#
|
||||||
|
# Override defaults:
|
||||||
|
# LLAMA_CPP_HOST=127.0.0.1 LLAMA_CPP_PORT=8082 \
|
||||||
|
# LLAMA_CPP_HF_REPO=ggml-org/GLM-4.7-Flash-GGUF \
|
||||||
|
# LLAMA_CPP_HF_FILE=GLM-4.7-Flash-Q4_K.gguf \
|
||||||
|
# Hermes-Agent/scripts/launch_llama_cpp_glm47_flash.sh
|
||||||
|
|
||||||
|
HOST="${LLAMA_CPP_HOST:-127.0.0.1}"
|
||||||
|
PORT="${LLAMA_CPP_PORT:-8080}"
|
||||||
|
HF_REPO="${LLAMA_CPP_HF_REPO:-ggml-org/GLM-4.7-Flash-GGUF}"
|
||||||
|
HF_FILE="${LLAMA_CPP_HF_FILE:-GLM-4.7-Flash-Q4_K.gguf}"
|
||||||
|
ALIAS="${LLAMA_CPP_ALIAS:-glm-4.7-flash}"
|
||||||
|
|
||||||
|
if ! command -v llama-server >/dev/null 2>&1; then
|
||||||
|
echo "Error: llama-server not found in PATH."
|
||||||
|
echo "Install via Homebrew: brew install llama.cpp"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "Launching llama.cpp server..."
|
||||||
|
echo " host: $HOST"
|
||||||
|
echo " port: $PORT"
|
||||||
|
echo " repo: $HF_REPO"
|
||||||
|
echo " file: $HF_FILE"
|
||||||
|
echo " alias: $ALIAS"
|
||||||
|
echo
|
||||||
|
echo "Suggested env vars for Hermes/Atropos integration:"
|
||||||
|
echo " export ATROPOS_SERVER_BASE_URL=http://${HOST}:${PORT}"
|
||||||
|
echo " export ATROPOS_SERVER_MODEL=${ALIAS}"
|
||||||
|
echo " export ATROPOS_SERVER_API_KEY=local"
|
||||||
|
echo
|
||||||
|
|
||||||
|
if command -v lsof >/dev/null 2>&1; then
|
||||||
|
if lsof -nP -iTCP:"$PORT" -sTCP:LISTEN >/dev/null 2>&1; then
|
||||||
|
echo "Error: port $PORT is already in use."
|
||||||
|
echo "Pick a different port, e.g.:"
|
||||||
|
echo " LLAMA_CPP_PORT=8082 Hermes-Agent/scripts/launch_llama_cpp_glm47_flash.sh"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
exec llama-server \
|
||||||
|
--host "$HOST" \
|
||||||
|
--port "$PORT" \
|
||||||
|
--hf-repo "$HF_REPO" \
|
||||||
|
--hf-file "$HF_FILE" \
|
||||||
|
--alias "$ALIAS" \
|
||||||
|
-c 32768 \
|
||||||
|
-n -1
|
||||||
70
scripts/launch_llama_cpp_hermes_4_36b.sh
Executable file
70
scripts/launch_llama_cpp_hermes_4_36b.sh
Executable file
@@ -0,0 +1,70 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Launch a local llama.cpp OpenAI-compatible server running Hermes 4.3 36B (GGUF).
|
||||||
|
#
|
||||||
|
# Requires:
|
||||||
|
# - `llama-server` installed (e.g. `brew install llama.cpp`)
|
||||||
|
#
|
||||||
|
# Note: Port choice can conflict with other local dev servers. If 8080 is already
|
||||||
|
# in use, override via `LLAMA_CPP_PORT=...`.
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# Hermes-Agent/scripts/launch_llama_cpp_hermes_4_36b.sh
|
||||||
|
#
|
||||||
|
# Override defaults:
|
||||||
|
# LLAMA_CPP_HOST=127.0.0.1 LLAMA_CPP_PORT=8082 \
|
||||||
|
# LLAMA_CPP_HF_REPO=NousResearch/Hermes-4.3-36B-GGUF \
|
||||||
|
# LLAMA_CPP_HF_FILE=hermes-4_3_36b-Q4_K_M.gguf \
|
||||||
|
# LLAMA_CPP_ALIAS=hermes-4-36b \
|
||||||
|
# LLAMA_CPP_PARALLEL=4 LLAMA_CPP_THREADS_HTTP=4 \
|
||||||
|
# Hermes-Agent/scripts/launch_llama_cpp_hermes_4_36b.sh
|
||||||
|
|
||||||
|
HOST="${LLAMA_CPP_HOST:-127.0.0.1}"
|
||||||
|
PORT="${LLAMA_CPP_PORT:-8080}"
|
||||||
|
HF_REPO="${LLAMA_CPP_HF_REPO:-NousResearch/Hermes-4.3-36B-GGUF}"
|
||||||
|
HF_FILE="${LLAMA_CPP_HF_FILE:-hermes-4_3_36b-Q4_K_M.gguf}"
|
||||||
|
ALIAS="${LLAMA_CPP_ALIAS:-hermes-4-36b}"
|
||||||
|
PARALLEL="${LLAMA_CPP_PARALLEL:-4}"
|
||||||
|
THREADS_HTTP="${LLAMA_CPP_THREADS_HTTP:-4}"
|
||||||
|
|
||||||
|
if ! command -v llama-server >/dev/null 2>&1; then
|
||||||
|
echo "Error: llama-server not found in PATH."
|
||||||
|
echo "Install via Homebrew: brew install llama.cpp"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "Launching llama.cpp server..."
|
||||||
|
echo " host: $HOST"
|
||||||
|
echo " port: $PORT"
|
||||||
|
echo " repo: $HF_REPO"
|
||||||
|
echo " file: $HF_FILE"
|
||||||
|
echo " alias: $ALIAS"
|
||||||
|
echo " slots: $PARALLEL"
|
||||||
|
echo
|
||||||
|
echo "Suggested env vars for Hermes/Atropos integration:"
|
||||||
|
echo " export ATROPOS_SERVER_BASE_URL=http://${HOST}:${PORT}"
|
||||||
|
echo " export ATROPOS_SERVER_MODEL=${ALIAS}"
|
||||||
|
echo " export ATROPOS_TOKENIZER_NAME=NousResearch/Hermes-4.3-36B"
|
||||||
|
echo " export ATROPOS_SERVER_API_KEY=local"
|
||||||
|
echo
|
||||||
|
|
||||||
|
if command -v lsof >/dev/null 2>&1; then
|
||||||
|
if lsof -nP -iTCP:"$PORT" -sTCP:LISTEN >/dev/null 2>&1; then
|
||||||
|
echo "Error: port $PORT is already in use."
|
||||||
|
echo "Pick a different port, e.g.:"
|
||||||
|
echo " LLAMA_CPP_PORT=8082 Hermes-Agent/scripts/launch_llama_cpp_hermes_4_36b.sh"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
exec llama-server \
|
||||||
|
--host "$HOST" \
|
||||||
|
--port "$PORT" \
|
||||||
|
--hf-repo "$HF_REPO" \
|
||||||
|
--hf-file "$HF_FILE" \
|
||||||
|
--alias "$ALIAS" \
|
||||||
|
--parallel "$PARALLEL" \
|
||||||
|
--threads-http "$THREADS_HTTP" \
|
||||||
|
-c 32768 \
|
||||||
|
-n -1
|
||||||
15
tests/test_data/checkpoint_test_dataset.jsonl
Normal file
15
tests/test_data/checkpoint_test_dataset.jsonl
Normal file
@@ -0,0 +1,15 @@
|
|||||||
|
{"prompt": "Test prompt 0: What is 2+2? Just answer briefly.", "test_id": 0}
|
||||||
|
{"prompt": "Test prompt 1: What is 2+2? Just answer briefly.", "test_id": 1}
|
||||||
|
{"prompt": "Test prompt 2: What is 2+2? Just answer briefly.", "test_id": 2}
|
||||||
|
{"prompt": "Test prompt 3: What is 2+2? Just answer briefly.", "test_id": 3}
|
||||||
|
{"prompt": "Test prompt 4: What is 2+2? Just answer briefly.", "test_id": 4}
|
||||||
|
{"prompt": "Test prompt 5: What is 2+2? Just answer briefly.", "test_id": 5}
|
||||||
|
{"prompt": "Test prompt 6: What is 2+2? Just answer briefly.", "test_id": 6}
|
||||||
|
{"prompt": "Test prompt 7: What is 2+2? Just answer briefly.", "test_id": 7}
|
||||||
|
{"prompt": "Test prompt 8: What is 2+2? Just answer briefly.", "test_id": 8}
|
||||||
|
{"prompt": "Test prompt 9: What is 2+2? Just answer briefly.", "test_id": 9}
|
||||||
|
{"prompt": "Test prompt 10: What is 2+2? Just answer briefly.", "test_id": 10}
|
||||||
|
{"prompt": "Test prompt 11: What is 2+2? Just answer briefly.", "test_id": 11}
|
||||||
|
{"prompt": "Test prompt 12: What is 2+2? Just answer briefly.", "test_id": 12}
|
||||||
|
{"prompt": "Test prompt 13: What is 2+2? Just answer briefly.", "test_id": 13}
|
||||||
|
{"prompt": "Test prompt 14: What is 2+2? Just answer briefly.", "test_id": 14}
|
||||||
5
tests/test_data/checkpoint_test_resume_partial.jsonl
Normal file
5
tests/test_data/checkpoint_test_resume_partial.jsonl
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
{"prompt": "Test prompt 0: What is 2+2? Just answer briefly.", "test_id": 0}
|
||||||
|
{"prompt": "Test prompt 1: What is 2+2? Just answer briefly.", "test_id": 1}
|
||||||
|
{"prompt": "Test prompt 2: What is 2+2? Just answer briefly.", "test_id": 2}
|
||||||
|
{"prompt": "Test prompt 3: What is 2+2? Just answer briefly.", "test_id": 3}
|
||||||
|
{"prompt": "Test prompt 4: What is 2+2? Just answer briefly.", "test_id": 4}
|
||||||
1082
tests/test_modal_integration.py
Normal file
1082
tests/test_modal_integration.py
Normal file
File diff suppressed because it is too large
Load Diff
923
tests/test_modal_stress.py
Normal file
923
tests/test_modal_stress.py
Normal file
@@ -0,0 +1,923 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Modal Integration Stress Tests & Full Integration Tests
|
||||||
|
|
||||||
|
This test suite includes:
|
||||||
|
1. Stress tests for Modal sandbox pools (concurrent load, scaling)
|
||||||
|
2. Atropos backend tests (requires atroposlib)
|
||||||
|
3. mini-swe-agent integration tests
|
||||||
|
|
||||||
|
Prerequisites:
|
||||||
|
# Install dev dependencies
|
||||||
|
pip install -e '.[dev,modal]'
|
||||||
|
|
||||||
|
# Install atroposlib for Atropos tests
|
||||||
|
pip install -e '.[atropos]'
|
||||||
|
|
||||||
|
# Clone mini-swe-agent (if not present)
|
||||||
|
git clone https://github.com/anthropics/mini-swe-agent.git mini-swe-agent
|
||||||
|
# Or as submodule:
|
||||||
|
git submodule add https://github.com/anthropics/mini-swe-agent.git mini-swe-agent
|
||||||
|
|
||||||
|
Run with:
|
||||||
|
# All tests
|
||||||
|
python tests/test_modal_stress.py
|
||||||
|
|
||||||
|
# Stress tests only
|
||||||
|
python tests/test_modal_stress.py --category stress
|
||||||
|
|
||||||
|
# Atropos tests only
|
||||||
|
python tests/test_modal_stress.py --category atropos
|
||||||
|
|
||||||
|
# Mini-swe-agent tests only
|
||||||
|
python tests/test_modal_stress.py --category miniswe
|
||||||
|
|
||||||
|
# Dry run (no Modal calls)
|
||||||
|
python tests/test_modal_stress.py --dry-run
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
import random
|
||||||
|
import traceback
|
||||||
|
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Dict, Any, List, Optional, Tuple
|
||||||
|
from dataclasses import dataclass
|
||||||
|
|
||||||
|
# Add parent to path for imports
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent.parent))
|
||||||
|
|
||||||
|
|
||||||
|
# =============================================================================
|
||||||
|
# Test Configuration
|
||||||
|
# =============================================================================
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class StressTestConfig:
|
||||||
|
dry_run: bool = False
|
||||||
|
verbose: bool = True
|
||||||
|
category: Optional[str] = None
|
||||||
|
# Stress test parameters (reduced defaults for faster first-run)
|
||||||
|
concurrent_tasks: int = 3 # Start small - Modal cold starts are slow
|
||||||
|
total_operations: int = 10
|
||||||
|
max_sandboxes: int = 3
|
||||||
|
slots_per_sandbox: int = 3
|
||||||
|
|
||||||
|
|
||||||
|
# =============================================================================
|
||||||
|
# Test Results Tracking
|
||||||
|
# =============================================================================
|
||||||
|
|
||||||
|
class TestResults:
|
||||||
|
def __init__(self):
|
||||||
|
self.passed: List[str] = []
|
||||||
|
self.failed: List[Tuple[str, str]] = []
|
||||||
|
self.skipped: List[Tuple[str, str]] = []
|
||||||
|
self.metrics: Dict[str, Any] = {}
|
||||||
|
|
||||||
|
def record_pass(self, name: str, metrics: Optional[Dict] = None):
|
||||||
|
self.passed.append(name)
|
||||||
|
if metrics:
|
||||||
|
self.metrics[name] = metrics
|
||||||
|
print(f" ✅ {name}")
|
||||||
|
if metrics:
|
||||||
|
for k, v in metrics.items():
|
||||||
|
print(f" 📊 {k}: {v}")
|
||||||
|
|
||||||
|
def record_fail(self, name: str, error: str):
|
||||||
|
self.failed.append((name, error))
|
||||||
|
print(f" ❌ {name}: {error}")
|
||||||
|
|
||||||
|
def record_skip(self, name: str, reason: str):
|
||||||
|
self.skipped.append((name, reason))
|
||||||
|
print(f" ⏭️ {name}: {reason}")
|
||||||
|
|
||||||
|
def summary(self):
|
||||||
|
total = len(self.passed) + len(self.failed) + len(self.skipped)
|
||||||
|
print(f"\n{'='*70}")
|
||||||
|
print(f"STRESS TEST RESULTS: {len(self.passed)}/{total} passed")
|
||||||
|
print(f" Passed: {len(self.passed)}")
|
||||||
|
print(f" Failed: {len(self.failed)}")
|
||||||
|
print(f" Skipped: {len(self.skipped)}")
|
||||||
|
|
||||||
|
if self.failed:
|
||||||
|
print(f"\nFailed tests:")
|
||||||
|
for name, error in self.failed:
|
||||||
|
print(f" - {name}: {error}")
|
||||||
|
|
||||||
|
if self.metrics:
|
||||||
|
print(f"\nPerformance Metrics:")
|
||||||
|
for test, metrics in self.metrics.items():
|
||||||
|
print(f" {test}:")
|
||||||
|
for k, v in metrics.items():
|
||||||
|
print(f" - {k}: {v}")
|
||||||
|
|
||||||
|
return len(self.failed) == 0
|
||||||
|
|
||||||
|
|
||||||
|
results = TestResults()
|
||||||
|
|
||||||
|
|
||||||
|
# =============================================================================
|
||||||
|
# Helper: Atropos Import
|
||||||
|
# =============================================================================
|
||||||
|
|
||||||
|
def try_import_atropos():
|
||||||
|
"""Try importing Atropos backend components."""
|
||||||
|
try:
|
||||||
|
from atropos.backends.modal_backend import (
|
||||||
|
ModalToolBackend, ModalSandboxConfig,
|
||||||
|
_ModalMultiProfileManager
|
||||||
|
)
|
||||||
|
from atropos.slots.slot import Slot, SlotState
|
||||||
|
return ModalToolBackend, ModalSandboxConfig, Slot, SlotState
|
||||||
|
except (ImportError, ModuleNotFoundError) as e:
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def try_import_miniswe():
|
||||||
|
"""Try importing mini-swe-agent components."""
|
||||||
|
try:
|
||||||
|
# Check if mini-swe-agent path exists and has content
|
||||||
|
mini_swe_path = Path(__file__).parent.parent / "mini-swe-agent" / "src"
|
||||||
|
if mini_swe_path.exists() and list(mini_swe_path.iterdir()):
|
||||||
|
sys.path.insert(0, str(mini_swe_path))
|
||||||
|
import minisweagent
|
||||||
|
return minisweagent
|
||||||
|
return None
|
||||||
|
except (ImportError, ModuleNotFoundError) as e:
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
# =============================================================================
|
||||||
|
# CATEGORY 1: Stress Tests (Terminal Tool)
|
||||||
|
# =============================================================================
|
||||||
|
|
||||||
|
def test_stress_concurrent_tasks(config: StressTestConfig):
|
||||||
|
"""Stress test: Multiple concurrent task_ids hitting the pool."""
|
||||||
|
if config.dry_run:
|
||||||
|
results.record_skip("test_stress_concurrent_tasks", "Dry run mode")
|
||||||
|
return
|
||||||
|
|
||||||
|
from tools.terminal_tool import terminal_tool, cleanup_vm
|
||||||
|
|
||||||
|
original_env = os.environ.get("TERMINAL_ENV")
|
||||||
|
os.environ["TERMINAL_ENV"] = "modal"
|
||||||
|
|
||||||
|
try:
|
||||||
|
num_tasks = config.concurrent_tasks
|
||||||
|
task_ids = [f"stress-concurrent-{i}-{int(time.time())}" for i in range(num_tasks)]
|
||||||
|
|
||||||
|
start_time = time.time()
|
||||||
|
errors = []
|
||||||
|
successes = 0
|
||||||
|
|
||||||
|
def run_task(task_id: str) -> Tuple[bool, str]:
|
||||||
|
try:
|
||||||
|
result = json.loads(terminal_tool(
|
||||||
|
f"echo 'Hello from {task_id}' && sleep 0.5",
|
||||||
|
task_id=task_id,
|
||||||
|
))
|
||||||
|
success = result["exit_code"] == 0
|
||||||
|
|
||||||
|
# IMPORTANT: Clean up immediately after task completes
|
||||||
|
# This releases the sandbox back to the pool for other tasks
|
||||||
|
try:
|
||||||
|
cleanup_vm(task_id)
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
|
||||||
|
if success:
|
||||||
|
return True, ""
|
||||||
|
# Include more details for debugging
|
||||||
|
error_detail = result.get("error", "no error message")
|
||||||
|
output = result.get("output", "")[:100] # First 100 chars
|
||||||
|
return False, f"Exit code: {result['exit_code']}, error: {error_detail}, output: {output}"
|
||||||
|
except Exception as e:
|
||||||
|
# Clean up even on failure
|
||||||
|
try:
|
||||||
|
cleanup_vm(task_id)
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
import traceback
|
||||||
|
return False, f"Exception: {str(e)}\n{traceback.format_exc()}"
|
||||||
|
|
||||||
|
# Run all tasks concurrently using threads
|
||||||
|
with ThreadPoolExecutor(max_workers=num_tasks) as executor:
|
||||||
|
futures = {executor.submit(run_task, tid): tid for tid in task_ids}
|
||||||
|
|
||||||
|
for future in as_completed(futures):
|
||||||
|
task_id = futures[future]
|
||||||
|
try:
|
||||||
|
success, error = future.result(timeout=60)
|
||||||
|
if success:
|
||||||
|
successes += 1
|
||||||
|
else:
|
||||||
|
errors.append(f"{task_id}: {error}")
|
||||||
|
except Exception as e:
|
||||||
|
errors.append(f"{task_id}: {str(e)}")
|
||||||
|
|
||||||
|
elapsed = time.time() - start_time
|
||||||
|
|
||||||
|
# No need for cleanup here - each task cleans up immediately
|
||||||
|
|
||||||
|
# Report
|
||||||
|
success_rate = successes / num_tasks * 100
|
||||||
|
|
||||||
|
if success_rate >= 90: # Allow 10% failure rate for stress test
|
||||||
|
results.record_pass("test_stress_concurrent_tasks", {
|
||||||
|
"concurrent_tasks": num_tasks,
|
||||||
|
"successes": successes,
|
||||||
|
"failures": len(errors),
|
||||||
|
"success_rate": f"{success_rate:.1f}%",
|
||||||
|
"total_time": f"{elapsed:.2f}s",
|
||||||
|
"avg_time_per_task": f"{elapsed/num_tasks:.2f}s",
|
||||||
|
})
|
||||||
|
else:
|
||||||
|
results.record_fail(
|
||||||
|
"test_stress_concurrent_tasks",
|
||||||
|
f"Success rate {success_rate:.1f}% < 90%. Errors: {errors[:3]}"
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
results.record_fail("test_stress_concurrent_tasks", str(e))
|
||||||
|
finally:
|
||||||
|
if original_env:
|
||||||
|
os.environ["TERMINAL_ENV"] = original_env
|
||||||
|
elif "TERMINAL_ENV" in os.environ:
|
||||||
|
del os.environ["TERMINAL_ENV"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_stress_rapid_fire(config: StressTestConfig):
|
||||||
|
"""Stress test: Rapid sequential commands to same task_id."""
|
||||||
|
if config.dry_run:
|
||||||
|
results.record_skip("test_stress_rapid_fire", "Dry run mode")
|
||||||
|
return
|
||||||
|
|
||||||
|
from tools.terminal_tool import terminal_tool, cleanup_vm
|
||||||
|
|
||||||
|
original_env = os.environ.get("TERMINAL_ENV")
|
||||||
|
os.environ["TERMINAL_ENV"] = "modal"
|
||||||
|
|
||||||
|
try:
|
||||||
|
task_id = f"stress-rapid-{int(time.time())}"
|
||||||
|
num_commands = config.total_operations
|
||||||
|
|
||||||
|
start_time = time.time()
|
||||||
|
successes = 0
|
||||||
|
errors = []
|
||||||
|
|
||||||
|
for i in range(num_commands):
|
||||||
|
try:
|
||||||
|
result = json.loads(terminal_tool(f"echo {i}", task_id=task_id))
|
||||||
|
if result["exit_code"] == 0 and str(i) in result["output"]:
|
||||||
|
successes += 1
|
||||||
|
else:
|
||||||
|
errors.append(f"Command {i}: unexpected result")
|
||||||
|
except Exception as e:
|
||||||
|
errors.append(f"Command {i}: {str(e)}")
|
||||||
|
|
||||||
|
elapsed = time.time() - start_time
|
||||||
|
cleanup_vm(task_id)
|
||||||
|
|
||||||
|
success_rate = successes / num_commands * 100
|
||||||
|
commands_per_second = num_commands / elapsed
|
||||||
|
|
||||||
|
if success_rate >= 95:
|
||||||
|
results.record_pass("test_stress_rapid_fire", {
|
||||||
|
"total_commands": num_commands,
|
||||||
|
"successes": successes,
|
||||||
|
"success_rate": f"{success_rate:.1f}%",
|
||||||
|
"total_time": f"{elapsed:.2f}s",
|
||||||
|
"commands_per_second": f"{commands_per_second:.1f}",
|
||||||
|
})
|
||||||
|
else:
|
||||||
|
results.record_fail(
|
||||||
|
"test_stress_rapid_fire",
|
||||||
|
f"Success rate {success_rate:.1f}% < 95%"
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
results.record_fail("test_stress_rapid_fire", str(e))
|
||||||
|
finally:
|
||||||
|
if original_env:
|
||||||
|
os.environ["TERMINAL_ENV"] = original_env
|
||||||
|
elif "TERMINAL_ENV" in os.environ:
|
||||||
|
del os.environ["TERMINAL_ENV"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_stress_pool_scaling(config: StressTestConfig):
|
||||||
|
"""Stress test: Force pool to scale up and down by running tasks in batches."""
|
||||||
|
if config.dry_run:
|
||||||
|
results.record_skip("test_stress_pool_scaling", "Dry run mode")
|
||||||
|
return
|
||||||
|
|
||||||
|
from tools.terminal_tool import terminal_tool, cleanup_vm, _ModalPoolManager
|
||||||
|
|
||||||
|
original_env = os.environ.get("TERMINAL_ENV")
|
||||||
|
os.environ["TERMINAL_ENV"] = "modal"
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Run tasks in batches matching max_sandboxes to test pool reuse
|
||||||
|
# This verifies sandboxes can be acquired, used, released, and reused
|
||||||
|
batch_size = config.max_sandboxes
|
||||||
|
num_batches = 3
|
||||||
|
total_tasks = batch_size * num_batches
|
||||||
|
|
||||||
|
start_time = time.time()
|
||||||
|
successes = 0
|
||||||
|
|
||||||
|
for batch in range(num_batches):
|
||||||
|
task_ids = [f"stress-scale-{batch}-{i}-{int(time.time())}" for i in range(batch_size)]
|
||||||
|
|
||||||
|
def run_task(task_id: str):
|
||||||
|
try:
|
||||||
|
result = json.loads(terminal_tool(
|
||||||
|
"echo done", # Fast command to test scaling
|
||||||
|
task_id=task_id,
|
||||||
|
))
|
||||||
|
success = result["exit_code"] == 0
|
||||||
|
try:
|
||||||
|
cleanup_vm(task_id)
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
return success
|
||||||
|
except:
|
||||||
|
try:
|
||||||
|
cleanup_vm(task_id)
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Run batch concurrently
|
||||||
|
with ThreadPoolExecutor(max_workers=batch_size) as executor:
|
||||||
|
batch_results = list(executor.map(run_task, task_ids))
|
||||||
|
successes += sum(batch_results)
|
||||||
|
|
||||||
|
elapsed = time.time() - start_time
|
||||||
|
|
||||||
|
# Check pool status
|
||||||
|
try:
|
||||||
|
manager = _ModalPoolManager.get_instance()
|
||||||
|
pool_status = manager.get_status() if hasattr(manager, 'get_status') else {}
|
||||||
|
except:
|
||||||
|
pool_status = {}
|
||||||
|
|
||||||
|
success_rate = successes / total_tasks * 100
|
||||||
|
|
||||||
|
if success_rate >= 80: # Allow some tolerance
|
||||||
|
results.record_pass("test_stress_pool_scaling", {
|
||||||
|
"total_tasks": total_tasks,
|
||||||
|
"num_batches": num_batches,
|
||||||
|
"batch_size": batch_size,
|
||||||
|
"successes": successes,
|
||||||
|
"success_rate": f"{success_rate:.1f}%",
|
||||||
|
"total_time": f"{elapsed:.2f}s",
|
||||||
|
"pool_status": pool_status,
|
||||||
|
})
|
||||||
|
else:
|
||||||
|
results.record_fail(
|
||||||
|
"test_stress_pool_scaling",
|
||||||
|
f"Success rate {success_rate:.1f}% < 80%"
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
results.record_fail("test_stress_pool_scaling", str(e))
|
||||||
|
finally:
|
||||||
|
if original_env:
|
||||||
|
os.environ["TERMINAL_ENV"] = original_env
|
||||||
|
elif "TERMINAL_ENV" in os.environ:
|
||||||
|
del os.environ["TERMINAL_ENV"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_stress_large_output(config: StressTestConfig):
|
||||||
|
"""Stress test: Commands producing large output."""
|
||||||
|
if config.dry_run:
|
||||||
|
results.record_skip("test_stress_large_output", "Dry run mode")
|
||||||
|
return
|
||||||
|
|
||||||
|
from tools.terminal_tool import terminal_tool, cleanup_vm
|
||||||
|
|
||||||
|
original_env = os.environ.get("TERMINAL_ENV")
|
||||||
|
os.environ["TERMINAL_ENV"] = "modal"
|
||||||
|
|
||||||
|
try:
|
||||||
|
task_id = f"stress-large-{int(time.time())}"
|
||||||
|
|
||||||
|
# First verify basic connectivity with simple command
|
||||||
|
warmup = json.loads(terminal_tool("echo warmup", task_id=task_id))
|
||||||
|
if warmup["exit_code"] != 0:
|
||||||
|
results.record_fail(
|
||||||
|
"test_stress_large_output",
|
||||||
|
f"Warmup failed: {warmup.get('error', 'unknown')}"
|
||||||
|
)
|
||||||
|
return
|
||||||
|
|
||||||
|
# Generate output - use seq which is more portable
|
||||||
|
start_time = time.time()
|
||||||
|
result = json.loads(terminal_tool(
|
||||||
|
'seq 1 500 | while read i; do echo "Line $i: This is test content for large output"; done',
|
||||||
|
task_id=task_id,
|
||||||
|
timeout=60,
|
||||||
|
))
|
||||||
|
elapsed = time.time() - start_time
|
||||||
|
|
||||||
|
cleanup_vm(task_id)
|
||||||
|
|
||||||
|
output_size = len(result.get("output", ""))
|
||||||
|
error_msg = result.get("error", "")
|
||||||
|
|
||||||
|
if result["exit_code"] == 0 and output_size > 5000:
|
||||||
|
results.record_pass("test_stress_large_output", {
|
||||||
|
"output_size": f"{output_size:,} bytes",
|
||||||
|
"time": f"{elapsed:.2f}s",
|
||||||
|
"throughput": f"{output_size/elapsed/1024:.1f} KB/s" if elapsed > 0 else "N/A",
|
||||||
|
})
|
||||||
|
else:
|
||||||
|
results.record_fail(
|
||||||
|
"test_stress_large_output",
|
||||||
|
f"Exit code: {result['exit_code']}, output size: {output_size}, error: {error_msg}"
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
import traceback
|
||||||
|
results.record_fail("test_stress_large_output", f"{str(e)}\n{traceback.format_exc()}")
|
||||||
|
finally:
|
||||||
|
try:
|
||||||
|
cleanup_vm(task_id)
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
if original_env:
|
||||||
|
os.environ["TERMINAL_ENV"] = original_env
|
||||||
|
elif "TERMINAL_ENV" in os.environ:
|
||||||
|
del os.environ["TERMINAL_ENV"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_stress_error_recovery(config: StressTestConfig):
|
||||||
|
"""Stress test: Commands that fail and verify sandbox continues working."""
|
||||||
|
if config.dry_run:
|
||||||
|
results.record_skip("test_stress_error_recovery", "Dry run mode")
|
||||||
|
return
|
||||||
|
|
||||||
|
from tools.terminal_tool import terminal_tool, cleanup_vm
|
||||||
|
|
||||||
|
original_env = os.environ.get("TERMINAL_ENV")
|
||||||
|
os.environ["TERMINAL_ENV"] = "modal"
|
||||||
|
|
||||||
|
try:
|
||||||
|
task_id = f"stress-error-{int(time.time())}"
|
||||||
|
|
||||||
|
# Run some failing commands
|
||||||
|
failing_commands = [
|
||||||
|
"exit 1",
|
||||||
|
"false",
|
||||||
|
"cat /nonexistent/file",
|
||||||
|
"command_that_does_not_exist",
|
||||||
|
]
|
||||||
|
|
||||||
|
for cmd in failing_commands:
|
||||||
|
result = json.loads(terminal_tool(cmd, task_id=task_id))
|
||||||
|
# These should fail but not crash
|
||||||
|
assert result["exit_code"] != 0 or result.get("error"), f"Expected failure for: {cmd}"
|
||||||
|
|
||||||
|
# Now run a command that should succeed
|
||||||
|
result = json.loads(terminal_tool("echo 'recovery success'", task_id=task_id))
|
||||||
|
|
||||||
|
cleanup_vm(task_id)
|
||||||
|
|
||||||
|
if result["exit_code"] == 0 and "recovery success" in result["output"]:
|
||||||
|
results.record_pass("test_stress_error_recovery", {
|
||||||
|
"failed_commands": len(failing_commands),
|
||||||
|
"recovery": "success",
|
||||||
|
})
|
||||||
|
else:
|
||||||
|
results.record_fail(
|
||||||
|
"test_stress_error_recovery",
|
||||||
|
f"Recovery failed: {result}"
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
results.record_fail("test_stress_error_recovery", str(e))
|
||||||
|
finally:
|
||||||
|
if original_env:
|
||||||
|
os.environ["TERMINAL_ENV"] = original_env
|
||||||
|
elif "TERMINAL_ENV" in os.environ:
|
||||||
|
del os.environ["TERMINAL_ENV"]
|
||||||
|
|
||||||
|
|
||||||
|
# =============================================================================
|
||||||
|
# CATEGORY 2: Atropos Backend Stress Tests
|
||||||
|
# =============================================================================
|
||||||
|
|
||||||
|
async def test_atropos_stress_slot_churn(config: StressTestConfig):
|
||||||
|
"""Atropos stress test: Rapid slot acquire/release cycles."""
|
||||||
|
if config.dry_run:
|
||||||
|
results.record_skip("test_atropos_stress_slot_churn", "Dry run mode")
|
||||||
|
return
|
||||||
|
|
||||||
|
imports = try_import_atropos()
|
||||||
|
if imports is None:
|
||||||
|
results.record_skip("test_atropos_stress_slot_churn", "Requires atroposlib")
|
||||||
|
return
|
||||||
|
|
||||||
|
ModalToolBackend, ModalSandboxConfig, _, _ = imports
|
||||||
|
|
||||||
|
try:
|
||||||
|
backend_config = ModalSandboxConfig(
|
||||||
|
app_name=f"stress-churn-{int(time.time())}",
|
||||||
|
min_sandboxes=1,
|
||||||
|
max_sandboxes=3,
|
||||||
|
slots_per_sandbox=5,
|
||||||
|
)
|
||||||
|
|
||||||
|
backend = ModalToolBackend(backend_config)
|
||||||
|
await backend.start()
|
||||||
|
|
||||||
|
try:
|
||||||
|
num_cycles = config.total_operations
|
||||||
|
start_time = time.time()
|
||||||
|
successes = 0
|
||||||
|
|
||||||
|
for i in range(num_cycles):
|
||||||
|
try:
|
||||||
|
slot = await backend.acquire(f"churn-{i}")
|
||||||
|
|
||||||
|
# Quick command
|
||||||
|
results_list = await backend.execute_batch([
|
||||||
|
(slot, "bash", {"command": f"echo {i}"})
|
||||||
|
])
|
||||||
|
|
||||||
|
if results_list[0].success:
|
||||||
|
successes += 1
|
||||||
|
|
||||||
|
await backend.release(slot, reset_workspace=(i % 5 == 0))
|
||||||
|
except Exception as e:
|
||||||
|
pass # Count as failure
|
||||||
|
|
||||||
|
elapsed = time.time() - start_time
|
||||||
|
success_rate = successes / num_cycles * 100
|
||||||
|
|
||||||
|
if success_rate >= 90:
|
||||||
|
results.record_pass("test_atropos_stress_slot_churn", {
|
||||||
|
"cycles": num_cycles,
|
||||||
|
"successes": successes,
|
||||||
|
"success_rate": f"{success_rate:.1f}%",
|
||||||
|
"total_time": f"{elapsed:.2f}s",
|
||||||
|
"cycles_per_second": f"{num_cycles/elapsed:.1f}",
|
||||||
|
})
|
||||||
|
else:
|
||||||
|
results.record_fail(
|
||||||
|
"test_atropos_stress_slot_churn",
|
||||||
|
f"Success rate {success_rate:.1f}% < 90%"
|
||||||
|
)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
await backend.stop(purge=True)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
results.record_fail("test_atropos_stress_slot_churn", str(e))
|
||||||
|
|
||||||
|
|
||||||
|
async def test_atropos_stress_parallel_batches(config: StressTestConfig):
|
||||||
|
"""Atropos stress test: Multiple parallel batch executions."""
|
||||||
|
if config.dry_run:
|
||||||
|
results.record_skip("test_atropos_stress_parallel_batches", "Dry run mode")
|
||||||
|
return
|
||||||
|
|
||||||
|
imports = try_import_atropos()
|
||||||
|
if imports is None:
|
||||||
|
results.record_skip("test_atropos_stress_parallel_batches", "Requires atroposlib")
|
||||||
|
return
|
||||||
|
|
||||||
|
ModalToolBackend, ModalSandboxConfig, _, _ = imports
|
||||||
|
|
||||||
|
try:
|
||||||
|
backend_config = ModalSandboxConfig(
|
||||||
|
app_name=f"stress-batch-{int(time.time())}",
|
||||||
|
min_sandboxes=2,
|
||||||
|
max_sandboxes=4,
|
||||||
|
slots_per_sandbox=5,
|
||||||
|
)
|
||||||
|
|
||||||
|
backend = ModalToolBackend(backend_config)
|
||||||
|
await backend.start()
|
||||||
|
|
||||||
|
try:
|
||||||
|
num_slots = 10
|
||||||
|
slots = []
|
||||||
|
|
||||||
|
# Acquire multiple slots
|
||||||
|
for i in range(num_slots):
|
||||||
|
slot = await backend.acquire(f"batch-{i}")
|
||||||
|
slots.append(slot)
|
||||||
|
|
||||||
|
# Run multiple batches in parallel
|
||||||
|
start_time = time.time()
|
||||||
|
num_batches = 5
|
||||||
|
|
||||||
|
async def run_batch(batch_id: int):
|
||||||
|
requests = [
|
||||||
|
(slot, "bash", {"command": f"echo 'batch{batch_id}-slot{i}'"})
|
||||||
|
for i, slot in enumerate(slots)
|
||||||
|
]
|
||||||
|
return await backend.execute_batch(requests)
|
||||||
|
|
||||||
|
batch_tasks = [run_batch(i) for i in range(num_batches)]
|
||||||
|
all_results = await asyncio.gather(*batch_tasks)
|
||||||
|
|
||||||
|
elapsed = time.time() - start_time
|
||||||
|
|
||||||
|
# Count successes
|
||||||
|
total_commands = num_batches * num_slots
|
||||||
|
successes = sum(
|
||||||
|
1 for batch_result in all_results
|
||||||
|
for r in batch_result
|
||||||
|
if r.success
|
||||||
|
)
|
||||||
|
|
||||||
|
# Release slots
|
||||||
|
for slot in slots:
|
||||||
|
await backend.release(slot)
|
||||||
|
|
||||||
|
success_rate = successes / total_commands * 100
|
||||||
|
|
||||||
|
if success_rate >= 90:
|
||||||
|
results.record_pass("test_atropos_stress_parallel_batches", {
|
||||||
|
"batches": num_batches,
|
||||||
|
"slots": num_slots,
|
||||||
|
"total_commands": total_commands,
|
||||||
|
"successes": successes,
|
||||||
|
"success_rate": f"{success_rate:.1f}%",
|
||||||
|
"total_time": f"{elapsed:.2f}s",
|
||||||
|
"commands_per_second": f"{total_commands/elapsed:.1f}",
|
||||||
|
})
|
||||||
|
else:
|
||||||
|
results.record_fail(
|
||||||
|
"test_atropos_stress_parallel_batches",
|
||||||
|
f"Success rate {success_rate:.1f}% < 90%"
|
||||||
|
)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
await backend.stop(purge=True)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
results.record_fail("test_atropos_stress_parallel_batches", str(e))
|
||||||
|
|
||||||
|
|
||||||
|
async def test_atropos_stress_multi_profile_load(config: StressTestConfig):
|
||||||
|
"""Atropos stress test: Load across multiple profiles."""
|
||||||
|
if config.dry_run:
|
||||||
|
results.record_skip("test_atropos_stress_multi_profile_load", "Dry run mode")
|
||||||
|
return
|
||||||
|
|
||||||
|
imports = try_import_atropos()
|
||||||
|
if imports is None:
|
||||||
|
results.record_skip("test_atropos_stress_multi_profile_load", "Requires atroposlib")
|
||||||
|
return
|
||||||
|
|
||||||
|
ModalToolBackend, ModalSandboxConfig, _, _ = imports
|
||||||
|
|
||||||
|
try:
|
||||||
|
backend = ModalToolBackend.with_profiles(
|
||||||
|
app_name=f"stress-multiprofile-{int(time.time())}",
|
||||||
|
profiles={
|
||||||
|
"cpu-light": ModalSandboxConfig(
|
||||||
|
name="cpu-light",
|
||||||
|
cpu=0.5,
|
||||||
|
memory=1024,
|
||||||
|
min_sandboxes=1,
|
||||||
|
max_sandboxes=2,
|
||||||
|
slots_per_sandbox=5,
|
||||||
|
),
|
||||||
|
"cpu-heavy": ModalSandboxConfig(
|
||||||
|
name="cpu-heavy",
|
||||||
|
cpu=2.0,
|
||||||
|
memory=4096,
|
||||||
|
min_sandboxes=0,
|
||||||
|
max_sandboxes=2,
|
||||||
|
slots_per_sandbox=3,
|
||||||
|
),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
await backend.start(profiles_to_start=["cpu-light", "cpu-heavy"])
|
||||||
|
|
||||||
|
try:
|
||||||
|
num_tasks_per_profile = 5
|
||||||
|
slots = []
|
||||||
|
|
||||||
|
# Acquire from both profiles
|
||||||
|
for i in range(num_tasks_per_profile):
|
||||||
|
light_slot = await backend.acquire(f"light-{i}", profile="cpu-light")
|
||||||
|
heavy_slot = await backend.acquire(f"heavy-{i}", profile="cpu-heavy")
|
||||||
|
slots.append((light_slot, "cpu-light"))
|
||||||
|
slots.append((heavy_slot, "cpu-heavy"))
|
||||||
|
|
||||||
|
# Execute batch across all profiles
|
||||||
|
start_time = time.time()
|
||||||
|
|
||||||
|
requests = [
|
||||||
|
(slot, "bash", {"command": f"echo 'profile={profile}'"})
|
||||||
|
for slot, profile in slots
|
||||||
|
]
|
||||||
|
|
||||||
|
batch_results = await backend.execute_batch(requests)
|
||||||
|
elapsed = time.time() - start_time
|
||||||
|
|
||||||
|
successes = sum(1 for r in batch_results if r.success)
|
||||||
|
|
||||||
|
# Release all
|
||||||
|
for slot, _ in slots:
|
||||||
|
await backend.release(slot)
|
||||||
|
|
||||||
|
status = backend.get_status()
|
||||||
|
|
||||||
|
success_rate = successes / len(slots) * 100
|
||||||
|
|
||||||
|
if success_rate >= 90:
|
||||||
|
results.record_pass("test_atropos_stress_multi_profile_load", {
|
||||||
|
"profiles": 2,
|
||||||
|
"tasks_per_profile": num_tasks_per_profile,
|
||||||
|
"total_tasks": len(slots),
|
||||||
|
"successes": successes,
|
||||||
|
"success_rate": f"{success_rate:.1f}%",
|
||||||
|
"time": f"{elapsed:.2f}s",
|
||||||
|
"status": status,
|
||||||
|
})
|
||||||
|
else:
|
||||||
|
results.record_fail(
|
||||||
|
"test_atropos_stress_multi_profile_load",
|
||||||
|
f"Success rate {success_rate:.1f}% < 90%"
|
||||||
|
)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
await backend.stop(purge=True)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
results.record_fail("test_atropos_stress_multi_profile_load", str(e))
|
||||||
|
|
||||||
|
|
||||||
|
# =============================================================================
|
||||||
|
# CATEGORY 3: Mini-SWE-Agent Integration Tests
|
||||||
|
# =============================================================================
|
||||||
|
|
||||||
|
def test_miniswe_environment_available():
|
||||||
|
"""Check if mini-swe-agent is properly set up."""
|
||||||
|
mini_swe_path = Path(__file__).parent.parent / "mini-swe-agent" / "src"
|
||||||
|
|
||||||
|
if not mini_swe_path.exists():
|
||||||
|
results.record_skip(
|
||||||
|
"test_miniswe_environment_available",
|
||||||
|
"mini-swe-agent not found. Run: git clone https://github.com/anthropics/mini-swe-agent.git mini-swe-agent"
|
||||||
|
)
|
||||||
|
return
|
||||||
|
|
||||||
|
if not list(mini_swe_path.iterdir()):
|
||||||
|
results.record_skip(
|
||||||
|
"test_miniswe_environment_available",
|
||||||
|
"mini-swe-agent directory is empty. Run: git submodule update --init"
|
||||||
|
)
|
||||||
|
return
|
||||||
|
|
||||||
|
miniswe = try_import_miniswe()
|
||||||
|
if miniswe is None:
|
||||||
|
results.record_fail(
|
||||||
|
"test_miniswe_environment_available",
|
||||||
|
"Failed to import minisweagent module"
|
||||||
|
)
|
||||||
|
return
|
||||||
|
|
||||||
|
results.record_pass("test_miniswe_environment_available", {
|
||||||
|
"path": str(mini_swe_path),
|
||||||
|
"module": miniswe.__name__,
|
||||||
|
})
|
||||||
|
|
||||||
|
|
||||||
|
def test_miniswe_modal_backend(config: StressTestConfig):
|
||||||
|
"""Test mini-swe-agent with Modal backend."""
|
||||||
|
if config.dry_run:
|
||||||
|
results.record_skip("test_miniswe_modal_backend", "Dry run mode")
|
||||||
|
return
|
||||||
|
|
||||||
|
miniswe = try_import_miniswe()
|
||||||
|
if miniswe is None:
|
||||||
|
results.record_skip(
|
||||||
|
"test_miniswe_modal_backend",
|
||||||
|
"mini-swe-agent not available"
|
||||||
|
)
|
||||||
|
return
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Check if ModalEnvironment exists in minisweagent
|
||||||
|
if not hasattr(miniswe, 'ModalEnvironment'):
|
||||||
|
results.record_skip(
|
||||||
|
"test_miniswe_modal_backend",
|
||||||
|
"minisweagent.ModalEnvironment not found"
|
||||||
|
)
|
||||||
|
return
|
||||||
|
|
||||||
|
# Create Modal environment
|
||||||
|
env = miniswe.ModalEnvironment(
|
||||||
|
image="python:3.11",
|
||||||
|
timeout=60,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Execute a command
|
||||||
|
result = env.execute("echo 'Hello from mini-swe-agent Modal'")
|
||||||
|
|
||||||
|
env.cleanup()
|
||||||
|
|
||||||
|
if "Hello from mini-swe-agent Modal" in str(result):
|
||||||
|
results.record_pass("test_miniswe_modal_backend")
|
||||||
|
else:
|
||||||
|
results.record_fail(
|
||||||
|
"test_miniswe_modal_backend",
|
||||||
|
f"Unexpected result: {result}"
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
results.record_fail("test_miniswe_modal_backend", str(e))
|
||||||
|
|
||||||
|
|
||||||
|
# =============================================================================
|
||||||
|
# Test Runner
|
||||||
|
# =============================================================================
|
||||||
|
|
||||||
|
def run_sync_tests(config: StressTestConfig):
|
||||||
|
"""Run synchronous tests."""
|
||||||
|
if config.category in (None, "stress"):
|
||||||
|
print("\n" + "="*70)
|
||||||
|
print("STRESS TESTS (Terminal Tool)")
|
||||||
|
print("="*70)
|
||||||
|
|
||||||
|
test_stress_concurrent_tasks(config)
|
||||||
|
test_stress_rapid_fire(config)
|
||||||
|
test_stress_pool_scaling(config)
|
||||||
|
test_stress_large_output(config)
|
||||||
|
test_stress_error_recovery(config)
|
||||||
|
|
||||||
|
if config.category in (None, "miniswe"):
|
||||||
|
print("\n" + "="*70)
|
||||||
|
print("MINI-SWE-AGENT INTEGRATION TESTS")
|
||||||
|
print("="*70)
|
||||||
|
|
||||||
|
test_miniswe_environment_available()
|
||||||
|
test_miniswe_modal_backend(config)
|
||||||
|
|
||||||
|
|
||||||
|
async def run_async_tests(config: StressTestConfig):
|
||||||
|
"""Run asynchronous tests."""
|
||||||
|
if config.category in (None, "atropos"):
|
||||||
|
print("\n" + "="*70)
|
||||||
|
print("ATROPOS BACKEND STRESS TESTS")
|
||||||
|
print("="*70)
|
||||||
|
|
||||||
|
await test_atropos_stress_slot_churn(config)
|
||||||
|
await test_atropos_stress_parallel_batches(config)
|
||||||
|
await test_atropos_stress_multi_profile_load(config)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
import argparse
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser(description="Modal Stress Test Suite")
|
||||||
|
parser.add_argument("--dry-run", action="store_true", help="Skip tests requiring Modal")
|
||||||
|
parser.add_argument("--category", choices=["stress", "atropos", "miniswe"], help="Run specific category")
|
||||||
|
parser.add_argument("--concurrent", type=int, default=10, help="Number of concurrent tasks")
|
||||||
|
parser.add_argument("--operations", type=int, default=50, help="Total operations for stress tests")
|
||||||
|
parser.add_argument("--verbose", action="store_true", default=True)
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
config = StressTestConfig(
|
||||||
|
dry_run=args.dry_run,
|
||||||
|
verbose=args.verbose,
|
||||||
|
category=args.category,
|
||||||
|
concurrent_tasks=args.concurrent,
|
||||||
|
total_operations=args.operations,
|
||||||
|
)
|
||||||
|
|
||||||
|
print("="*70)
|
||||||
|
print("MODAL STRESS & INTEGRATION TEST SUITE")
|
||||||
|
print("="*70)
|
||||||
|
print(f"Mode: {'DRY RUN' if config.dry_run else 'LIVE'}")
|
||||||
|
print(f"Category: {config.category or 'ALL'}")
|
||||||
|
print(f"Concurrent tasks: {config.concurrent_tasks}")
|
||||||
|
print(f"Total operations: {config.total_operations}")
|
||||||
|
|
||||||
|
# Run sync tests
|
||||||
|
run_sync_tests(config)
|
||||||
|
|
||||||
|
# Run async tests
|
||||||
|
asyncio.run(run_async_tests(config))
|
||||||
|
|
||||||
|
# Summary
|
||||||
|
success = results.summary()
|
||||||
|
sys.exit(0 if success else 1)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -236,6 +236,63 @@ def test_environment_isolation():
|
|||||||
return isolated
|
return isolated
|
||||||
|
|
||||||
|
|
||||||
|
def test_pool_status():
|
||||||
|
"""Test that the Modal pool manager reports status correctly."""
|
||||||
|
print("\n" + "=" * 60)
|
||||||
|
print("TEST 7: Pool Status")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Import pool manager
|
||||||
|
_ModalPoolManager = terminal_module._ModalPoolManager
|
||||||
|
|
||||||
|
# Get pool manager instance
|
||||||
|
manager = _ModalPoolManager.get_instance()
|
||||||
|
status = manager.get_status()
|
||||||
|
|
||||||
|
print(f"\nPool Manager Status:")
|
||||||
|
print(f" App name: {manager.app_name}")
|
||||||
|
print(f" Default profile: {manager.default_profile}")
|
||||||
|
print(f" Available profiles: {list(manager.profiles.keys())}")
|
||||||
|
print(f" Active pools: {list(status.keys())}")
|
||||||
|
|
||||||
|
for pool_name, pool_status in status.items():
|
||||||
|
print(f"\n Pool '{pool_name}':")
|
||||||
|
print(f" Size: {pool_status['pool_size']}/{pool_status['max_pool']}")
|
||||||
|
print(f" In use: {pool_status['in_use']}")
|
||||||
|
print(f" Min pool: {pool_status['min_pool']}")
|
||||||
|
|
||||||
|
print(f"\nTest: ✅ Passed")
|
||||||
|
return True
|
||||||
|
except Exception as e:
|
||||||
|
print(f"\nError: {e}")
|
||||||
|
print(f"\nTest: ❌ Failed")
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def test_profile_selection():
|
||||||
|
"""Test that profile parameter is accepted (even if profile doesn't exist)."""
|
||||||
|
print("\n" + "=" * 60)
|
||||||
|
print("TEST 8: Profile Selection")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
test_task_id = "modal_test_profile"
|
||||||
|
|
||||||
|
# Test with default profile (no profile specified)
|
||||||
|
print("Testing with default profile...")
|
||||||
|
result = terminal_tool("echo 'default profile'", task_id=test_task_id)
|
||||||
|
result_json = json.loads(result)
|
||||||
|
|
||||||
|
success = result_json.get('exit_code') == 0
|
||||||
|
print(f" Default profile: {'✅' if success else '❌'} (exit code: {result_json.get('exit_code')})")
|
||||||
|
|
||||||
|
# Cleanup
|
||||||
|
cleanup_vm(test_task_id)
|
||||||
|
|
||||||
|
print(f"\nTest: {'✅ Passed' if success else '❌ Failed'}")
|
||||||
|
return success
|
||||||
|
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
"""Run all Modal terminal tests."""
|
"""Run all Modal terminal tests."""
|
||||||
print("🧪 Modal Terminal Tool Test Suite")
|
print("🧪 Modal Terminal Tool Test Suite")
|
||||||
@@ -247,6 +304,8 @@ def main():
|
|||||||
print(f" TERMINAL_ENV: {config['env_type']}")
|
print(f" TERMINAL_ENV: {config['env_type']}")
|
||||||
print(f" TERMINAL_MODAL_IMAGE: {config['modal_image']}")
|
print(f" TERMINAL_MODAL_IMAGE: {config['modal_image']}")
|
||||||
print(f" TERMINAL_TIMEOUT: {config['timeout']}s")
|
print(f" TERMINAL_TIMEOUT: {config['timeout']}s")
|
||||||
|
print(f" TERMINAL_MODAL_APP_NAME: {os.getenv('TERMINAL_MODAL_APP_NAME', 'hermes-sandbox')}")
|
||||||
|
print(f" TERMINAL_MODAL_DEFAULT_PROFILE: {os.getenv('TERMINAL_MODAL_DEFAULT_PROFILE', 'default')}")
|
||||||
|
|
||||||
if config['env_type'] != 'modal':
|
if config['env_type'] != 'modal':
|
||||||
print(f"\n⚠️ WARNING: TERMINAL_ENV is set to '{config['env_type']}', not 'modal'")
|
print(f"\n⚠️ WARNING: TERMINAL_ENV is set to '{config['env_type']}', not 'modal'")
|
||||||
@@ -270,6 +329,8 @@ def main():
|
|||||||
results['pip_install'] = test_pip_install()
|
results['pip_install'] = test_pip_install()
|
||||||
results['filesystem_persistence'] = test_filesystem_persistence()
|
results['filesystem_persistence'] = test_filesystem_persistence()
|
||||||
results['environment_isolation'] = test_environment_isolation()
|
results['environment_isolation'] = test_environment_isolation()
|
||||||
|
results['pool_status'] = test_pool_status()
|
||||||
|
results['profile_selection'] = test_profile_selection()
|
||||||
|
|
||||||
# Summary
|
# Summary
|
||||||
print("\n" + "=" * 60)
|
print("\n" + "=" * 60)
|
||||||
|
|||||||
31
tests/test_tool_call_parsing.py
Normal file
31
tests/test_tool_call_parsing.py
Normal file
@@ -0,0 +1,31 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from atropos.tools.base import ToolCall
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_tool_call_json_wrapper() -> None:
|
||||||
|
text = '<tool_call>{"name":"terminal","arguments":{"command":"pwd"}}</tool_call>'
|
||||||
|
calls = ToolCall.parse_from_text(text)
|
||||||
|
assert len(calls) == 1
|
||||||
|
assert calls[0].name == "terminal"
|
||||||
|
assert calls[0].arguments == {"command": "pwd"}
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_tool_call_glm_style() -> None:
|
||||||
|
text = '<tool_call>terminal{"command":"ls -la"}</tool_call>'
|
||||||
|
calls = ToolCall.parse_from_text(text)
|
||||||
|
assert len(calls) == 1
|
||||||
|
assert calls[0].name == "terminal"
|
||||||
|
assert calls[0].arguments == {"command": "ls -la"}
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_tool_call_missing_close_tag() -> None:
|
||||||
|
text = '<tool_call>terminal{"command":"echo hi"}'
|
||||||
|
calls = ToolCall.parse_from_text(text)
|
||||||
|
assert calls == []
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_tool_call_strips_accidental_xml() -> None:
|
||||||
|
text = '<tool_call>terminal{"command":"ls -la"}</arg_value></tool_call>'
|
||||||
|
calls = ToolCall.parse_from_text(text)
|
||||||
|
assert calls == []
|
||||||
@@ -16,14 +16,6 @@ The tools are imported into model_tools.py which provides a unified interface
|
|||||||
for the AI agent to access all capabilities.
|
for the AI agent to access all capabilities.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
# Export all tools for easy importing
|
|
||||||
from .web_tools import (
|
|
||||||
web_search_tool,
|
|
||||||
web_extract_tool,
|
|
||||||
web_crawl_tool,
|
|
||||||
check_firecrawl_api_key
|
|
||||||
)
|
|
||||||
|
|
||||||
# Primary terminal tool (mini-swe-agent backend: local/docker/singularity/modal)
|
# Primary terminal tool (mini-swe-agent backend: local/docker/singularity/modal)
|
||||||
from .terminal_tool import (
|
from .terminal_tool import (
|
||||||
terminal_tool,
|
terminal_tool,
|
||||||
@@ -34,54 +26,106 @@ from .terminal_tool import (
|
|||||||
TERMINAL_TOOL_DESCRIPTION
|
TERMINAL_TOOL_DESCRIPTION
|
||||||
)
|
)
|
||||||
|
|
||||||
# Alternative terminal tool (Hecate/MorphCloud cloud VMs)
|
# Optional toolsets: keep imports soft so users can run subsets of tools without
|
||||||
from .terminal_hecate import (
|
# installing every dependency (requirements gating lives in model_tools.py).
|
||||||
terminal_hecate_tool,
|
try:
|
||||||
check_hecate_requirements,
|
from .web_tools import check_firecrawl_api_key, web_crawl_tool, web_extract_tool, web_search_tool
|
||||||
TERMINAL_HECATE_DESCRIPTION
|
except ModuleNotFoundError: # pragma: no cover
|
||||||
)
|
web_search_tool = None # type: ignore[assignment]
|
||||||
|
web_extract_tool = None # type: ignore[assignment]
|
||||||
|
web_crawl_tool = None # type: ignore[assignment]
|
||||||
|
|
||||||
from .vision_tools import (
|
def check_firecrawl_api_key() -> bool: # type: ignore[no-redef]
|
||||||
vision_analyze_tool,
|
return False
|
||||||
check_vision_requirements
|
|
||||||
)
|
|
||||||
|
|
||||||
from .mixture_of_agents_tool import (
|
try:
|
||||||
mixture_of_agents_tool,
|
# Alternative terminal tool (Hecate/MorphCloud cloud VMs)
|
||||||
check_moa_requirements
|
from .terminal_hecate import TERMINAL_HECATE_DESCRIPTION, check_hecate_requirements, terminal_hecate_tool
|
||||||
)
|
except ModuleNotFoundError: # pragma: no cover
|
||||||
|
terminal_hecate_tool = None # type: ignore[assignment]
|
||||||
|
TERMINAL_HECATE_DESCRIPTION = ""
|
||||||
|
|
||||||
from .image_generation_tool import (
|
def check_hecate_requirements() -> bool: # type: ignore[no-redef]
|
||||||
image_generate_tool,
|
return False
|
||||||
check_image_generation_requirements
|
|
||||||
)
|
|
||||||
|
|
||||||
from .skills_tool import (
|
try:
|
||||||
skills_categories,
|
from .vision_tools import check_vision_requirements, vision_analyze_tool
|
||||||
skills_list,
|
except ModuleNotFoundError: # pragma: no cover
|
||||||
skill_view,
|
vision_analyze_tool = None # type: ignore[assignment]
|
||||||
check_skills_requirements,
|
|
||||||
SKILLS_TOOL_DESCRIPTION
|
|
||||||
)
|
|
||||||
|
|
||||||
# Browser automation tools (agent-browser + Browserbase)
|
def check_vision_requirements() -> bool: # type: ignore[no-redef]
|
||||||
from .browser_tool import (
|
return False
|
||||||
browser_navigate,
|
|
||||||
browser_snapshot,
|
try:
|
||||||
browser_click,
|
from .mixture_of_agents_tool import check_moa_requirements, mixture_of_agents_tool
|
||||||
browser_type,
|
except ModuleNotFoundError: # pragma: no cover
|
||||||
browser_scroll,
|
mixture_of_agents_tool = None # type: ignore[assignment]
|
||||||
browser_back,
|
|
||||||
browser_press,
|
def check_moa_requirements() -> bool: # type: ignore[no-redef]
|
||||||
browser_close,
|
return False
|
||||||
browser_get_images,
|
|
||||||
browser_vision,
|
try:
|
||||||
cleanup_browser,
|
from .image_generation_tool import check_image_generation_requirements, image_generate_tool
|
||||||
cleanup_all_browsers,
|
except ModuleNotFoundError: # pragma: no cover
|
||||||
get_active_browser_sessions,
|
image_generate_tool = None # type: ignore[assignment]
|
||||||
check_browser_requirements,
|
|
||||||
BROWSER_TOOL_SCHEMAS
|
def check_image_generation_requirements() -> bool: # type: ignore[no-redef]
|
||||||
)
|
return False
|
||||||
|
|
||||||
|
try:
|
||||||
|
from .skills_tool import (
|
||||||
|
SKILLS_TOOL_DESCRIPTION,
|
||||||
|
check_skills_requirements,
|
||||||
|
skill_view,
|
||||||
|
skills_categories,
|
||||||
|
skills_list,
|
||||||
|
)
|
||||||
|
except ModuleNotFoundError: # pragma: no cover
|
||||||
|
skills_categories = None # type: ignore[assignment]
|
||||||
|
skills_list = None # type: ignore[assignment]
|
||||||
|
skill_view = None # type: ignore[assignment]
|
||||||
|
SKILLS_TOOL_DESCRIPTION = ""
|
||||||
|
|
||||||
|
def check_skills_requirements() -> bool: # type: ignore[no-redef]
|
||||||
|
return False
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Browser automation tools (agent-browser + Browserbase)
|
||||||
|
from .browser_tool import (
|
||||||
|
BROWSER_TOOL_SCHEMAS,
|
||||||
|
browser_back,
|
||||||
|
browser_click,
|
||||||
|
browser_close,
|
||||||
|
browser_get_images,
|
||||||
|
browser_navigate,
|
||||||
|
browser_press,
|
||||||
|
browser_scroll,
|
||||||
|
browser_snapshot,
|
||||||
|
browser_type,
|
||||||
|
browser_vision,
|
||||||
|
check_browser_requirements,
|
||||||
|
cleanup_all_browsers,
|
||||||
|
cleanup_browser,
|
||||||
|
get_active_browser_sessions,
|
||||||
|
)
|
||||||
|
except ModuleNotFoundError: # pragma: no cover
|
||||||
|
browser_navigate = None # type: ignore[assignment]
|
||||||
|
browser_snapshot = None # type: ignore[assignment]
|
||||||
|
browser_click = None # type: ignore[assignment]
|
||||||
|
browser_type = None # type: ignore[assignment]
|
||||||
|
browser_scroll = None # type: ignore[assignment]
|
||||||
|
browser_back = None # type: ignore[assignment]
|
||||||
|
browser_press = None # type: ignore[assignment]
|
||||||
|
browser_close = None # type: ignore[assignment]
|
||||||
|
browser_get_images = None # type: ignore[assignment]
|
||||||
|
browser_vision = None # type: ignore[assignment]
|
||||||
|
cleanup_browser = None # type: ignore[assignment]
|
||||||
|
cleanup_all_browsers = None # type: ignore[assignment]
|
||||||
|
get_active_browser_sessions = None # type: ignore[assignment]
|
||||||
|
BROWSER_TOOL_SCHEMAS = []
|
||||||
|
|
||||||
|
def check_browser_requirements() -> bool: # type: ignore[no-redef]
|
||||||
|
return False
|
||||||
|
|
||||||
# Cronjob management tools (CLI-only, hermes-cli toolset)
|
# Cronjob management tools (CLI-only, hermes-cli toolset)
|
||||||
from .cronjob_tools import (
|
from .cronjob_tools import (
|
||||||
@@ -206,4 +250,3 @@ __all__ = [
|
|||||||
'clear_file_ops_cache',
|
'clear_file_ops_cache',
|
||||||
'check_file_requirements',
|
'check_file_requirements',
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|||||||
@@ -36,8 +36,14 @@ import shutil
|
|||||||
import subprocess
|
import subprocess
|
||||||
import tempfile
|
import tempfile
|
||||||
import uuid
|
import uuid
|
||||||
|
from dataclasses import dataclass, field
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Optional, Dict, Any
|
from typing import Optional, Dict, Any, ClassVar, List
|
||||||
|
|
||||||
|
try:
|
||||||
|
import yaml
|
||||||
|
except ImportError:
|
||||||
|
yaml = None
|
||||||
|
|
||||||
# Add mini-swe-agent to path if not installed
|
# Add mini-swe-agent to path if not installed
|
||||||
mini_swe_path = Path(__file__).parent.parent / "mini-swe-agent" / "src"
|
mini_swe_path = Path(__file__).parent.parent / "mini-swe-agent" / "src"
|
||||||
@@ -951,38 +957,585 @@ class _DockerEnvironment:
|
|||||||
pass
|
pass
|
||||||
|
|
||||||
|
|
||||||
class _ModalEnvironment:
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ModalProfile:
|
||||||
"""
|
"""
|
||||||
Modal cloud execution environment wrapper with sudo support.
|
Configuration for a Modal sandbox profile.
|
||||||
|
|
||||||
Wraps mini-swe-agent's SwerexModalEnvironment but adds:
|
Each profile defines the container image, resources, and pool scaling behavior.
|
||||||
- SUDO_PASSWORD support via _transform_sudo_command
|
Different profiles can be used for different workloads
|
||||||
|
|
||||||
Note: stdin handling is not needed for Modal since it uses remote async execution.
|
Secrets:
|
||||||
|
secrets: List of Modal Secret names to inject into the sandbox.
|
||||||
|
These secrets must be created on Modal dashboard or via CLI.
|
||||||
|
|
||||||
|
env_vars: Dict of environment variables to pass directly to sandbox.
|
||||||
|
Use for non-sensitive configuration.
|
||||||
|
Example: {"DEBUG": "1", "LOG_LEVEL": "info"}
|
||||||
|
|
||||||
|
use_dotenv: loads local dotenv
|
||||||
|
"""
|
||||||
|
name: str
|
||||||
|
image: str = "python:3.11"
|
||||||
|
gpu: Optional[str] = None # None, "T4", "A10G", "A100", "H100"
|
||||||
|
cpu: float = 1.0
|
||||||
|
memory: int = 2048 # MB
|
||||||
|
min_pool: int = 1
|
||||||
|
max_pool: int = 5
|
||||||
|
idle_timeout: int = 120 # Modal server-side auto-cleanup (seconds)
|
||||||
|
max_lifetime: int = 3600 # Max sandbox lifetime (seconds)
|
||||||
|
scale_down_idle: int = 180 # Client-side scale down threshold (seconds)
|
||||||
|
workdir: str = "/workspace"
|
||||||
|
# Secrets and environment variables
|
||||||
|
secrets: List[str] = field(default_factory=list) # Modal Secret names
|
||||||
|
env_vars: Dict[str, str] = field(default_factory=dict) # Direct env vars
|
||||||
|
use_dotenv: bool = False # Load .env file and pass to sandbox
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_env(cls, profile_name: str) -> "ModalProfile":
|
||||||
|
"""Load profile configuration from environment variables."""
|
||||||
|
prefix = f"TERMINAL_MODAL_PROFILE_{profile_name}_"
|
||||||
|
|
||||||
|
# Parse secrets list from comma-separated string
|
||||||
|
secrets_str = os.getenv(f"{prefix}SECRETS", "")
|
||||||
|
secrets = [s.strip() for s in secrets_str.split(",") if s.strip()]
|
||||||
|
|
||||||
|
# Parse env_vars from KEY=VALUE pairs separated by semicolons
|
||||||
|
env_vars_str = os.getenv(f"{prefix}ENV_VARS", "")
|
||||||
|
env_vars = {}
|
||||||
|
if env_vars_str:
|
||||||
|
for pair in env_vars_str.split(";"):
|
||||||
|
if "=" in pair:
|
||||||
|
k, v = pair.split("=", 1)
|
||||||
|
env_vars[k.strip()] = v.strip()
|
||||||
|
|
||||||
|
return cls(
|
||||||
|
name=profile_name,
|
||||||
|
image=os.getenv(f"{prefix}IMAGE", "python:3.11"),
|
||||||
|
gpu=os.getenv(f"{prefix}GPU"),
|
||||||
|
cpu=float(os.getenv(f"{prefix}CPU", "1.0")),
|
||||||
|
memory=int(os.getenv(f"{prefix}MEMORY", "2048")),
|
||||||
|
min_pool=int(os.getenv(f"{prefix}MIN_POOL", "1")),
|
||||||
|
max_pool=int(os.getenv(f"{prefix}MAX_POOL", "5")),
|
||||||
|
idle_timeout=int(os.getenv(f"{prefix}IDLE_TIMEOUT", "120")),
|
||||||
|
max_lifetime=int(os.getenv(f"{prefix}MAX_LIFETIME", "3600")),
|
||||||
|
scale_down_idle=int(os.getenv(f"{prefix}SCALE_DOWN_IDLE", "180")),
|
||||||
|
workdir=os.getenv(f"{prefix}WORKDIR", "/workspace"),
|
||||||
|
secrets=secrets,
|
||||||
|
env_vars=env_vars,
|
||||||
|
use_dotenv=os.getenv(f"{prefix}USE_DOTENV", "").lower() in ("true", "1", "yes"),
|
||||||
|
)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def load_profiles(cls, config_file: Optional[str] = None) -> Dict[str, "ModalProfile"]:
|
||||||
|
"""
|
||||||
|
Load all profiles from YAML file or environment variables.
|
||||||
|
|
||||||
|
Priority:
|
||||||
|
1. YAML file specified by config_file or TERMINAL_MODAL_PROFILES_FILE
|
||||||
|
2. Environment variables with TERMINAL_MODAL_PROFILE_<name>_* pattern
|
||||||
|
3. Default profile with basic settings
|
||||||
|
"""
|
||||||
|
profiles = {}
|
||||||
|
|
||||||
|
# Try YAML file first
|
||||||
|
yaml_path = config_file or os.getenv("TERMINAL_MODAL_PROFILES_FILE", "modal_profiles.yaml")
|
||||||
|
if Path(yaml_path).exists():
|
||||||
|
try:
|
||||||
|
with open(yaml_path) as f:
|
||||||
|
config = yaml.safe_load(f)
|
||||||
|
for name, cfg in config.get("profiles", {}).items():
|
||||||
|
profiles[name] = cls(name=name, **cfg)
|
||||||
|
if not os.getenv("HERMES_QUIET"):
|
||||||
|
print(f"[Modal] Loaded {len(profiles)} profiles from {yaml_path}")
|
||||||
|
return profiles
|
||||||
|
except Exception as e:
|
||||||
|
if not os.getenv("HERMES_QUIET"):
|
||||||
|
print(f"[Modal] Warning: Failed to load {yaml_path}: {e}")
|
||||||
|
|
||||||
|
# Check for environment variable profiles
|
||||||
|
# Look for any env vars starting with TERMINAL_MODAL_PROFILE_
|
||||||
|
profile_names = set()
|
||||||
|
for key in os.environ:
|
||||||
|
if key.startswith("TERMINAL_MODAL_PROFILE_") and "_IMAGE" in key:
|
||||||
|
# Extract profile name: TERMINAL_MODAL_PROFILE_<name>_IMAGE
|
||||||
|
parts = key.replace("TERMINAL_MODAL_PROFILE_", "").rsplit("_IMAGE", 1)
|
||||||
|
if parts[0]:
|
||||||
|
profile_names.add(parts[0])
|
||||||
|
|
||||||
|
for name in profile_names:
|
||||||
|
profiles[name] = cls.from_env(name)
|
||||||
|
|
||||||
|
# If no profiles found, create a default one
|
||||||
|
if not profiles:
|
||||||
|
default_name = os.getenv("TERMINAL_MODAL_DEFAULT_PROFILE", "default")
|
||||||
|
profiles[default_name] = cls(
|
||||||
|
name=default_name,
|
||||||
|
image=os.getenv("TERMINAL_MODAL_IMAGE", "python:3.11"),
|
||||||
|
min_pool=int(os.getenv("TERMINAL_MODAL_MIN_POOL", "1")),
|
||||||
|
max_pool=int(os.getenv("TERMINAL_MODAL_MAX_POOL", "5")),
|
||||||
|
idle_timeout=int(os.getenv("TERMINAL_MODAL_IDLE_TIMEOUT", "120")),
|
||||||
|
max_lifetime=int(os.getenv("TERMINAL_MODAL_MAX_LIFETIME", "3600")),
|
||||||
|
scale_down_idle=int(os.getenv("TERMINAL_MODAL_SCALE_DOWN_IDLE", "180")),
|
||||||
|
)
|
||||||
|
|
||||||
|
return profiles
|
||||||
|
|
||||||
|
|
||||||
|
class _ModalSandboxPool:
|
||||||
|
"""
|
||||||
|
Auto-scaling pool of warm Modal sandboxes for a single profile.
|
||||||
|
|
||||||
|
Features:
|
||||||
|
- Named sandboxes for recovery after restart
|
||||||
|
- Reactive scale-up when demand exceeds capacity
|
||||||
|
- Background scale-down when sandboxes are idle
|
||||||
|
- Server-side idle_timeout for orphan protection
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, image: str, cwd: str = "/root", timeout: int = 60):
|
def __init__(self, profile: ModalProfile, app_name: str):
|
||||||
from minisweagent.environments.extra.swerex_modal import SwerexModalEnvironment
|
self.profile = profile
|
||||||
self._inner = SwerexModalEnvironment(image=image, cwd=cwd, timeout=timeout)
|
self.app_name = app_name
|
||||||
|
self._app = None
|
||||||
|
self._modal_image = None
|
||||||
|
self._pool: Dict[str, Any] = {} # sandbox_name -> modal.Sandbox
|
||||||
|
self._in_use: Dict[str, str] = {} # task_id -> sandbox_name
|
||||||
|
self._last_used: Dict[str, float] = {} # sandbox_name -> timestamp
|
||||||
|
self._lock = threading.Lock()
|
||||||
|
self._running = True
|
||||||
|
self._next_index = 0
|
||||||
|
|
||||||
|
# Start scale-down monitor if min_pool > 0 (worth keeping warm)
|
||||||
|
self._monitor_thread = None
|
||||||
|
if profile.min_pool > 0 or profile.max_pool > 0:
|
||||||
|
self._monitor_thread = threading.Thread(
|
||||||
|
target=self._scale_down_monitor,
|
||||||
|
daemon=True,
|
||||||
|
name=f"modal-pool-{profile.name}"
|
||||||
|
)
|
||||||
|
self._monitor_thread.start()
|
||||||
|
|
||||||
|
def _get_sandbox_name(self, index: int) -> str:
|
||||||
|
"""Generate a unique sandbox name for this profile."""
|
||||||
|
return f"hermes-{self.profile.name}-{index}"
|
||||||
|
|
||||||
|
def _ensure_app(self):
|
||||||
|
"""Lazy initialization of Modal app and image."""
|
||||||
|
if self._app is None:
|
||||||
|
try:
|
||||||
|
import modal
|
||||||
|
self._app = modal.App.lookup(self.app_name, create_if_missing=True)
|
||||||
|
self._modal_image = modal.Image.from_registry(self.profile.image)
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError("Modal package not installed. Run: pip install modal")
|
||||||
|
|
||||||
|
def _recover_or_create_sandbox(self, name: str) -> Any:
|
||||||
|
"""
|
||||||
|
Try to recover an existing named sandbox, or create a new one.
|
||||||
|
|
||||||
|
Uses Modal's named sandbox feature for recovery after Hermes restart.
|
||||||
|
Supports Modal Secrets for secure credential injection.
|
||||||
|
"""
|
||||||
|
import modal
|
||||||
|
|
||||||
|
# Try to recover existing sandbox
|
||||||
|
try:
|
||||||
|
sb = modal.Sandbox.from_name(self.app_name, name)
|
||||||
|
if sb.poll() is None: # Still running
|
||||||
|
# Health check - verify sandbox is responsive
|
||||||
|
try:
|
||||||
|
sb.exec("echo", "ok", timeout=10)
|
||||||
|
if not os.getenv("HERMES_QUIET"):
|
||||||
|
print(f"[Modal] Recovered existing sandbox: {name}")
|
||||||
|
return sb
|
||||||
|
except Exception:
|
||||||
|
# Sandbox is not healthy, will create new
|
||||||
|
pass
|
||||||
|
except modal.exception.NotFoundError:
|
||||||
|
pass
|
||||||
|
except Exception as e:
|
||||||
|
if not os.getenv("HERMES_QUIET"):
|
||||||
|
print(f"[Modal] Could not recover sandbox {name}: {e}")
|
||||||
|
|
||||||
|
# Build create kwargs based on profile
|
||||||
|
create_kwargs = {
|
||||||
|
"app": self._app,
|
||||||
|
"name": name,
|
||||||
|
"image": self._modal_image,
|
||||||
|
"timeout": self.profile.max_lifetime,
|
||||||
|
"idle_timeout": self.profile.idle_timeout,
|
||||||
|
"workdir": self.profile.workdir,
|
||||||
|
}
|
||||||
|
|
||||||
|
# Add resource specs
|
||||||
|
if self.profile.cpu != 1.0:
|
||||||
|
create_kwargs["cpu"] = self.profile.cpu
|
||||||
|
if self.profile.memory != 2048:
|
||||||
|
create_kwargs["memory"] = self.profile.memory
|
||||||
|
|
||||||
|
# Add GPU if specified
|
||||||
|
if self.profile.gpu:
|
||||||
|
create_kwargs["gpu"] = self.profile.gpu
|
||||||
|
|
||||||
|
# Build secrets list
|
||||||
|
secrets_list = []
|
||||||
|
|
||||||
|
# Add named secrets from Modal dashboard/CLI
|
||||||
|
for secret_name in self.profile.secrets:
|
||||||
|
try:
|
||||||
|
secrets_list.append(modal.Secret.from_name(secret_name))
|
||||||
|
if not os.getenv("HERMES_QUIET"):
|
||||||
|
print(f"[Modal] Adding secret: {secret_name}")
|
||||||
|
except Exception as e:
|
||||||
|
if not os.getenv("HERMES_QUIET"):
|
||||||
|
print(f"[Modal] Warning: Could not load secret '{secret_name}': {e}")
|
||||||
|
|
||||||
|
# Add direct environment variables
|
||||||
|
if self.profile.env_vars:
|
||||||
|
secrets_list.append(modal.Secret.from_dict(self.profile.env_vars))
|
||||||
|
|
||||||
|
# Add .env file if requested
|
||||||
|
if self.profile.use_dotenv:
|
||||||
|
try:
|
||||||
|
secrets_list.append(modal.Secret.from_dotenv())
|
||||||
|
if not os.getenv("HERMES_QUIET"):
|
||||||
|
print(f"[Modal] Loading .env file into sandbox")
|
||||||
|
except Exception as e:
|
||||||
|
if not os.getenv("HERMES_QUIET"):
|
||||||
|
print(f"[Modal] Warning: Could not load .env file: {e}")
|
||||||
|
|
||||||
|
# Add global secrets from environment variable
|
||||||
|
global_secrets_str = os.getenv("TERMINAL_MODAL_SECRETS", "")
|
||||||
|
if global_secrets_str:
|
||||||
|
for secret_name in global_secrets_str.split(","):
|
||||||
|
secret_name = secret_name.strip()
|
||||||
|
if secret_name and secret_name not in self.profile.secrets:
|
||||||
|
try:
|
||||||
|
secrets_list.append(modal.Secret.from_name(secret_name))
|
||||||
|
except Exception as e:
|
||||||
|
if not os.getenv("HERMES_QUIET"):
|
||||||
|
print(f"[Modal] Warning: Could not load global secret '{secret_name}': {e}")
|
||||||
|
|
||||||
|
if secrets_list:
|
||||||
|
create_kwargs["secrets"] = secrets_list
|
||||||
|
|
||||||
|
if not os.getenv("HERMES_QUIET"):
|
||||||
|
gpu_str = f" with GPU={self.profile.gpu}" if self.profile.gpu else ""
|
||||||
|
secrets_str = f" with {len(secrets_list)} secret(s)" if secrets_list else ""
|
||||||
|
print(f"[Modal] Creating sandbox: {name}{gpu_str}{secrets_str}")
|
||||||
|
|
||||||
|
return modal.Sandbox.create(**create_kwargs)
|
||||||
|
|
||||||
|
def _find_available_slot(self) -> Optional[str]:
|
||||||
|
"""Find an available sandbox in the pool (not currently in use)."""
|
||||||
|
in_use_names = set(self._in_use.values())
|
||||||
|
for name in self._pool:
|
||||||
|
if name not in in_use_names:
|
||||||
|
# Verify sandbox is still running
|
||||||
|
try:
|
||||||
|
if self._pool[name].poll() is None:
|
||||||
|
return name
|
||||||
|
else:
|
||||||
|
# Sandbox died, remove it
|
||||||
|
del self._pool[name]
|
||||||
|
self._last_used.pop(name, None)
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _current_size(self) -> int:
|
||||||
|
"""Get current pool size."""
|
||||||
|
return len(self._pool)
|
||||||
|
|
||||||
|
def acquire(self, task_id: str, timeout: float = 60.0) -> Any:
|
||||||
|
"""
|
||||||
|
Acquire a sandbox for a task.
|
||||||
|
|
||||||
|
- Returns existing sandbox if task already has one
|
||||||
|
- Finds available sandbox in pool if any
|
||||||
|
- Scales up if under max_pool and all busy
|
||||||
|
- Waits if at max_pool and all busy
|
||||||
|
"""
|
||||||
|
deadline = time.time() + timeout
|
||||||
|
|
||||||
|
while True:
|
||||||
|
with self._lock:
|
||||||
|
# Task already has a sandbox?
|
||||||
|
if task_id in self._in_use:
|
||||||
|
name = self._in_use[task_id]
|
||||||
|
self._last_used[name] = time.time()
|
||||||
|
return self._pool[name]
|
||||||
|
|
||||||
|
self._ensure_app()
|
||||||
|
|
||||||
|
# Find available slot in pool
|
||||||
|
available = self._find_available_slot()
|
||||||
|
if available:
|
||||||
|
self._in_use[task_id] = available
|
||||||
|
self._last_used[available] = time.time()
|
||||||
|
return self._pool[available]
|
||||||
|
|
||||||
|
# Scale up if under max
|
||||||
|
if self._current_size() < self.profile.max_pool:
|
||||||
|
name = self._get_sandbox_name(self._next_index)
|
||||||
|
self._next_index += 1
|
||||||
|
try:
|
||||||
|
sb = self._recover_or_create_sandbox(name)
|
||||||
|
self._pool[name] = sb
|
||||||
|
self._in_use[task_id] = name
|
||||||
|
self._last_used[name] = time.time()
|
||||||
|
return sb
|
||||||
|
except Exception as e:
|
||||||
|
if not os.getenv("HERMES_QUIET"):
|
||||||
|
print(f"[Modal] Failed to create sandbox: {e}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
# At capacity - wait and retry
|
||||||
|
if time.time() > deadline:
|
||||||
|
raise TimeoutError(
|
||||||
|
f"No Modal sandbox available for profile '{self.profile.name}' "
|
||||||
|
f"within {timeout}s (pool size: {self._current_size()}/{self.profile.max_pool})"
|
||||||
|
)
|
||||||
|
time.sleep(0.5)
|
||||||
|
|
||||||
|
def release(self, task_id: str, terminate: bool = False):
|
||||||
|
"""
|
||||||
|
Release a sandbox back to the pool.
|
||||||
|
|
||||||
|
If terminate=False, sandbox stays warm for reuse.
|
||||||
|
If terminate=True, sandbox is terminated immediately.
|
||||||
|
"""
|
||||||
|
with self._lock:
|
||||||
|
if task_id not in self._in_use:
|
||||||
|
return
|
||||||
|
|
||||||
|
name = self._in_use.pop(task_id)
|
||||||
|
self._last_used[name] = time.time()
|
||||||
|
|
||||||
|
if terminate:
|
||||||
|
self._terminate_sandbox(name)
|
||||||
|
|
||||||
|
def _terminate_sandbox(self, name: str, during_shutdown: bool = False):
|
||||||
|
"""Terminate and remove a sandbox from the pool."""
|
||||||
|
if name in self._pool:
|
||||||
|
try:
|
||||||
|
self._pool[name].terminate()
|
||||||
|
if not os.getenv("HERMES_QUIET"):
|
||||||
|
print(f"[Modal] Terminated sandbox: {name}")
|
||||||
|
except Exception as e:
|
||||||
|
if not during_shutdown and not os.getenv("HERMES_QUIET"):
|
||||||
|
print(f"[Modal] Error terminating {name}: {e}")
|
||||||
|
del self._pool[name]
|
||||||
|
self._last_used.pop(name, None)
|
||||||
|
|
||||||
|
def _scale_down_monitor(self):
|
||||||
|
"""Background thread: terminate idle sandboxes above min_pool size."""
|
||||||
|
while self._running:
|
||||||
|
time.sleep(30) # Check every 30 seconds
|
||||||
|
|
||||||
|
with self._lock:
|
||||||
|
if self._current_size() <= self.profile.min_pool:
|
||||||
|
continue
|
||||||
|
|
||||||
|
now = time.time()
|
||||||
|
in_use_names = set(self._in_use.values())
|
||||||
|
|
||||||
|
# Find idle sandboxes to terminate
|
||||||
|
to_terminate = []
|
||||||
|
for name, last_used in list(self._last_used.items()):
|
||||||
|
if name in in_use_names:
|
||||||
|
continue
|
||||||
|
if now - last_used > self.profile.scale_down_idle:
|
||||||
|
# Don't go below min_pool
|
||||||
|
if self._current_size() - len(to_terminate) > self.profile.min_pool:
|
||||||
|
to_terminate.append(name)
|
||||||
|
|
||||||
|
for name in to_terminate:
|
||||||
|
if not os.getenv("HERMES_QUIET"):
|
||||||
|
print(f"[Modal] Scaling down idle sandbox: {name}")
|
||||||
|
self._terminate_sandbox(name)
|
||||||
|
|
||||||
|
def shutdown(self, during_shutdown: bool = False):
|
||||||
|
"""Stop monitor thread and terminate all sandboxes."""
|
||||||
|
self._running = False
|
||||||
|
with self._lock:
|
||||||
|
for name in list(self._pool.keys()):
|
||||||
|
self._terminate_sandbox(name, during_shutdown=during_shutdown)
|
||||||
|
|
||||||
|
|
||||||
|
class _ModalPoolManager:
|
||||||
|
"""
|
||||||
|
Manages multiple sandbox pools, one per profile.
|
||||||
|
|
||||||
|
Singleton pattern - shared across all _ModalSandboxEnvironment instances.
|
||||||
|
Each profile has its own pool with independent scaling.
|
||||||
|
"""
|
||||||
|
|
||||||
|
_instance: ClassVar[Optional["_ModalPoolManager"]] = None
|
||||||
|
_init_lock: ClassVar[threading.Lock] = threading.Lock()
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def get_instance(cls) -> "_ModalPoolManager":
|
||||||
|
"""Get or create the singleton instance."""
|
||||||
|
with cls._init_lock:
|
||||||
|
if cls._instance is None:
|
||||||
|
cls._instance = cls()
|
||||||
|
return cls._instance
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def reset_instance(cls):
|
||||||
|
"""Reset the singleton (for testing)."""
|
||||||
|
with cls._init_lock:
|
||||||
|
if cls._instance is not None:
|
||||||
|
cls._instance.shutdown()
|
||||||
|
cls._instance = None
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.app_name = os.getenv("TERMINAL_MODAL_APP_NAME", "hermes-sandbox")
|
||||||
|
self.profiles = ModalProfile.load_profiles()
|
||||||
|
self.default_profile = os.getenv("TERMINAL_MODAL_DEFAULT_PROFILE", "default")
|
||||||
|
|
||||||
|
# Fall back to first profile if default not found
|
||||||
|
if self.default_profile not in self.profiles and self.profiles:
|
||||||
|
self.default_profile = next(iter(self.profiles.keys()))
|
||||||
|
|
||||||
|
self._pools: Dict[str, _ModalSandboxPool] = {}
|
||||||
|
self._pools_lock = threading.Lock()
|
||||||
|
|
||||||
|
if not os.getenv("HERMES_QUIET"):
|
||||||
|
print(f"[Modal] Pool manager initialized with profiles: {list(self.profiles.keys())}")
|
||||||
|
print(f"[Modal] Default profile: {self.default_profile}")
|
||||||
|
|
||||||
|
def _get_pool(self, profile_name: str) -> _ModalSandboxPool:
|
||||||
|
"""Get or create a pool for a profile."""
|
||||||
|
with self._pools_lock:
|
||||||
|
if profile_name not in self._pools:
|
||||||
|
if profile_name not in self.profiles:
|
||||||
|
available = list(self.profiles.keys())
|
||||||
|
raise ValueError(
|
||||||
|
f"Unknown Modal profile: '{profile_name}'. "
|
||||||
|
f"Available profiles: {available}"
|
||||||
|
)
|
||||||
|
profile = self.profiles[profile_name]
|
||||||
|
self._pools[profile_name] = _ModalSandboxPool(profile, self.app_name)
|
||||||
|
return self._pools[profile_name]
|
||||||
|
|
||||||
|
def acquire(self, task_id: str, profile: Optional[str] = None, timeout: float = 60.0) -> Any:
|
||||||
|
"""Acquire a sandbox from the appropriate profile's pool."""
|
||||||
|
profile_name = profile or self.default_profile
|
||||||
|
return self._get_pool(profile_name).acquire(task_id, timeout=timeout)
|
||||||
|
|
||||||
|
def release(self, task_id: str, profile: Optional[str] = None, terminate: bool = False):
|
||||||
|
"""Release a sandbox back to its pool."""
|
||||||
|
profile_name = profile or self.default_profile
|
||||||
|
if profile_name in self._pools:
|
||||||
|
self._pools[profile_name].release(task_id, terminate=terminate)
|
||||||
|
|
||||||
|
def get_status(self) -> Dict[str, Any]:
|
||||||
|
"""Get status of all pools."""
|
||||||
|
status = {}
|
||||||
|
with self._pools_lock:
|
||||||
|
for name, pool in self._pools.items():
|
||||||
|
with pool._lock:
|
||||||
|
status[name] = {
|
||||||
|
"pool_size": pool._current_size(),
|
||||||
|
"in_use": len(pool._in_use),
|
||||||
|
"max_pool": pool.profile.max_pool,
|
||||||
|
"min_pool": pool.profile.min_pool,
|
||||||
|
}
|
||||||
|
return status
|
||||||
|
|
||||||
|
def shutdown(self, during_shutdown: bool = False):
|
||||||
|
"""Shutdown all pools."""
|
||||||
|
with self._pools_lock:
|
||||||
|
for pool in self._pools.values():
|
||||||
|
pool.shutdown(during_shutdown=during_shutdown)
|
||||||
|
self._pools.clear()
|
||||||
|
|
||||||
|
|
||||||
|
class _ModalSandboxEnvironment:
|
||||||
|
"""
|
||||||
|
Modal Sandbox environment with profile-based pool management.
|
||||||
|
|
||||||
|
Features:
|
||||||
|
- Profile selection for heterogeneous workloads
|
||||||
|
- Auto-scaling warm sandbox pool
|
||||||
|
- Named sandbox recovery
|
||||||
|
- SUDO_PASSWORD support
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
image: str, # Used only if no profile config
|
||||||
|
cwd: str = "/workspace",
|
||||||
|
timeout: int = 60,
|
||||||
|
task_id: str = "",
|
||||||
|
profile: Optional[str] = None, # Profile name (e.g., "pytorch-gpu")
|
||||||
|
):
|
||||||
self.cwd = cwd
|
self.cwd = cwd
|
||||||
self.timeout = timeout
|
self.timeout = timeout
|
||||||
|
self.task_id = task_id or str(uuid.uuid4())
|
||||||
|
self.profile = profile
|
||||||
|
self._released = False
|
||||||
|
|
||||||
|
# Acquire sandbox from pool
|
||||||
|
manager = _ModalPoolManager.get_instance()
|
||||||
|
self._sandbox = manager.acquire(self.task_id, profile=profile)
|
||||||
|
|
||||||
def execute(self, command: str, cwd: str = "", *, timeout: int | None = None) -> dict:
|
def execute(self, command: str, cwd: str = "", *, timeout: int | None = None) -> dict:
|
||||||
"""Execute a command in Modal with sudo support."""
|
"""Execute a command in the Modal sandbox."""
|
||||||
# Transform sudo commands if SUDO_PASSWORD is available
|
# Transform sudo commands if SUDO_PASSWORD is available
|
||||||
exec_command = _transform_sudo_command(command)
|
exec_command = _transform_sudo_command(command)
|
||||||
|
work_dir = cwd or self.cwd
|
||||||
|
|
||||||
# Delegate to inner environment with transformed command
|
try:
|
||||||
return self._inner.execute(exec_command, cwd=cwd, timeout=timeout)
|
# Run command via bash with proper working directory
|
||||||
|
process = self._sandbox.exec(
|
||||||
|
"bash", "-c", f"cd {work_dir} && {exec_command}",
|
||||||
|
timeout=timeout or self.timeout
|
||||||
|
)
|
||||||
|
|
||||||
|
# Read output
|
||||||
|
stdout = process.stdout.read()
|
||||||
|
stderr = process.stderr.read()
|
||||||
|
process.wait()
|
||||||
|
|
||||||
|
# Combine stdout and stderr
|
||||||
|
output = stdout
|
||||||
|
if stderr:
|
||||||
|
output = output + stderr if output else stderr
|
||||||
|
|
||||||
|
return {"output": output, "returncode": process.returncode}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
error_msg = str(e)
|
||||||
|
if "timeout" in error_msg.lower():
|
||||||
|
return {"output": f"Command timed out after {timeout or self.timeout}s", "returncode": 124}
|
||||||
|
return {"output": f"Modal execution error: {error_msg}", "returncode": 1}
|
||||||
|
|
||||||
def cleanup(self):
|
def cleanup(self):
|
||||||
"""Cleanup the Modal deployment."""
|
"""Release sandbox back to pool (stays warm for reuse)."""
|
||||||
if hasattr(self._inner, 'stop'):
|
if not self._released:
|
||||||
self._inner.stop()
|
self._released = True
|
||||||
|
_ModalPoolManager.get_instance().release(
|
||||||
|
self.task_id,
|
||||||
|
profile=self.profile,
|
||||||
|
terminate=False
|
||||||
|
)
|
||||||
|
|
||||||
def stop(self):
|
def stop(self):
|
||||||
"""Stop the Modal deployment."""
|
"""Terminate this sandbox explicitly."""
|
||||||
self.cleanup()
|
if not self._released:
|
||||||
|
self._released = True
|
||||||
|
_ModalPoolManager.get_instance().release(
|
||||||
|
self.task_id,
|
||||||
|
profile=self.profile,
|
||||||
|
terminate=True
|
||||||
|
)
|
||||||
|
|
||||||
def __del__(self):
|
def __del__(self):
|
||||||
"""Cleanup on destruction."""
|
"""Cleanup on destruction."""
|
||||||
@@ -1090,8 +1643,14 @@ def _create_environment(env_type: str, image: str, cwd: str, timeout: int, ssh_c
|
|||||||
return _SingularityEnvironment(image=image, cwd=cwd, timeout=timeout)
|
return _SingularityEnvironment(image=image, cwd=cwd, timeout=timeout)
|
||||||
|
|
||||||
elif env_type == "modal":
|
elif env_type == "modal":
|
||||||
# Use custom Modal wrapper with sudo support
|
# Use native Modal Sandbox with auto-scaling pool and profile support
|
||||||
return _ModalEnvironment(image=image, cwd=cwd, timeout=timeout)
|
return _ModalSandboxEnvironment(
|
||||||
|
image=image,
|
||||||
|
cwd=cwd,
|
||||||
|
timeout=timeout,
|
||||||
|
task_id=task_id,
|
||||||
|
profile=profile,
|
||||||
|
)
|
||||||
|
|
||||||
elif env_type == "ssh":
|
elif env_type == "ssh":
|
||||||
if not ssh_config or not ssh_config.get("host") or not ssh_config.get("user"):
|
if not ssh_config or not ssh_config.get("host") or not ssh_config.get("user"):
|
||||||
@@ -1286,13 +1845,24 @@ def cleanup_vm(task_id: str):
|
|||||||
|
|
||||||
atexit.register(_stop_cleanup_thread)
|
atexit.register(_stop_cleanup_thread)
|
||||||
|
|
||||||
|
def _shutdown_modal_pools():
|
||||||
|
"""Shutdown Modal pool manager on exit (silently, as interpreter is shutting down)."""
|
||||||
|
try:
|
||||||
|
if _ModalPoolManager._instance is not None:
|
||||||
|
_ModalPoolManager._instance.shutdown(during_shutdown=True)
|
||||||
|
except:
|
||||||
|
pass # Ignore all errors during interpreter shutdown
|
||||||
|
|
||||||
|
atexit.register(_shutdown_modal_pools)
|
||||||
|
|
||||||
|
|
||||||
def terminal_tool(
|
def terminal_tool(
|
||||||
command: str,
|
command: str,
|
||||||
background: bool = False,
|
background: bool = False,
|
||||||
timeout: Optional[int] = None,
|
timeout: Optional[int] = None,
|
||||||
task_id: Optional[str] = None,
|
task_id: Optional[str] = None,
|
||||||
force: bool = False
|
force: bool = False,
|
||||||
|
profile: Optional[str] = None,
|
||||||
) -> str:
|
) -> str:
|
||||||
"""
|
"""
|
||||||
Execute a command using mini-swe-agent's execution environments.
|
Execute a command using mini-swe-agent's execution environments.
|
||||||
|
|||||||
1
wandb/latest-run
Symbolic link
1
wandb/latest-run
Symbolic link
@@ -0,0 +1 @@
|
|||||||
|
run-20260206_003827-82b0oahi
|
||||||
180
wandb/run-20260206_003827-82b0oahi/files/config.yaml
Normal file
180
wandb/run-20260206_003827-82b0oahi/files/config.yaml
Normal file
@@ -0,0 +1,180 @@
|
|||||||
|
_wandb:
|
||||||
|
value:
|
||||||
|
cli_version: 0.24.2
|
||||||
|
e:
|
||||||
|
2gw7xuffca69jbm2b60l3w5ymo5pb5lf:
|
||||||
|
args:
|
||||||
|
- process
|
||||||
|
- --env.driver
|
||||||
|
- singularity
|
||||||
|
- --env.singularity_image
|
||||||
|
- /root/Hermes-Agent/atropos/atropos-sandbox.sif
|
||||||
|
email: shannon@nousresearch.com
|
||||||
|
executable: /root/Hermes-Agent/.venv/bin/python
|
||||||
|
git:
|
||||||
|
commit: 4d619bcd21feedc9eed36c53c038585d97e7295e
|
||||||
|
remote: https://github.com/NousResearch/Hermes-Agent.git
|
||||||
|
host: vultr
|
||||||
|
os: Linux-6.8.0-90-generic-x86_64-with-glibc2.39
|
||||||
|
program: -m atropos.envs.swe_smith_oracle_env
|
||||||
|
python: CPython 3.12.3
|
||||||
|
root: /root/Hermes-Agent
|
||||||
|
startedAt: "2026-02-06T00:38:27.351013Z"
|
||||||
|
writerId: 2gw7xuffca69jbm2b60l3w5ymo5pb5lf
|
||||||
|
m: []
|
||||||
|
python_version: 3.12.3
|
||||||
|
t:
|
||||||
|
"1":
|
||||||
|
- 11
|
||||||
|
- 49
|
||||||
|
- 51
|
||||||
|
- 95
|
||||||
|
"3":
|
||||||
|
- 13
|
||||||
|
- 16
|
||||||
|
"4": 3.12.3
|
||||||
|
"5": 0.24.2
|
||||||
|
"6": 5.0.0
|
||||||
|
"12": 0.24.2
|
||||||
|
"13": linux-x86_64
|
||||||
|
acquire_timeout_s:
|
||||||
|
value: 30
|
||||||
|
agent_max_steps:
|
||||||
|
value: 50
|
||||||
|
agent_max_tokens:
|
||||||
|
value: null
|
||||||
|
agent_temperature:
|
||||||
|
value: 0.7
|
||||||
|
agent_tool_delay_s:
|
||||||
|
value: 0
|
||||||
|
allow_network:
|
||||||
|
value: true
|
||||||
|
batch_size:
|
||||||
|
value: 1
|
||||||
|
custom_thinking_prompt:
|
||||||
|
value: null
|
||||||
|
data_dir_to_save_evals:
|
||||||
|
value: null
|
||||||
|
data_path_to_save_groups:
|
||||||
|
value: data/swe_smith_oracle_env_2.jsonl
|
||||||
|
dataset_name:
|
||||||
|
value: NousResearch/SWE-smith-oracle
|
||||||
|
dataset_split:
|
||||||
|
value: train
|
||||||
|
disabled_toolsets:
|
||||||
|
value: []
|
||||||
|
driver:
|
||||||
|
value: singularity
|
||||||
|
enabled_toolsets:
|
||||||
|
value:
|
||||||
|
- terminal
|
||||||
|
ensure_scores_are_not_same:
|
||||||
|
value: false
|
||||||
|
eval_handling:
|
||||||
|
value: STOP_TRAIN
|
||||||
|
eval_limit_ratio:
|
||||||
|
value: 0.5
|
||||||
|
group_size:
|
||||||
|
value: 1
|
||||||
|
include_messages:
|
||||||
|
value: true
|
||||||
|
inference_weight:
|
||||||
|
value: 1
|
||||||
|
install_timeout_s:
|
||||||
|
value: 600
|
||||||
|
max_batches_offpolicy:
|
||||||
|
value: 3
|
||||||
|
max_containers:
|
||||||
|
value: 10
|
||||||
|
max_eval_workers:
|
||||||
|
value: 16
|
||||||
|
max_items:
|
||||||
|
value: 0
|
||||||
|
max_num_workers:
|
||||||
|
value: -1
|
||||||
|
max_num_workers_per_node:
|
||||||
|
value: 8
|
||||||
|
max_reasoning_tokens:
|
||||||
|
value: null
|
||||||
|
max_token_length:
|
||||||
|
value: 8192
|
||||||
|
min_batch_allocation:
|
||||||
|
value: null
|
||||||
|
min_containers:
|
||||||
|
value: 1
|
||||||
|
min_items_sent_before_logging:
|
||||||
|
value: 2
|
||||||
|
modal_app_name:
|
||||||
|
value: atropos-sandbox
|
||||||
|
modal_function_name:
|
||||||
|
value: sandbox_server
|
||||||
|
modal_volume_mount_path:
|
||||||
|
value: /data
|
||||||
|
modal_volume_name:
|
||||||
|
value: null
|
||||||
|
nomad_address:
|
||||||
|
value: http://localhost:4646
|
||||||
|
num_rollouts_per_group_for_logging:
|
||||||
|
value: 1
|
||||||
|
num_rollouts_to_keep:
|
||||||
|
value: 32
|
||||||
|
privileged:
|
||||||
|
value: false
|
||||||
|
prompt_mode:
|
||||||
|
value: problem_statement
|
||||||
|
purge_job_on_shutdown:
|
||||||
|
value: true
|
||||||
|
purge_job_on_start:
|
||||||
|
value: true
|
||||||
|
python_only:
|
||||||
|
value: true
|
||||||
|
reasoning_effort:
|
||||||
|
value: null
|
||||||
|
repo_base_url:
|
||||||
|
value: https://github.com
|
||||||
|
require_sandbox:
|
||||||
|
value: false
|
||||||
|
require_stateful_sandbox:
|
||||||
|
value: false
|
||||||
|
rollout_server_url:
|
||||||
|
value: http://localhost:8000
|
||||||
|
sandbox_image:
|
||||||
|
value: atropos-sandbox:local
|
||||||
|
sandbox_job_id:
|
||||||
|
value: atropos-sandbox-agent-env
|
||||||
|
score_include_fail_to_pass:
|
||||||
|
value: true
|
||||||
|
seed:
|
||||||
|
value: 0
|
||||||
|
shuffle:
|
||||||
|
value: true
|
||||||
|
singularity_image:
|
||||||
|
value: /root/Hermes-Agent/atropos/atropos-sandbox.sif
|
||||||
|
slots_per_container:
|
||||||
|
value: 10
|
||||||
|
steps_per_eval:
|
||||||
|
value: 1
|
||||||
|
test_timeout_s:
|
||||||
|
value: 600
|
||||||
|
thinking_mode:
|
||||||
|
value: false
|
||||||
|
tokenizer_name:
|
||||||
|
value: NousResearch/Hermes-4.3-36B
|
||||||
|
tool_batch_window_ms:
|
||||||
|
value: 20
|
||||||
|
tool_max_batch_size:
|
||||||
|
value: 200
|
||||||
|
tool_pool_mode:
|
||||||
|
value: nomad
|
||||||
|
tool_server_token:
|
||||||
|
value: null
|
||||||
|
tool_server_url:
|
||||||
|
value: null
|
||||||
|
total_steps:
|
||||||
|
value: 1
|
||||||
|
use_wandb:
|
||||||
|
value: true
|
||||||
|
wandb_name:
|
||||||
|
value: swe_smith_oracle
|
||||||
|
worker_timeout:
|
||||||
|
value: 600
|
||||||
BIN
wandb/run-20260206_003827-82b0oahi/run-82b0oahi.wandb
Normal file
BIN
wandb/run-20260206_003827-82b0oahi/run-82b0oahi.wandb
Normal file
Binary file not shown.
Reference in New Issue
Block a user