Compare commits

...

34 Commits

Author SHA1 Message Date
Sam Herring
735723803f Adding finalized endless terminal 2026-02-13 10:52:13 -07:00
Sam Herring
1472cc302d Adding full environment and config file 2026-02-12 13:58:30 -07:00
Sam Herring
9c200abdb1 Initial commit for endless terminal integrations 2026-02-09 14:30:35 -08:00
Shannon Sands
9dc27880cd adding tinker but need api key 2026-02-09 02:37:39 +00:00
Shannon Sands
3b9c53e6db Add Tinker RL training integration and documentation
- pyproject.toml: Added tinker SDK, torch, wandb, math-verify to [atropos] extras
- README.md: Added comprehensive RL Training with Tinker section including:
  - Architecture diagram (3-process pipeline)
  - Quick start guide for GSM8k agent training
  - Configuration documentation
  - RL CLI usage
  - Sandbox backend options (Nomad, Singularity, Modal)

New files in tinker-atropos submodule (committed there):
- tinker_atropos/environments/gsm8k_agent.py: Agent GSM8k env with Python REPL tool
- configs/gsm8k_agent.yaml: Config for Qwen3-4B training
2026-02-09 01:36:20 +00:00
Shannon Sands
05dd31131f merged main 2026-02-09 00:17:07 +00:00
Shannon Sands
36ea883d45 Merge origin/main into atropos-integrations
Merged main's latest changes including:
- New hermes_cli/ unified CLI commands
- File operations tools, fuzzy match, patch parser
- RL training tools and tinker-atropos submodule
- Enhanced batch_runner and run_agent
- Gateway improvements (Telegram, Discord)
- Cron job management
- Installation scripts

Preserved our branch-specific features:
- Modal backend (atropos/backends/modal_backend.py)
- Modal terminal tool integration (ModalProfile, _ModalSandboxPool, etc.)
- Singularity/Apptainer support
- Atropos AgentEnv Modal config fields
- Combined pyproject.toml extras (atropos + messaging + cron + cli)

Conflict resolution:
- cli.py, model_tools.py, README.md: accepted main (newer features)
- pyproject.toml: combined both extras and package lists
- tools/terminal_tool.py: accepted main's base + re-inserted Modal integration
2026-02-09 00:08:25 +00:00
Shannon Sands
6be8cdeeca modal backend working ok, merged in modal-integrations 2026-02-08 23:48:01 +00:00
Jai Suphavadeeprasit
0bc914b00c readme edit 2026-02-06 04:24:39 -05:00
Jai Suphavadeeprasit
411e7f8ff4 readme edit 2026-02-06 04:24:12 -05:00
Jai Suphavadeeprasit
eb2e6b73fe integration 2026-02-06 04:15:56 -05:00
Shannon Sands
664acf7426 fixed gitignore 2026-02-06 02:27:47 +00:00
Shannon Sands
fd1c3da305 singularity working 2026-02-06 01:03:59 +00:00
Shannon Sands
4d619bcd21 moved nomand config 2026-02-05 15:45:46 +10:00
Shannon Sands
beac2ee06a increasing per-chat timeout (re api issues ergh), and tweaked logging 2026-02-05 14:54:34 +10:00
Shannon Sands
487487406d adjusted prompt again to make things more reliable, having api issues 2026-02-05 14:42:10 +10:00
Shannon Sands
87464821d8 added metadata capture 2026-02-05 12:00:31 +10:00
Shannon Sands
661d8f4d6c logprobs 2026-02-05 11:42:58 +10:00
Shannon Sands
bf13a848ef endpoint issue (can reproduce with curl calls) 2026-02-05 11:27:18 +10:00
Shannon Sands
88286f6da3 slow completions over group_size 4, debugging added 2026-02-05 10:57:13 +10:00
Shannon Sands
5b82190460 adding some more debugging, hitting endpoint errors or some other slowdown 2026-02-05 08:59:14 +10:00
Shannon Sands
ea7aa0b0d4 Modal backend stubs 2026-02-04 15:20:37 +10:00
Shannon Sands
7130fa50cb fixed infinite loop on agent errors 2026-02-04 14:25:08 +10:00
Shannon Sands
5a9c98a771 swe-smith-oracle runs 1 step process. llama server was just breaking again locally idk, works through Hermes endpoint & ManagedServer fine 2026-02-04 11:22:45 +10:00
Shannon Sands
6cb4fe948a group size 1 works, some timeouts but could be just local server 2026-02-03 16:24:47 +10:00
Shannon Sands
30221d8c20 get tokenizer from .env 2026-02-03 14:50:37 +10:00
Shannon Sands
b5b1fef20a successful loop with Hermes-36b, adding docker lib to hermes-agent to manage env sandbox builds 2026-02-03 14:24:20 +10:00
Shannon Sands
16fb41f9cc smokes working, fixing up toolserver. switched to llama.cpp, ollama sucks too much 2026-02-03 11:41:34 +10:00
Shannon Sands
4939130485 tool dedup 2026-02-02 15:28:10 +10:00
Shannon Sands
8dccd6569e moved in main atropos agent files to Hermes-Agent, updated paths, gated on optional package install 2026-02-02 15:12:27 +10:00
Shannon Sands
db348dc467 ds store 2026-02-02 14:07:20 +10:00
Shannon Sands
88722e230d backed in tui works for basic toolset 2026-02-02 14:06:07 +10:00
Shannon Sands
68fb0efe0e added atropos as dependency, and extra flag, adding atropos as optional backend to agent 2026-02-02 11:56:08 +10:00
Shannon Sands
e38c274f8d Added AtroposAIAgent to ovveride standard runner with ManagedServer integration 2026-02-02 10:24:28 +10:00
78 changed files with 20698 additions and 88 deletions

115
.clinerules Normal file
View File

@@ -0,0 +1,115 @@
# Cline's Memory Bank
I am Cline, an expert software engineer with a unique characteristic: my memory resets completely between sessions. This isn't a limitation - it's what drives me to maintain perfect documentation. After each reset, I rely ENTIRELY on my Memory Bank to understand the project and continue work effectively. I MUST read ALL memory bank files at the start of EVERY task - this is not optional.
## Memory Bank Structure
The Memory Bank consists of core files and optional context files, all in Markdown format. Files build upon each other in a clear hierarchy:
flowchart TD
PB[projectbrief.md] --> PC[productContext.md]
PB --> SP[systemPatterns.md]
PB --> TC[techContext.md]
PC --> AC[activeContext.md]
SP --> AC
TC --> AC
AC --> P[progress.md]
### Core Files (Required)
1. `projectbrief.md`
- Foundation document that shapes all other files
- Created at project start if it doesn't exist
- Defines core requirements and goals
- Source of truth for project scope
2. `productContext.md`
- Why this project exists
- Problems it solves
- How it should work
- User experience goals
3. `activeContext.md`
- Current work focus
- Recent changes
- Next steps
- Active decisions and considerations
- Important patterns and preferences
- Learnings and project insights
4. `systemPatterns.md`
- System architecture
- Key technical decisions
- Design patterns in use
- Component relationships
- Critical implementation paths
5. `techContext.md`
- Technologies used
- Development setup
- Technical constraints
- Dependencies
- Tool usage patterns
6. `progress.md`
- What works
- What's left to build
- Current status
- Known issues
- Evolution of project decisions
### Additional Context
Create additional files/folders within memory-bank/ when they help organize:
- Complex feature documentation
- Integration specifications
- API documentation
- Testing strategies
- Deployment procedures
## Core Workflows
### Plan Mode
flowchart TD
Start[Start] --> ReadFiles[Read Memory Bank]
ReadFiles --> CheckFiles{Files Complete?}
CheckFiles -->|No| Plan[Create Plan]
Plan --> Document[Document in Chat]
CheckFiles -->|Yes| Verify[Verify Context]
Verify --> Strategy[Develop Strategy]
Strategy --> Present[Present Approach]
### Act Mode
flowchart TD
Start[Start] --> Context[Check Memory Bank]
Context --> Update[Update Documentation]
Update --> Execute[Execute Task]
Execute --> Document[Document Changes]
## Documentation Updates
Memory Bank updates occur when:
1. Discovering new project patterns
2. After implementing significant changes
3. When user requests with **update memory bank** (MUST review ALL files)
4. When context needs clarification
flowchart TD
Start[Update Process]
subgraph Process
P1[Review ALL Files]
P2[Document Current State]
P3[Clarify Next Steps]
P4[Document Insights & Patterns]
P1 --> P2 --> P3 --> P4
end
Start --> Process
Note: When triggered by **update memory bank**, I MUST review every memory bank file, even if some don't require updates. Focus particularly on activeContext.md and progress.md as they track current state.
REMEMBER: After every memory reset, I begin completely fresh. The Memory Bank is my only link to previous work. It must be maintained with precision and clarity, as my effectiveness depends entirely on its accuracy.

View File

@@ -1,12 +1,68 @@
# Hermes Agent Environment Configuration # Hermes Agent Environment Configuration
# Copy this file to .env and fill in your API keys # Copy this file to .env and fill in your API keys
# =============================================================================
# CORE SETTINGS
# =============================================================================
# Agent backend:
# - openai : default Hermes-Agent loop (OpenAI function-calling via OpenAI SDK)
# - atropos : Atroposlib ServerManager/ManagedServer-backed loop (training/env integration)
HERMES_BACKEND=openai
# =============================================================================
# LOCAL / SELF-HOSTED OPENAI-COMPATIBLE ENDPOINTS (vLLM, SGLang, llama.cpp, etc.)
# =============================================================================
# For local development (matches the Atropos test env defaults):
# ATROPOS_SERVER_BASE_URL=http://127.0.0.1:8080
# ATROPOS_SERVER_MODEL=hermes-4-36b
# For hosted inference (Nous Research inference API):
ATROPOS_SERVER_BASE_URL=
ATROPOS_SERVER_MODEL=
ATROPOS_TOKENIZER_NAME=
# Set this to your Nous API key (Bearer token).
ATROPOS_SERVER_API_KEY=
# Debugging (prints to stdout; use with care)
# HERMES_DEBUG_ATROPOS_REQUEST=1
# HERMES_DEBUG_ATROPOS_RESPONSE=1
# HERMES_DEBUG_OPENAI_REQUEST=1
# HERMES_DEBUG_OPENAI_RESPONSE=1
# =============================================================================
# LOCAL / SELF-HOSTED OPENAI-COMPATIBLE ENDPOINTS (vLLM, SGLang, llama.cpp, etc.)
# =============================================================================
# If you set ATROPOS_SERVER_BASE_URL or OPENAI_BASE_URL, Hermes will use it instead
# of OpenRouter.
#
# Local server convenience (base URL without /v1):
# llama.cpp example (see `Hermes-Agent/scripts/launch_llama_cpp_hermes_4_36b.sh`):
# ATROPOS_SERVER_BASE_URL=http://127.0.0.1:8080
# ATROPOS_SERVER_MODEL=hermes-4-36b
# ATROPOS_TOKENIZER_NAME=NousResearch/Hermes-4.3-36B
# ATROPOS_SERVER_API_KEY=local
#
# Hosted Nous inference API:
# ATROPOS_SERVER_BASE_URL=https://inference-api.nousresearch.com
# ATROPOS_SERVER_MODEL=Hermes-4.3-36B
# ATROPOS_TOKENIZER_NAME=NousResearch/Hermes-4.3-36B
# ATROPOS_SERVER_API_KEY=sk-... (Bearer token)
#
# If you plan to run GRPO-style group sampling (e.g. `--env.group_size 4`) against
# llama.cpp, start the server with at least that many slots, e.g.:
# LLAMA_CPP_PARALLEL=4 Hermes-Agent/scripts/launch_llama_cpp_hermes_4_36b.sh
#
# Generic OpenAI-compatible (base URL should include /v1):
# OPENAI_BASE_URL=http://127.0.0.1:8080/v1
# OPENAI_API_KEY=local
# ============================================================================= # =============================================================================
# LLM PROVIDER (OpenRouter) # LLM PROVIDER (OpenRouter)
# ============================================================================= # =============================================================================
# OpenRouter provides access to many models through one API # OpenRouter provides access to many models through one API
# All LLM calls go through OpenRouter - no direct provider keys needed # All LLM calls go through OpenRouter - no direct provider keys needed
# Get your key at: https://openrouter.ai/keys # Get your key at: https://openrouter.ai/keys
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
OPENROUTER_API_KEY= OPENROUTER_API_KEY=
# Default model to use (OpenRouter format: provider/model) # Default model to use (OpenRouter format: provider/model)
@@ -92,12 +148,87 @@ TERMINAL_LIFETIME_SECONDS=300
# SUDO_PASSWORD=your_password_here # SUDO_PASSWORD=your_password_here
# ============================================================================= # =============================================================================
# MODAL CLOUD BACKEND (Optional - for TERMINAL_ENV=modal) # MODAL CLOUD BACKEND (for TERMINAL_ENV=modal)
# ============================================================================= # =============================================================================
# Modal uses CLI authentication, not environment variables. # Modal provides cloud sandboxes with per-second billing and auto-scaling.
# Run: pip install modal && modal setup # This implementation uses a warm pool of sandboxes for cost efficiency.
# This will authenticate via browser and store credentials locally. #
# No API key needed in .env - Modal handles auth automatically. # SETUP:
# pip install modal && modal setup
# (Authenticates via browser, stores credentials locally)
#
# FEATURES:
# - Auto-scaling warm sandbox pool (no cold start after first use)
# - Named sandbox recovery (reconnects after restart)
# - Profile-based heterogeneous environments (CPU, GPU, different images)
# - Server-side idle_timeout protection against orphaned sandboxes
# Modal app name (groups all sandboxes, used for recovery)
TERMINAL_MODAL_APP_NAME=hermes-sandbox
# Default profile when none specified
TERMINAL_MODAL_DEFAULT_PROFILE=default
# Profile config file (optional - YAML format, see modal_profiles.yaml)
# TERMINAL_MODAL_PROFILES_FILE=modal_profiles.yaml
# --- Default Profile Settings (used if no YAML file) ---
# These apply when no profile is specified or for the "default" profile
TERMINAL_MODAL_IMAGE=python:3.11
TERMINAL_MODAL_MIN_POOL=1
TERMINAL_MODAL_MAX_POOL=5
TERMINAL_MODAL_IDLE_TIMEOUT=120
TERMINAL_MODAL_MAX_LIFETIME=3600
TERMINAL_MODAL_SCALE_DOWN_IDLE=180
# --- Custom Profile Example: pytorch-gpu ---
# Uncomment to enable a GPU profile for ML tasks
# Usage: terminal_tool("python train.py", profile="pytorch-gpu")
#
# TERMINAL_MODAL_PROFILE_pytorch_gpu_IMAGE=pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
# TERMINAL_MODAL_PROFILE_pytorch_gpu_GPU=T4
# TERMINAL_MODAL_PROFILE_pytorch_gpu_MEMORY=16384
# TERMINAL_MODAL_PROFILE_pytorch_gpu_MIN_POOL=0
# TERMINAL_MODAL_PROFILE_pytorch_gpu_MAX_POOL=2
# TERMINAL_MODAL_PROFILE_pytorch_gpu_IDLE_TIMEOUT=60
# --- Custom Profile Example: node ---
# Uncomment to enable a Node.js profile
# Usage: terminal_tool("npm test", profile="node")
#
# TERMINAL_MODAL_PROFILE_node_IMAGE=node:18
# TERMINAL_MODAL_PROFILE_node_MIN_POOL=0
# TERMINAL_MODAL_PROFILE_node_MAX_POOL=3
# =============================================================================
# MODAL SECRETS (Secure credential injection)
# =============================================================================
# Modal Secrets allow you to securely pass API keys, passwords, and other
# sensitive data to your sandboxes without exposing them in code or logs.
#
# SETUP SECRETS:
# 1. Via Dashboard: https://modal.com/secrets
# 2. Via CLI: modal secret create my-secret KEY1=value1 KEY2=value2
# 3. Via CLI with env: modal secret create my-secret API_KEY="$API_KEY"
#
# LIST SECRETS:
# modal secret list
#
# DELETE SECRETS:
# modal secret delete my-secret
# Global secrets applied to ALL profiles (comma-separated secret names)
# These secrets must be created on Modal dashboard or via CLI first
# TERMINAL_MODAL_SECRETS=my-api-keys,database-creds
# Per-profile secrets (comma-separated secret names)
# TERMINAL_MODAL_PROFILE_pytorch_gpu_SECRETS=huggingface-token,wandb-key
# Per-profile environment variables (semicolon-separated KEY=VALUE pairs)
# TERMINAL_MODAL_PROFILE_default_ENV_VARS=DEBUG=1;LOG_LEVEL=info
# Load local .env file into sandbox (useful for development)
# TERMINAL_MODAL_PROFILE_default_USE_DOTENV=true
# ============================================================================= # =============================================================================
# BROWSER TOOL CONFIGURATION (agent-browser + Browserbase) # BROWSER TOOL CONFIGURATION (agent-browser + Browserbase)

20
.gitignore vendored
View File

@@ -46,3 +46,23 @@ testlogs
# CLI config (may contain sensitive SSH paths) # CLI config (may contain sensitive SSH paths)
cli-config.yaml cli-config.yaml
.DS_Store
# artifacts
*.jsonl
*.html
*.json
*.log
*.csv
# Singularity/Apptainer images (large binary files)
*.sif
# Test files
test_singularity_*.py
test_*.py
!tests/test_*.py
# Nomad data
/tmp/NomadClient*/

131
README.md
View File

@@ -995,6 +995,137 @@ All variables go in `~/.hermes/.env`. Run `hermes config set VAR value` to set t
--- ---
## RL Training with Tinker
Hermes-Agent includes an RL training integration with [Tinker](https://thinkingmachines.ai/tinker/) (Thinking Machines) and [Atropos](https://github.com/NousResearch/atropos) for training language models with reinforcement learning from agent trajectories.
### Prerequisites
1. **Install with Atropos extras** (includes Tinker SDK, atroposlib, torch, wandb):
```bash
pip install -e ".[atropos]"
```
2. **Initialize the tinker-atropos submodule**:
```bash
git submodule update --init
pip install -e ./tinker-atropos
```
3. **Get API keys**:
- `TINKER_API_KEY` from [Tinker Console](https://tinker-console.thinkingmachines.ai/keys) (requires billing setup)
- `WANDB_API_KEY` from [Weights & Biases](https://wandb.ai/settings) (for metrics tracking)
4. **Add keys to your `.env` file**:
```bash
# Add to .env or ~/.hermes/.env
TINKER_API_KEY=your_tinker_key
WANDB_API_KEY=your_wandb_key
```
### Architecture
The RL training pipeline uses three processes that communicate over HTTP:
```
┌──────────────────────┐ ┌─────────────────────┐ ┌────────────────────────┐
│ Atropos Rollout API │ │ Tinker Trainer │ │ Environment │
│ (port 8000) │◄──│ (port 8001) │◄──│ (worker) │
│ │ │ │ │ │
│ • Collects batches │ │ • LoRA training │ │ • Generates prompts │
│ • Coordinates env │ │ • Inference server │ │ • Calls inference API │
│ and trainer │ │ • Weight updates │ │ • Scores responses │
│ │ │ • WandB logging │ │ • Sends scored batches │
└──────────────────────┘ └─────────────────────┘ └────────────────────────┘
```
### Quick Start: GSM8k Agent Training
This example trains a model on math problems using a Python REPL tool — the model learns to write and execute Python code to solve math:
```bash
# Terminal 1: Start Atropos Rollout API
cd tinker-atropos
source ../.venv/bin/activate
set -a && source ../.env && set +a
run-api
# Terminal 2: Start Tinker Trainer + Inference Server
cd tinker-atropos
source ../.venv/bin/activate
set -a && source ../.env && set +a
python launch_training.py --config configs/gsm8k_agent.yaml
# Terminal 3: Start GSM8k Agent Environment
cd tinker-atropos
source ../.venv/bin/activate
set -a && source ../.env && set +a
python tinker_atropos/environments/gsm8k_agent.py serve --config configs/gsm8k_agent.yaml
```
### Available Environments
| Environment | File | Description |
|------------|------|-------------|
| `gsm8k` | `gsm8k_tinker.py` | Standard GSM8k math (no tools) |
| `gsm8k_agent` | `gsm8k_agent.py` | GSM8k with Python REPL tool calling |
### Configuration
Configs are YAML files in `tinker-atropos/configs/` with three sections:
```yaml
env: # Atropos environment settings
group_size: 4 # Parallel rollouts per problem
batch_size: 16 # Training batch size
tokenizer_name: "Qwen/Qwen3-4B-Instruct-2507"
max_token_length: 2048 # Max generation length
total_steps: 20 # Training steps
openai: # Inference server (served by Tinker trainer)
- model_name: "Qwen/Qwen3-4B-Instruct-2507"
base_url: "http://localhost:8001/v1"
tinker: # Tinker training parameters
lora_rank: 16 # LoRA rank (lower = faster, less capacity)
learning_rate: 0.00005 # Learning rate
max_token_trainer_length: 4096 # Max tokens for training
wandb_project: "hermes-agent-rl"
```
### RL CLI (Agent-Driven Training)
For interactive training management via the Hermes agent:
```bash
# Interactive mode - let the agent manage training
python rl_cli.py --interactive
# List available environments
python rl_cli.py --list-environments
# Direct task
python rl_cli.py "Train a model on GSM8k with tool use"
```
### Sandbox Backends for Agent Environments
For agent environments that need isolated tool execution (e.g., SWE tasks), Hermes-Agent supports multiple sandbox backends:
| Backend | Use Case | Command |
|---------|----------|---------|
| **Nomad + Docker** | Default, local development | `--env.tool_pool_mode nomad` |
| **Nomad + Singularity** | HPC clusters without Docker | `--env.tool_pool_mode nomad --env.driver singularity` |
| **Modal** | Cloud-based, auto-scaling | `--env.tool_pool_mode modal` |
See [docs/MODAL_BACKEND.md](docs/MODAL_BACKEND.md) for Modal backend details.
### Cost
Check the [Tinker Rate Card](https://tinker-console.thinkingmachines.ai/rate-card) for available models and pricing.
---
## Troubleshooting ## Troubleshooting
```bash ```bash

41
atropos/Dockerfile Normal file
View File

@@ -0,0 +1,41 @@
# Dockerfile for atropos-agent sandbox server
# Runs inside Nomad containers to handle tool execution
# Includes bubblewrap for namespace-based slot isolation
FROM python:3.11-slim
# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
# Bubblewrap for namespace isolation
bubblewrap \
# `script` for PTY allocation (used for stable tmux+asciinema startup)
util-linux \
# Git for SWE-style tasks (cloning repos)
git \
# tmux for stateful terminal sessions (Phase 4.7+)
tmux \
# Common tools agents might need
curl \
wget \
jq \
# Cleanup
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies (sandbox server + optional terminal recording)
RUN pip install --no-cache-dir aiohttp asciinema
# Copy the sandbox server
COPY sandbox_server.py /app/sandbox_server.py
WORKDIR /app
# Create data directory for slot workspaces
RUN mkdir -p /data
# Verify bubblewrap is installed and working
RUN bwrap --version
EXPOSE 8080
# Default command - can be overridden by Nomad job spec
CMD ["python", "sandbox_server.py", "--port", "8080", "--slots", "10", "--data-dir", "/data"]

46
atropos/__init__.py Normal file
View File

@@ -0,0 +1,46 @@
"""
Atropos integration for Hermes-Agent.
This package is intentionally optional: Hermes-Agent should work without Atropos.
If you import anything from `atropos.*` without having `atroposlib` installed,
we raise a clear error with install instructions.
Install (recommended, from repo checkout):
uv sync --extra atropos
Or (pip / editable):
pip install -e '.[atropos]'
"""
from __future__ import annotations
def _require_atroposlib() -> None:
try:
import atroposlib # noqa: F401
except ModuleNotFoundError as exc: # pragma: no cover
raise ModuleNotFoundError(
"Hermes-Agent Atropos integration requires `atroposlib`, but it is not installed.\n"
"Install it with:\n"
" uv sync --extra atropos\n"
"or:\n"
" pip install -e '.[atropos]'\n"
) from exc
_require_atroposlib()
# Re-export the most commonly used pieces for convenience.
from .agent import AgentConfig, AgentResult, AgentStep, AtroposAgent, SequenceData # noqa: E402
from .envs import AgentEnv, AgentEnvConfig # noqa: E402
__all__ = [
"AtroposAgent",
"AgentConfig",
"AgentResult",
"AgentStep",
"SequenceData",
"AgentEnv",
"AgentEnvConfig",
]

15
atropos/agent/__init__.py Normal file
View File

@@ -0,0 +1,15 @@
"""
Agent abstractions for atropos-agent.
Provides the core AtroposAgent class for running ReACT-style agent loops.
"""
from .atropos_agent import AgentConfig, AgentResult, AgentStep, AtroposAgent, SequenceData
__all__ = [
"AtroposAgent",
"AgentConfig",
"AgentResult",
"AgentStep",
"SequenceData",
]

View File

@@ -0,0 +1,850 @@
"""
ReACT-style agent implementation for atropos-agent.
This module provides the core AtroposAgent class that implements a basic
Reason-Act-Observe loop with tool calling capabilities.
Uses ManagedServer from atroposlib for automatic token/logprob tracking,
making trajectories ready for RL training.
The agent uses Hermes-style XML tags for tool calls:
- <think>...</think> for reasoning
- <tool_call>{"name": "...", "arguments": {...}}</tool_call> for actions
- <tool_response>...</tool_response> for observations
"""
import asyncio
import os
import json
import time
from contextlib import asynccontextmanager
from dataclasses import dataclass, field
from uuid import uuid4
from typing import Any, AsyncGenerator, Awaitable, Callable, Dict, List, Optional, Union
from dotenv import load_dotenv
import httpx
from ..tools import ToolCall, ToolRegistry, ToolResult
from atroposlib.envs.server_handling.managed_server import ManagedServer
load_dotenv()
# Default system prompt with tool calling instructions.
AGENT_SYSTEM_PROMPT = """You are a deep thinking AI. You MUST enclose your internal reasoning inside <think>...</think> tags.
You are a function calling AI model.
You are provided with function signatures within <tools></tools> XML tags.
You must call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.
You can ONLY respond without a tool call if you are totally certain you have the final answer to the user's question or task
After calling & executing a function, you will be provided with function results within <tool_response></tool_response> XML tags.
Here are the available tools:
<tools>
{tools_json}
</tools>
Use the following JSON schema for each tool call you will make:
{"title": "FunctionCall", "type": "object", "properties": {"name": {"title": "Name", "type": "string"}, "arguments": {"title": "Arguments", "type": "object"}}, "required": ["name", "arguments"]}
## REQUIRED TOOL FORMAT
When you decide to call a tool, your assistant message MUST be:
1) exactly one <think>...</think> block, followed by
2) one or more <tool_call>...</tool_call> blocks,
and NOTHING else in that message.
If you need to explain anything, put it inside <think>. Do NOT write natural language outside <think> or <tool_call>.
For each function call return a JSON object with function name and arguments within <tool_call></tool_call> XML tags as follows:
<tool_call>
{"name": "<function-name>", "arguments": {"arg1": "value1"}}
</tool_call>
Each <tool_call> must be on its own and contain ONLY the JSON object (no extra text).
The JSON inside <tool_call> MUST be valid JSON with double quotes.
Do NOT output <tool_response> in an assistant message.
After you receive tool results, you may either call more tools (same required format) or provide the final answer.
When providing the final answer, do NOT include any <tool_call> blocks.
## TERMINAL TOOL NOTES
- Commands execute under POSIX `/bin/sh` (not bash).
- Each tool call runs in a fresh shell: environment changes (like `cd` or venv activation) do not persist across tool calls.
- Avoid bash-only features like `source`, `[[ ... ]]`, or process substitution.
- Prefer explicit venv usage:
- `python -m venv .venv && . .venv/bin/activate && python -m pip install -e .` (POSIX `.` activation), or
- `.venv/bin/python -m pip install -e .` (no activation required).
## ICL (examples)
User: Show the current directory.
Assistant:
<think>I should run pwd.</think>
<tool_call>
{"name": "terminal", "arguments": {"command": "pwd"}}
</tool_call>
User: <tool_response>{"success": true, "output": "/tmp\\n"}</tool_response>
Assistant: /tmp
User: List files, then count them.
Assistant:
<think>I should count files.</think>
<tool_call>
{"name": "terminal", "arguments": {"command": "ls -1 | wc -l"}}
</tool_call>
User: <tool_response>{"success": true, "output": "3\\n"}</tool_response>
Assistant: 3
User: Run pwd, then print ok (two tool calls).
Assistant:
<think>I should run two commands.</think>
<tool_call>
{"name": "terminal", "arguments": {"command": "pwd"}}
</tool_call>
<tool_call>
{"name": "terminal", "arguments": {"command": "echo ok"}}
</tool_call>
User: <tool_response>{"success": true, "output": "/tmp\\n"}</tool_response>
User: <tool_response>{"success": true, "output": "ok\\n"}</tool_response>
Assistant: ok
"""
@dataclass
class AgentConfig:
"""Configuration for the AtroposAgent."""
# Generation parameters
temperature: Optional[float] = 0.7
# Default to "let the backend decide" (important for tool-tag completions that may be longer).
max_tokens: Optional[int] = None
# Agent behavior
max_steps: int = 50
system_prompt: Optional[str] = None
tool_delay_s: float = 0.0
# Working directory for tools
working_dir: Optional[str] = None
@dataclass
class SequenceData:
"""Token/logprob data from a single completion."""
full_text: str
tokens: List[int]
masked_tokens: List[int] # -100 for prompt, actual IDs for completion
logprobs: List[float] # 1.0 for prompt, actual values for completion
metadata: Optional[Dict[str, Any]] = None
@classmethod
def from_sequence_node(cls, node) -> "SequenceData":
"""Create from a ManagedServer SequenceNode."""
return cls(
full_text=node.full_text,
tokens=node.tokens,
masked_tokens=node.masked_tokens,
logprobs=node.logprobs,
metadata=getattr(node, "metadata", None),
)
@dataclass
class AgentStep:
"""A single step in the agent's trajectory."""
step_number: int
assistant_message: str
tool_calls: List[ToolCall] = field(default_factory=list)
tool_results: List[ToolResult] = field(default_factory=list)
sequence_data: Optional[SequenceData] = None # Token data from this step
@property
def has_tool_calls(self) -> bool:
return len(self.tool_calls) > 0
@dataclass
class AgentResult:
"""Result of running an agent trajectory."""
success: bool
final_response: str
steps: List[AgentStep] = field(default_factory=list)
total_tokens: int = 0
error: Optional[str] = None
metadata: Dict[str, Any] = field(default_factory=dict)
# Full trajectory token data for RL training
trajectory_data: Optional[SequenceData] = None
@property
def num_steps(self) -> int:
return len(self.steps)
@property
def total_tool_calls(self) -> int:
return sum(len(step.tool_calls) for step in self.steps)
def to_messages(self) -> List[Dict[str, str]]:
"""Convert trajectory to messages format for logging."""
messages = []
for step in self.steps:
messages.append({"role": "assistant", "content": step.assistant_message})
if step.tool_results:
# Combine all tool responses
responses = "\n".join(r.to_xml() for r in step.tool_results)
messages.append({"role": "user", "content": responses})
return messages
def to_scored_data(self, score: float) -> Optional[Dict[str, Any]]:
"""
Convert to format suitable for ScoredDataGroup.
Args:
score: The score for this trajectory
Returns:
Dict with tokens, masks, scores suitable for training, or None if no data
"""
if self.trajectory_data is None:
return None
return {
"tokens": self.trajectory_data.tokens,
"masks": self.trajectory_data.masked_tokens,
"scores": score,
"logprobs": self.trajectory_data.logprobs,
}
class AtroposAgent:
"""
A ReACT-style agent that uses LLMs with tool calling.
This implementation wraps ManagedServer for automatic token/logprob tracking,
making trajectories ready for RL training.
Example:
# `server` may be an Atropos `ServerManager` (recommended) or a single `APIServer`.
# In practice, environments usually construct this via `BaseEnv`.
server = ...
tools = ToolRegistry()
tools.register(BashTool())
agent = AtroposAgent(server=server, tools=tools)
result = await agent.run("List the files in the current directory")
# Access token data for training
if result.trajectory_data:
print(f"Tokens: {result.trajectory_data.tokens}")
print(f"Masked: {result.trajectory_data.masked_tokens}")
"""
def __init__(
self,
server, # ServerManager or APIServer
tools: Optional[ToolRegistry] = None,
config: Optional[AgentConfig] = None,
tokenizer: Optional[Any] = None,
execute_tool: Optional[Callable[[ToolCall], Awaitable[ToolResult]]] = None,
):
self.server = server
self.tools = tools or ToolRegistry()
self.config = config or AgentConfig()
self.tokenizer = tokenizer or getattr(server, "tokenizer", None)
self.execute_tool = execute_tool or self.tools.execute
@asynccontextmanager
async def _managed(self) -> AsyncGenerator[Any, None]:
"""
Yield a ManagedServer-like object.
- If `self.server` is a ServerManager, use its `managed_server()` context manager.
- If `self.server` is a single APIServer, wrap it in `ManagedServer` directly.
"""
if os.getenv("ATROPOS_BYPASS_MANAGED_SERVER") == "1":
yield _DirectChatCompletionClient(server=self.server)
return
if hasattr(self.server, "managed_server"):
async with self.server.managed_server(tokenizer=self.tokenizer) as managed:
yield managed
else:
managed = ManagedServer(server=self.server, tokenizer=self.tokenizer)
try:
yield managed
finally:
managed.reset()
def _build_system_prompt(self) -> str:
"""Build the system prompt with tool descriptions."""
if self.config.system_prompt:
return self.config.system_prompt
tools_json = self.tools.get_prompt_tool_definitions_json()
# Avoid `str.format()` here because the prompt contains many literal `{}` braces
# in JSON examples; we only want to substitute the single `{tools_json}` token.
return AGENT_SYSTEM_PROMPT.replace("{tools_json}", tools_json)
def _infer_server_model_for_debug(self) -> Optional[str]:
"""
Best-effort inference of the configured model name for debug payload saving.
ManagedServer/server_manager typically injects `model` internally, so `chat_kwargs`
may not contain it. For replaying saved payloads via curl, it's useful to persist it.
"""
servers = getattr(self.server, "servers", None)
if isinstance(servers, list) and servers:
s0 = servers[0]
cfg = getattr(s0, "config", None)
model = getattr(cfg, "model_name", None) or getattr(s0, "model_name", None)
if isinstance(model, str) and model:
return model
model = getattr(self.server, "model_name", None) or getattr(self.server, "model", None)
if isinstance(model, str) and model:
return model
return None
def _infer_server_base_url_for_debug(self) -> Optional[str]:
"""
Best-effort inference of the configured base_url for debug logging.
This is helpful when diagnosing hangs / retries at the transport layer.
"""
servers = getattr(self.server, "servers", None)
if isinstance(servers, list) and servers:
s0 = servers[0]
cfg = getattr(s0, "config", None)
base_url = getattr(cfg, "base_url", None) or getattr(s0, "base_url", None)
if isinstance(base_url, str) and base_url:
return base_url
base_url = getattr(self.server, "base_url", None)
if isinstance(base_url, str) and base_url:
return base_url
return None
def _extract_response_metadata(self, response: Any) -> Dict[str, Any]:
"""
Extract lightweight, JSON-serializable metadata from an OpenAI-style response.
This is useful for debugging training runs, especially when ManagedServer state
tracking is unavailable (e.g. OpenAI-compatible chat endpoints).
"""
meta: Dict[str, Any] = {}
try:
rid = getattr(response, "id", None)
if isinstance(rid, str) and rid:
meta["id"] = rid
model = getattr(response, "model", None)
if isinstance(model, str) and model:
meta["model"] = model
created = getattr(response, "created", None)
if isinstance(created, int):
meta["created"] = created
system_fingerprint = getattr(response, "system_fingerprint", None)
if isinstance(system_fingerprint, str) and system_fingerprint:
meta["system_fingerprint"] = system_fingerprint
choices = getattr(response, "choices", None)
if isinstance(choices, list) and choices:
fr = getattr(choices[0], "finish_reason", None)
if isinstance(fr, str) and fr:
meta["finish_reason"] = fr
usage = getattr(response, "usage", None)
if usage is not None:
if hasattr(usage, "model_dump"):
meta["usage"] = usage.model_dump()
elif isinstance(usage, dict):
meta["usage"] = usage
except Exception:
pass
return meta
def _debug_dump_request(self, *, step_num: int, chat_kwargs: Dict[str, Any]) -> None:
if os.getenv("ATROPOS_DEBUG_AGENT_REQUEST") != "1":
return
try:
# Avoid dumping megabytes by default; messages can be huge.
meta = {
"step": step_num,
"base_url": self._infer_server_base_url_for_debug(),
"model": chat_kwargs.get("model") or self._infer_server_model_for_debug(),
"chat_kwargs_keys": sorted(list(chat_kwargs.keys())),
"n": chat_kwargs.get("n"),
"max_tokens": chat_kwargs.get("max_tokens"),
"temperature": chat_kwargs.get("temperature"),
"num_messages": len(chat_kwargs.get("messages") or []),
}
print("\n=== ATROPOS_DEBUG_AGENT_REQUEST ===", flush=True)
print(meta, flush=True)
if os.getenv("ATROPOS_DEBUG_AGENT_REQUEST_FULL") == "1":
payload = dict(chat_kwargs)
# Make the payload more legible and less huge.
try:
dumped = json.dumps(payload, ensure_ascii=False, indent=2)
except Exception:
dumped = repr(payload)
print("\n=== ATROPOS_DEBUG_AGENT_REQUEST_FULL ===", flush=True)
print(dumped[:200_000], flush=True)
# Optional: save the FULL request payload to disk (no truncation).
save_dir = os.getenv("ATROPOS_DEBUG_AGENT_REQUEST_SAVE_DIR")
if save_dir:
os.makedirs(save_dir, exist_ok=True)
payload: Dict[str, Any] = dict(chat_kwargs)
if "model" not in payload:
model = self._infer_server_model_for_debug()
if model:
payload["model"] = model
# Use a unique filename so parallel trajectories don't clobber each other.
fname = os.path.join(
save_dir,
f"atropos_agent_request_step{step_num}_{int(time.time()*1000)}_{os.getpid()}_{uuid4().hex}.json",
)
with open(fname, "w", encoding="utf-8") as f:
json.dump(payload, f, ensure_ascii=False, indent=2)
print(f"[AtroposAgent] saved request payload: {fname}", flush=True)
except Exception:
return
def _debug_dump_response(self, *, step_num: int, response: Any) -> None:
if os.getenv("ATROPOS_DEBUG_AGENT_RESPONSE") != "1":
return
print("\n=== ATROPOS_DEBUG_AGENT_RESPONSE ===", flush=True)
print({"step": step_num, "type": type(response).__name__}, flush=True)
try:
dumped = response.model_dump() # openai pydantic model
except Exception:
dumped = getattr(response, "__dict__", {"repr": repr(response)})
# Keep the dump bounded; we only need enough to see the assistant message content.
text = str(dumped)
print(text[:200_000], flush=True)
async def _chat_completion_with_debug(
self, *, managed: Any, step_num: int, chat_kwargs: Dict[str, Any]
) -> Any:
"""
Call `managed.chat_completion()` with optional timeout + richer failure logging.
Debug env vars:
- `ATROPOS_AGENT_CHAT_TIMEOUT_S`: if set, wraps the await in `asyncio.wait_for`.
- `ATROPOS_DEBUG_AGENT_WAIT_EVERY_S`: if set, prints a heartbeat while waiting.
"""
# Hard guardrail: never allow a single chat completion to block for too long.
# This is essential for RL data-gen stability; long hangs should be treated as failures (score=0).
timeout_s_raw = os.getenv("ATROPOS_AGENT_CHAT_TIMEOUT_S")
timeout_s_default = 240.0
timeout_s = float(timeout_s_raw) if timeout_s_raw else timeout_s_default
timeout_s = min(timeout_s, 240.0)
wait_every_raw = os.getenv("ATROPOS_DEBUG_AGENT_WAIT_EVERY_S")
wait_every_s = float(wait_every_raw) if wait_every_raw else None
async def _await_call() -> Any:
if not wait_every_s or wait_every_s <= 0:
return await managed.chat_completion(**chat_kwargs)
# Heartbeat mode: wait in chunks without cancelling the underlying request.
# NOTE: do NOT use `asyncio.wait_for(task, timeout=...)` here, because a timeout
# will cancel the task and surface as `CancelledError` on the next loop.
task = asyncio.create_task(managed.chat_completion(**chat_kwargs))
t0 = time.perf_counter()
try:
while True:
done, _pending = await asyncio.wait({task}, timeout=wait_every_s)
if task in done:
return task.result()
waited = time.perf_counter() - t0
print(
f"[AtroposAgent] step={step_num} still waiting for chat_completion... ({waited:.1f}s)",
flush=True,
)
except asyncio.CancelledError:
task.cancel()
raise
try:
return await asyncio.wait_for(_await_call(), timeout=timeout_s)
except asyncio.TimeoutError as e:
print("\n=== ATROPOS_DEBUG_AGENT_CHAT_TIMEOUT ===", flush=True)
print({"step": step_num, "timeout_s": timeout_s}, flush=True)
raise RuntimeError(f"chat_completion timed out after {timeout_s:.1f}s") from e
except asyncio.CancelledError:
# Treat cancellation as a hard failure rather than crashing the whole env run.
# (Atropos/BaseEnv may cancel tasks during shutdown or retries.)
raise RuntimeError("chat_completion cancelled") from None
except Exception as e:
detail: Dict[str, Any] = {
"step": step_num,
"exc_type": type(e).__name__,
"exc_str": str(e),
}
if isinstance(e, httpx.HTTPStatusError):
try:
detail["status_code"] = e.response.status_code
detail["response_text"] = e.response.text[:20_000]
except Exception:
pass
elif isinstance(e, httpx.RequestError):
detail["request"] = repr(getattr(e, "request", None))
print("\n=== ATROPOS_DEBUG_AGENT_CHAT_FAILURE ===", flush=True)
print(detail, flush=True)
raise
async def run(
self,
task: str,
initial_messages: Optional[List[Dict[str, str]]] = None,
) -> AgentResult:
"""
Run the agent on a task using ManagedServer for token tracking.
Args:
task: The task/prompt for the agent
initial_messages: Optional additional context messages
Returns:
AgentResult with the trajectory, final response, and token data
"""
messages = [
{"role": "system", "content": self._build_system_prompt()},
]
if initial_messages:
messages.extend(initial_messages)
messages.append({"role": "user", "content": task})
steps = []
final_response = ""
final_node = None
final_prompt_messages: Optional[List[Dict[str, str]]] = None
last_node = None
last_prompt_messages: Optional[List[Dict[str, str]]] = None
last_response_text: str = ""
# Use ManagedServer for automatic token tracking
async with self._managed() as managed:
for step_num in range(self.config.max_steps):
# ReACT loop iteration here, just call -> tools -> observe until done (no tools called)
try:
# Keep a copy of the prompt messages used for this completion.
# Useful for reconstructing tokens/masks when state tracking is unavailable.
prompt_messages = list(messages)
chat_kwargs: Dict[str, Any] = {"messages": messages, "n": 1}
if self.config.max_tokens is not None:
chat_kwargs["max_tokens"] = self.config.max_tokens
if self.config.temperature is not None:
chat_kwargs["temperature"] = self.config.temperature
t_req = time.perf_counter()
print(
f"[AtroposAgent] step={step_num+1} chat_completion start "
f"(messages={len(messages)}, max_tokens={self.config.max_tokens}, temp={self.config.temperature})",
flush=True,
)
self._debug_dump_request(step_num=step_num + 1, chat_kwargs=chat_kwargs)
response = await self._chat_completion_with_debug(
managed=managed, step_num=step_num + 1, chat_kwargs=chat_kwargs
)
self._debug_dump_response(step_num=step_num + 1, response=response)
response_meta = self._extract_response_metadata(response)
print(
f"[AtroposAgent] step={step_num+1} chat_completion done in {time.perf_counter() - t_req:.2f}s",
flush=True,
)
current_node = None
if hasattr(managed, "get_state"):
state = managed.get_state()
nodes = state.get("nodes", [])
current_node = nodes[-1] if nodes else None
except Exception as e:
return AgentResult(
success=False,
final_response="",
steps=steps,
error=f"Generation error: {str(e)}",
)
msg = response.choices[0].message
# Some OpenAI-compatible servers populate `message.reasoning` and leave `content=""`.
response_text = (msg.content or "") or (getattr(msg, "reasoning", None) or "")
tool_calls = ToolCall.parse_from_text(response_text)
last_node = current_node
last_prompt_messages = prompt_messages
last_response_text = response_text
step_sequence_data = SequenceData.from_sequence_node(current_node) if current_node else None
if step_sequence_data is None:
if response_meta:
# We still want metadata for debugging even if token/logprob state tracking is unavailable.
step_sequence_data = SequenceData(
full_text=response_text,
tokens=[],
masked_tokens=[],
logprobs=[],
metadata=response_meta,
)
else:
merged = dict(response_meta)
node_meta = step_sequence_data.metadata
if isinstance(node_meta, dict):
merged.update(node_meta)
step_sequence_data.metadata = merged or step_sequence_data.metadata
step = AgentStep(
step_number=step_num + 1,
assistant_message=response_text,
tool_calls=tool_calls,
sequence_data=step_sequence_data,
)
if not tool_calls:
steps.append(step)
final_response = response_text
final_node = current_node
final_prompt_messages = prompt_messages
break
messages.append({"role": "assistant", "content": response_text})
tool_responses = []
for call in tool_calls:
result = await self.execute_tool(call)
step.tool_results.append(result)
tool_responses.append(result.to_xml())
if self.config.tool_delay_s > 0:
await asyncio.sleep(self.config.tool_delay_s)
steps.append(step)
responses_text = "\n".join(tool_responses)
# Tool observations are represented as user content with Hermes-style tags.
# This is compatible with most OpenAI-compatible chat APIs and ensures
# tokenizers/chat templates include tool outputs during training.
messages.append({"role": "user", "content": responses_text})
else:
# Reached max steps without completing
# Return a failure result but include the last observed completion so callers can
# record the trajectory (score=0) without triggering retries.
final_response = last_response_text or final_response
final_node = last_node
final_prompt_messages = last_prompt_messages
trajectory_data = None
if final_node:
trajectory_data = SequenceData.from_sequence_node(final_node)
elif final_prompt_messages is not None and self.tokenizer is not None:
if hasattr(self.tokenizer, "apply_chat_template"):
prompt_text = self.tokenizer.apply_chat_template(
final_prompt_messages, tokenize=False, add_generation_prompt=True
)
prompt_tokens = self.tokenizer.encode(prompt_text, add_special_tokens=False)
else:
prompt_text = "\n".join([f"{m['role']}: {m['content']}" for m in final_prompt_messages])
prompt_tokens = self.tokenizer.encode(prompt_text, add_special_tokens=True)
output_tokens = self.tokenizer.encode(final_response, add_special_tokens=False)
tokens = prompt_tokens + output_tokens
masked_tokens = ([-100] * len(prompt_tokens)) + output_tokens
logprobs = ([1.0] * len(prompt_tokens)) + ([0.0] * len(output_tokens))
trajectory_data = SequenceData(
full_text=f"{prompt_text}{final_response}",
tokens=tokens,
masked_tokens=masked_tokens,
logprobs=logprobs,
)
# Preserve response metadata (if any) even on failure trajectories.
try:
if trajectory_data is not None and steps:
last_step = steps[-1]
if last_step.sequence_data and isinstance(last_step.sequence_data.metadata, dict):
trajectory_data.metadata = dict(last_step.sequence_data.metadata)
except Exception:
pass
return AgentResult(
success=False,
final_response=final_response,
steps=steps,
error=f"Reached maximum steps ({self.config.max_steps})",
trajectory_data=trajectory_data,
)
# Build result with trajectory data
trajectory_data = None
if final_node:
trajectory_data = SequenceData.from_sequence_node(final_node)
elif final_prompt_messages is not None and self.tokenizer is not None:
if hasattr(self.tokenizer, "apply_chat_template"):
prompt_text = self.tokenizer.apply_chat_template(
final_prompt_messages, tokenize=False, add_generation_prompt=True
)
prompt_tokens = self.tokenizer.encode(prompt_text, add_special_tokens=False)
else:
prompt_text = "\n".join([f"{m['role']}: {m['content']}" for m in final_prompt_messages])
prompt_tokens = self.tokenizer.encode(prompt_text, add_special_tokens=True)
output_tokens = self.tokenizer.encode(final_response, add_special_tokens=False)
tokens = prompt_tokens + output_tokens
masked_tokens = ([-100] * len(prompt_tokens)) + output_tokens
logprobs = ([1.0] * len(prompt_tokens)) + ([0.0] * len(output_tokens))
trajectory_data = SequenceData(
full_text=f"{prompt_text}{final_response}",
tokens=tokens,
masked_tokens=masked_tokens,
logprobs=logprobs,
)
# Ensure trajectory_data carries the most recent metadata we observed (if any).
try:
if trajectory_data is not None and steps:
last_step = steps[-1]
if last_step.sequence_data and isinstance(last_step.sequence_data.metadata, dict):
trajectory_data.metadata = dict(last_step.sequence_data.metadata)
except Exception:
pass
return AgentResult(
success=True,
final_response=final_response,
steps=steps,
trajectory_data=trajectory_data,
)
async def run_single_turn(
self,
messages: List[Dict[str, str]],
execute_tools: bool = True,
) -> tuple[str, List[ToolResult], Optional[SequenceData]]:
"""
Run a single turn of the agent (one LLM call + tool execution).
This is useful for integration with BaseEnv where you want more
control over the loop.
Args:
messages: The conversation history
execute_tools: Whether to execute parsed tool calls
Returns:
Tuple of (response_text, tool_results, sequence_data)
"""
async with self._managed() as managed:
chat_kwargs: Dict[str, Any] = {"messages": messages, "n": 1}
if self.config.max_tokens is not None:
chat_kwargs["max_tokens"] = self.config.max_tokens
if self.config.temperature is not None:
chat_kwargs["temperature"] = self.config.temperature
self._debug_dump_request(step_num=1, chat_kwargs=chat_kwargs)
response = await self._chat_completion_with_debug(managed=managed, step_num=1, chat_kwargs=chat_kwargs)
self._debug_dump_response(step_num=1, response=response)
current_node = None
if hasattr(managed, "get_state"):
state = managed.get_state()
nodes = state.get("nodes", [])
current_node = nodes[-1] if nodes else None
msg = response.choices[0].message
response_text = (msg.content or "") or (getattr(msg, "reasoning", None) or "")
tool_results = []
if execute_tools:
tool_calls = ToolCall.parse_from_text(response_text)
for call in tool_calls:
result = await self.execute_tool(call)
tool_results.append(result)
sequence_data = SequenceData.from_sequence_node(current_node) if current_node else None
return response_text, tool_results, sequence_data
class _DirectChatCompletionClient:
"""
Minimal stand-in for ManagedServer that calls the OpenAI-compatible endpoint directly.
This is for isolating issues where `ManagedServer.chat_completion()` hangs or misbehaves.
It intentionally does NOT do token/logprob tracking.
"""
def __init__(self, server: Any):
self._server = server
def _server_config(self) -> tuple[str, str, str]:
# ServerManager case: first configured server.
servers = getattr(self._server, "servers", None)
if isinstance(servers, list) and servers:
s0 = servers[0]
cfg = getattr(s0, "config", None)
base_url = getattr(cfg, "base_url", None) or getattr(s0, "base_url", None)
api_key = getattr(cfg, "api_key", None) or getattr(s0, "api_key", None)
model = getattr(cfg, "model_name", None) or getattr(s0, "model_name", None)
if isinstance(base_url, str) and isinstance(api_key, str) and isinstance(model, str):
return base_url.rstrip("/"), api_key, model
# APIServer-like fallback.
base_url = getattr(self._server, "base_url", None)
api_key = getattr(self._server, "api_key", None)
model = getattr(self._server, "model_name", None) or getattr(self._server, "model", None)
if isinstance(base_url, str) and isinstance(api_key, str) and isinstance(model, str):
return base_url.rstrip("/"), api_key, model
raise RuntimeError("Unable to resolve server base_url/api_key/model for direct chat completion")
async def chat_completion(self, *, messages: List[Dict[str, str]], n: int = 1, **kwargs: Any) -> Any:
base_url, api_key, model = self._server_config()
url = f"{base_url}/chat/completions"
payload: Dict[str, Any] = {
"model": model,
"messages": messages,
"n": n,
}
# Pass through common generation kwargs.
for k in ("max_tokens", "temperature", "top_p", "presence_penalty", "frequency_penalty", "stop"):
if k in kwargs and kwargs[k] is not None:
payload[k] = kwargs[k]
timeout_s = float(os.getenv("ATROPOS_DIRECT_REQUEST_TIMEOUT_S") or "120")
print(f"[AtroposAgent] DIRECT chat_completion POST {url} (timeout={timeout_s}s)", flush=True)
async with httpx.AsyncClient(timeout=timeout_s) as client:
resp = await client.post(
url,
headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
json=payload,
)
resp.raise_for_status()
data = resp.json()
# Return a very small object compatible with the code paths that read
# `response.choices[0].message.content`.
class _Msg:
def __init__(self, d: Dict[str, Any]):
self.content = d.get("content")
self.reasoning = d.get("reasoning")
class _Choice:
def __init__(self, d: Dict[str, Any]):
self.message = _Msg(d.get("message") or {})
class _Resp:
def __init__(self, d: Dict[str, Any]):
self._d = d
self.choices = [_Choice(c) for c in (d.get("choices") or [])]
def model_dump(self) -> Dict[str, Any]:
return self._d
return _Resp(data)

6
atropos/api/__init__.py Normal file
View File

@@ -0,0 +1,6 @@
"""
FastAPI services for atropos-agent.
- tool_executor_server: queued/batched sandbox tool execution (Phase 4)
"""

View File

@@ -0,0 +1,254 @@
"""
Tool Executor API (Phase 4)
This service provides a queued, batched execution layer on top of a ToolBackend.
It mirrors the stateful FastAPI + app.state pattern used in:
atropos/atroposlib/api/server.py
Run (dev):
uv run uvicorn atropos_agent.api.tool_executor_server:app --host 0.0.0.0 --port 9001
"""
from __future__ import annotations
import os
from typing import Any, Dict, Optional
from pathlib import Path
from fastapi import FastAPI, Header, HTTPException, status
from pydantic import BaseModel, Field
from ..backends.nomad_backend import NomadBackendConfig, NomadToolBackend
from ..tools import ToolRegistry, build_tool_registry
from ..tools.base import (
ArtifactArchiveRequestPayload,
ArtifactArchiveResponsePayload,
ArtifactListRequestPayload,
ArtifactListResponsePayload,
ArtifactReadRequestPayload,
ArtifactReadResponsePayload,
ToolExecutorExecuteRequest,
ToolExecutorReleaseRequest,
ToolResultPayload,
)
from ..tools.tool_executor import ToolExecutor, ToolExecutorConfig
class ToolExecutorServerConfig(BaseModel):
nomad_address: str = Field(default="http://localhost:4646")
job_id: str = Field(default="atropos-sandbox-tool-executor")
image: str = Field(default="atropos-sandbox:local")
slots_per_container: int = Field(default=10)
min_containers: int = Field(default=1)
max_containers: int = Field(default=10)
privileged: bool = Field(default=False)
acquire_timeout_s: float = Field(default=30.0)
batch_window_ms: int = Field(default=20)
max_batch_size: int = Field(default=200)
allow_network: bool = Field(default=True)
tool_server_url: Optional[str] = Field(default=None)
tool_server_token: Optional[str] = Field(default=None)
token: Optional[str] = Field(default=None, description="Bearer token required for requests (optional in dev).")
purge_job_on_shutdown: bool = Field(default=True)
@classmethod
def from_env(cls) -> "ToolExecutorServerConfig":
# In dev, prefer loading secrets/config from the repo-local `.env` (not committed).
try:
from dotenv import load_dotenv # type: ignore
except Exception: # pragma: no cover
load_dotenv = None # type: ignore[assignment]
if load_dotenv is not None:
env_path = Path(__file__).resolve().parents[2] / ".env"
if env_path.exists():
load_dotenv(dotenv_path=env_path)
def _get_bool(name: str, default: bool) -> bool:
raw = os.getenv(name)
if raw is None:
return default
return raw.strip().lower() in {"1", "true", "yes", "y", "on"}
return cls(
nomad_address=os.getenv("TOOL_EXECUTOR_NOMAD_ADDRESS", "http://localhost:4646"),
job_id=os.getenv("TOOL_EXECUTOR_JOB_ID", "atropos-sandbox-tool-executor"),
image=os.getenv("TOOL_EXECUTOR_IMAGE", "atropos-sandbox:local"),
slots_per_container=int(os.getenv("TOOL_EXECUTOR_SLOTS", "10")),
min_containers=int(os.getenv("TOOL_EXECUTOR_MIN_CONTAINERS", "1")),
max_containers=int(os.getenv("TOOL_EXECUTOR_MAX_CONTAINERS", "10")),
privileged=_get_bool("TOOL_EXECUTOR_PRIVILEGED", False),
acquire_timeout_s=float(os.getenv("TOOL_EXECUTOR_ACQUIRE_TIMEOUT_S", "30.0")),
batch_window_ms=int(os.getenv("TOOL_EXECUTOR_BATCH_WINDOW_MS", "20")),
max_batch_size=int(os.getenv("TOOL_EXECUTOR_MAX_BATCH_SIZE", "200")),
allow_network=_get_bool("TOOL_EXECUTOR_ALLOW_NETWORK", True),
tool_server_url=os.getenv("TOOL_EXECUTOR_TOOL_SERVER_URL") or None,
tool_server_token=os.getenv("TOOL_EXECUTOR_TOOL_SERVER_TOKEN") or None,
token=os.getenv("TOOL_EXECUTOR_TOKEN") or None,
purge_job_on_shutdown=_get_bool("TOOL_EXECUTOR_PURGE_JOB_ON_SHUTDOWN", True),
)
app = FastAPI(title="Atropos-Agent Tool Executor")
@app.get("/")
async def root() -> Dict[str, str]:
return {"message": "Atropos-Agent Tool Executor"}
def _check_auth(cfg: ToolExecutorServerConfig, authorization: Optional[str]) -> None:
if not cfg.token:
return
if not authorization:
raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Missing Authorization header")
if not authorization.lower().startswith("bearer "):
raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Invalid Authorization header")
token = authorization.split(" ", 1)[1].strip()
if token != cfg.token:
raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail="Invalid token")
@app.on_event("startup")
async def _startup() -> None:
cfg = ToolExecutorServerConfig.from_env()
# Default to Atropos "full" tool surface: sandbox + external (if tool_server_url provided).
tools: ToolRegistry = build_tool_registry(
enabled_toolsets=["full"],
disabled_toolsets=None,
tool_server_url=cfg.tool_server_url,
)
backend = NomadToolBackend(
NomadBackendConfig(
nomad_address=cfg.nomad_address,
sandbox_job_id=cfg.job_id,
sandbox_image=cfg.image,
slots_per_container=cfg.slots_per_container,
min_containers=cfg.min_containers,
max_containers=cfg.max_containers,
privileged=cfg.privileged,
acquire_timeout_s=cfg.acquire_timeout_s,
purge_job_on_start=False,
)
)
await backend.start()
executor = ToolExecutor(
backend=backend,
tools=tools,
config=ToolExecutorConfig(
batch_window_ms=cfg.batch_window_ms,
max_batch_size=cfg.max_batch_size,
allow_network=cfg.allow_network,
tool_server_url=cfg.tool_server_url,
tool_server_token=cfg.tool_server_token,
),
)
await executor.start()
app.state.cfg = cfg
app.state.backend = backend
app.state.executor = executor
@app.on_event("shutdown")
async def _shutdown() -> None:
executor: Optional[ToolExecutor] = getattr(app.state, "executor", None)
backend: Optional[NomadToolBackend] = getattr(app.state, "backend", None)
cfg: Optional[ToolExecutorServerConfig] = getattr(app.state, "cfg", None)
if executor is not None:
await executor.close()
if backend is not None:
await backend.stop(purge=bool(cfg.purge_job_on_shutdown) if cfg else False)
@app.get("/health")
async def health() -> Dict[str, Any]:
return {"status": "ok"}
@app.get("/status")
async def status_endpoint() -> Dict[str, Any]:
executor: ToolExecutor = app.state.executor
backend: NomadToolBackend = app.state.backend
return {
"queue_size": executor.queue_size(),
"total_requests": executor.total_requests,
"total_errors": executor.total_errors,
"pool": backend.get_stats(),
}
@app.post("/execute", response_model=ToolResultPayload)
async def execute_tool(
req: ToolExecutorExecuteRequest,
authorization: Optional[str] = Header(default=None),
status_code: int = status.HTTP_200_OK, # noqa: B008
) -> ToolResultPayload:
cfg: ToolExecutorServerConfig = app.state.cfg
_check_auth(cfg, authorization)
executor: ToolExecutor = app.state.executor
result = await executor.execute(
trajectory_id=req.trajectory_id,
call=req.tool.to_tool_call(),
timeout_s=req.timeout_s,
)
return ToolResultPayload.from_tool_result(result)
@app.post("/release")
async def release_trajectory(
req: ToolExecutorReleaseRequest,
authorization: Optional[str] = Header(default=None),
) -> Dict[str, Any]:
cfg: ToolExecutorServerConfig = app.state.cfg
_check_auth(cfg, authorization)
executor: ToolExecutor = app.state.executor
await executor.release_trajectory(req.trajectory_id, reset_workspace=req.reset_workspace)
return {"status": "ok"}
@app.post("/artifacts/read", response_model=ArtifactReadResponsePayload)
async def artifacts_read(
req: ArtifactReadRequestPayload,
authorization: Optional[str] = Header(default=None),
) -> ArtifactReadResponsePayload:
cfg: ToolExecutorServerConfig = app.state.cfg
_check_auth(cfg, authorization)
executor: ToolExecutor = app.state.executor
return await executor.read_artifact(req)
@app.post("/artifacts/list", response_model=ArtifactListResponsePayload)
async def artifacts_list(
req: ArtifactListRequestPayload,
authorization: Optional[str] = Header(default=None),
) -> ArtifactListResponsePayload:
cfg: ToolExecutorServerConfig = app.state.cfg
_check_auth(cfg, authorization)
executor: ToolExecutor = app.state.executor
return await executor.list_artifacts(req)
@app.post("/artifacts/archive", response_model=ArtifactArchiveResponsePayload)
async def artifacts_archive(
req: ArtifactArchiveRequestPayload,
authorization: Optional[str] = Header(default=None),
) -> ArtifactArchiveResponsePayload:
cfg: ToolExecutorServerConfig = app.state.cfg
_check_auth(cfg, authorization)
executor: ToolExecutor = app.state.executor
return await executor.archive_artifacts(req)

140
atropos/api/tool_server.py Normal file
View File

@@ -0,0 +1,140 @@
"""
External ToolServer (Phase 4.5+).
This server executes tools that must NOT run inside the sandbox, typically
because they require credentials or access to external services.
Run (dev):
uv run uvicorn atropos_agent.api.tool_server:app --host 0.0.0.0 --port 9002
"""
from __future__ import annotations
import asyncio
import os
import inspect
from typing import Any, Dict, List, Optional
from pathlib import Path
from fastapi import FastAPI, Header, HTTPException, status
from pydantic import BaseModel, Field
from ..tools import ToolRegistry, build_tool_registry
from ..tools.base import ToolResultPayload, ToolServerExecuteRequest
class ToolServerConfig(BaseModel):
token: Optional[str] = Field(
default=None,
description="Bearer token required for requests (optional in dev).",
)
max_concurrency: int = Field(default=16, ge=1, description="Max concurrent tool executions.")
@classmethod
def from_env(cls) -> "ToolServerConfig":
# In dev, prefer loading secrets from the repo-local `.env` (not committed).
try:
from dotenv import load_dotenv # type: ignore
except Exception: # pragma: no cover
load_dotenv = None # type: ignore[assignment]
if load_dotenv is not None:
env_path = Path(__file__).resolve().parents[2] / ".env"
if env_path.exists():
load_dotenv(dotenv_path=env_path)
token = os.getenv("TOOL_SERVER_TOKEN") or None
max_concurrency = int(os.getenv("TOOL_SERVER_MAX_CONCURRENCY", "16"))
return cls(token=token, max_concurrency=max_concurrency)
app = FastAPI(title="Atropos-Agent Tool Server")
@app.get("/")
async def root() -> Dict[str, str]:
return {"message": "Atropos-Agent Tool Server"}
@app.on_event("startup")
async def _startup() -> None:
cfg = ToolServerConfig.from_env()
# External-only registry. It will only include tools that are enabled by toolsets and
# whose Hermes requirements/keys are satisfied in this process.
tools: ToolRegistry = build_tool_registry(
enabled_toolsets=["all"],
disabled_toolsets=["terminal", "sandbox", "filesystem", "terminal_stateful", "default"],
tool_server_url="enabled",
)
app.state.cfg = cfg
app.state.tools = tools
app.state.semaphore = asyncio.Semaphore(cfg.max_concurrency)
@app.get("/health")
async def health() -> Dict[str, Any]:
return {"status": "ok"}
@app.get("/tools")
async def list_tools() -> Dict[str, Any]:
tools: ToolRegistry = app.state.tools
return {"tools": [s.to_dict() for s in tools.get_schemas()]}
def _check_auth(cfg: ToolServerConfig, authorization: Optional[str]) -> None:
if not cfg.token:
return
if not authorization:
raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Missing Authorization header")
if not authorization.lower().startswith("bearer "):
raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Invalid Authorization header")
token = authorization.split(" ", 1)[1].strip()
if token != cfg.token:
raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail="Invalid token")
@app.post("/execute", response_model=ToolResultPayload)
async def execute_tool(
req: ToolServerExecuteRequest,
authorization: Optional[str] = Header(default=None),
) -> ToolResultPayload:
cfg: ToolServerConfig = app.state.cfg
_check_auth(cfg, authorization)
tools: ToolRegistry = app.state.tools
sem: asyncio.Semaphore = app.state.semaphore
tool = tools.get(req.tool.name)
if tool is None:
return ToolResultPayload(
success=False,
error=f"Unknown tool: {req.tool.name}",
uniq_id=req.tool.uniq_id,
)
async with sem:
try:
kwargs = dict(req.tool.arguments)
sig = inspect.signature(tool.execute).parameters
# Some tools can benefit from extra context.
if req.trajectory_id and "trajectory_id" in sig:
kwargs["trajectory_id"] = req.trajectory_id
if req.slot_id and "slot_id" in sig:
kwargs["slot_id"] = req.slot_id
if req.container_addr and "container_addr" in sig:
kwargs["container_addr"] = req.container_addr
if "task_id" in sig:
kwargs["task_id"] = req.trajectory_id
result = await tool.execute(**kwargs)
except Exception as e:
return ToolResultPayload(
success=False,
error=f"Tool execution error: {e}",
uniq_id=req.tool.uniq_id,
)
if result.uniq_id is None:
result.uniq_id = req.tool.uniq_id
return ToolResultPayload.from_tool_result(result)

View File

@@ -0,0 +1,27 @@
from __future__ import annotations
from typing import Any
from .base import ToolBackend
from .modal_backend import ModalSandboxConfig, ModalToolBackend
from .nomad_backend import NomadBackendConfig, NomadToolBackend
def create_tool_backend(cfg: Any) -> ToolBackend:
mode = str(getattr(cfg, "tool_pool_mode", "nomad")).strip().lower()
if mode == "nomad":
return NomadToolBackend(NomadBackendConfig.from_agent_env_config(cfg))
if mode == "modal":
return ModalToolBackend(ModalSandboxConfig.from_agent_env_config(cfg))
raise ValueError(f"Unknown tool_pool_mode: {mode}")
__all__ = [
"ToolBackend",
"create_tool_backend",
"NomadBackendConfig",
"NomadToolBackend",
"ModalSandboxConfig",
"ModalToolBackend",
]

89
atropos/backends/base.py Normal file
View File

@@ -0,0 +1,89 @@
"""
Backend interfaces for AgentEnv tool execution.
The goal of this module is to decouple ToolExecutor / AgentEnv from any single
execution backend (Nomad/Docker today; Modal later).
"""
from __future__ import annotations
from typing import Any, Dict, List, Optional, Protocol, Tuple
from ..slots.executor import ExecutionResult
from ..slots.slot import Slot
class ToolBackend(Protocol):
"""
Minimal interface required by ToolExecutor.
Backends provide:
- lifecycle (start/stop)
- slot acquisition/release (workspace affinity)
- batched tool execution across slots
- optional artifact helpers (for env verification / demos)
"""
@property
def default_timeout_s(self) -> Optional[float]:
"""Default sandbox execution timeout in seconds (if any)."""
async def start(self) -> None:
"""Start the backend (provision workers/containers, health checks, etc)."""
async def stop(self, *, purge: bool = False) -> None:
"""Stop the backend and optionally purge remote resources."""
async def acquire(self, trajectory_id: Optional[str] = None) -> Slot:
"""Acquire a slot for a trajectory (workspace affinity)."""
async def release(self, slot: Slot, *, reset_workspace: bool = False) -> None:
"""Release a slot back to the pool."""
async def execute_batch(
self,
requests: List[Tuple[Slot, str, Dict[str, Any]]],
*,
timeout_s: Optional[float] = None,
) -> List[ExecutionResult]:
"""Execute a batch of sandbox tool calls and return results in order."""
# ---------------------------------------------------------------------
# Optional artifact helpers (supported by the Nomad sandbox-server today)
# ---------------------------------------------------------------------
async def read_artifact(
self,
slot: Slot,
path: str,
*,
encoding: str = "text",
max_bytes: Optional[int] = None,
include_sha256: bool = False,
timeout_s: Optional[float] = None,
) -> Dict[str, Any]:
raise NotImplementedError
async def list_artifacts(
self,
slot: Slot,
path: str = ".",
*,
recursive: bool = False,
max_entries: Optional[int] = None,
timeout_s: Optional[float] = None,
) -> Dict[str, Any]:
raise NotImplementedError
async def archive_artifacts(
self,
slot: Slot,
path: str = ".",
*,
archive_format: str = "tar.gz",
max_bytes: Optional[int] = None,
max_entries: Optional[int] = None,
timeout_s: Optional[float] = None,
) -> Dict[str, Any]:
raise NotImplementedError

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,156 @@
"""
Nomad/Docker tool backend.
This backend is the current default for AgentEnv: it provisions a Nomad job
running `sandbox_server.py` and multiplexes stateless slots inside each container.
"""
from __future__ import annotations
from dataclasses import dataclass
from typing import Any, Dict, List, Optional, Tuple
from ..slots import Slot, SlotPool, SlotPoolConfig
from ..slots.executor import ExecutionResult
from .base import ToolBackend
@dataclass(frozen=True)
class NomadBackendConfig:
nomad_address: str
sandbox_job_id: str
sandbox_image: str
slots_per_container: int
min_containers: int
max_containers: int
privileged: bool
acquire_timeout_s: float
purge_job_on_start: bool
# Driver selection: "docker" or "singularity"
driver: str = "docker"
# Path to .sif file for singularity driver (required if driver="singularity")
singularity_image: Optional[str] = None
@classmethod
def from_agent_env_config(cls, cfg: Any) -> "NomadBackendConfig":
return cls(
nomad_address=str(getattr(cfg, "nomad_address")),
sandbox_job_id=str(getattr(cfg, "sandbox_job_id")),
sandbox_image=str(getattr(cfg, "sandbox_image")),
slots_per_container=int(getattr(cfg, "slots_per_container")),
min_containers=int(getattr(cfg, "min_containers")),
max_containers=int(getattr(cfg, "max_containers")),
privileged=bool(getattr(cfg, "privileged")),
acquire_timeout_s=float(getattr(cfg, "acquire_timeout_s")),
purge_job_on_start=bool(getattr(cfg, "purge_job_on_start", False)),
driver=str(getattr(cfg, "driver", "docker")),
singularity_image=getattr(cfg, "singularity_image", None),
)
class NomadToolBackend(ToolBackend):
def __init__(self, config: NomadBackendConfig):
self.config = config
self.pool = SlotPool(
SlotPoolConfig(
nomad_address=config.nomad_address,
job_id=config.sandbox_job_id,
image=config.sandbox_image,
slots_per_container=config.slots_per_container,
min_containers=config.min_containers,
max_containers=config.max_containers,
privileged=config.privileged,
acquire_timeout=config.acquire_timeout_s,
purge_job_on_start=bool(config.purge_job_on_start),
driver=config.driver,
singularity_image=config.singularity_image,
)
)
@property
def default_timeout_s(self) -> Optional[float]:
t = getattr(self.pool.executor, "timeout", None)
total = getattr(t, "total", None)
try:
return float(total) if total is not None else None
except Exception:
return None
async def start(self) -> None:
await self.pool.start()
async def stop(self, *, purge: bool = False) -> None:
await self.pool.stop(purge_job=purge)
async def acquire(self, trajectory_id: Optional[str] = None) -> Slot:
return await self.pool.acquire(trajectory_id)
async def release(self, slot: Slot, *, reset_workspace: bool = False) -> None:
await self.pool.release(slot, reset_workspace=reset_workspace)
async def execute_batch(
self,
requests: List[Tuple[Slot, str, Dict[str, Any]]],
*,
timeout_s: Optional[float] = None,
) -> List[ExecutionResult]:
return await self.pool.execute_batch(requests, timeout=timeout_s)
async def read_artifact(
self,
slot: Slot,
path: str,
*,
encoding: str = "text",
max_bytes: Optional[int] = None,
include_sha256: bool = False,
timeout_s: Optional[float] = None,
) -> Dict[str, Any]:
return await self.pool.executor.read_artifact(
slot,
path,
encoding=encoding,
max_bytes=max_bytes,
include_sha256=include_sha256,
timeout=timeout_s,
)
async def list_artifacts(
self,
slot: Slot,
path: str = ".",
*,
recursive: bool = False,
max_entries: Optional[int] = None,
timeout_s: Optional[float] = None,
) -> Dict[str, Any]:
return await self.pool.executor.list_artifacts(
slot,
path,
recursive=recursive,
max_entries=max_entries,
timeout=timeout_s,
)
async def archive_artifacts(
self,
slot: Slot,
path: str = ".",
*,
archive_format: str = "tar.gz",
max_bytes: Optional[int] = None,
max_entries: Optional[int] = None,
timeout_s: Optional[float] = None,
) -> Dict[str, Any]:
return await self.pool.executor.archive_artifacts(
slot,
path,
archive_format=archive_format,
max_bytes=max_bytes,
max_entries=max_entries,
timeout=timeout_s,
)
def get_stats(self) -> Dict[str, Any]:
return self.pool.get_stats()

10
atropos/envs/__init__.py Normal file
View File

@@ -0,0 +1,10 @@
"""
Environment implementations for atropos-agent.
"""
from .agent_env import AgentEnv, AgentEnvConfig
# NOTE: Additional example envs exist as modules (e.g. `test_env`, `swe_smith_oracle_env`),
# but are intentionally not imported here to avoid pulling heavy optional deps at import time.
__all__ = ["AgentEnv", "AgentEnvConfig"]

537
atropos/envs/agent_env.py Normal file
View File

@@ -0,0 +1,537 @@
"""
AgentEnv - Atropos BaseEnv extension for agent/tool-call workloads.
AgentEnv is responsible for starting the sandbox tool execution backend and
providing helpers for running agent trajectories with queued/batched tool calls.
"""
from __future__ import annotations
import os
import asyncio
import time
import uuid
from abc import ABC, abstractmethod
from typing import Any, Awaitable, Callable, Dict, Generic, List, Optional, Tuple, TypeVar
from pydantic import Field
from atroposlib.envs.base import APIServerConfig, BaseEnv, BaseEnvConfig, Item, ScoredDataGroup, ScoredDataItem
from atroposlib.envs.server_handling.server_baseline import AsyncSemWithAdaptiveWeight
from ..agent import AgentConfig, AgentResult, AtroposAgent
from ..backends import ToolBackend, create_tool_backend
from ..tools import ToolRegistry, build_tool_registry
from ..tools.tool_executor import ToolExecutor, ToolExecutorConfig
# Main BaseEnv child classes. Child class THESE to get agent+tooling functionality easily.
class AgentEnvConfig(BaseEnvConfig):
tool_pool_mode: str = Field(default="nomad", description="Tool execution backend ('nomad' or 'modal')")
allow_network: bool = Field(
default=True,
description="Whether sandbox bash commands may access the network (env policy).",
)
require_sandbox: bool = Field(
default=False,
description="Fail closed if bubblewrap sandboxing is unavailable/unusable for stateless sandbox tools.",
)
require_stateful_sandbox: bool = Field(
default=False,
description="Fail closed if bubblewrap/PID isolation is unavailable for stateful terminal tools (tmux).",
)
tool_batch_window_ms: int = Field(default=20, description="ToolExecutor batching window (ms)")
tool_max_batch_size: int = Field(default=200, description="ToolExecutor maximum batch size")
# nomad mode settings. TODO: Add Modal support, split this into own config
nomad_address: str = Field(default="http://localhost:4646", description="Nomad API address")
sandbox_job_id: str = Field(default="atropos-sandbox-agent-env", description="Nomad job id for sandbox containers")
sandbox_image: str = Field(default="atropos-sandbox:local", description="Docker image for sandbox containers")
slots_per_container: int = Field(default=10, description="Nomad mode: slots per container")
min_containers: int = Field(default=1, description="Nomad mode: minimum containers")
max_containers: int = Field(default=10, description="Nomad mode: maximum containers")
privileged: bool = Field(default=False, description="Nomad mode: run container privileged")
acquire_timeout_s: float = Field(default=30.0, description="Slot acquisition timeout (seconds)")
purge_job_on_start: bool = Field(
default=False,
description=(
"Nomad mode: stop/purge the sandbox job on startup. This is helpful in local dev and training runs "
"to recover from previous crashes that leave the job in a restart backoff state."
),
)
purge_job_on_shutdown: bool = Field(default=True, description="Nomad mode: stop/purge job on shutdown")
# Nomad driver selection (docker or singularity)
driver: str = Field(
default="docker",
description="Nomad task driver: 'docker' (default) or 'singularity' (for HPC without sudo Docker)",
)
singularity_image: Optional[str] = Field(
default=None,
description="Path to .sif file for Singularity driver (required if driver='singularity')",
)
# Modal mode settings
modal_app_name: str = Field(default="atropos-sandbox", description="Modal app name prefix")
modal_image: str = Field(default="python:3.11", description="Modal: container image")
modal_gpu: Optional[str] = Field(default=None, description="Modal: GPU type (None, 'T4', 'A10G', 'A100', 'H100')")
modal_cpu: float = Field(default=1.0, description="Modal: CPU cores")
modal_memory: int = Field(default=2048, description="Modal: memory in MB")
modal_slots_per_sandbox: int = Field(default=10, description="Modal: slots per sandbox")
modal_min_sandboxes: int = Field(default=1, description="Modal: minimum sandboxes")
modal_max_sandboxes: int = Field(default=5, description="Modal: maximum sandboxes")
modal_idle_timeout: int = Field(default=120, description="Modal: server-side idle timeout (seconds)")
modal_max_lifetime: int = Field(default=3600, description="Modal: max sandbox lifetime (seconds)")
modal_acquire_timeout: float = Field(default=60.0, description="Modal: slot acquisition timeout (seconds)")
modal_execution_timeout: float = Field(default=30.0, description="Modal: default command execution timeout (seconds)")
modal_secrets: str = Field(default="", description="Modal: comma-separated list of Modal Secret names")
modal_env_vars: str = Field(default="", description="Modal: semicolon-separated KEY=VALUE pairs for env vars")
modal_workspace_base: str = Field(default="/data", description="Modal: workspace base directory in sandbox")
# basic agent defaults
agent_max_steps: int = Field(default=50, description="Max ReACT steps per trajectory")
agent_temperature: float = Field(default=0.7, description="Sampling temperature")
agent_max_tokens: Optional[int] = Field(
default=None,
description="Max tokens per model response (default: let backend decide)",
)
agent_tool_delay_s: float = Field(default=0.0, description="Delay between tool calls (seconds)")
# tool selection
enabled_toolsets: List[str] = Field(
default_factory=lambda: ["default"],
description="Toolsets to enable (Hermes-style grouping).",
)
disabled_toolsets: List[str] = Field(
default_factory=list,
description="Toolsets to disable (applied after enabled_toolsets).",
)
# external ToolServer routing (Phase 4.5+)
tool_server_url: Optional[str] = Field(
default=None,
description="Base URL for external ToolServer (enables external tools).",
)
tool_server_token: Optional[str] = Field(
default=None,
description="Bearer token for ToolServer auth (optional in dev).",
)
AgentEnvConfigT = TypeVar("AgentEnvConfigT", bound="AgentEnvConfig")
class AgentEnv(BaseEnv, ABC, Generic[AgentEnvConfigT]):
env_config_cls = AgentEnvConfig
def __init__(
self,
config: AgentEnvConfigT,
server_configs: List[APIServerConfig],
slurm: bool = False,
testing: bool = False,
):
super().__init__(config, server_configs, slurm, testing)
self.config: AgentEnvConfigT = config
self.tools: ToolRegistry = self.build_tools()
self._backend: Optional[ToolBackend] = None
self._tool_executor: Optional[ToolExecutor] = None
self._tool_server_inprocess: bool = False
self._trajectory_workspace_meta: Dict[str, Dict[str, Any]] = {}
def build_tools(self) -> ToolRegistry:
"""Wraps original Hermes-Agent ToolRegistry for atropos AgentEnv use.
See Hermes-Agent docs for toolsets and available tools etc.
"""
return build_tool_registry(
enabled_toolsets=self.config.enabled_toolsets or ["default"],
disabled_toolsets=self.config.disabled_toolsets or None,
tool_server_url=self.config.tool_server_url,
)
@abstractmethod
def build_task(self, item: Item) -> str:
"""Return the user-facing task string for the agent."""
@abstractmethod
async def score_trajectory(self, item: Item, final_response: str) -> float:
"""Return a scalar score for this trajectory."""
async def setup_trajectory_workspace(
self,
item: Item,
*,
trajectory_id: str,
exec_tool: Callable[["ToolCall"], Awaitable["ToolResult"]],
) -> Dict[str, Any]:
"""
Optional hook: prepare the sandbox workspace before the agent starts.
Examples:
- clone a repo and checkout a commit
- write fixture files (e.g. images) for external-tool demos
- pre-install dependencies
Default: no-op.
"""
_ = (item, trajectory_id, exec_tool)
return {}
async def verify_and_score_trajectory(
self,
item: Item,
final_response: str,
*,
trajectory_id: str,
exec_tool: Callable[["ToolCall"], Awaitable["ToolResult"]],
agent_result: Optional[AgentResult] = None,
workspace_meta: Optional[Dict[str, Any]] = None,
) -> tuple[float, Dict[str, Any]]:
"""
Optional hook: run in-sandbox verification before scoring.
Many agent envs need to execute verification inside the same trajectory
workspace (e.g. pytest) before releasing/resetting the slot.
Default: calls `score_trajectory()` and returns empty metadata.
"""
_ = (trajectory_id, exec_tool, agent_result, workspace_meta) # default ignores in-workspace verification
score = await self.score_trajectory(item, final_response)
return score, {}
def build_agent_config(self, item: Item) -> AgentConfig: # noqa: ARG002
return AgentConfig(
max_steps=self.config.agent_max_steps,
temperature=self.config.agent_temperature,
max_tokens=self.config.agent_max_tokens,
tool_delay_s=self.config.agent_tool_delay_s,
)
async def setup(self) -> None:
print(f"[AgentEnv] setup(): starting tool backend ({self.config.tool_pool_mode})", flush=True)
await self._start_tool_backend()
print("[AgentEnv] setup(): configuring server concurrency", flush=True)
self._configure_server_concurrency()
print("[AgentEnv] setup(): running env-specific setup_agent_env()", flush=True)
await self.setup_agent_env()
print("[AgentEnv] setup(): done", flush=True)
def _configure_server_concurrency(self) -> None:
"""
Ensure the LLM server concurrency isn't accidentally capped below `group_size`.
In `BaseEnv process` mode, groups are collected concurrently and if the underlying
ServerManager/OpenAIServer semaphore is left at 1, we serialize inference even
when `--env.group_size` is > 1.
"""
desired = int(getattr(self.config, "group_size", 1) or 1)
if desired <= 1:
return
servers = getattr(self.server, "servers", None)
if not isinstance(servers, list) or not servers:
return
for s in servers:
sem = getattr(s, "sem", None)
eval_sem = getattr(s, "eval_sem", None)
# Only increase; never shrink.
if sem is not None and getattr(sem, "max_val", 0) < desired:
s.sem = AsyncSemWithAdaptiveWeight(desired)
if hasattr(s, "config") and hasattr(s.config, "num_max_requests_at_once"):
s.config.num_max_requests_at_once = desired
if eval_sem is not None and getattr(eval_sem, "max_val", 0) < desired:
s.eval_sem = AsyncSemWithAdaptiveWeight(desired)
if hasattr(s, "config") and hasattr(s.config, "num_requests_for_eval"):
s.config.num_requests_for_eval = desired
@abstractmethod
async def setup_agent_env(self) -> None:
"""Subclass hook for env-specific setup."""
async def evaluate(self, *args, **kwargs): # noqa: ARG002
"""
Default eval hook (no-op).
Atropos BaseEnv requires an `evaluate()` implementation. Many agent envs
won't have a meaningful evaluation path during early PoC work; they can
override this when needed.
"""
return {}
async def env_manager(self):
try:
return await super().env_manager()
finally:
await self.shutdown_tool_backend()
async def process_manager(self):
try:
return await super().process_manager()
finally:
await self.shutdown_tool_backend()
async def _start_tool_backend(self) -> None:
if self._tool_executor is not None:
return
tool_server_url = self.config.tool_server_url
tool_server_client = None
if tool_server_url == "inprocess":
import httpx
from ..api.tool_server import app as tool_server_app
await tool_server_app.router.startup()
tool_server_client = httpx.AsyncClient(
transport=httpx.ASGITransport(app=tool_server_app),
base_url="http://toolserver",
)
tool_server_url = "http://toolserver"
self._tool_server_inprocess = True
backend = create_tool_backend(self.config)
await backend.start()
executor = ToolExecutor(
backend=backend,
tools=self.tools,
config=ToolExecutorConfig(
batch_window_ms=self.config.tool_batch_window_ms,
max_batch_size=self.config.tool_max_batch_size,
allow_network=self.config.allow_network,
require_sandbox=self.config.require_sandbox,
require_stateful_sandbox=self.config.require_stateful_sandbox,
tool_server_url=tool_server_url,
tool_server_token=self.config.tool_server_token,
),
)
await executor.start()
if tool_server_client is not None:
executor._tool_server_client = tool_server_client # type: ignore[attr-defined]
self._backend = backend
self._tool_executor = executor
async def shutdown_tool_backend(self) -> None:
executor = self._tool_executor
backend = self._backend
inprocess_tool_server = self._tool_server_inprocess
self._tool_executor = None
self._backend = None
self._tool_server_inprocess = False
if executor is not None:
await executor.close()
if backend is not None:
await backend.stop(purge=bool(self.config.purge_job_on_shutdown))
if inprocess_tool_server:
from ..api.tool_server import app as tool_server_app
await tool_server_app.router.shutdown()
async def collect_trajectory(
self, item: Item
) -> Tuple[Optional[ScoredDataItem], List[Item]]:
if self._tool_executor is None:
raise RuntimeError("Tool backend not started")
trajectory_id = str(uuid.uuid4())
t0 = time.perf_counter()
print(f"[AgentEnv] collect_trajectory(): tid={trajectory_id} start", flush=True)
task = self.build_task(item)
agent_config = self.build_agent_config(item)
if os.getenv("ATROPOS_DEBUG_PRINT_TASK") == "1":
print(f"Starting trajectory {trajectory_id} with task: {task}", flush=True)
else:
# Avoid printing the full task prompt by default (can be huge/noisy).
one_line = " ".join(str(task).splitlines()).strip()
preview = one_line[:240] + ("" if len(one_line) > 240 else "")
print(f"Starting trajectory {trajectory_id} (task preview): {preview}", flush=True)
async def _exec(call):
return await self._tool_executor.execute(trajectory_id, call)
agent = AtroposAgent(
server=self.server,
tokenizer=self.tokenizer,
tools=self.tools,
config=agent_config,
execute_tool=_exec,
)
try:
print(f"[AgentEnv] tid={trajectory_id} setup_trajectory_workspace() start", flush=True)
workspace_meta = await self.setup_trajectory_workspace(item, trajectory_id=trajectory_id, exec_tool=_exec)
if not isinstance(workspace_meta, dict):
workspace_meta = {}
self._trajectory_workspace_meta[trajectory_id] = workspace_meta
print(
f"[AgentEnv] tid={trajectory_id} setup_trajectory_workspace() done in {time.perf_counter() - t0:.2f}s",
flush=True,
)
print(f"[AgentEnv] tid={trajectory_id} agent.run() start", flush=True)
result = await agent.run(task)
print(
f"[AgentEnv] tid={trajectory_id} agent.run() done in {time.perf_counter() - t0:.2f}s "
f"success={result.success} tool_calls={result.total_tool_calls}",
flush=True,
)
if not result.success or result.trajectory_data is None:
# Do not trigger BaseEnv retries for agent failures.
# Record the trajectory with score 0.0 so training/eval can see the failure mode.
messages = [{"role": "system", "content": agent._build_system_prompt()}] # noqa: SLF001
messages.append({"role": "user", "content": task})
for step in result.steps:
messages.append({"role": "assistant", "content": step.assistant_message})
if step.tool_results:
tool_text = "\n".join(r.to_xml() for r in step.tool_results)
messages.append({"role": "user", "content": tool_text})
scored: ScoredDataItem = {
"tokens": (result.trajectory_data.tokens if result.trajectory_data else []),
"masks": (result.trajectory_data.masked_tokens if result.trajectory_data else []),
"scores": 0.0,
}
if result.trajectory_data is not None:
scored["inference_logprobs"] = result.trajectory_data.logprobs # type: ignore[typeddict-unknown-key]
if getattr(result.trajectory_data, "metadata", None):
scored["overrides"] = {"managed_metadata": result.trajectory_data.metadata}
if self.config.include_messages:
# Record a final failure marker as a user-side tool_response-like block so it survives templates.
import json
err = result.error or "agent_failed"
messages.append(
{
"role": "user",
"content": f"<tool_response>{json.dumps({'success': False, 'error': err})}</tool_response>",
}
)
scored["messages"] = messages
return scored, []
print(f"[AgentEnv] tid={trajectory_id} verify_and_score_trajectory() start", flush=True)
score, score_metadata = await self.verify_and_score_trajectory(
item,
result.final_response,
trajectory_id=trajectory_id,
exec_tool=_exec,
agent_result=result,
workspace_meta=workspace_meta,
)
print(
f"[AgentEnv] tid={trajectory_id} verify_and_score_trajectory() done in {time.perf_counter() - t0:.2f}s "
f"score={score}",
flush=True,
)
messages = [{"role": "system", "content": agent._build_system_prompt()}] # noqa: SLF001
messages.append({"role": "user", "content": task})
for step in result.steps:
messages.append({"role": "assistant", "content": step.assistant_message})
if step.tool_results:
tool_text = "\n".join(r.to_xml() for r in step.tool_results)
messages.append({"role": "user", "content": tool_text})
# Optional: allow env verification to attach additional messages (e.g. install logs).
if self.config.include_messages and isinstance(score_metadata, dict):
extra = score_metadata.get("verification_messages")
if isinstance(extra, list):
for m in extra:
if isinstance(m, dict) and isinstance(m.get("role"), str) and isinstance(m.get("content"), str):
messages.append({"role": m["role"], "content": m["content"]})
scored: ScoredDataItem = {
"tokens": result.trajectory_data.tokens,
"masks": result.trajectory_data.masked_tokens,
"scores": score,
}
# Atroposlib expects policy logprobs at the *group* level under `inference_logprobs`.
# We stash per-item values here and lift them into the group in `collect_trajectories()`.
scored["inference_logprobs"] = result.trajectory_data.logprobs # type: ignore[typeddict-unknown-key]
if getattr(result.trajectory_data, "metadata", None):
scored["overrides"] = {"managed_metadata": result.trajectory_data.metadata}
if self.config.include_messages:
scored["messages"] = messages
return scored, []
finally:
self._trajectory_workspace_meta.pop(trajectory_id, None)
print(f"[AgentEnv] tid={trajectory_id} release_trajectory(reset_workspace=True)", flush=True)
await self._tool_executor.release_trajectory(trajectory_id, reset_workspace=True)
print(f"[AgentEnv] collect_trajectory(): tid={trajectory_id} done in {time.perf_counter() - t0:.2f}s", flush=True)
async def collect_trajectories(
self, item: Item
) -> Tuple[Optional[ScoredDataGroup], List[Item]]:
tasks = [self.collect_trajectory(item) for _ in range(self.config.group_size)]
results = await asyncio.gather(*tasks)
backlog: List[Item] = []
items: List[ScoredDataItem] = []
for scored, b in results:
backlog.extend(b)
if scored is not None:
items.append(scored)
if len(items) != self.config.group_size:
return None, backlog
group: ScoredDataGroup = ScoredDataGroup(
tokens=[],
masks=[],
scores=[],
advantages=[],
ref_logprobs=[],
messages=[] if self.config.include_messages else None,
inference_logprobs=[],
group_overrides={},
overrides=[],
images=[],
generation_params=None,
)
for it in items:
group["tokens"].append(it["tokens"])
group["masks"].append(it["masks"])
group["scores"].append(it["scores"])
# policy logprobs (for PPO/GRPO training) if present
lp = it.get("inference_logprobs") # type: ignore[typeddict-item]
if lp is not None:
group["inference_logprobs"].append(lp)
group["overrides"].append(it.get("overrides") or {}) # type: ignore[typeddict-item]
if group.get("messages") is not None and it.get("messages") is not None:
group["messages"].append(it["messages"])
return group, backlog
async def run_agent(self, task: str, *, trajectory_id: Optional[str] = None) -> Tuple[str, Dict[str, Any]]:
"""
Run the AtroposAgent on a single task and return (final_response, debug).
This is a helper intended for simple environments and tests.
"""
if self._tool_executor is None:
raise RuntimeError("Tool backend not started")
tid = trajectory_id or str(uuid.uuid4())
async def _exec(call):
return await self._tool_executor.execute(tid, call)
agent = AtroposAgent(
server=self.server,
tokenizer=self.tokenizer,
tools=self.tools,
config=AgentConfig(
max_steps=self.config.agent_max_steps,
temperature=self.config.agent_temperature,
max_tokens=self.config.agent_max_tokens,
),
execute_tool=_exec,
)
result = await agent.run(task)
await self._tool_executor.release_trajectory(tid, reset_workspace=True)
return result.final_response, {"success": result.success, "error": result.error, "tool_calls": result.total_tool_calls}

View File

@@ -0,0 +1,873 @@
"""
Endless Terminals Environment for Hermes-Agent + Atropos RL.
Runs terminal tasks from the Endless Terminals dataset.
Supports three modes:
1. Local directory: tasks from a local folder of task_* dirs (default)
2. HuggingFace dataset: tasks from a HF dataset
3. Procedural: generate tasks on-the-fly via LLM (requires vLLM)
Each task provides a Dockerfile that defines the initial environment.
The agent solves the task using terminal commands inside a Docker container.
Scoring is done by running pytest on `test_final_state.py` in the container.
Run (standalone process mode):
python -m atropos.envs.endless_terminals_env process \
--env.use_wandb false \
--env.total_steps 100 \
--env.group_size 4
Run (Tinker serve mode):
# Terminal 1: run-api
# Terminal 2: python launch_training.py --config configs/endless_terminals.yaml
# Terminal 3:
TINKER_CONFIG=configs/endless_terminals.yaml \
ENDLESS_TERMINALS_DIR=/path/to/endless-terminals \
python -m atropos.envs.endless_terminals_env serve
"""
from __future__ import annotations
import asyncio
import base64
import json
import os
import random
import shutil
import subprocess
import sys
import tempfile
import uuid
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple
from dotenv import load_dotenv
from pydantic import Field
from atroposlib.envs.base import APIServerConfig, Item
from ..agent import AgentConfig
from ..backends.docker_direct_backend import (
DockerDirectBackend,
build_docker_image,
docker_image_exists,
)
from ..tools import ToolCall
from .agent_env import AgentEnv, AgentEnvConfig
load_dotenv()
# ---------------------------------------------------------------------------
# Tinker integration
# ---------------------------------------------------------------------------
# When TINKER_CONFIG is set, we load model/training params from the Tinker YAML.
# Custom env fields (ENDLESS_TERMINALS_DIR, etc.) are always read from env vars.
TINKER_CONFIG = os.getenv("TINKER_CONFIG", "")
def _load_tinker_config():
"""Load TinkerAtroposConfig if available, else return None."""
if not TINKER_CONFIG:
return None
config_path = Path(TINKER_CONFIG)
if not config_path.exists():
print(f"[EndlessTerminalsEnv] TINKER_CONFIG={TINKER_CONFIG} not found, ignoring", flush=True)
return None
try:
from tinker_atropos.config import TinkerAtroposConfig
config = TinkerAtroposConfig.from_yaml(config_path)
print(f"[EndlessTerminalsEnv] Loaded Tinker config from {config_path}", flush=True)
return config
except ImportError:
print("[EndlessTerminalsEnv] tinker_atropos not installed, ignoring TINKER_CONFIG", flush=True)
return None
except Exception as e:
print(f"[EndlessTerminalsEnv] Error loading Tinker config: {e}", flush=True)
return None
# ---------------------------------------------------------------------------
# Config
# ---------------------------------------------------------------------------
class EndlessTerminalsEnvConfig(AgentEnvConfig):
"""Configuration for Endless Terminals environment."""
# ---- Local directory mode (primary) ----
use_local_dir: bool = Field(
default=True,
description="Load tasks from a local directory of task_* folders.",
)
local_tasks_dir: str = Field(
default="",
description="Path to directory containing task_* folders. Required if use_local_dir=True.",
)
prebuild_images: bool = Field(
default=False,
description="Pre-build ALL Docker images during setup (slow but avoids build-during-training).",
)
max_concurrent_builds: int = Field(
default=4,
description="Max parallel Docker image builds during pre-build.",
)
# ---- HuggingFace dataset mode ----
use_dataset: bool = Field(
default=False,
description="Load tasks from HuggingFace dataset.",
)
dataset_name: str = Field(
default="obiwan96/endless-terminals-train",
description="HuggingFace dataset name (if use_dataset=True)",
)
dataset_split: str = Field(default="train")
dataset_cache_dir: str = Field(default="~/.cache/huggingface/datasets")
tasks_base_dir: str = Field(
default="",
description="Base directory containing task_* folders (for dataset mode path resolution).",
)
# ---- Procedural generation mode ----
task_gen_model: str = Field(default="Qwen/Qwen3-32B")
task_gen_temperature: float = Field(default=1.0)
task_gen_max_tokens: int = Field(default=2048)
# ---- Container / scoring ----
container_build_timeout_s: float = Field(default=600.0, description="Docker build timeout")
test_timeout_s: int = Field(default=120, description="Test execution timeout (seconds)")
keep_failed_tasks: bool = Field(default=False)
# ---- Agent defaults ----
agent_max_steps: int = Field(default=32)
agent_temperature: float = Field(default=0.7)
# ---- Docker image prefix ----
docker_image_prefix: str = Field(
default="endless-terminals",
description="Docker image name prefix for built task images.",
)
# ---- Server defaults ----
server_base_url: str = Field(default="http://127.0.0.1:8080")
server_model: str = Field(default="hermes-4-36b")
tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B")
# ---------------------------------------------------------------------------
# Env
# ---------------------------------------------------------------------------
class EndlessTerminalsEnv(AgentEnv[EndlessTerminalsEnvConfig]):
"""
Endless Terminals environment.
Each task:
1. Has a Dockerfile defining the initial container state
2. Has an instruction.md describing what the agent should do
3. Has tests/test_final_state.py to verify completion
Flow per trajectory:
1. get_next_item() → picks a task
2. setup_trajectory_workspace() → builds Docker image, registers with backend
3. Agent solves task via terminal commands (docker exec in the container)
4. verify_and_score_trajectory() → runs pytest in container, returns binary reward
"""
name = "endless_terminals_env"
env_config_cls = EndlessTerminalsEnvConfig
def __init__(
self,
config: EndlessTerminalsEnvConfig,
server_configs: List[APIServerConfig],
slurm: bool = False,
testing: bool = False,
):
super().__init__(config, server_configs, slurm, testing)
self._iteration = 0
# Local dir mode
self._local_tasks: List[Dict[str, Any]] = []
self._local_task_indices: List[int] = []
self._local_current_index = 0
# Eval split (held-out tasks)
self._eval_tasks: List[Dict[str, Any]] = []
# Training metrics
self._train_scores_buffer: List[float] = []
self._eval_metrics: List[tuple] = []
# HF dataset mode
self._dataset = None
self._dataset_indices: List[int] = []
self._dataset_current_index = 0
# Docker image cache: task_name -> image_tag
self._image_cache: Dict[str, str] = {}
self._build_lock = asyncio.Lock()
# ---- Config init (CLI) ----
@classmethod
def config_init(cls) -> Tuple[EndlessTerminalsEnvConfig, List[APIServerConfig]]:
"""
Initialize config.
Two modes:
1. Tinker mode: TINKER_CONFIG env var points to a Tinker YAML.
Model, training params, and server config come from the YAML.
2. Standalone mode: Everything from env vars (ATROPOS_SERVER_*, etc.)
In both modes, Endless Terminals-specific fields (ENDLESS_TERMINALS_DIR,
PREBUILD_IMAGES, etc.) are always read from env vars.
"""
tinker_cfg = _load_tinker_config()
# ── Endless Terminals-specific fields (always from env vars) ──
local_tasks_dir = os.getenv("ENDLESS_TERMINALS_DIR", "")
use_local_dir = bool(local_tasks_dir)
if tinker_cfg is not None:
# ── Tinker mode ─────────────────────────────────────────
print("[EndlessTerminalsEnv] Using Tinker config", flush=True)
env_config = EndlessTerminalsEnvConfig(
# Standard Atropos fields from Tinker YAML
tokenizer_name=tinker_cfg.base_model,
group_size=tinker_cfg.group_size,
use_wandb=tinker_cfg.use_wandb,
rollout_server_url=tinker_cfg.atropos_api_url,
total_steps=tinker_cfg.num_steps,
batch_size=tinker_cfg.batch_size,
steps_per_eval=tinker_cfg.steps_per_eval,
max_token_length=tinker_cfg.max_token_env_length,
max_num_workers=tinker_cfg.max_num_workers,
max_batches_offpolicy=tinker_cfg.max_batches_offpolicy,
ensure_scores_are_not_same=tinker_cfg.ensure_scores_are_not_same,
wandb_name=f"{tinker_cfg.wandb_run_name}-env",
include_messages=True,
# Tooling: terminal only
enabled_toolsets=["terminal"],
disabled_toolsets=[],
# Agent config
agent_max_steps=int(os.getenv("AGENT_MAX_STEPS", "32")),
agent_temperature=float(os.getenv("AGENT_TEMPERATURE", "0.7")),
# Docker-direct backend (no Nomad needed)
tool_pool_mode="docker_direct",
sandbox_image="ubuntu:22.04",
purge_job_on_start=False,
purge_job_on_shutdown=False,
# Endless Terminals fields
use_local_dir=use_local_dir,
local_tasks_dir=local_tasks_dir,
prebuild_images=os.getenv("PREBUILD_IMAGES", "false").lower() == "true",
use_dataset=os.getenv("USE_DATASET", "false").lower() == "true",
dataset_name=os.getenv("ENDLESS_DATASET", "obiwan96/endless-terminals-train"),
container_build_timeout_s=float(os.getenv("CONTAINER_BUILD_TIMEOUT", "600")),
test_timeout_s=int(os.getenv("TEST_TIMEOUT", "120")),
)
server_configs = [
APIServerConfig(
model_name=tinker_cfg.base_model,
base_url=tinker_cfg.inference_api_url + "/v1",
api_key="x",
server_type="sglang",
num_requests_for_eval=tinker_cfg.num_requests_for_eval,
timeout=600, # Longer timeout for multi-step agent trajectories
),
]
return env_config, server_configs
else:
# ── Standalone mode (env vars) ──────────────────────────
base_url = (
os.getenv("ATROPOS_SERVER_BASE_URL")
or os.getenv("OPENAI_BASE_URL")
or os.getenv("LLM_BASE_URL")
or "http://127.0.0.1:8080"
)
model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
api_key = (
os.getenv("ATROPOS_SERVER_API_KEY")
or os.getenv("NOUS_API_KEY")
or os.getenv("OPENAI_API_KEY")
or "local"
)
env_config = EndlessTerminalsEnvConfig(
tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
group_size=int(os.getenv("ATROPOS_GROUP_SIZE", "4")),
use_wandb=os.getenv("USE_WANDB", "false").lower() == "true",
include_messages=True,
total_steps=int(os.getenv("ATROPOS_TOTAL_STEPS", "1000")),
batch_size=int(os.getenv("ATROPOS_BATCH_SIZE", "32")),
server_base_url=base_url,
server_model=model,
# Tooling
enabled_toolsets=["terminal"],
disabled_toolsets=[],
# Agent
agent_max_steps=int(os.getenv("AGENT_MAX_STEPS", "32")),
agent_temperature=float(os.getenv("AGENT_TEMPERATURE", "0.7")),
# Docker-direct backend
tool_pool_mode="docker_direct",
sandbox_image="ubuntu:22.04",
purge_job_on_start=False,
purge_job_on_shutdown=False,
# Endless Terminals fields
use_local_dir=use_local_dir,
local_tasks_dir=local_tasks_dir,
prebuild_images=os.getenv("PREBUILD_IMAGES", "false").lower() == "true",
use_dataset=os.getenv("USE_DATASET", "false").lower() == "true",
dataset_name=os.getenv("ENDLESS_DATASET", "obiwan96/endless-terminals-train"),
task_gen_model=os.getenv("TASK_GEN_MODEL", "Qwen/Qwen3-32B"),
container_build_timeout_s=float(os.getenv("CONTAINER_BUILD_TIMEOUT", "600")),
test_timeout_s=int(os.getenv("TEST_TIMEOUT", "120")),
)
server_configs = [
APIServerConfig(
model_name=model,
base_url=f"{base_url.rstrip('/')}/v1",
api_key=api_key,
num_max_requests_at_once=int(os.getenv("MAX_CONCURRENT_REQUESTS", "4")),
num_requests_for_eval=int(os.getenv("MAX_EVAL_REQUESTS", "4")),
timeout=300,
)
]
return env_config, server_configs
# ---- Setup ----
async def setup_agent_env(self) -> None:
"""Env-specific setup: scan tasks and optionally pre-build images."""
if self.config.use_local_dir:
await self._setup_local_dir()
elif self.config.use_dataset:
await self._setup_hf_dataset()
else:
print("[EndlessTerminalsEnv] Using procedural task generation", flush=True)
async def _setup_local_dir(self) -> None:
"""Scan local directory for task_* folders."""
tasks_dir = Path(self.config.local_tasks_dir).expanduser().resolve()
if not tasks_dir.is_dir():
raise RuntimeError(f"local_tasks_dir does not exist: {tasks_dir}")
print(f"[EndlessTerminalsEnv] Scanning {tasks_dir} for tasks...", flush=True)
tasks = []
for entry in sorted(tasks_dir.iterdir()):
if not entry.is_dir() or not entry.name.startswith("task_"):
continue
# Validate required files
dockerfile = entry / "environment" / "Dockerfile"
instruction = entry / "instruction.md"
test_final = entry / "tests" / "test_final_state.py"
if not dockerfile.exists():
continue
if not instruction.exists():
continue
if not test_final.exists():
continue
# Read task metadata
task_json_path = entry / "environment" / "task.json"
description = instruction.read_text(encoding="utf-8").strip()
truth = ""
if task_json_path.exists():
try:
task_json = json.loads(task_json_path.read_text(encoding="utf-8"))
# task.json may have a richer description; prefer instruction.md
truth = task_json.get("truth", "")
except Exception:
pass
tasks.append({
"task_name": entry.name,
"task_dir": str(entry),
"dockerfile": str(dockerfile),
"description": description,
"truth": truth,
"test_final": str(test_final),
})
if not tasks:
raise RuntimeError(f"No valid task_* directories found in {tasks_dir}")
# Split into train and eval (hold out ~5% for eval, min 10, max 50)
random.shuffle(tasks)
eval_count = max(10, min(50, len(tasks) // 20))
eval_count = min(eval_count, len(tasks) // 2) # Never more than half
self._eval_tasks = tasks[:eval_count]
self._local_tasks = tasks[eval_count:]
self._local_task_indices = list(range(len(self._local_tasks)))
random.shuffle(self._local_task_indices)
self._local_current_index = 0
print(
f"[EndlessTerminalsEnv] Found {len(tasks)} valid tasks "
f"({len(self._local_tasks)} train, {len(self._eval_tasks)} eval)",
flush=True,
)
# Optionally pre-build all Docker images
if self.config.prebuild_images:
await self._prebuild_images()
async def _prebuild_images(self) -> None:
"""Pre-build Docker images for all tasks."""
print(f"[EndlessTerminalsEnv] Pre-building Docker images...", flush=True)
sem = asyncio.Semaphore(self.config.max_concurrent_builds)
built = 0
skipped = 0
failed = 0
async def _build_one(task: Dict[str, Any]) -> None:
nonlocal built, skipped, failed
image_tag = self._image_tag_for_task(task["task_name"])
if docker_image_exists(image_tag):
self._image_cache[task["task_name"]] = image_tag
skipped += 1
return
async with sem:
ok = await build_docker_image(
task["dockerfile"], image_tag,
timeout_s=self.config.container_build_timeout_s,
)
if ok:
self._image_cache[task["task_name"]] = image_tag
built += 1
else:
failed += 1
await asyncio.gather(*[_build_one(t) for t in self._local_tasks])
print(
f"[EndlessTerminalsEnv] Pre-build: {built} built, {skipped} cached, {failed} failed",
flush=True,
)
async def _setup_hf_dataset(self) -> None:
"""Load HuggingFace dataset."""
print(f"[EndlessTerminalsEnv] Loading dataset: {self.config.dataset_name}", flush=True)
try:
from datasets import load_dataset
loop = asyncio.get_event_loop()
self._dataset = await loop.run_in_executor(
None,
lambda: load_dataset(
self.config.dataset_name,
split=self.config.dataset_split,
cache_dir=os.path.expanduser(self.config.dataset_cache_dir),
),
)
self._dataset_indices = list(range(len(self._dataset)))
random.shuffle(self._dataset_indices)
self._dataset_current_index = 0
print(f"[EndlessTerminalsEnv] Loaded {len(self._dataset)} tasks from dataset", flush=True)
except Exception as e:
print(f"[EndlessTerminalsEnv] ERROR loading dataset: {e}", flush=True)
raise
# ---- Image helpers ----
def _image_tag_for_task(self, task_name: str) -> str:
return f"{self.config.docker_image_prefix}:{task_name}"
async def _ensure_image(self, task: Dict[str, Any]) -> str:
"""Ensure the Docker image for a task is built. Returns image tag."""
task_name = task["task_name"]
image_tag = self._image_tag_for_task(task_name)
# Fast path: already cached
if task_name in self._image_cache:
return self._image_cache[task_name]
async with self._build_lock:
# Double-check after acquiring lock
if task_name in self._image_cache:
return self._image_cache[task_name]
# Check if image exists in Docker
if docker_image_exists(image_tag):
self._image_cache[task_name] = image_tag
return image_tag
# Build it
print(f"[EndlessTerminalsEnv] Building image {image_tag}...", flush=True)
ok = await build_docker_image(
task["dockerfile"], image_tag,
timeout_s=self.config.container_build_timeout_s,
)
if not ok:
raise RuntimeError(f"Failed to build Docker image for {task_name}")
self._image_cache[task_name] = image_tag
return image_tag
# ---- Item generation ----
async def get_next_item(self) -> Item:
self._iteration += 1
if self.config.use_local_dir and self._local_tasks:
return self._get_next_local_item()
elif self.config.use_dataset and self._dataset is not None:
return self._get_next_dataset_item()
else:
return self._get_fallback_item()
def _get_next_local_item(self) -> Item:
"""Pick the next task from local directories."""
idx = self._local_task_indices[self._local_current_index]
task = self._local_tasks[idx]
self._local_current_index += 1
if self._local_current_index >= len(self._local_task_indices):
random.shuffle(self._local_task_indices)
self._local_current_index = 0
print("[EndlessTerminalsEnv] Reshuffled local tasks (epoch complete)", flush=True)
return {
"task_id": f"local_{self._iteration:06d}_{task['task_name']}",
"task_name": task["task_name"],
"description": task["description"],
"truth": task.get("truth", ""),
"task_dir": task["task_dir"],
"dockerfile": task["dockerfile"],
"test_final": task["test_final"],
"from_local_dir": True,
}
def _get_next_dataset_item(self) -> Item:
"""Pick the next task from HuggingFace dataset."""
idx = self._dataset_indices[self._dataset_current_index]
task = self._dataset[idx]
self._dataset_current_index += 1
if self._dataset_current_index >= len(self._dataset_indices):
random.shuffle(self._dataset_indices)
self._dataset_current_index = 0
print("[EndlessTerminalsEnv] Reshuffled dataset (epoch complete)", flush=True)
# Resolve task directory
task_dir = task.get("extra_info", {}).get("task_dir") or task.get("reward_spec", {}).get("ground_truth", "")
if self.config.tasks_base_dir:
task_name = Path(task_dir).name
task_dir = str(Path(self.config.tasks_base_dir) / task_name)
task_dir_path = Path(task_dir)
return {
"task_id": f"dataset_{self._iteration:06d}_{task_dir_path.name}",
"task_name": task_dir_path.name,
"description": task.get("description", ""),
"task_dir": task_dir,
"dockerfile": str(task_dir_path / "environment" / "Dockerfile"),
"test_final": str(task_dir_path / "tests" / "test_final_state.py"),
"from_dataset": True,
}
def _get_fallback_item(self) -> Item:
return {
"task_id": f"fallback_{self._iteration:06d}",
"task_name": "fallback",
"description": (
"Create a file named 'hello.txt' in /home/user/ containing "
"the text 'Hello, World!' on a single line."
),
"task_dir": "",
"dockerfile": "",
"test_final": "",
}
# ---- AgentEnv hooks ----
def build_task(self, item: Item) -> str:
"""Return the task prompt for the agent."""
return str(item.get("description", ""))
def build_agent_config(self, item: Item) -> AgentConfig:
return AgentConfig(
max_steps=self.config.agent_max_steps,
temperature=self.config.agent_temperature,
max_tokens=self.config.agent_max_tokens,
tool_delay_s=self.config.agent_tool_delay_s,
)
async def setup_trajectory_workspace(
self,
item: Item,
*,
trajectory_id: str,
exec_tool,
) -> Dict[str, Any]:
"""
Build the Docker image for this task and register it with the backend.
The DockerDirectBackend will start a container from this image when the
agent makes its first tool call (lazy acquisition via ToolExecutor).
"""
task_name = item.get("task_name", "unknown")
dockerfile = item.get("dockerfile", "")
if not dockerfile or not Path(dockerfile).exists():
print(f"[EndlessTerminalsEnv] WARNING: No Dockerfile for {task_name}", flush=True)
return {"image": "ubuntu:22.04"}
# Build/get Docker image
image_tag = await self._ensure_image({
"task_name": task_name,
"dockerfile": dockerfile,
})
# Register image with the DockerDirect backend
if isinstance(self._backend, DockerDirectBackend):
self._backend.register_image(trajectory_id, image_tag)
return {"image": image_tag, "task_name": task_name}
async def score_trajectory(self, item: Item, final_response: str) -> float:
"""Not used — scoring happens in verify_and_score_trajectory."""
return 0.0
async def verify_and_score_trajectory(
self,
item: Item,
final_response: str,
*,
trajectory_id: str,
exec_tool,
agent_result=None,
workspace_meta=None,
) -> tuple[float, Dict[str, Any]]:
"""
Run test_final_state.py inside the container and return binary reward.
"""
task_id = item.get("task_id", "unknown")
test_final = item.get("test_final", "")
if not test_final or not Path(test_final).exists():
print(f"[EndlessTerminalsEnv] No test file for {task_id}", flush=True)
return 0.0, {"error": "No test file"}
print(f"[EndlessTerminalsEnv] Scoring {task_id}...", flush=True)
try:
# Read the test file and base64-encode it for safe transfer
test_content = Path(test_final).read_text(encoding="utf-8")
encoded = base64.b64encode(test_content.encode("utf-8")).decode("ascii")
# Write test file into the container and run pytest
# We write to /tmp to avoid interfering with the agent's workspace
# Use printf + heredoc to avoid quoting issues with single quotes in base64
verify_cmd = (
f"printf '%s' '{encoded}' | base64 -d > /tmp/_test_final_state.py && "
f"cd /home/user && "
f"python3 -m pytest /tmp/_test_final_state.py -v --tb=short 2>&1; "
f"echo \"EXIT_CODE=$?\""
)
result = await exec_tool(ToolCall(
name="terminal",
arguments={"command": verify_cmd},
))
output = result.output if hasattr(result, "output") else str(result)
# Check if pytest passed
# Look for EXIT_CODE=0 at the end (most reliable)
success = "EXIT_CODE=0" in output
score = 1.0 if success else 0.0
metadata = {
"task_id": task_id,
"success": success,
"test_output": output[-2000:] if len(output) > 2000 else output,
"total_tool_calls": agent_result.total_tool_calls if agent_result else 0,
}
self._train_scores_buffer.append(score)
print(f"[EndlessTerminalsEnv] {task_id} → score={score}", flush=True)
return score, metadata
except Exception as e:
print(f"[EndlessTerminalsEnv] Error scoring {task_id}: {e}", flush=True)
return 0.0, {"error": str(e)}
# ---- WandB logging ----
async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
"""Log training metrics to wandb."""
if wandb_metrics is None:
wandb_metrics = {}
# Training pass rate since last log
if self._train_scores_buffer:
wandb_metrics["train/percent_correct"] = (
sum(self._train_scores_buffer) / len(self._train_scores_buffer)
)
wandb_metrics["train/num_trajectories"] = len(self._train_scores_buffer)
self._train_scores_buffer = []
# Eval metrics (populated by evaluate())
for key, value in self._eval_metrics:
wandb_metrics[key] = value
self._eval_metrics = []
await super().wandb_log(wandb_metrics)
# ---- Evaluation ----
async def evaluate(self, *args, **kwargs):
"""
Run the agent on held-out eval tasks and report pass rate.
Each eval task: build Docker container → run agent (temp=0) → pytest → score.
This is expensive (full agent trajectories), so we only eval a subset.
"""
import time as _time
if not self._eval_tasks:
return {}
start_time = _time.time()
eval_sample_size = min(len(self._eval_tasks), 20)
eval_subset = random.sample(self._eval_tasks, eval_sample_size)
print(
f"[EndlessTerminalsEnv] Running evaluation on {eval_sample_size} tasks...",
flush=True,
)
scores = []
samples = []
for task_info in eval_subset:
task_name = task_info["task_name"]
description = task_info["description"]
try:
# Build Docker image
image_tag = await self._ensure_image(task_info)
# Run agent with temp=0 for deterministic eval
eval_tid = f"eval_{uuid.uuid4().hex[:8]}"
# Register image with backend
if isinstance(self._backend, DockerDirectBackend):
self._backend.register_image(eval_tid, image_tag)
async def _exec(call, _tid=eval_tid):
return await self._tool_executor.execute(_tid, call)
from ..agent import AtroposAgent as _AtroposAgent
agent = _AtroposAgent(
server=self.server,
tokenizer=self.tokenizer,
tools=self.tools,
config=AgentConfig(
max_steps=self.config.agent_max_steps,
temperature=0.0, # Deterministic for eval
max_tokens=self.config.agent_max_tokens,
),
execute_tool=_exec,
)
result = await agent.run(description)
# Score: run pytest in the container
score = 0.0
test_final = task_info.get("test_final", "")
if result.success and test_final and Path(test_final).exists():
test_content = Path(test_final).read_text(encoding="utf-8")
encoded = base64.b64encode(test_content.encode("utf-8")).decode("ascii")
verify_cmd = (
f"printf '%s' '{encoded}' | base64 -d > /tmp/_test_final_state.py && "
f"cd /home/user && "
f"python3 -m pytest /tmp/_test_final_state.py -v --tb=short 2>&1; "
f'echo "EXIT_CODE=$?"'
)
test_result = await _exec(ToolCall(
name="terminal",
arguments={"command": verify_cmd},
))
test_output = test_result.output if hasattr(test_result, "output") else ""
if "EXIT_CODE=0" in test_output:
score = 1.0
scores.append(score)
samples.append({
"task": task_name,
"score": score,
"tool_calls": result.total_tool_calls,
"success": result.success,
})
# Cleanup
await self._tool_executor.release_trajectory(eval_tid, reset_workspace=True)
print(f" [eval] {task_name}{score}", flush=True)
except Exception as e:
print(f" [eval] {task_name} → ERROR: {e}", flush=True)
scores.append(0.0)
samples.append({"task": task_name, "score": 0.0, "error": str(e)})
end_time = _time.time()
percent_correct = sum(scores) / len(scores) if scores else 0.0
print(
f"[EndlessTerminalsEnv] Eval: {percent_correct:.1%} pass rate "
f"({sum(scores):.0f}/{len(scores)}) in {end_time - start_time:.0f}s",
flush=True,
)
# Store for wandb_log to pick up
self._eval_metrics.append(("eval/percent_correct", percent_correct))
self._eval_metrics.append(("eval/num_tasks", len(scores)))
self._eval_metrics.append(("eval/duration_s", end_time - start_time))
# Log via atroposlib
eval_metrics = {
"eval/percent_correct": percent_correct,
"eval/num_tasks": len(scores),
}
await self.evaluate_log(
metrics=eval_metrics,
samples=samples,
start_time=start_time,
end_time=end_time,
generation_parameters={
"temperature": 0.0,
"max_tokens": self.config.agent_max_tokens,
},
)
if __name__ == "__main__":
EndlessTerminalsEnv.cli()

View File

@@ -0,0 +1,171 @@
"""
Hermes-Agent + Atropos (Nomad sandbox) compatibility smoke environment.
This environment is intended to validate, end-to-end:
BaseEnv.process -> AgentEnv -> ToolExecutor (batched) -> Nomad SlotPool -> sandbox_server
It forces the model to use a sandbox tool by asking it to run a command that
generates a high-entropy token inside the sandbox, then repeat it exactly.
Run (process mode):
uv run python -m atropos.envs.hermes_compat_test_env process --env.use_wandb false --env.total_steps 2 --env.group_size 1
"""
from __future__ import annotations
import os
from typing import Any, Dict, List, Tuple
from dotenv import load_dotenv
from pydantic import Field
from atroposlib.envs.base import APIServerConfig, Item
from ..agent import AgentConfig, AgentResult
from ..tools import ToolCall
from .agent_env import AgentEnv, AgentEnvConfig
load_dotenv()
def _forced_tool_item() -> Item:
# Use double quotes in the shell command and show JSON escaping explicitly.
# This avoids invalid JSON escapes like `\\'` (not valid JSON) that some models produce.
cmd = 'python -c "import secrets; print(secrets.token_hex(16))"'
return {
"command": cmd,
"prompt": (
"You are acting as an agent inside a sandboxed environment.\n"
"You MUST use the terminal tool to execute commands.\n"
"Run this exact command:\n"
f"{cmd}\n"
"When you call the tool, use valid JSON inside <tool_call>. Example:\n"
'<tool_call>{"name": "terminal", "arguments": {"command": '
'"python -c \\\\"import secrets; print(secrets.token_hex(16))\\\\""}}'
"</tool_call>\n"
"Then respond with EXACTLY what it printed (the hex token) and nothing else.\n"
"Do not guess. Do not explain."
),
}
class HermesCompatTestEnvConfig(AgentEnvConfig):
server_base_url: str = Field(
default="http://127.0.0.1:8080",
description="Base URL for an OpenAI-compatible chat server (without /v1).",
)
server_model: str = Field(default="hermes-4-36b", description="Model name")
tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
class HermesCompatTestEnv(AgentEnv[HermesCompatTestEnvConfig]):
name = "hermes_compat_test_env"
env_config_cls = HermesCompatTestEnvConfig
def __init__(
self,
config: HermesCompatTestEnvConfig,
server_configs: List[APIServerConfig],
slurm: bool = False,
testing: bool = False,
):
super().__init__(config, server_configs, slurm, testing)
self._iter = 0
@classmethod
def config_init(cls) -> Tuple[HermesCompatTestEnvConfig, List[APIServerConfig]]:
base_url = (
os.getenv("ATROPOS_SERVER_BASE_URL")
or os.getenv("OPENAI_BASE_URL")
or os.getenv("LLM_BASE_URL")
or "http://127.0.0.1:8080"
)
model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
env_config = HermesCompatTestEnvConfig(
tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
group_size=1,
use_wandb=False,
include_messages=True,
ensure_scores_are_not_same=False,
total_steps=2,
batch_size=1,
server_base_url=base_url,
server_model=model,
# Tooling: sandbox-only terminal.
enabled_toolsets=["terminal"],
disabled_toolsets=[],
# Default to Nomad sandboxing; users can override via --env.* args.
sandbox_image=os.getenv("ATROPOS_SANDBOX_IMAGE") or "atropos-sandbox:local",
# In local dev it's common for a previous crash to leave the job in backoff.
purge_job_on_start=True,
purge_job_on_shutdown=True,
)
server_configs = [
APIServerConfig(
model_name=model,
base_url=f"{base_url.rstrip('/')}/v1",
api_key=api_key,
num_max_requests_at_once=1,
num_requests_for_eval=1,
timeout=120,
)
]
return env_config, server_configs
async def setup_agent_env(self) -> None:
return None
async def get_next_item(self) -> Item:
self._iter += 1
return _forced_tool_item()
def build_task(self, item: Item) -> str:
return str(item.get("prompt") or "")
def build_agent_config(self, item: Item) -> AgentConfig: # noqa: ARG002
# Avoid imposing max_tokens by default; tool-tag responses can be long for some models.
return AgentConfig(
max_steps=min(8, int(self.config.agent_max_steps)),
temperature=0.2,
max_tokens=None,
)
async def score_trajectory(self, item: Item, final_response: str) -> float:
# Scoring happens in verify_and_score_trajectory so we can inspect tool results.
_ = (item, final_response)
return 0.0
async def verify_and_score_trajectory(
self,
item: Item,
final_response: str,
*,
trajectory_id: str, # noqa: ARG002
exec_tool, # noqa: ARG002
agent_result: AgentResult | None = None,
workspace_meta: Dict[str, Any] | None = None, # noqa: ARG002
) -> tuple[float, Dict[str, Any]]:
if agent_result is None:
return 0.0, {"error": "Missing agent_result"}
observed: str = ""
tool_ok = False
for step in agent_result.steps:
for res in step.tool_results:
if not res.success:
return 0.0, {"error": res.error, "output": res.output}
out = (res.output or "").strip()
if out:
observed = out.splitlines()[-1].strip()
tool_ok = True
final = (final_response or "").strip()
score = 1.0 if tool_ok and agent_result.total_tool_calls > 0 and observed and final == observed else 0.0
return score, {"observed": observed, "tool_calls": agent_result.total_tool_calls, "command": item.get("command")}
if __name__ == "__main__":
HermesCompatTestEnv.cli()

View File

@@ -0,0 +1,172 @@
"""
Nomad sandbox terminal smoke environment (training-oriented).
Validates, end-to-end:
BaseEnv.process -> AgentEnv -> ToolExecutor (batched) -> Nomad SlotPool -> sandbox_server
It forces the model to use a sandbox tool by asking it to run a command that
generates a high-entropy token inside the sandbox, then repeat it exactly.
Run (process mode):
uv run python -m atropos.envs.sandbox_terminal_smoke_env process --env.use_wandb false --env.total_steps 2 --env.group_size 1
"""
from __future__ import annotations
import os
from typing import Any, Dict, List, Tuple
from dotenv import load_dotenv
from pydantic import Field
from atroposlib.envs.base import APIServerConfig, Item
from ..agent import AgentConfig, AgentResult
from ..tools import ToolCall
from .agent_env import AgentEnv, AgentEnvConfig
load_dotenv()
STRICT_TOOLCALL_SYSTEM_PROMPT = None
def _forced_tool_item() -> Item:
# Use double quotes in the shell command and show JSON escaping explicitly.
# This avoids invalid JSON escapes like `\\'` (not valid JSON) that some models produce.
cmd = 'python -c "import secrets; print(secrets.token_hex(16))"'
return {
"command": cmd,
"prompt": (
"You MUST use the terminal tool.\n"
"Run this exact command:\n"
f"{cmd}\n"
"When you call the tool, use valid JSON inside <tool_call>. Example:\n"
'<tool_call>{"name": "terminal", "arguments": {"command": '
'"python -c \\\\"import secrets; print(secrets.token_hex(16))\\\\""}}'
"</tool_call>\n"
"Then respond with EXACTLY what it printed (the hex token) and nothing else.\n"
"Do not guess. Do not explain."
),
}
class SandboxTerminalSmokeEnvConfig(AgentEnvConfig):
server_base_url: str = Field(
default="http://127.0.0.1:8080",
description="Base URL for an OpenAI-compatible chat server (without /v1).",
)
server_model: str = Field(default="hermes-4-36b", description="Model name")
tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
class SandboxTerminalSmokeEnv(AgentEnv[SandboxTerminalSmokeEnvConfig]):
name = "sandbox_terminal_smoke_env"
env_config_cls = SandboxTerminalSmokeEnvConfig
def __init__(
self,
config: SandboxTerminalSmokeEnvConfig,
server_configs: List[APIServerConfig],
slurm: bool = False,
testing: bool = False,
):
super().__init__(config, server_configs, slurm, testing)
self._iter = 0
@classmethod
def config_init(cls) -> Tuple[SandboxTerminalSmokeEnvConfig, List[APIServerConfig]]:
base_url = (
os.getenv("ATROPOS_SERVER_BASE_URL")
or os.getenv("OPENAI_BASE_URL")
or os.getenv("LLM_BASE_URL")
or "http://127.0.0.1:8080"
)
model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
env_config = SandboxTerminalSmokeEnvConfig(
tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
group_size=1,
use_wandb=False,
include_messages=True,
ensure_scores_are_not_same=False,
total_steps=2,
batch_size=1,
server_base_url=base_url,
server_model=model,
# Tooling: sandbox-only terminal.
enabled_toolsets=["terminal"],
disabled_toolsets=[],
# Default to Nomad sandboxing; users can override via --env.* args.
sandbox_image=os.getenv("ATROPOS_SANDBOX_IMAGE") or "atropos-sandbox:local",
purge_job_on_start=True,
purge_job_on_shutdown=True,
)
server_configs = [
APIServerConfig(
model_name=model,
base_url=f"{base_url.rstrip('/')}/v1",
api_key=api_key,
num_max_requests_at_once=1,
num_requests_for_eval=1,
timeout=120,
)
]
return env_config, server_configs
async def setup_agent_env(self) -> None:
return None
async def get_next_item(self) -> Item:
self._iter += 1
return _forced_tool_item()
def build_task(self, item: Item) -> str:
return str(item.get("prompt") or "")
def build_agent_config(self, item: Item) -> AgentConfig: # noqa: ARG002
# Avoid imposing max_tokens by default; tool-tag responses can be long for some models.
return AgentConfig(
max_steps=min(8, int(self.config.agent_max_steps)),
temperature=0.2,
max_tokens=None,
system_prompt=STRICT_TOOLCALL_SYSTEM_PROMPT,
)
async def score_trajectory(self, item: Item, final_response: str) -> float:
# Scoring happens in verify_and_score_trajectory so we can inspect tool results.
_ = (item, final_response)
return 0.0
async def verify_and_score_trajectory(
self,
item: Item,
final_response: str,
*,
trajectory_id: str, # noqa: ARG002
exec_tool, # noqa: ARG002
agent_result: AgentResult | None = None,
workspace_meta: Dict[str, Any] | None = None, # noqa: ARG002
) -> tuple[float, Dict[str, Any]]:
if agent_result is None:
return 0.0, {"error": "Missing agent_result"}
observed: str = ""
tool_ok = False
for step in agent_result.steps:
for res in step.tool_results:
if not res.success:
return 0.0, {"error": res.error, "output": res.output}
out = (res.output or "").strip()
if out:
observed = out.splitlines()[-1].strip()
tool_ok = True
final = (final_response or "").strip()
score = 1.0 if tool_ok and agent_result.total_tool_calls > 0 and observed and final == observed else 0.0
return score, {"observed": observed, "tool_calls": agent_result.total_tool_calls, "command": item.get("command")}
if __name__ == "__main__":
SandboxTerminalSmokeEnv.cli()

View File

@@ -0,0 +1,418 @@
"""
SWE-smith-oracle environment.
This environment is intentionally minimal:
- prepares a sandbox workspace by cloning a public GitHub repo at `base_commit`
- runs an AtroposAgent tool loop to apply a fix
- verifies by running pytest nodeids from the dataset (reward = pass/fail)
- Python only (no multi-language support currently, need to properly bauild & add to dropbox)
- TODO: Get the other nonpython sandboxes up and running, then add a config knob to switch between them per row
- oh and add to dockerhub
Dataset: NousResearch/SWE-smith-oracle (train; does NOT use SWE-bench eval set).
"""
from __future__ import annotations
import os
import random
import time
from typing import Any, Dict, List, Optional, Tuple
from pydantic import Field
from atroposlib.envs.base import APIServerConfig, Item
from ..agent import AgentConfig
from ..tools import ToolCall
from .agent_env import AgentEnv, AgentEnvConfig
class SweSmithOracleEnvConfig(AgentEnvConfig):
dataset_name: str = Field(default="NousResearch/SWE-smith-oracle")
dataset_split: str = Field(default="train")
max_items: int = Field(default=0, description="0 = no limit")
shuffle: bool = Field(default=True)
seed: int = Field(default=0)
python_only: bool = Field(default=True, description="Filter to Python-evaluable rows")
score_include_fail_to_pass: bool = Field(
default=True,
description=(
"If true (default), score tests on PASS_TO_PASS FAIL_TO_PASS. "
"Disable to only run PASS_TO_PASS (faster but weaker signal)."
),
)
prompt_mode: str = Field(
default="problem_statement",
description="Task prompt content: 'problem_statement' (fast) or 'problem_statement+text' (slower, includes dataset 'text').",
)
repo_base_url: str = Field(default="https://github.com", description="Base URL for repo cloning")
install_timeout_s: float = Field(default=600.0)
test_timeout_s: float = Field(default=600.0)
tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
class SweSmithOracleEnv(AgentEnv[SweSmithOracleEnvConfig]):
"""
SWE-smith-oracle AgentEnv.
This is designed for benchmarking multiplexed slot execution vs naive container-per-trajectory.
"""
name = "swe_smith_oracle_env"
env_config_cls = SweSmithOracleEnvConfig
def __init__(
self,
config: SweSmithOracleEnvConfig,
server_configs: List[APIServerConfig],
slurm: bool = False,
testing: bool = False,
):
super().__init__(config, server_configs, slurm, testing)
self._dataset = None
self._indices: List[int] = []
self._cursor = 0
@classmethod
def config_init(cls) -> Tuple[SweSmithOracleEnvConfig, List[APIServerConfig]]:
# Defaults for running the env via CLI in offline `process` mode.
# Override via env vars or `--env.*` flags as needed.
base_url_raw = (
os.getenv("ATROPOS_SERVER_BASE_URL")
or os.getenv("OPENAI_BASE_URL")
or os.getenv("LLM_BASE_URL")
or "http://127.0.0.1:8080"
)
base_url = base_url_raw.rstrip("/")
if not base_url.endswith("/v1"):
base_url = f"{base_url}/v1"
model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
env_config = SweSmithOracleEnvConfig(
tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
group_size=1,
use_wandb=False,
rollout_server_url="http://localhost:8000",
total_steps=1,
batch_size=1,
steps_per_eval=1,
max_token_length=8192,
inference_weight=1.0,
wandb_name="swe_smith_oracle",
enabled_toolsets=["terminal"],
disabled_toolsets=[],
sandbox_image=os.getenv("ATROPOS_SANDBOX_IMAGE") or "atropos-sandbox:local",
purge_job_on_start=True,
purge_job_on_shutdown=True,
)
server_configs = [
APIServerConfig(
model_name=model,
base_url=base_url,
api_key=api_key,
num_max_requests_at_once=1,
num_requests_for_eval=1,
timeout=int(os.getenv("ATROPOS_SERVER_TIMEOUT_S") or "300"),
),
]
return env_config, server_configs
async def setup_agent_env(self) -> None:
from datasets import load_dataset
t0 = time.perf_counter()
print(
f"[SweSmithOracleEnv] loading dataset {self.config.dataset_name}:{self.config.dataset_split} "
f"(python_only={self.config.python_only}, max_items={self.config.max_items or 'all'})",
flush=True,
)
ds = load_dataset(self.config.dataset_name, split=self.config.dataset_split)
self._dataset = ds
indices: List[int] = []
for idx in range(len(ds)):
row = ds[idx]
if self.config.python_only and not self._is_python_row(row):
continue
indices.append(idx)
if self.config.shuffle:
rnd = random.Random(self.config.seed)
rnd.shuffle(indices)
if self.config.max_items and self.config.max_items > 0:
indices = indices[: self.config.max_items]
self._indices = indices
self._cursor = 0
print(
f"[SweSmithOracleEnv] loaded {len(self._indices)} items from {self.config.dataset_name}:{self.config.dataset_split} "
f"in {time.perf_counter() - t0:.2f}s",
flush=True,
)
def _is_python_row(self, row: Dict[str, Any]) -> bool:
nodeids = row.get("PASS_TO_PASS")
if not isinstance(nodeids, list) or not nodeids:
return False
for nid in nodeids:
if not isinstance(nid, str) or ".py::" not in nid:
return False
return True
async def get_next_item(self) -> Item:
print(f"[SweSmithOracleEnv] get_next_item() cursor={self._cursor}/{len(self._indices)}", flush=True)
if not self._dataset or not self._indices:
raise RuntimeError("Dataset not initialized (did setup() run?)")
if self._cursor >= len(self._indices):
self._cursor = 0
idx = self._indices[self._cursor]
self._cursor += 1
return dict(self._dataset[idx])
def _repo_name(self, item: Item) -> str:
repo = item.get("repo") or ""
if isinstance(repo, str) and "/" in repo:
return repo.split("/")[-1]
return "repo"
def build_task(self, item: Item) -> str:
repo = item.get("repo") or ""
base_commit = item.get("base_commit") or ""
problem = str(item.get("problem_statement") or "")
context = str(item.get("text") or "")
nodeids = self._tests_for_item(item)
tests_list = "\n".join(f"- {t}" for t in nodeids)
repo_dir = self._repo_name(item)
tests_block = (
"Run these tests to verify:\n"
f"{tests_list}\n\n"
"When done, briefly describe what you changed and confirm tests pass."
)
prompt_mode = (self.config.prompt_mode or "problem_statement").strip().lower()
if prompt_mode not in {"problem_statement", "problem_statement+text"}:
raise ValueError(
f"Invalid prompt_mode={self.config.prompt_mode!r}. "
"Expected 'problem_statement' or 'problem_statement+text'."
)
context_block = ""
if prompt_mode == "problem_statement+text" and context:
# Note: We intentionally do NOT truncate/cap here. This mode is for debugging / richer prompts and can be slow.
context_block = f"\nAdditional context:\n{context}\n"
return (
"You are a senior software engineer. Fix the repository so the specified tests pass.\n\n"
f"Repository: {repo} (checked out at base_commit={base_commit})\n"
f"Workspace path: ./{repo_dir}\n\n"
"Constraints:\n"
"- You MUST use the terminal tool to inspect, edit, and verify the repository. Do not respond with a patch file.\n"
f"- Start by inspecting the repo (e.g. `ls`, `cd ./{repo_dir}`, `git status`).\n"
"- Use a workspace-local virtualenv (e.g. inside the repo at ./.venv) to avoid cross-run contamination.\n"
"- Use non-interactive commands only.\n\n"
"- Terminal commands run under POSIX /bin/sh and each tool call runs in a fresh shell (no persisted env vars).\n"
" Avoid bash-only `source`; prefer `. .venv/bin/activate` or `.venv/bin/python ...`.\n\n"
"Problem statement:\n"
f"{problem}\n\n"
f"{context_block}\n"
f"{tests_block}"
)
def build_agent_config(self, item: Item) -> AgentConfig: # noqa: ARG002
# SWE tasks are longer than the simple test env.
return AgentConfig(
max_steps=self.config.agent_max_steps,
temperature=self.config.agent_temperature,
max_tokens=self.config.agent_max_tokens,
tool_delay_s=self.config.agent_tool_delay_s,
)
async def setup_trajectory_workspace(self, item: Item, *, trajectory_id: str, exec_tool) -> Dict[str, Any]:
t0 = time.perf_counter()
repo = item.get("repo")
base_commit = item.get("base_commit")
instance_id = item.get("instance_id") or item.get("id") or item.get("problem_id")
if not isinstance(repo, str) or not isinstance(base_commit, str):
raise RuntimeError("Invalid dataset row: missing repo/base_commit")
repo_dir = self._repo_name(item)
clone_url = f"{self.config.repo_base_url.rstrip('/')}/{repo}.git"
print(
f"[SweSmithOracleEnv] tid={trajectory_id} setup_trajectory_workspace(): "
f"repo={repo} base_commit={base_commit} instance_id={instance_id} dir=./{repo_dir}",
flush=True,
)
# Repo setup strategy:
# - Maintain a shared, per-container bare repo cache under /data/repo_cache
# - For each trajectory, create an isolated git worktree under the slot workspace
# This avoids cloning/fetching full repos per trajectory and is crucial for multiplexing.
def _repo_cache_slug(repo_name: str) -> str:
return repo_name.replace("/", "__")
repo_slug = _repo_cache_slug(repo)
cache_root = "/data/repo_cache"
bare_repo = f"{cache_root}/{repo_slug}.git"
lock_file = f"{cache_root}/.locks/{repo_slug}.lock"
# Use flock to serialize operations that mutate the shared bare repo (fetch/worktree).
# util-linux (flock) is included in the sandbox image.
worktree_cmd = (
"set -e; "
f"rm -rf {repo_dir}; "
f"mkdir -p {cache_root}/.locks; "
f": > {lock_file}; "
f"flock -x {lock_file} sh -lc '"
f"set -e; "
"export GIT_TERMINAL_PROMPT=0; "
"export GIT_LFS_SKIP_SMUDGE=1; "
f"if [ ! -d \"{bare_repo}\" ]; then "
f" git init --bare \"{bare_repo}\"; "
f" git -C \"{bare_repo}\" remote add origin \"{clone_url}\"; "
"fi; "
f"git -C \"{bare_repo}\" remote set-url origin \"{clone_url}\"; "
f"git -C \"{bare_repo}\" worktree prune || true; "
f"if ! git -C \"{bare_repo}\" cat-file -e \"{base_commit}^{{commit}}\" 2>/dev/null; then "
f" git -C \"{bare_repo}\" fetch --depth 1 origin \"{base_commit}\" || true; "
"fi; "
f"if ! git -C \"{bare_repo}\" cat-file -e \"{base_commit}^{{commit}}\" 2>/dev/null; then "
f" git -C \"{bare_repo}\" fetch --prune origin; "
"fi; "
f"git --git-dir=\"{bare_repo}\" worktree add --detach \"{repo_dir}\" \"{base_commit}\"; "
"'"
)
print(f"[SweSmithOracleEnv] tid={trajectory_id} preparing worktree from repo cache", flush=True)
res = await exec_tool(
ToolCall(
name="terminal",
arguments={"command": worktree_cmd, "timeout": self.config.install_timeout_s},
)
)
if not res.success:
raise RuntimeError(
"git worktree setup failed "
f"(repo={repo}, base_commit={base_commit}, instance_id={instance_id}): {res.error}\n{res.output}"
)
print(
f"[SweSmithOracleEnv] tid={trajectory_id} setup_trajectory_workspace(): worktree ready in {time.perf_counter() - t0:.2f}s",
flush=True,
)
return {"repo_dir": repo_dir, "base_commit": base_commit}
def _tests_for_item(self, item: Item) -> List[str]:
tests: List[str] = []
if self.config.score_include_fail_to_pass:
for key in ("PASS_TO_PASS", "FAIL_TO_PASS"):
nodeids = item.get(key)
if isinstance(nodeids, list):
tests.extend([n for n in nodeids if isinstance(n, str)])
else:
nodeids = item.get("PASS_TO_PASS")
if isinstance(nodeids, list):
tests.extend([n for n in nodeids if isinstance(n, str)])
# Stable order for reproducibility.
return sorted(dict.fromkeys(tests))
def _chunk_nodeids(self, nodeids: List[str], max_per_chunk: int = 50) -> List[List[str]]:
chunks: List[List[str]] = []
for i in range(0, len(nodeids), max_per_chunk):
chunks.append(nodeids[i : i + max_per_chunk])
return chunks
async def verify_and_score_trajectory(
self,
item: Item,
final_response: str, # noqa: ARG002
*,
trajectory_id: str,
exec_tool,
agent_result=None,
workspace_meta: Optional[Dict[str, Any]] = None,
) -> tuple[float, Dict[str, Any]]:
_ = trajectory_id
repo_dir = self._repo_name(item)
# Training correctness: do not reward trajectories that never actually used tools.
if agent_result is not None and getattr(agent_result, "total_tool_calls", 0) <= 0:
print(
f"[SweSmithOracleEnv] tid={trajectory_id} verify (dataset_tests): no tool calls; score=0.0",
flush=True,
)
return 0.0, {
"verification_mode": "dataset_tests",
"error": "No tool calls were made by the agent",
}
nodeids = self._tests_for_item(item)
if not nodeids:
return 0.0, {"error": "No tests provided"}
print(f"[SweSmithOracleEnv] tid={trajectory_id} verify (dataset_tests): ensuring venv + deps", flush=True)
setup_cmd = (
f"cd {repo_dir} && "
"python -m venv .venv && "
". .venv/bin/activate && "
"python -m pip install -U pip setuptools wheel && "
"python -m pip install -e . && "
"python -m pip install pytest"
)
setup_res = await exec_tool(
ToolCall(name="terminal", arguments={"command": setup_cmd, "timeout": self.config.install_timeout_s})
)
verification_messages = [{"role": "user", "content": setup_res.to_xml()}]
if not setup_res.success:
return 0.0, {
"verification_mode": "dataset_tests",
"phase": "install",
"error": setup_res.error,
"output": setup_res.output,
"verification_messages": verification_messages,
}
chunks = self._chunk_nodeids(nodeids, max_per_chunk=50)
for chunk_idx, chunk in enumerate(chunks):
joined = " ".join(chunk)
cmd = f"cd {repo_dir} && . .venv/bin/activate && python -m pytest -q {joined}"
res = await exec_tool(
ToolCall(
name="terminal",
arguments={"command": cmd, "timeout": self.config.test_timeout_s},
)
)
verification_messages.append({"role": "user", "content": res.to_xml()})
if not res.success:
return 0.0, {
"verification_mode": "dataset_tests",
"phase": "pytest",
"failed_chunk": chunk_idx,
"error": res.error,
"output": res.output,
"verification_messages": verification_messages,
}
return 1.0, {"verification_mode": "dataset_tests", "passed": True, "verification_messages": verification_messages}
async def score_trajectory(self, item: Item, final_response: str) -> float:
# Not used; scoring happens in verify_and_score_trajectory.
_ = (item, final_response)
return 0.0
if __name__ == "__main__":
SweSmithOracleEnv.cli()

217
atropos/envs/test_env.py Normal file
View File

@@ -0,0 +1,217 @@
"""
Simple test environment for validating the atropos-agent setup.
This environment uses a local OpenAI-compatible server for LLM testing to verify:
- BaseEnv extension works correctly
- API communication via OpenAI-compatible endpoint
- Basic trajectory collection
This is a minimal environment for testing, not production use.
"""
import os
from typing import Dict, List, Optional, Tuple
from dotenv import load_dotenv
from pydantic import Field
from atroposlib.envs.base import (
APIServerConfig,
Item,
)
from ..agent import AgentConfig
from .agent_env import AgentEnv, AgentEnvConfig
# Load environment variables from .env file
load_dotenv()
# Simple test prompts for validation
TEST_PROMPTS = [
{
"prompt": "What is 2 + 2? Answer with just the number.",
"expected": "4",
},
{
"prompt": "What is the capital of France? Answer with just the city name.",
"expected": "Paris",
},
{
"prompt": "What color is the sky on a clear day? Answer with just the color.",
"expected": "Blue",
},
{
"prompt": "How many days are in a week? Answer with just the number.",
"expected": "7",
},
{
"prompt": "What is 10 * 5? Answer with just the number.",
"expected": "50",
},
]
SYSTEM_PROMPT = (
"You are a helpful assistant. Answer questions concisely and directly. "
"When asked for a simple answer, provide just that answer without explanation."
)
class SimpleTestEnvConfig(AgentEnvConfig):
"""Configuration for the simple test environment."""
server_base_url: str = Field(
default="http://127.0.0.1:8080",
description="Base URL for an OpenAI-compatible server (without /v1)",
)
server_model: str = Field(
default="hermes-4-36b",
description="Model name",
)
tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
class SimpleTestEnv(AgentEnv[SimpleTestEnvConfig]):
"""
A simple test environment to validate the atropos-agent setup.
Uses a local OpenAI-compatible LLM endpoint with basic question-answering tasks.
Scoring is based on whether the response contains the expected answer.
"""
name = "simple_test_env"
env_config_cls = SimpleTestEnvConfig
def __init__(
self,
config: SimpleTestEnvConfig,
server_configs: List[APIServerConfig],
slurm: bool = False,
testing: bool = False,
):
super().__init__(config, server_configs, slurm, testing)
self.iter = 0
self.test_prompts = TEST_PROMPTS
self.percent_correct_buffer: List[float] = []
@classmethod
def config_init(cls) -> Tuple[SimpleTestEnvConfig, List[APIServerConfig]]:
"""
Initialize configuration with local server settings from environment variables.
"""
base_url = (
os.getenv("ATROPOS_SERVER_BASE_URL")
or os.getenv("OPENAI_BASE_URL")
or os.getenv("LLM_BASE_URL")
or "http://127.0.0.1:8080"
)
model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
env_config = SimpleTestEnvConfig(
tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
group_size=4,
use_wandb=False, # Disable wandb for simple testing
rollout_server_url="http://localhost:8000",
total_steps=10,
batch_size=16,
steps_per_eval=5,
max_token_length=2048,
inference_weight=1.0,
wandb_name="simple_test",
server_base_url=base_url,
server_model=model,
)
# OpenAI-compatible servers typically expose chat completions at /v1.
server_configs = [
APIServerConfig(
model_name=model,
base_url=f"{base_url}/v1",
api_key=api_key,
num_max_requests_at_once=4,
num_requests_for_eval=8,
timeout=120, # Local models may be slower
),
]
return env_config, server_configs
async def setup_agent_env(self):
"""Setup the environment - load test data."""
print(f"SimpleTestEnv setup complete. {len(self.test_prompts)} test prompts loaded.")
print(f"Using server at: {self.config.server_base_url}")
print(f"Model: {self.config.server_model}")
async def get_next_item(self) -> Item:
"""Get the next test prompt."""
item = self.test_prompts[self.iter % len(self.test_prompts)]
self.iter += 1
return item
def build_task(self, item: Item) -> str:
return item["prompt"]
def build_agent_config(self, item: Item) -> AgentConfig: # noqa: ARG002
return AgentConfig(
max_steps=5,
temperature=0.7,
max_tokens=256,
system_prompt=SYSTEM_PROMPT,
)
async def score_trajectory(self, item: Item, final_response: str) -> float:
expected = item["expected"].lower()
response_lower = (final_response or "").lower()
score = 1.0 if expected in response_lower else 0.0
self.percent_correct_buffer.append(score)
return score
async def evaluate(self, *args, **kwargs):
"""
Simple evaluation - run through all test prompts once.
"""
correct = 0
total = len(self.test_prompts)
for item in self.test_prompts:
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": item["prompt"]},
]
response = await self.server.chat_completion(
messages=messages,
n=1,
max_tokens=256,
temperature=0.0, # Greedy for eval
split="eval",
)
response_text = response.choices[0].message.content or ""
expected = item["expected"].lower()
if expected in response_text.lower():
correct += 1
accuracy = correct / total
print(f"Evaluation: {correct}/{total} = {accuracy:.2%} accuracy")
return {"eval_accuracy": accuracy}
async def wandb_log(self, wandb_metrics: Optional[Dict] = None):
"""Log metrics (simplified for testing)."""
if wandb_metrics is None:
wandb_metrics = {}
if self.percent_correct_buffer:
avg_correct = sum(self.percent_correct_buffer) / len(self.percent_correct_buffer)
wandb_metrics["train/percent_correct"] = avg_correct
print(f"Train accuracy: {avg_correct:.2%}")
self.percent_correct_buffer = []
await super().wandb_log(wandb_metrics)
if __name__ == "__main__":
# Allow running as CLI
SimpleTestEnv.cli()

View File

@@ -0,0 +1,165 @@
"""
ToolServer routing smoke environment.
Validates that:
- sandbox tools run through Nomad SlotPool (terminal -> bash in sandbox)
- external tools run through ToolServer (skills_list)
This env uses ToolServer in-process by default (`tool_server_url="inprocess"`),
so it is self-contained for local testing.
Run:
uv run python -m atropos.envs.toolserver_smoke_env process --env.use_wandb false --env.total_steps 1 --env.group_size 1
"""
from __future__ import annotations
import os
from typing import Any, Dict, List, Tuple
from dotenv import load_dotenv
from pydantic import Field
from atroposlib.envs.base import APIServerConfig, Item
from ..agent import AgentConfig, AgentResult
from .agent_env import AgentEnv, AgentEnvConfig
load_dotenv()
class ToolServerSmokeEnvConfig(AgentEnvConfig):
server_base_url: str = Field(
default="http://127.0.0.1:8080",
description="Base URL for an OpenAI-compatible chat server (without /v1).",
)
server_model: str = Field(default="hermes-4-36b", description="Model name")
tokenizer_name: str = Field(default="NousResearch/Hermes-4.3-36B", description="Tokenizer name for RL tokenization")
class ToolServerSmokeEnv(AgentEnv[ToolServerSmokeEnvConfig]):
name = "toolserver_smoke_env"
env_config_cls = ToolServerSmokeEnvConfig
def __init__(
self,
config: ToolServerSmokeEnvConfig,
server_configs: List[APIServerConfig],
slurm: bool = False,
testing: bool = False,
):
super().__init__(config, server_configs, slurm, testing)
self._iter = 0
@classmethod
def config_init(cls) -> Tuple[ToolServerSmokeEnvConfig, List[APIServerConfig]]:
base_url = (
os.getenv("ATROPOS_SERVER_BASE_URL")
or os.getenv("OPENAI_BASE_URL")
or os.getenv("LLM_BASE_URL")
or "http://127.0.0.1:8080"
)
model = os.getenv("ATROPOS_SERVER_MODEL") or os.getenv("LLM_MODEL") or "hermes-4-36b"
api_key = os.getenv("ATROPOS_SERVER_API_KEY") or os.getenv("NOUS_API_KEY") or os.getenv("OPENAI_API_KEY") or "local"
env_config = ToolServerSmokeEnvConfig(
tokenizer_name=os.getenv("ATROPOS_TOKENIZER_NAME") or "NousResearch/Hermes-4.3-36B",
group_size=1,
use_wandb=False,
include_messages=True,
ensure_scores_are_not_same=False,
total_steps=1,
batch_size=1,
server_base_url=base_url,
server_model=model,
enabled_toolsets=["terminal", "skills"],
disabled_toolsets=[],
# Self-contained ToolServer for local smoke.
tool_server_url="inprocess",
sandbox_image=os.getenv("ATROPOS_SANDBOX_IMAGE") or "atropos-sandbox:local",
purge_job_on_start=True,
purge_job_on_shutdown=True,
)
server_configs = [
APIServerConfig(
model_name=model,
base_url=f"{base_url.rstrip('/')}/v1",
api_key=api_key,
num_max_requests_at_once=1,
num_requests_for_eval=1,
timeout=120,
)
]
return env_config, server_configs
async def setup_agent_env(self) -> None:
return None
async def get_next_item(self) -> Item:
self._iter += 1
return {
"prompt": (
"You MUST call exactly one tool per assistant message.\n"
"\n"
"Step 1) Call the skills_list tool (no arguments), then stop.\n"
"Step 2) After you receive the tool response, call the terminal tool to run:\n"
"python -c \"print('ok')\"\n"
"Step 3) After you receive the terminal tool response, answer with just: ok\n"
"\n"
"Tool call format requirements:\n"
"- Every tool call MUST be a complete XML block with a closing tag.\n"
"- Do NOT emit a second <tool_call> in the same assistant message.\n"
"\n"
"Example:\n"
"<tool_call>{\"name\": \"skills_list\", \"arguments\": {}}</tool_call>\n"
"Do not include anything else in your final answer."
)
}
def build_task(self, item: Item) -> str:
return str(item.get("prompt") or "")
def build_agent_config(self, item: Item) -> AgentConfig: # noqa: ARG002
return AgentConfig(
max_steps=min(10, int(self.config.agent_max_steps)),
temperature=0.2,
max_tokens=None,
)
async def score_trajectory(self, item: Item, final_response: str) -> float:
_ = (item, final_response)
return 0.0
async def verify_and_score_trajectory(
self,
item: Item,
final_response: str,
*,
trajectory_id: str, # noqa: ARG002
exec_tool, # noqa: ARG002
agent_result: AgentResult | None = None,
workspace_meta: Dict[str, Any] | None = None, # noqa: ARG002
) -> tuple[float, Dict[str, Any]]:
if agent_result is None:
return 0.0, {"error": "Missing agent_result"}
called = {c.name for s in agent_result.steps for c in s.tool_calls}
need = {"skills_list", "terminal"}
if not need.issubset(called):
return 0.0, {"error": f"Missing tool calls: {sorted(need - called)}", "called": sorted(called)}
terminal_ok = False
for step in agent_result.steps:
for call, res in zip(step.tool_calls, step.tool_results):
if call.name != "terminal":
continue
if res.success and (res.output or "").strip().splitlines()[-1].strip() == "ok":
terminal_ok = True
score = 1.0 if terminal_ok and (final_response or "").strip() == "ok" else 0.0
return score, {"called": sorted(called), "final": (final_response or "").strip()}
if __name__ == "__main__":
ToolServerSmokeEnv.cli()

11
atropos/nomad/__init__.py Normal file
View File

@@ -0,0 +1,11 @@
"""
Nomad integration for atropos-agent.
Provides:
- NomadClient: Client for Nomad HTTP API
- Job templates for sandbox containers
"""
from .client import NomadClient
__all__ = ["NomadClient"]

500
atropos/nomad/client.py Normal file
View File

@@ -0,0 +1,500 @@
"""
Nomad API Client for atropos-agent.
Provides a simple async client for interacting with the Nomad HTTP API:
- Submit/stop jobs
- Query allocations
- Get allocation addresses
- Scale jobs up/down
"""
import asyncio
import json
import os
from dataclasses import dataclass, field
from enum import Enum
from pathlib import Path
from typing import Any, Dict, List, Optional
import aiohttp
class AllocationStatus(Enum):
"""Nomad allocation status."""
PENDING = "pending"
RUNNING = "running"
COMPLETE = "complete"
FAILED = "failed"
LOST = "lost"
@dataclass
class Allocation:
"""Information about a Nomad allocation."""
id: str
job_id: str
task_group: str
node_id: str
status: AllocationStatus
# Network info for reaching the allocation
address: Optional[str] = None
port: Optional[int] = None
@property
def http_address(self) -> Optional[str]:
"""Get full HTTP address for the allocation."""
if self.address and self.port:
return f"http://{self.address}:{self.port}"
return None
@dataclass
class JobStatus:
"""Status of a Nomad job."""
id: str
name: str
status: str
allocations: List[Allocation] = field(default_factory=list)
count: int = 0 # Number of task groups
class NomadClient:
"""
Async client for Nomad HTTP API.
Usage:
client = NomadClient(address="http://localhost:4646")
# Submit a job
await client.submit_job(job_spec)
# Get allocations
allocs = await client.get_job_allocations("sandbox-python")
# Scale job
await client.scale_job("sandbox-python", count=5)
"""
def __init__(
self,
address: str = "http://localhost:4646",
token: Optional[str] = None,
timeout: float = 30.0,
):
self.address = address.rstrip("/")
self.token = token or os.environ.get("NOMAD_TOKEN")
self.timeout = aiohttp.ClientTimeout(total=timeout)
self._session: Optional[aiohttp.ClientSession] = None
async def _get_session(self) -> aiohttp.ClientSession:
"""Get or create HTTP session."""
if self._session is None or self._session.closed:
headers = {}
if self.token:
headers["X-Nomad-Token"] = self.token
self._session = aiohttp.ClientSession(
timeout=self.timeout,
headers=headers,
)
return self._session
async def close(self):
"""Close the HTTP session."""
if self._session and not self._session.closed:
await self._session.close()
async def __aenter__(self):
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
await self.close()
async def _request(
self,
method: str,
path: str,
data: Optional[Dict[str, Any]] = None,
) -> Dict[str, Any]:
"""Make an HTTP request to Nomad API."""
session = await self._get_session()
url = f"{self.address}{path}"
try:
async with session.request(method, url, json=data) as response:
if response.status == 404:
return {"error": "not_found", "status": 404}
text = await response.text()
if not text:
return {"status": response.status}
try:
result = json.loads(text)
except json.JSONDecodeError:
return {"text": text, "status": response.status}
if response.status >= 400:
return {"error": result, "status": response.status}
return result if isinstance(result, dict) else {"data": result, "status": response.status}
except aiohttp.ClientError as e:
return {"error": str(e), "status": 0}
# Job Operations
async def submit_job(self, job_spec: Dict[str, Any]) -> Dict[str, Any]:
"""
Submit a job to Nomad.
Args:
job_spec: Job specification dict (HCL converted to JSON)
Returns:
Response with EvalID if successful
"""
return await self._request("POST", "/v1/jobs", {"Job": job_spec})
async def stop_job(self, job_id: str, purge: bool = False) -> Dict[str, Any]:
"""
Stop (and optionally purge) a job.
Args:
job_id: Job identifier
purge: If True, completely remove the job
"""
path = f"/v1/job/{job_id}"
if purge:
path += "?purge=true"
return await self._request("DELETE", path)
async def get_job(self, job_id: str) -> Optional[Dict[str, Any]]:
"""Get job details."""
result = await self._request("GET", f"/v1/job/{job_id}")
if "error" in result and result.get("status") == 404:
return None
return result
async def get_job_status(self, job_id: str) -> Optional[JobStatus]:
"""Get job status with allocations."""
job = await self.get_job(job_id)
if not job:
return None
allocs = await self.get_job_allocations(job_id)
# Get count from task groups
count = 0
task_groups = job.get("TaskGroups", [])
for tg in task_groups:
count += tg.get("Count", 1)
return JobStatus(
id=job_id,
name=job.get("Name", job_id),
status=job.get("Status", "unknown"),
allocations=allocs,
count=count,
)
# Allocation Operations
async def get_job_allocations(self, job_id: str) -> List[Allocation]:
"""Get all allocations for a job."""
result = await self._request("GET", f"/v1/job/{job_id}/allocations")
if "error" in result:
return []
allocs_data = result.get("data", result) if isinstance(result, dict) else result
if not isinstance(allocs_data, list):
return []
allocations = []
for alloc_data in allocs_data:
# Parse allocation info
alloc_id = alloc_data.get("ID", "")
status_str = alloc_data.get("ClientStatus", "unknown")
try:
status = AllocationStatus(status_str)
except ValueError:
status = AllocationStatus.PENDING
# Get network info - need to fetch detailed allocation for this
address = None
port = None
# First try the summary data
resources = alloc_data.get("AllocatedResources") or {}
shared = resources.get("Shared") or {}
networks = shared.get("Networks") or []
# If no networks in summary, fetch detailed allocation
if not networks and alloc_id:
detailed = await self.get_allocation(alloc_id)
if detailed:
resources = detailed.get("AllocatedResources") or {}
shared = resources.get("Shared") or {}
networks = shared.get("Networks") or []
if networks:
network = networks[0]
address = network.get("IP")
# Look for dynamic ports OR reserved ports (Singularity/raw_exec uses reserved)
dyn_ports = network.get("DynamicPorts") or []
reserved_ports = network.get("ReservedPorts") or []
for dp in dyn_ports + reserved_ports:
if dp.get("Label") == "http":
port = dp.get("Value")
break
allocations.append(Allocation(
id=alloc_id,
job_id=job_id,
task_group=alloc_data.get("TaskGroup", ""),
node_id=alloc_data.get("NodeID", ""),
status=status,
address=address,
port=port,
))
return allocations
async def get_allocation(self, alloc_id: str) -> Optional[Dict[str, Any]]:
"""Get detailed allocation info."""
result = await self._request("GET", f"/v1/allocation/{alloc_id}")
if "error" in result and result.get("status") == 404:
return None
return result
# Scaling Operations
async def scale_job(self, job_id: str, count: int, task_group: str = "sandbox") -> Dict[str, Any]:
"""
Scale a job's task group to specified count.
Args:
job_id: Job identifier
count: Desired number of allocations
task_group: Name of task group to scale
"""
payload = {
"Count": count,
"Target": {
"Group": task_group,
},
}
return await self._request("POST", f"/v1/job/{job_id}/scale", payload)
async def get_job_scale_status(self, job_id: str) -> Dict[str, int]:
"""
Get current scale status for a job.
Returns:
Dict mapping task group name to count
"""
result = await self._request("GET", f"/v1/job/{job_id}/scale")
if "error" in result:
return {}
task_groups = result.get("TaskGroups", {})
return {
name: info.get("Running", 0)
for name, info in task_groups.items()
}
# Health Check
async def is_healthy(self) -> bool:
"""Check if Nomad is reachable and healthy."""
try:
result = await self._request("GET", "/v1/status/leader")
return "error" not in result
except Exception:
return False
async def get_leader(self) -> Optional[str]:
"""Get current Nomad leader address."""
result = await self._request("GET", "/v1/status/leader")
if isinstance(result, dict) and "data" in result:
return result["data"]
return None
def load_job_template(
template_name: str = "sandbox",
**kwargs,
) -> Dict[str, Any]:
"""
Load and configure a job template.
Args:
template_name: Name of template (e.g., "sandbox")
**kwargs: Template variables to substitute
Returns:
Job specification dict ready for Nomad API
"""
# Default job template for sandbox container
if template_name == "sandbox":
return create_sandbox_job(**kwargs)
else:
raise ValueError(f"Unknown template: {template_name}")
def create_sandbox_job(
job_id: str = "atropos-sandbox",
image: str = "atropos-sandbox:local", # Use :local tag to avoid registry pull
count: int = 1,
slots_per_container: int = 10,
privileged: bool = False,
cpu: int = 500,
memory: int = 512,
port: int = 8080,
datacenter: str = "dc1",
driver: str = "docker", # "docker" or "singularity"
singularity_image: str = None, # Path to .sif file for singularity driver
) -> Dict[str, Any]:
"""
Create a sandbox job specification.
This job runs the sandbox_server.py inside a container,
with the specified number of slots for agent workspaces.
Args:
job_id: Unique job identifier
image: Docker image to use (for docker driver)
count: Number of container instances
slots_per_container: Number of slots per container
privileged: Run container in privileged mode (recommended for bubblewrap)
cpu: CPU allocation in MHz
memory: Memory allocation in MB
port: HTTP port for sandbox server
datacenter: Nomad datacenter
driver: Container driver - "docker" or "singularity"
singularity_image: Path to .sif file (required if driver="singularity")
Returns:
Job specification dict
"""
# Build task config based on driver
if driver == "singularity":
if not singularity_image:
raise ValueError("singularity_image path required when driver='singularity'")
# Use raw_exec driver to run apptainer via shell for variable expansion
# The container binds the allocation directory for workspace persistence
# For raw_exec, we use static port since Nomad's dynamic port mapping doesn't
# work the same as Docker - the process runs directly on the host.
shell_cmd = (
f'apptainer run '
f'--bind "$NOMAD_ALLOC_DIR/data:/data" '
f'--pwd /app '
f'--env PYTHONUNBUFFERED=1 '
f'{singularity_image} '
f'python sandbox_server.py '
f'--port {port} '
f'--slots {slots_per_container} '
f'--data-dir /data'
)
task_config = {
"command": "/bin/sh",
"args": ["-c", shell_cmd],
}
task_driver = "raw_exec"
else:
# Docker driver (default)
task_config = {
"image": image,
"force_pull": False, # Use local image, don't try to pull
"ports": ["http"],
"privileged": privileged,
"command": "python",
"args": [
"sandbox_server.py",
"--port", str(port),
"--slots", str(slots_per_container),
"--data-dir", "/data",
],
# Note: On Linux, you can mount persistent storage:
# "volumes": ["${NOMAD_ALLOC_DIR}/data:/data"],
# On macOS/Docker Desktop, skip volumes for PoC
# (container /data is ephemeral but works for testing)
}
task_driver = "docker"
# For Singularity/raw_exec, use static ports since the process runs directly on host.
# For Docker, use dynamic ports with port mapping.
if driver == "singularity":
network_config = {
"Mode": "host",
"ReservedPorts": [
{
"Label": "http",
"Value": port,
}
],
}
else:
network_config = {
"Mode": "host",
"DynamicPorts": [
{
"Label": "http",
"To": port,
}
],
}
return {
"ID": job_id,
"Name": job_id,
"Type": "service",
"Datacenters": [datacenter],
"TaskGroups": [
{
"Name": "sandbox",
"Count": count,
# Speed up deployments and avoid Consul checks. Without this, Nomad may
# keep an "active deployment" around for the default MinHealthyTime,
# which blocks immediate scaling under load.
"Update": {
"HealthCheck": "task_states",
"MinHealthyTime": 0,
},
"Networks": [network_config],
"Tasks": [
{
"Name": "sandbox-server",
"Driver": task_driver,
"Config": task_config,
"Env": {
"PYTHONUNBUFFERED": "1",
"NOMAD_ALLOC_DIR": "${NOMAD_ALLOC_DIR}",
},
"Resources": {
"CPU": cpu,
"MemoryMB": memory,
},
# Note: Services with Checks require Consul, which we skip for the PoC
}
],
"RestartPolicy": {
"Attempts": 3,
"Interval": 300_000_000_000, # 5 minutes
"Delay": 10_000_000_000, # 10 seconds
"Mode": "delay",
},
"ReschedulePolicy": {
"Attempts": 5,
"Interval": 3600_000_000_000, # 1 hour
"Delay": 30_000_000_000, # 30 seconds
"DelayFunction": "exponential",
"MaxDelay": 300_000_000_000, # 5 minutes
"Unlimited": False,
},
}
],
}

1912
atropos/sandbox_server.py Normal file

File diff suppressed because it is too large Load Diff

20
atropos/slots/__init__.py Normal file
View File

@@ -0,0 +1,20 @@
"""
Slot-based multiplexing for atropos-agent.
Provides:
- Slot: Isolated workspace for a single trajectory
- SlotPool: Manages slots across Nomad allocations
- SandboxExecutor: Executes tools in sandbox containers
"""
from .executor import SandboxExecutor
from .pool import SlotPool, SlotPoolConfig
from .slot import Slot, SlotState
__all__ = [
"Slot",
"SlotState",
"SlotPool",
"SlotPoolConfig",
"SandboxExecutor",
]

457
atropos/slots/executor.py Normal file
View File

@@ -0,0 +1,457 @@
"""
SandboxExecutor - HTTP client for sandbox container communication.
Sends tool execution requests to sandbox_server.py running inside Nomad containers.
Supports single and batch execution for efficiency.
"""
import asyncio
import uuid
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Tuple
import aiohttp
from .slot import Slot, SlotState
from ..tools.base import ToolCall, ToolResult
@dataclass
class ExecutionRequest:
"""Request to execute a tool in a slot."""
slot: Slot
tool_name: str
args: Dict[str, Any]
execution_id: str = field(default_factory=lambda: str(uuid.uuid4()))
timeout: float = 30.0
@dataclass
class ExecutionResult:
"""Result from sandbox execution."""
success: bool
output: str = ""
error: str = ""
execution_id: str = ""
slot_id: str = ""
metadata: Dict[str, Any] = field(default_factory=dict)
def to_tool_result(self) -> ToolResult:
"""Convert to ToolResult for agent consumption."""
return ToolResult(
success=self.success,
output=self.output,
error=self.error,
metadata=self.metadata,
uniq_id=self.execution_id,
)
class SandboxExecutor:
"""
HTTP client for executing tools in sandbox containers.
Communicates with sandbox_server.py running inside Nomad allocations.
Supports both single execution and batched parallel execution.
Usage:
executor = SandboxExecutor()
# Single execution
result = await executor.execute(slot, "bash", {"command": "ls"})
# Batch execution
results = await executor.execute_batch([
(slot1, "bash", {"command": "ls"}),
(slot2, "write_file", {"path": "test.txt", "content": "hello"}),
])
"""
def __init__(
self,
timeout: float = 30.0,
max_retries: int = 3,
retry_delay: float = 1.0,
):
self.timeout = aiohttp.ClientTimeout(total=timeout)
self.max_retries = max_retries
self.retry_delay = retry_delay
self._session: Optional[aiohttp.ClientSession] = None
async def _get_session(self) -> aiohttp.ClientSession:
"""Get or create HTTP session."""
if self._session is None or self._session.closed:
self._session = aiohttp.ClientSession(timeout=self.timeout)
return self._session
async def close(self):
"""Close HTTP session."""
if self._session and not self._session.closed:
await self._session.close()
async def __aenter__(self):
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
await self.close()
async def execute(
self,
slot: Slot,
tool_name: str,
args: Dict[str, Any],
timeout: Optional[float] = None,
) -> ExecutionResult:
"""
Execute a tool in a slot's workspace.
Args:
slot: Slot to execute in
tool_name: Name of tool (bash, read_file, write_file)
args: Tool arguments
timeout: Optional timeout override
Returns:
ExecutionResult with output or error
"""
execution_id = str(uuid.uuid4())
exec_timeout = timeout or self.timeout.total or 30.0
# Mark slot as executing
original_state = slot.state
try:
if slot.state == SlotState.ACQUIRED:
slot.start_execution(execution_id)
result = await self._send_execute_request(
container_addr=slot.container_addr,
slot_id=slot.slot_id,
tool_name=tool_name,
args=args,
execution_id=execution_id,
timeout=exec_timeout,
)
result.slot_id = slot.slot_id
return result
finally:
# Restore slot state
if slot.state == SlotState.EXECUTING:
slot.end_execution()
async def _send_execute_request(
self,
container_addr: str,
slot_id: str,
tool_name: str,
args: Dict[str, Any],
execution_id: str,
timeout: float,
) -> ExecutionResult:
"""Send execution request to sandbox server with retry logic."""
session = await self._get_session()
url = f"{container_addr}/execute"
payload = {
"slot_id": slot_id,
"tool": tool_name,
"args": args,
"execution_id": execution_id,
"timeout": timeout,
}
last_error = None
for attempt in range(self.max_retries):
try:
async with session.post(url, json=payload) as response:
data = await response.json()
return ExecutionResult(
success=data.get("success", False),
output=data.get("output", ""),
error=data.get("error", ""),
execution_id=data.get("execution_id", execution_id),
metadata=data.get("metadata", {}),
)
except aiohttp.ClientError as e:
last_error = str(e)
if attempt < self.max_retries - 1:
await asyncio.sleep(self.retry_delay * (attempt + 1))
continue
except asyncio.TimeoutError:
last_error = f"Request timed out after {timeout}s"
break
except Exception as e:
last_error = str(e)
break
return ExecutionResult(
success=False,
error=f"Failed after {self.max_retries} attempts: {last_error}",
execution_id=execution_id,
)
async def execute_batch(
self,
requests: List[Tuple[Slot, str, Dict[str, Any]]],
timeout: Optional[float] = None,
) -> List[ExecutionResult]:
"""
Execute multiple tools in parallel across slots.
This is the key optimization - we batch tool calls to maximize
container utilization while agents are waiting for LLM responses.
Args:
requests: List of (slot, tool_name, args) tuples
timeout: Optional timeout override
Returns:
List of ExecutionResults in same order as requests
"""
if not requests:
return []
# Group requests by container address for batch API
by_container: Dict[str, List[Tuple[int, Slot, str, Dict[str, Any], str]]] = {}
for idx, (slot, tool_name, args) in enumerate(requests):
execution_id = str(uuid.uuid4())
container = slot.container_addr
if container not in by_container:
by_container[container] = []
by_container[container].append((idx, slot, tool_name, args, execution_id))
# Mark slots as executing
if slot.state == SlotState.ACQUIRED:
slot.start_execution(execution_id)
# Execute batches in parallel
exec_timeout = timeout or self.timeout.total or 30.0
batch_tasks = []
for container_addr, batch_requests in by_container.items():
task = self._send_batch_request(
container_addr=container_addr,
batch_requests=batch_requests,
timeout=exec_timeout,
)
batch_tasks.append(task)
# Gather all batch results
batch_results = await asyncio.gather(*batch_tasks, return_exceptions=True)
# Collect results in original order
results: List[Optional[ExecutionResult]] = [None] * len(requests)
for batch_result in batch_results:
if isinstance(batch_result, Exception):
# Mark all in this batch as failed
continue
for idx, result in batch_result:
results[idx] = result
# Fill in any missing results
for idx, result in enumerate(results):
if result is None:
slot, tool_name, args = requests[idx]
results[idx] = ExecutionResult(
success=False,
error="Batch execution failed",
slot_id=slot.slot_id,
)
# End execution on all slots
for slot, _, _ in requests:
if slot.state == SlotState.EXECUTING:
slot.end_execution()
return results # type: ignore
async def _send_batch_request(
self,
container_addr: str,
batch_requests: List[Tuple[int, Slot, str, Dict[str, Any], str]],
timeout: float,
) -> List[Tuple[int, ExecutionResult]]:
"""Send batch execution request to a single container."""
session = await self._get_session()
url = f"{container_addr}/batch"
# Build batch payload
payload = [
{
"slot_id": slot.slot_id,
"tool": tool_name,
"args": args,
"execution_id": execution_id,
"timeout": timeout,
}
for _, slot, tool_name, args, execution_id in batch_requests
]
try:
async with session.post(url, json=payload) as response:
data = await response.json()
if not isinstance(data, list):
raise ValueError(f"Expected list response, got {type(data)}")
results = []
for i, (idx, slot, _, _, execution_id) in enumerate(batch_requests):
if i < len(data):
item = data[i]
result = ExecutionResult(
success=item.get("success", False),
output=item.get("output", ""),
error=item.get("error", ""),
execution_id=item.get("execution_id", execution_id),
slot_id=slot.slot_id,
metadata=item.get("metadata", {}),
)
else:
result = ExecutionResult(
success=False,
error="Missing result in batch response",
execution_id=execution_id,
slot_id=slot.slot_id,
)
results.append((idx, result))
return results
except Exception as e:
# Return error for all requests in batch
return [
(idx, ExecutionResult(
success=False,
error=str(e),
execution_id=execution_id,
slot_id=slot.slot_id,
))
for idx, slot, _, _, execution_id in batch_requests
]
async def reset_slot(self, slot: Slot) -> ExecutionResult:
"""
Reset a slot's workspace (delete all files).
Useful when reusing a slot for a new trajectory.
"""
session = await self._get_session()
url = f"{slot.container_addr}/reset"
try:
async with session.post(url, json={"slot_id": slot.slot_id}) as response:
data = await response.json()
return ExecutionResult(
success=data.get("success", False),
output=data.get("output", ""),
error=data.get("error", ""),
slot_id=slot.slot_id,
)
except Exception as e:
return ExecutionResult(
success=False,
error=str(e),
slot_id=slot.slot_id,
)
async def health_check(self, container_addr: str) -> bool:
"""Check if a sandbox container is healthy."""
session = await self._get_session()
url = f"{container_addr}/health"
try:
async with session.get(url) as response:
data = await response.json()
return data.get("status") == "ok"
except Exception:
return False
async def get_container_status(
self,
container_addr: str
) -> Optional[Dict[str, Any]]:
"""Get status info from a sandbox container."""
session = await self._get_session()
url = f"{container_addr}/health"
try:
async with session.get(url) as response:
return await response.json()
except Exception:
return None
# -------------------------------------------------------------------------
# Artifact helpers (optional)
# -------------------------------------------------------------------------
async def _post_json(
self,
url: str,
payload: Dict[str, Any],
timeout: Optional[float] = None,
) -> Dict[str, Any]:
session = await self._get_session()
try:
async with session.post(url, json=payload, timeout=timeout) as response:
data = await response.json()
if isinstance(data, dict):
data.setdefault("http_status", response.status)
return data
return {"success": False, "error": f"Unexpected response type: {type(data)}", "http_status": response.status}
except Exception as e:
return {"success": False, "error": str(e)}
async def read_artifact(
self,
slot: Slot,
path: str,
*,
encoding: str = "text",
max_bytes: Optional[int] = None,
include_sha256: bool = False,
timeout: Optional[float] = None,
) -> Dict[str, Any]:
url = f"{slot.container_addr}/artifacts/read"
payload: Dict[str, Any] = {"slot_id": slot.slot_id, "path": path, "encoding": encoding, "include_sha256": include_sha256}
if max_bytes is not None:
payload["max_bytes"] = max_bytes
return await self._post_json(url, payload, timeout=timeout)
async def list_artifacts(
self,
slot: Slot,
path: str = ".",
*,
recursive: bool = False,
max_entries: Optional[int] = None,
timeout: Optional[float] = None,
) -> Dict[str, Any]:
url = f"{slot.container_addr}/artifacts/list"
payload: Dict[str, Any] = {"slot_id": slot.slot_id, "path": path, "recursive": recursive}
if max_entries is not None:
payload["max_entries"] = max_entries
return await self._post_json(url, payload, timeout=timeout)
async def archive_artifacts(
self,
slot: Slot,
path: str = ".",
*,
archive_format: str = "tar.gz",
max_bytes: Optional[int] = None,
max_entries: Optional[int] = None,
timeout: Optional[float] = None,
) -> Dict[str, Any]:
url = f"{slot.container_addr}/artifacts/archive"
payload: Dict[str, Any] = {"slot_id": slot.slot_id, "path": path, "format": archive_format}
if max_bytes is not None:
payload["max_bytes"] = max_bytes
if max_entries is not None:
payload["max_entries"] = max_entries
return await self._post_json(url, payload, timeout=timeout)

659
atropos/slots/pool.py Normal file
View File

@@ -0,0 +1,659 @@
"""
SlotPool - Manages slots across Nomad allocations.
The SlotPool is the core abstraction for slot-based multiplexing:
- Tracks available/acquired slots across containers
- Handles slot acquisition and release
- Auto-scales Nomad job count based on demand
- Provides batched tool execution
"""
import asyncio
import logging
import os
import subprocess
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple
from ..nomad.client import (
Allocation,
AllocationStatus,
NomadClient,
create_sandbox_job,
)
from .executor import ExecutionResult, SandboxExecutor
from .slot import Slot, SlotState, create_slots_for_allocation
logger = logging.getLogger(__name__)
@dataclass
class SlotPoolConfig:
"""Configuration for SlotPool."""
# Nomad settings
nomad_address: str = "http://localhost:4646"
job_id: str = "atropos-sandbox"
datacenter: str = "dc1"
# Container settings
image: str = "atropos-sandbox:local" # Use :local tag to avoid registry pull
slots_per_container: int = 10
privileged: bool = False
cpu: int = 500 # MHz
memory: int = 512 # MB
# Driver selection: "docker" or "singularity"
driver: str = "docker"
# Path to .sif file for singularity driver (required if driver="singularity")
singularity_image: Optional[str] = None
# Scaling settings
min_containers: int = 1
max_containers: int = 10
# Timeouts
acquire_timeout: float = 30.0 # Seconds between acquire polls (also triggers scale-up attempts)
health_check_interval: float = 30.0 # Seconds between health checks
scale_cooldown: float = 60.0 # Seconds between scale operations
# Job lifecycle
purge_job_on_start: bool = False # Purge any pre-existing job before starting (local dev/training friendly)
# Local Docker image convenience (macOS/Nomad dev mode)
auto_build_local_image: bool = True # If image endswith :local and is missing, build it from the bundled Dockerfile.
dockerfile_path: Optional[str] = None # Override Dockerfile path (default: Hermes-Agent/atropos/Dockerfile).
docker_build_context: Optional[str] = None # Override build context (default: Hermes-Agent/atropos).
class SlotPool:
"""
Manages a pool of slots across Nomad allocations.
The SlotPool:
- Deploys sandbox containers to Nomad
- Tracks slots across all running containers
- Handles slot acquisition/release
- Auto-scales based on demand
- Provides batched execution via SandboxExecutor
Usage:
config = SlotPoolConfig(
nomad_address="http://localhost:4646",
job_id="my-sandbox",
slots_per_container=10,
)
pool = SlotPool(config)
await pool.start()
# Acquire a slot
slot = await pool.acquire()
# Execute tool
result = await pool.execute(slot, "bash", {"command": "ls"})
# Release slot
await pool.release(slot)
# Shutdown
await pool.stop()
"""
def __init__(self, config: Optional[SlotPoolConfig] = None):
self.config = config or SlotPoolConfig()
# Nomad client
self.nomad = NomadClient(address=self.config.nomad_address)
# Sandbox executor for tool execution
self.executor = SandboxExecutor()
# Slot tracking
self._slots: Dict[str, Slot] = {} # slot_key -> Slot
self._available_queue: asyncio.Queue[str] = asyncio.Queue()
self._lock = asyncio.Lock()
self._scale_lock = asyncio.Lock()
# State
self._started = False
self._health_task: Optional[asyncio.Task] = None
self._scale_task: Optional[asyncio.Task] = None
self._last_scale_time = 0.0
def _default_dockerfile_path(self) -> Path:
# Hermes-Agent/atropos/Dockerfile lives next to this module in source checkouts.
return Path(__file__).resolve().parents[1] / "Dockerfile"
def _default_build_context(self) -> Path:
return Path(__file__).resolve().parents[1]
def _docker_image_exists(self, image: str) -> bool:
try:
proc = subprocess.run(
["docker", "image", "inspect", image],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
check=False,
env={**os.environ, "DOCKER_CLI_HINTS": "false"},
)
return proc.returncode == 0
except FileNotFoundError:
return False
def _try_build_local_image(self, image: str) -> None:
dockerfile = Path(self.config.dockerfile_path) if self.config.dockerfile_path else self._default_dockerfile_path()
context = Path(self.config.docker_build_context) if self.config.docker_build_context else self._default_build_context()
if not dockerfile.exists():
raise RuntimeError(
f"Sandbox Dockerfile not found at {dockerfile}. "
"Build the sandbox image manually or set --env.purge_job_on_start false and provide a non-local image."
)
if not context.exists():
raise RuntimeError(f"Docker build context not found at {context}")
# Prefer buildx+--load to ensure the image ends up in the local daemon (required by Nomad's docker driver).
buildx_cmd = [
"docker",
"buildx",
"build",
"--load",
"-t",
image,
"-f",
str(dockerfile),
str(context),
]
proc = subprocess.run(buildx_cmd, check=False, env={**os.environ, "DOCKER_CLI_HINTS": "false"})
if proc.returncode == 0:
return
# Fallback to classic docker build if buildx isn't available.
build_cmd = ["docker", "build", "-t", image, "-f", str(dockerfile), str(context)]
proc2 = subprocess.run(build_cmd, check=False, env={**os.environ, "DOCKER_CLI_HINTS": "false"})
if proc2.returncode != 0:
raise RuntimeError(
f"Failed to build local sandbox image {image}. "
f"Tried: {' '.join(buildx_cmd)} and {' '.join(build_cmd)}"
)
def _ensure_local_image(self) -> None:
image = (self.config.image or "").strip()
if not image.endswith(":local"):
return
if not self.config.auto_build_local_image:
return
if self._docker_image_exists(image):
return
logger.info(f"Local sandbox image {image} not found; building it now...")
self._try_build_local_image(image)
def _slot_key(self, alloc_id: str, slot_id: str) -> str:
"""Generate unique key for a slot."""
return f"{alloc_id}:{slot_id}"
@property
def total_slots(self) -> int:
"""Total number of slots in pool."""
return len(self._slots)
@property
def available_slots(self) -> int:
"""Number of available slots."""
return sum(1 for s in self._slots.values() if s.is_available)
@property
def acquired_slots(self) -> int:
"""Number of acquired slots."""
return sum(1 for s in self._slots.values() if s.is_acquired)
async def start(self) -> None:
"""
Start the slot pool.
- Checks if Nomad is healthy
- Deploys sandbox job if not running
- Discovers existing allocations
- Starts health check background task
"""
if self._started:
return
logger.info(f"Starting SlotPool (job_id={self.config.job_id})")
try:
# Make sure local sandbox images exist before Nomad tries to pull them.
# This is a common footgun in macOS dev mode with :local tags.
self._ensure_local_image()
# Check Nomad health
if not await self.nomad.is_healthy():
raise RuntimeError(f"Nomad is not reachable at {self.config.nomad_address}")
if self.config.purge_job_on_start:
logger.info(f"Purging any existing Nomad job: {self.config.job_id}")
await self.nomad.stop_job(self.config.job_id, purge=True)
# Check if job exists (after optional purge)
job = await self.nomad.get_job(self.config.job_id)
if job is None:
# Deploy new job
logger.info(f"Deploying sandbox job: {self.config.job_id} (driver={self.config.driver})")
job_spec = create_sandbox_job(
job_id=self.config.job_id,
image=self.config.image,
count=self.config.min_containers,
slots_per_container=self.config.slots_per_container,
privileged=self.config.privileged,
cpu=self.config.cpu,
memory=self.config.memory,
datacenter=self.config.datacenter,
driver=self.config.driver,
singularity_image=self.config.singularity_image,
)
result = await self.nomad.submit_job(job_spec)
if "error" in result:
raise RuntimeError(f"Failed to submit job: {result}")
# Wait for allocations to be running (even if the job already existed).
await self._wait_for_healthy_allocations(self.config.min_containers)
# Discover existing allocations and slots
await self._refresh_slots()
# Start health check task
self._health_task = asyncio.create_task(self._health_check_loop())
self._started = True
logger.info(f"SlotPool started: {self.total_slots} slots available")
except Exception:
# Ensure aiohttp sessions are not leaked if we fail to start.
await self.stop(purge_job=False)
raise
async def stop(self, purge_job: bool = False) -> None:
"""
Stop the slot pool.
Args:
purge_job: If True, also stop the Nomad job
"""
logger.info("Stopping SlotPool")
# Cancel health check task
if self._health_task:
self._health_task.cancel()
try:
await self._health_task
except asyncio.CancelledError:
pass
finally:
self._health_task = None
if self._scale_task:
self._scale_task.cancel()
try:
await self._scale_task
except asyncio.CancelledError:
pass
finally:
self._scale_task = None
# Optionally stop the job (do this even if start() never completed).
if purge_job:
logger.info(f"Stopping Nomad job: {self.config.job_id}")
await self.nomad.stop_job(self.config.job_id, purge=True)
# Close connections
await self.executor.close()
await self.nomad.close()
self._started = False
self._slots.clear()
# Clear the queue
while not self._available_queue.empty():
try:
self._available_queue.get_nowait()
except asyncio.QueueEmpty:
break
async def acquire(self, trajectory_id: Optional[str] = None) -> Slot:
"""
Acquire an available slot.
If no slots are available, waits up to acquire_timeout seconds.
If still no slots, attempts to scale up.
Args:
trajectory_id: Optional ID of trajectory acquiring the slot
Returns:
Acquired Slot
Raises:
asyncio.TimeoutError: If no slot becomes available
"""
if not self._started:
raise RuntimeError("SlotPool not started")
while True:
try:
# Try to get an available slot
slot_key = await asyncio.wait_for(
self._available_queue.get(),
timeout=self.config.acquire_timeout,
)
except asyncio.TimeoutError:
# Try to scale up, but keep waiting even if scaling isn't possible.
# In practice, slots may become available shortly (e.g. contention),
# and scaling may be temporarily blocked by Nomad deployments.
await self._try_scale_up()
continue
slot = self._slots.get(slot_key)
if slot is None:
# Slot was removed; discard stale queue entry and retry.
continue
try:
slot.acquire(trajectory_id)
except RuntimeError:
# Slot isn't actually available (e.g. duplicate queue entry); retry.
continue
logger.debug(f"Acquired slot {slot.slot_id} (alloc={slot.alloc_id[:8]})")
return slot
async def release(self, slot: Slot, reset_workspace: bool = False) -> None:
"""
Release a slot back to the pool.
Args:
slot: Slot to release
reset_workspace: If True, clear the workspace files
"""
slot_key = self._slot_key(slot.alloc_id, slot.slot_id)
if slot_key not in self._slots:
logger.warning(f"Releasing unknown slot: {slot_key}")
return
# Optionally reset workspace
if reset_workspace:
await self.executor.reset_slot(slot)
slot.release()
await self._available_queue.put(slot_key)
logger.debug(f"Released slot {slot.slot_id}")
async def execute(
self,
slot: Slot,
tool_name: str,
args: Dict[str, Any],
timeout: Optional[float] = None,
) -> ExecutionResult:
"""
Execute a tool in a slot's workspace.
Args:
slot: Slot to execute in
tool_name: Name of tool (bash, read_file, write_file)
args: Tool arguments
timeout: Optional timeout override
Returns:
ExecutionResult
"""
return await self.executor.execute(slot, tool_name, args, timeout)
async def execute_batch(
self,
requests: List[Tuple[Slot, str, Dict[str, Any]]],
timeout: Optional[float] = None,
) -> List[ExecutionResult]:
"""
Execute multiple tools in parallel.
This is the key optimization - batch execution across multiple slots
maximizes container utilization.
Args:
requests: List of (slot, tool_name, args) tuples
timeout: Optional timeout override
Returns:
List of ExecutionResults in same order
"""
return await self.executor.execute_batch(requests, timeout)
async def _refresh_slots(self) -> None:
"""Refresh slot inventory from Nomad allocations."""
async with self._lock:
allocs = await self.nomad.get_job_allocations(self.config.job_id)
# Track which slots we've seen
seen_keys = set()
for alloc in allocs:
if alloc.status != AllocationStatus.RUNNING:
continue
if not alloc.http_address:
continue
# Check container health
healthy = await self.executor.health_check(alloc.http_address)
if not healthy:
continue
# Create slots for this allocation
for i in range(self.config.slots_per_container):
slot_id = f"slot_{i}"
slot_key = self._slot_key(alloc.id, slot_id)
seen_keys.add(slot_key)
if slot_key not in self._slots:
# New slot
slot = Slot(
slot_id=slot_id,
alloc_id=alloc.id,
container_addr=alloc.http_address,
)
self._slots[slot_key] = slot
await self._available_queue.put(slot_key)
logger.debug(f"Added slot: {slot_key}")
# Remove slots from dead allocations
for slot_key in list(self._slots.keys()):
if slot_key not in seen_keys:
slot = self._slots.pop(slot_key)
logger.debug(f"Removed slot: {slot_key}")
async def _wait_for_healthy_allocations(
self,
min_count: int,
timeout: float = 120.0
) -> None:
"""Wait for allocations to become healthy."""
import time
start = time.time()
def _summarize_alloc_detail(detail: Dict[str, Any]) -> str:
task_states = detail.get("TaskStates") or {}
parts: List[str] = []
if isinstance(task_states, dict):
for task_name, st in task_states.items():
events = (st or {}).get("Events") or []
if isinstance(events, list) and events:
# Include a few recent events; the latest can be a generic restart message
# while the true root cause is slightly earlier (e.g. image pull failure).
recent = events[-3:]
msgs: List[str] = []
for ev in recent:
desc = ev.get("DisplayMessage") or ev.get("Message") or ev.get("Type") or ""
if desc:
msgs.append(desc)
if msgs:
parts.append(f"{task_name}: " + " | ".join(msgs))
return "; ".join(parts)
def _alloc_events_lower(detail: Dict[str, Any]) -> str:
task_states = detail.get("TaskStates") or {}
texts: List[str] = []
if isinstance(task_states, dict):
for _task_name, st in task_states.items():
events = (st or {}).get("Events") or []
if isinstance(events, list):
for ev in events[-10:]:
desc = ev.get("DisplayMessage") or ev.get("Message") or ev.get("Type") or ""
if desc:
texts.append(desc)
return " ".join(texts).lower()
while time.time() - start < timeout:
allocs = await self.nomad.get_job_allocations(self.config.job_id)
healthy_count = 0
for alloc in allocs:
if alloc.status == AllocationStatus.RUNNING and alloc.http_address:
if await self.executor.health_check(alloc.http_address):
healthy_count += 1
# Fast-fail on obvious driver/image errors to avoid waiting out the full timeout.
if alloc.id:
detail = await self.nomad.get_allocation(alloc.id)
if isinstance(detail, dict):
summary = _summarize_alloc_detail(detail)
lowered = _alloc_events_lower(detail) or summary.lower()
if "failed to pull" in lowered or "pull access denied" in lowered:
raise RuntimeError(
"Nomad allocation failed to start due to a Docker image pull error. "
f"Allocation {alloc.id[:8]}: {summary}\n"
"If you're using a local image tag (e.g. `atropos-sandbox:local`) on macOS, "
"make sure the image is loaded into Docker, e.g.:\n"
" docker buildx build --load -t atropos-sandbox:local -f Hermes-Agent/atropos/Dockerfile Hermes-Agent/atropos"
)
if "exceeded allowed attempts" in lowered:
raise RuntimeError(
"Nomad allocation is crash-looping and has entered restart backoff. "
f"Allocation {alloc.id[:8]}: {summary}\n"
"Inspect logs with:\n"
f" nomad alloc logs -stderr -task sandbox-server {alloc.id}\n"
"Common causes include: missing local Docker image tag, container entrypoint error, "
"or sandbox-server startup failure."
)
if healthy_count >= min_count:
return
await asyncio.sleep(2.0)
# Timed out: include allocation status detail to help debugging.
allocs = await self.nomad.get_job_allocations(self.config.job_id)
alloc_lines: List[str] = []
for alloc in allocs[:10]:
addr = alloc.http_address or "-"
line = f"{alloc.id[:8]} status={alloc.status.value} http={addr}"
detail = await self.nomad.get_allocation(alloc.id)
if isinstance(detail, dict):
summary = _summarize_alloc_detail(detail)
if summary:
line += f" detail={summary}"
alloc_lines.append(line)
hint = (
"Timed out waiting for healthy sandbox allocations.\n"
f"Job: {self.config.job_id}, desired_healthy: {min_count}\n"
"Allocations:\n - " + "\n - ".join(alloc_lines)
)
raise RuntimeError(hint)
async def _try_scale_up(self) -> bool:
"""Attempt to scale up the job."""
import time
async with self._scale_lock:
# Check cooldown
if time.time() - self._last_scale_time < self.config.scale_cooldown:
return False
# Check max containers
status = await self.nomad.get_job_status(self.config.job_id)
if status is None:
return False
current_count = status.count
if current_count >= self.config.max_containers:
logger.warning(f"Cannot scale up: already at max ({self.config.max_containers})")
return False
# Scale up
new_count = min(current_count + 1, self.config.max_containers)
logger.info(f"Scaling up from {current_count} to {new_count} containers")
scale_resp = await self.nomad.scale_job(
self.config.job_id,
count=new_count,
task_group="sandbox",
)
# Nomad may return non-JSON errors (e.g. plain text) with a status field.
if isinstance(scale_resp, dict) and scale_resp.get("status", 200) >= 400:
logger.warning(f"Scale request rejected: {scale_resp}")
self._last_scale_time = time.time()
return False
self._last_scale_time = time.time()
# Wait for new allocation in the background so contended acquires can still
# make progress (e.g. by grabbing slots released by other trajectories).
if self._scale_task is None or self._scale_task.done():
self._scale_task = asyncio.create_task(self._wait_for_scale(new_count))
return True
async def _wait_for_scale(self, desired_count: int) -> None:
try:
await self._wait_for_healthy_allocations(desired_count, timeout=60.0)
await self._refresh_slots()
except asyncio.CancelledError:
raise
except Exception as e:
logger.error(f"Failed to scale up: {e}")
async def _health_check_loop(self) -> None:
"""Background task to monitor container health."""
while True:
try:
await asyncio.sleep(self.config.health_check_interval)
await self._refresh_slots()
except asyncio.CancelledError:
break
except Exception as e:
logger.error(f"Health check error: {e}")
def get_stats(self) -> Dict[str, Any]:
"""Get pool statistics."""
slots_by_state = {}
for slot in self._slots.values():
state = slot.state.value
slots_by_state[state] = slots_by_state.get(state, 0) + 1
container_count = len({s.alloc_id for s in self._slots.values()}) if self._slots else 0
return {
"total_slots": self.total_slots,
"available_slots": self.available_slots,
"acquired_slots": self.acquired_slots,
"containers": container_count,
"slots_by_state": slots_by_state,
"started": self._started,
}

159
atropos/slots/slot.py Normal file
View File

@@ -0,0 +1,159 @@
"""
Slot abstraction for atropos-agent.
A Slot represents an isolated workspace for a single agent trajectory.
Slots are hosted on Nomad allocations and provide workspace isolation
via filesystem directories.
"""
from dataclasses import dataclass, field
from enum import Enum
from typing import Any, Dict, Optional
import uuid
class SlotState(Enum):
"""State of a slot in the pool."""
AVAILABLE = "available" # Ready to be acquired
ACQUIRED = "acquired" # Assigned to a trajectory
EXECUTING = "executing" # Currently executing a tool
RELEASING = "releasing" # Being released back to pool
ERROR = "error" # In error state
@dataclass
class Slot:
"""
An isolated workspace for a single agent trajectory.
Slots are the unit of scheduling - each trajectory runs in its own slot,
with an isolated workspace directory. Multiple slots share a container.
Attributes:
slot_id: Unique identifier for this slot (e.g., "slot_0")
alloc_id: Nomad allocation ID hosting this slot
container_addr: HTTP address of the sandbox server (e.g., "http://10.0.0.1:8080")
workspace_dir: Path to workspace in container (e.g., "/data/slot_0")
state: Current state of the slot
trajectory_id: ID of trajectory currently using this slot (if acquired)
metadata: Additional metadata
"""
slot_id: str
alloc_id: str
container_addr: str
workspace_dir: str = ""
state: SlotState = SlotState.AVAILABLE
trajectory_id: Optional[str] = None
metadata: Dict[str, Any] = field(default_factory=dict)
def __post_init__(self):
"""Set default workspace_dir if not provided."""
if not self.workspace_dir:
self.workspace_dir = f"/data/{self.slot_id}"
@property
def is_available(self) -> bool:
"""Check if slot is available for acquisition."""
return self.state == SlotState.AVAILABLE
@property
def is_acquired(self) -> bool:
"""Check if slot is currently acquired."""
return self.state in (SlotState.ACQUIRED, SlotState.EXECUTING)
def acquire(self, trajectory_id: Optional[str] = None) -> None:
"""
Mark slot as acquired by a trajectory.
Args:
trajectory_id: Optional ID of acquiring trajectory
"""
if not self.is_available:
raise RuntimeError(f"Cannot acquire slot {self.slot_id}: state is {self.state}")
self.state = SlotState.ACQUIRED
self.trajectory_id = trajectory_id or str(uuid.uuid4())
def start_execution(self, execution_id: Optional[str] = None) -> None:
"""Mark slot as executing."""
if self.state != SlotState.ACQUIRED:
raise RuntimeError(f"Cannot start execution on slot {self.slot_id}: state is {self.state}")
self.state = SlotState.EXECUTING
if execution_id:
self.metadata["current_execution_id"] = execution_id
def end_execution(self) -> None:
"""Mark execution as complete, return to acquired state."""
if self.state != SlotState.EXECUTING:
raise RuntimeError(f"Cannot end execution on slot {self.slot_id}: state is {self.state}")
self.state = SlotState.ACQUIRED
self.metadata.pop("current_execution_id", None)
def release(self) -> None:
"""Release slot back to available state."""
self.state = SlotState.AVAILABLE
self.trajectory_id = None
self.metadata.pop("current_execution_id", None)
def mark_error(self, error: str) -> None:
"""Mark slot as in error state."""
self.state = SlotState.ERROR
self.metadata["error"] = error
def to_dict(self) -> Dict[str, Any]:
"""Convert to dictionary for serialization."""
return {
"slot_id": self.slot_id,
"alloc_id": self.alloc_id,
"container_addr": self.container_addr,
"workspace_dir": self.workspace_dir,
"state": self.state.value,
"trajectory_id": self.trajectory_id,
"metadata": self.metadata,
}
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "Slot":
"""Create from dictionary."""
return cls(
slot_id=data["slot_id"],
alloc_id=data["alloc_id"],
container_addr=data["container_addr"],
workspace_dir=data.get("workspace_dir", ""),
state=SlotState(data.get("state", "available")),
trajectory_id=data.get("trajectory_id"),
metadata=data.get("metadata", {}),
)
def __repr__(self) -> str:
return f"Slot({self.slot_id}, state={self.state.value}, alloc={self.alloc_id[:8]}...)"
def create_slots_for_allocation(
alloc_id: str,
container_addr: str,
num_slots: int = 10,
) -> list["Slot"]:
"""
Create slots for a Nomad allocation.
Args:
alloc_id: Nomad allocation ID
container_addr: HTTP address of sandbox server
num_slots: Number of slots to create
Returns:
List of Slot objects
"""
slots = []
for i in range(num_slots):
slot_id = f"slot_{i}"
slots.append(Slot(
slot_id=slot_id,
alloc_id=alloc_id,
container_addr=container_addr,
workspace_dir=f"/data/{slot_id}",
))
return slots

View File

@@ -0,0 +1,2 @@
"""Terminal helpers for stateful sandbox interactions."""

View File

@@ -0,0 +1,115 @@
from __future__ import annotations
import json
from typing import Any
import pyte
class AsciinemaStreamDecoder:
def __init__(self, *, default_width: int = 80, default_height: int = 24) -> None:
self._default_width = max(1, int(default_width))
self._default_height = max(1, int(default_height))
self._buffer = ""
self._has_header = False
self.width = self._default_width
self.height = self._default_height
self._screen = pyte.Screen(self.width, self.height)
self._stream = pyte.Stream(self._screen)
def reset(self) -> None:
self._buffer = ""
self._has_header = False
self.width = self._default_width
self.height = self._default_height
self._screen = pyte.Screen(self.width, self.height)
self._stream = pyte.Stream(self._screen)
def feed(self, chunk: str | bytes) -> None:
if not chunk:
return
if isinstance(chunk, bytes):
chunk = chunk.decode("utf-8", errors="replace")
self._buffer += chunk
while True:
line, sep, rest = self._buffer.partition("\n")
if not sep:
break
self._buffer = rest
line = line.strip()
if not line:
continue
parsed = self._parse_json_line(line)
if parsed is None:
continue
if not self._has_header:
if isinstance(parsed, dict):
self._init_from_header(parsed)
continue
if isinstance(parsed, list):
self._has_header = True
self._apply_event(parsed)
continue
continue
if isinstance(parsed, list):
self._apply_event(parsed)
def render(self) -> str:
return "\n".join(self._screen.display)
def _parse_json_line(self, line: str) -> Any | None:
try:
return json.loads(line)
except json.JSONDecodeError:
return None
def _init_from_header(self, header: dict[str, Any]) -> None:
width = _coerce_int(
header.get("width") or header.get("columns") or header.get("cols"),
self._default_width,
)
height = _coerce_int(
header.get("height") or header.get("rows") or header.get("lines"),
self._default_height,
)
self.width = max(1, width)
self.height = max(1, height)
self._screen = pyte.Screen(self.width, self.height)
self._stream = pyte.Stream(self._screen)
self._has_header = True
def _apply_event(self, event: list[Any]) -> None:
if len(event) < 2:
return
event_type = event[1]
payload = event[2] if len(event) > 2 else ""
if event_type == "o":
if isinstance(payload, str):
self._stream.feed(payload)
elif event_type == "r":
width, height = _parse_resize(payload)
if width and height:
self.width = width
self.height = height
self._screen.resize(width, height)
def _coerce_int(value: Any, default: int) -> int:
try:
return int(value)
except (TypeError, ValueError):
return int(default)
def _parse_resize(payload: Any) -> tuple[int, int]:
if isinstance(payload, str) and "x" in payload:
left, right = payload.lower().split("x", 1)
return _coerce_int(left, 0), _coerce_int(right, 0)
if isinstance(payload, dict):
width = _coerce_int(payload.get("width") or payload.get("columns") or payload.get("cols"), 0)
height = _coerce_int(payload.get("height") or payload.get("rows") or payload.get("lines"), 0)
return width, height
if isinstance(payload, list) and len(payload) >= 2:
return _coerce_int(payload[0], 0), _coerce_int(payload[1], 0)
return 0, 0

26
atropos/tools/__init__.py Normal file
View File

@@ -0,0 +1,26 @@
"""
Tool abstractions for atropos-agent.
Provides base Tool class and common tool implementations.
"""
from .base import Tool, ToolCall, ToolRegistry, ToolResult, ToolSchema
from .build_registry import build_tool_registry
from .sandbox_stubs import BashTool, ReadFileTool, TerminalTool, WriteFileTool
from .terminal_stateful_tool import TerminalStatefulTool
from .tmux_tool import TmuxTool
__all__ = [
"Tool",
"ToolCall",
"ToolRegistry",
"ToolResult",
"ToolSchema",
"BashTool",
"ReadFileTool",
"WriteFileTool",
"TerminalTool",
"TerminalStatefulTool",
"TmuxTool",
"build_tool_registry",
]

423
atropos/tools/base.py Normal file
View File

@@ -0,0 +1,423 @@
"""
Base Tool abstraction for atropos-agent.
Tools follow a simple pattern:
1. Define schema (name, description, parameters)
2. Implement execute() method
3. Return ToolResult with output/error
Tool calls use Hermes-style XML tags:
<tool_call>{"name": "bash", "arguments": {"command": "ls"}}</tool_call>
"""
import json
import re
import uuid
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import Any, Dict, List, Literal, Optional
from pydantic import BaseModel, Field
@dataclass
class ToolSchema:
"""JSON Schema for a tool's parameters."""
name: str
description: str
parameters: Dict[str, Any] = field(default_factory=dict)
required: List[str] = field(default_factory=list)
external: bool = False # Whether the tool must be executed via an external ToolServer (secret proxy) and not inside the sandbox.
def to_dict(self) -> Dict[str, Any]:
"""Convert to OpenAI-compatible function schema."""
return {
"type": "function",
"function": {
"name": self.name,
"description": self.description,
"parameters": {
"type": "object",
"properties": self.parameters,
"required": self.required,
},
},
}
def to_prompt_description(self) -> str:
"""Convert to human-readable description for system prompt."""
params_desc = []
for name, spec in self.parameters.items():
req = "(required)" if name in self.required else "(optional)"
desc = spec.get("description", "")
param_type = spec.get("type", "string")
params_desc.append(f" - {name} ({param_type}) {req}: {desc}")
params_str = "\n".join(params_desc) if params_desc else " (no parameters)"
return f"**{self.name}**: {self.description}\nParameters:\n{params_str}"
@dataclass
class ToolCall:
"""A parsed tool call from model output."""
name: str
arguments: Dict[str, Any]
raw_text: str = "" # Original XML/JSON text
uniq_id: str = field(default_factory=lambda: str(uuid.uuid4())) # Unique tool-call id for traceability/reconstruction.
@classmethod
def parse_from_text(cls, text: str) -> List["ToolCall"]:
"""
Extract tool calls from text using Hermes-style XML tags.
Supported formats (STRICT: requires well-formed closing tags):
- Hermes JSON wrapper:
<tool_call>{"name": "...", "arguments": {...}}</tool_call>
- GLM/llama.cpp style:
<tool_call>terminal{"command":"ls -la"}</tool_call>
"""
calls: List["ToolCall"] = []
if not text:
return calls
def _append_from_payload(*, name: str, arguments: Dict[str, Any], raw: str, uniq_id: Optional[str] = None) -> None:
if not isinstance(name, str) or not name:
return
if not isinstance(arguments, dict):
return
calls.append(
cls(
name=name,
arguments=arguments,
raw_text=raw,
uniq_id=uniq_id or str(uuid.uuid4()),
)
)
# STRICT parsing: only accept well-formed <tool_call>...</tool_call> blocks.
pattern = r"<tool_call>\s*(.*?)\s*</tool_call>"
for inner in re.findall(pattern, text, re.DOTALL):
cleaned = (inner or "").strip()
if not cleaned:
continue
# Hermes JSON wrapper.
if cleaned.startswith("{"):
try:
data = json.loads(cleaned)
except json.JSONDecodeError:
continue
uniq_id = data.get("uniq_id") or data.get("id") or None
_append_from_payload(
name=data.get("name", ""),
arguments=data.get("arguments", {}),
raw=inner,
uniq_id=uniq_id,
)
continue
# GLM/llama.cpp style: terminal{...}
m = re.match(r"^\s*([A-Za-z0-9_.:\\-]+)\s*(\{.*\})\s*$", cleaned, re.DOTALL)
if not m:
continue
name = m.group(1)
args_text = m.group(2)
try:
args = json.loads(args_text)
except json.JSONDecodeError:
continue
_append_from_payload(name=name, arguments=args, raw=inner)
return calls
@classmethod
def has_tool_call(cls, text: str) -> bool:
"""Check if text contains any tool calls."""
return bool(re.search(r"<tool_call>", text))
@dataclass
class ToolResult:
"""Result from executing a tool."""
success: bool
output: str = ""
error: str = ""
metadata: Dict[str, Any] = field(default_factory=dict)
uniq_id: Optional[str] = None # Should match ToolCall.uniq_id for async execution tracking.
def to_xml(self) -> str:
"""Format as XML for including in conversation."""
data = {
"success": self.success,
"output": self.output,
}
if self.uniq_id:
data["uniq_id"] = self.uniq_id
if self.error:
data["error"] = self.error
if self.metadata:
data["metadata"] = self.metadata
return f"<tool_response>{json.dumps(data)}</tool_response>"
def to_dict(self) -> Dict[str, Any]:
"""Convert to dictionary."""
return {
"success": self.success,
"output": self.output,
"error": self.error,
"metadata": self.metadata,
"uniq_id": self.uniq_id,
}
class Tool(ABC):
"""
Abstract base class for tools.
Subclasses must implement:
- schema: ToolSchema describing the tool
- execute(): async method that performs the tool action
"""
@property
@abstractmethod
def schema(self) -> ToolSchema:
"""Return the tool's schema."""
pass
@property
def name(self) -> str:
"""Tool name (from schema)."""
return self.schema.name
@abstractmethod
async def execute(self, **kwargs) -> ToolResult:
"""
Execute the tool with given arguments.
Args:
**kwargs: Tool-specific arguments
Returns:
ToolResult with success/failure and output
"""
pass
def is_available(self) -> tuple[bool, str | None]:
"""
Return whether this tool should be exposed/executable in the current process.
Tools that depend on optional binaries/services/env vars can override this
to avoid advertising a tool that will fail at runtime.
"""
return True, None
async def __call__(self, **kwargs) -> ToolResult:
"""Allow calling tool instance directly."""
return await self.execute(**kwargs)
# Note: This is only wrapping declarations for the external ToolServer (for execution on external process tools), and tools preinstalled in envs
class ToolRegistry:
"""Registry of available tools."""
def __init__(self):
self._tools: Dict[str, Tool] = {}
def register(self, tool: Tool) -> None:
"""Register a tool."""
self._tools[tool.name] = tool
def get(self, name: str) -> Optional[Tool]:
"""Get a tool by name."""
return self._tools.get(name)
def list_tools(self) -> List[Tool]:
"""List all registered tools."""
return list(self._tools.values())
def get_schemas(self) -> List[ToolSchema]:
"""Get schemas for all registered tools."""
return [tool.schema for tool in self._tools.values()]
def get_prompt_description(self) -> str:
"""Generate tool descriptions for system prompt."""
descriptions = [tool.schema.to_prompt_description() for tool in self._tools.values()]
return "\n\n".join(descriptions)
def get_prompt_tool_definitions_json(self) -> str:
"""
Return a Hermes-style JSON list of tool definitions for use inside a `<tools>...</tools>` block.
Hermes trajectories historically use a simplified schema list:
[{"name": ..., "description": ..., "parameters": {...}, "required": null}, ...]
"""
formatted: List[Dict[str, Any]] = []
for tool in self._tools.values():
fn = tool.schema.to_dict().get("function", {})
formatted.append(
{
"name": fn.get("name", tool.name),
"description": fn.get("description", ""),
"parameters": fn.get("parameters", {}),
# Keep parity with Hermes saved trajectories (required is typically null there).
"required": None,
}
)
return json.dumps(formatted, ensure_ascii=False)
async def execute(self, call: ToolCall) -> ToolResult:
"""Execute a tool call."""
tool = self.get(call.name)
if tool is None:
return ToolResult(
success=False,
error=f"Unknown tool: {call.name}",
uniq_id=call.uniq_id,
)
try:
result = await tool.execute(**call.arguments)
if result.uniq_id is None:
result.uniq_id = call.uniq_id
return result
except Exception as e:
return ToolResult(
success=False,
error=f"Tool execution error: {str(e)}",
uniq_id=call.uniq_id,
)
# =============================================================================
# FastAPI / transport models
# =============================================================================
class ToolCallPayload(BaseModel):
name: str
arguments: Dict[str, Any] = Field(default_factory=dict)
uniq_id: str
@classmethod
def from_tool_call(cls, call: ToolCall) -> "ToolCallPayload":
return cls(name=call.name, arguments=call.arguments, uniq_id=call.uniq_id)
def to_tool_call(self) -> ToolCall:
return ToolCall(name=self.name, arguments=self.arguments, uniq_id=self.uniq_id)
class ToolResultPayload(BaseModel):
success: bool
output: str = ""
error: str = ""
metadata: Dict[str, Any] = Field(default_factory=dict)
uniq_id: Optional[str] = None
@classmethod
def from_tool_result(cls, result: ToolResult) -> "ToolResultPayload":
return cls(
success=result.success,
output=result.output,
error=result.error,
metadata=result.metadata,
uniq_id=result.uniq_id,
)
def to_tool_result(self) -> ToolResult:
return ToolResult(
success=self.success,
output=self.output,
error=self.error,
metadata=self.metadata,
uniq_id=self.uniq_id,
)
class ToolExecutorExecuteRequest(BaseModel):
trajectory_id: str
tool: ToolCallPayload
timeout_s: Optional[float] = None
class ToolExecutorReleaseRequest(BaseModel):
trajectory_id: str
reset_workspace: bool = False
class ToolServerExecuteRequest(BaseModel):
trajectory_id: Optional[str] = None
tool: ToolCallPayload
timeout_s: Optional[float] = None
# Optional sandbox context for tools that need workspace artifacts.
# This is set by ToolExecutor and is NOT model-controlled.
slot_id: Optional[str] = None
container_addr: Optional[str] = None
# =============================================================================
# Artifact transport models
# =============================================================================
class ArtifactReadRequestPayload(BaseModel):
trajectory_id: str
path: str
encoding: Literal["text", "base64"] = "text"
max_bytes: Optional[int] = None
include_sha256: bool = False
class ArtifactReadResponsePayload(BaseModel):
success: bool
content: str = ""
error: str = ""
encoding: str = "text"
truncated: bool = False
bytes: int = 0
file_size: Optional[int] = None
path: str = ""
mime: Optional[str] = None
sha256: Optional[str] = None
class ArtifactListRequestPayload(BaseModel):
trajectory_id: str
path: str = "."
recursive: bool = False
max_entries: Optional[int] = None
class ArtifactListEntryPayload(BaseModel):
path: str
is_dir: bool
size: int
mtime: float
class ArtifactListResponsePayload(BaseModel):
success: bool
entries: List[ArtifactListEntryPayload] = Field(default_factory=list)
truncated: bool = False
error: str = ""
class ArtifactArchiveRequestPayload(BaseModel):
trajectory_id: str
path: str = "."
format: Literal["tar.gz", "tgz"] = "tar.gz"
max_bytes: Optional[int] = None
max_entries: Optional[int] = None
class ArtifactArchiveResponsePayload(BaseModel):
success: bool
content: str = ""
error: str = ""
encoding: str = "base64"
format: str = "tar.gz"
bytes: int = 0
entry_count: int = 0

View File

@@ -0,0 +1,64 @@
"""
Unified tool registry builder for Hermes-Agent Atropos integration.
This composes:
- sandbox tool stubs (terminal/bash/read_file/write_file + stateful terminal/tmux)
- Hermes external tools (web/vision/image/moa/skills/browser), executed via ToolServer
ToolExecutor only needs the schema + `external` routing bit; ToolServer executes
the external tools via Hermes' existing implementations.
"""
from __future__ import annotations
from typing import List, Optional
from .base import ToolRegistry
from .hermes_external_tools import build_external_tools
from .sandbox_stubs import BashTool, ReadFileTool, TerminalTool, WriteFileTool
from .terminal_stateful_tool import TerminalStatefulTool
from .tmux_tool import TmuxTool
from .toolset_resolver import resolve_multiple_toolsets
def build_tool_registry(
*,
enabled_toolsets: Optional[List[str]] = None,
disabled_toolsets: Optional[List[str]] = None,
tool_server_url: Optional[str] = None,
) -> ToolRegistry:
"""
Build a ToolRegistry for AgentEnv / ToolExecutor / ToolServer.
If `tool_server_url` is not provided, external tools will be omitted so we do
not advertise tools that cannot execute.
"""
enabled_toolsets = enabled_toolsets or ["default"]
# Resolve tool names using Hermes toolsets plus Atropos additions.
selected = set(resolve_multiple_toolsets(enabled_toolsets))
if disabled_toolsets:
selected -= set(resolve_multiple_toolsets(disabled_toolsets))
reg = ToolRegistry()
# Always register sandbox tools if selected.
sandbox_by_name = {
"terminal": TerminalTool(),
"bash": BashTool(),
"read_file": ReadFileTool(),
"write_file": WriteFileTool(),
"terminal_stateful": TerminalStatefulTool(),
"tmux": TmuxTool(),
}
for name, tool in sandbox_by_name.items():
if name in selected:
reg.register(tool)
# External tools: only include when ToolServer is configured.
if tool_server_url:
for tool in build_external_tools(selected_tool_names=selected):
if tool.name in selected:
reg.register(tool)
return reg

View File

@@ -0,0 +1,90 @@
"""
Hermes external tool adapter for Atropos ToolServer.
These tools reuse Hermes-Agent's existing tool runner (`model_tools.handle_function_call`)
so we don't duplicate external tool implementations.
Important:
- These are marked `external=True` and should be executed ONLY by ToolServer.
- We run `handle_function_call` in a worker thread because the Hermes implementation
uses `asyncio.run()` internally for some async tools (web_extract, vision, MoA, etc).
"""
from __future__ import annotations
import asyncio
import json
from typing import Any, Dict, List, Optional
import model_tools
from .base import Tool, ToolResult, ToolSchema
def _schema_from_openai_tool_dict(tool: Dict[str, Any], *, external: bool) -> ToolSchema:
fn = tool.get("function") or {}
name = str(fn.get("name") or "")
description = str(fn.get("description") or "")
params = fn.get("parameters") or {}
properties = params.get("properties") or {}
required = params.get("required") or []
if not isinstance(required, list):
required = []
return ToolSchema(
name=name,
description=description,
parameters=dict(properties),
required=[str(x) for x in required if isinstance(x, (str, int))],
external=external,
)
class HermesExternalTool(Tool):
def __init__(self, schema: ToolSchema):
self._schema = schema
@property
def schema(self) -> ToolSchema:
return self._schema
async def execute(self, task_id: Optional[str] = None, **kwargs: Any) -> ToolResult:
# `model_tools.handle_function_call` returns a JSON string (success or error).
# Run in a thread because some Hermes tool handlers call `asyncio.run()`.
raw = await asyncio.to_thread(model_tools.handle_function_call, self.name, kwargs, task_id)
try:
parsed = json.loads(raw)
except Exception:
# Keep as plain string.
return ToolResult(success=True, output=str(raw))
if isinstance(parsed, dict) and parsed.get("error"):
return ToolResult(success=False, error=str(parsed.get("error")), output="")
return ToolResult(success=True, output=json.dumps(parsed, ensure_ascii=False))
def build_external_tools(
*,
selected_tool_names: Optional[set[str]] = None,
) -> List[HermesExternalTool]:
"""
Build external tool wrappers from Hermes tool declarations.
Filters out sandbox-oriented tools (e.g. `terminal`) since those should run
inside the sandbox via ToolExecutor.
"""
# IMPORTANT: Hermes' `model_tools.get_tool_definitions()` only understands Hermes toolsets.
# Atropos envs add extra toolsets (filesystem/sandbox/stateful). To avoid noisy "Unknown toolset"
# prints and accidental filtering, we fetch ALL Hermes tool definitions here and filter by name.
tools = model_tools.get_tool_definitions(enabled_toolsets=None, disabled_toolsets=None, quiet_mode=True)
wrappers: List[HermesExternalTool] = []
for t in tools:
schema = _schema_from_openai_tool_dict(t, external=True)
if schema.name in {"terminal"}:
continue
if selected_tool_names is not None and schema.name not in selected_tool_names:
continue
wrappers.append(HermesExternalTool(schema))
return wrappers

View File

@@ -0,0 +1,99 @@
"""
Sandbox tool stubs for Atropos ToolExecutor.
These tools are executed inside the sandbox containers via:
ToolExecutor -> SlotPool -> sandbox_server.py
They intentionally do NOT execute anything on the host process. If they are
called directly (outside ToolExecutor), they return a clear error.
"""
from __future__ import annotations
from typing import Optional
from .base import Tool, ToolResult, ToolSchema
class TerminalTool(Tool):
@property
def schema(self) -> ToolSchema:
return ToolSchema(
name="terminal",
description=(
"Execute a command inside the sandbox slot workspace and return stdout/stderr. "
"Filesystem persists within a trajectory slot. Background processes are not supported "
"in stateless mode. Commands run under POSIX /bin/sh and each tool call runs in a fresh "
"shell (no persisted env vars). Avoid bash-only syntax like `source`; prefer `. .venv/bin/activate` "
"or invoke `.venv/bin/python ...` directly."
),
parameters={
"command": {"type": "string", "description": "The command to execute"},
"timeout": {
"type": "integer",
"description": "Command timeout in seconds (optional).",
"minimum": 1,
},
"background": {
"type": "boolean",
"description": "Not supported in sandbox terminal (always false).",
"default": False,
},
},
required=["command"],
external=False,
)
async def execute(self, **_kwargs) -> ToolResult:
return ToolResult(
success=False,
error="terminal must be executed via ToolExecutor inside the sandbox",
)
class BashTool(Tool):
@property
def schema(self) -> ToolSchema:
return ToolSchema(
name="bash",
description="Execute a bash command inside the sandbox slot workspace.",
parameters={"command": {"type": "string", "description": "The bash command to execute"}},
required=["command"],
external=False,
)
async def execute(self, **_kwargs) -> ToolResult:
return ToolResult(success=False, error="bash must be executed via ToolExecutor inside the sandbox")
class ReadFileTool(Tool):
@property
def schema(self) -> ToolSchema:
return ToolSchema(
name="read_file",
description="Read a file from the sandbox slot workspace.",
parameters={"path": {"type": "string", "description": "Path to the file"}},
required=["path"],
external=False,
)
async def execute(self, **_kwargs) -> ToolResult:
return ToolResult(success=False, error="read_file must be executed via ToolExecutor inside the sandbox")
class WriteFileTool(Tool):
@property
def schema(self) -> ToolSchema:
return ToolSchema(
name="write_file",
description="Write a file into the sandbox slot workspace.",
parameters={
"path": {"type": "string", "description": "Path to the file"},
"content": {"type": "string", "description": "File content"},
},
required=["path", "content"],
external=False,
)
async def execute(self, **_kwargs) -> ToolResult:
return ToolResult(success=False, error="write_file must be executed via ToolExecutor inside the sandbox")

View File

@@ -0,0 +1,45 @@
"""
Stateful terminal tool schema.
This is a sandbox tool that routes to the sandbox server as `bash_stateful`
via ToolExecutor mapping. It exists to expose an explicit, opt-in terminal
primitive suitable for stateful workflows (e.g. tmux sessions / TUIs).
"""
from __future__ import annotations
from typing import Optional
from .base import Tool, ToolResult, ToolSchema
class TerminalStatefulTool(Tool):
@property
def schema(self) -> ToolSchema:
return ToolSchema(
name="terminal_stateful",
description=(
"Execute a command in the sandbox, allowing stateful/background processes to persist "
"across tool calls within the same trajectory slot (e.g. tmux sessions). "
"Use sparingly; output is still non-interactive."
),
parameters={
"command": {"type": "string", "description": "The command to execute"},
"timeout": {
"type": "integer",
"description": "Command timeout in seconds (optional).",
"minimum": 1,
},
},
required=["command"],
)
def is_available(self) -> tuple[bool, str | None]:
return True, None
async def execute(self, command: str, timeout: Optional[int] = None) -> ToolResult:
_ = (command, timeout)
return ToolResult(
success=False,
error="terminal_stateful must be executed via ToolExecutor inside the sandbox",
)

View File

@@ -0,0 +1,89 @@
"""
tmux tool schema (sandbox).
This is a sandbox tool that provides basic tmux session control suitable for
TUI-style terminal interactions:
- send keys (arrow keys, enter, etc.)
- capture the current screen buffer
Execution is routed by ToolExecutor to the sandbox server's `tmux` backend.
"""
from __future__ import annotations
from typing import Any, Dict, Optional
from .base import Tool, ToolResult, ToolSchema
class TmuxTool(Tool):
@property
def schema(self) -> ToolSchema:
return ToolSchema(
name="tmux",
description=(
"Control a per-trajectory tmux session inside the sandbox (stateful terminal). "
"Use this for TUI-style interactions: send keys and capture the current screen."
),
parameters={
"action": {
"type": "string",
"description": "Action to perform: start | send_keys | stream | stop.",
"enum": ["start", "send_keys", "stream", "stop", "capture"],
},
"keys": {
"description": "Keys to send (string or list of strings) when action=send_keys.",
},
"block": {
"type": "boolean",
"description": "If true, wait for shell command completion (only valid at a shell prompt).",
"default": False,
},
"min_wait_s": {
"type": "number",
"description": "For non-blocking send_keys, sleep this long after sending keys (seconds).",
"default": 0.0,
},
"max_wait_s": {
"type": "number",
"description": "For blocking send_keys, max time to wait for completion (seconds).",
},
"capture_entire": {
"type": "boolean",
"description": "Deprecated. Streaming is preferred.",
"default": False,
},
"max_bytes": {
"type": "integer",
"description": "Max bytes to return per stream call.",
},
"reset": {
"type": "boolean",
"description": "If true, reset stream offset to the beginning of the asciinema recording.",
"default": False,
},
"pane_width": {
"type": "integer",
"description": "Pane width for action=start (columns).",
"minimum": 20,
},
"pane_height": {
"type": "integer",
"description": "Pane height for action=start (rows).",
"minimum": 10,
},
},
required=["action"],
)
def is_available(self) -> tuple[bool, str | None]:
return True, None
async def execute(self, **kwargs: Dict[str, Any]) -> ToolResult:
# This tool is intended to be executed via ToolExecutor -> sandbox server.
# We keep a safe fallback for non-sandbox contexts.
action = str(kwargs.get("action") or "").strip()
return ToolResult(
success=False,
error=f"tmux tool must be executed in the sandbox (got action={action!r})",
)

View File

@@ -0,0 +1,500 @@
"""
ToolExecutor - queued, batched tool dispatch for multiplexed agent trajectories.
This component is responsible for:
- Maintaining trajectory -> Slot affinity (workspace continuity)
- Batching sandbox tool calls across trajectories to maximize container utilization
- Routing external tools (ToolSchema.external=True) to a ToolServer (Phase 4.5)
For now, only sandbox tools are executed:
- bash
- read_file
- write_file
"""
from __future__ import annotations
import asyncio
import time
from dataclasses import dataclass
from typing import Any, Dict, List, Optional
import httpx
from .base import (
ArtifactArchiveRequestPayload,
ArtifactArchiveResponsePayload,
ArtifactListRequestPayload,
ArtifactListResponsePayload,
ArtifactReadRequestPayload,
ArtifactReadResponsePayload,
ToolCall,
ToolCallPayload,
ToolRegistry,
ToolResult,
ToolResultPayload,
ToolServerExecuteRequest,
)
from ..backends.base import ToolBackend
from ..slots import Slot
@dataclass
class ToolExecutorConfig:
batch_window_ms: int = 20
max_batch_size: int = 200
allow_network: bool = True
require_sandbox: bool = False
require_stateful_sandbox: bool = False
tool_server_url: Optional[str] = None
tool_server_token: Optional[str] = None
@dataclass
class _QueuedToolRequest:
trajectory_id: str
call: ToolCall
timeout_s: Optional[float]
future: asyncio.Future
class ToolExecutor:
def __init__(
self,
backend: ToolBackend,
tools: ToolRegistry,
config: Optional[ToolExecutorConfig] = None,
) -> None:
self.backend = backend
self.tools = tools
self.config = config or ToolExecutorConfig()
self._queue: asyncio.Queue[Optional[_QueuedToolRequest]] = asyncio.Queue()
self._task: Optional[asyncio.Task] = None
self._stopping = asyncio.Event()
self._slots_lock = asyncio.Lock()
self._slot_by_trajectory: Dict[str, Slot] = {}
self._tool_server_client: Optional[httpx.AsyncClient] = None
self._tool_server_lock = asyncio.Lock()
# lightweight stats for status endpoints
self.total_requests: int = 0
self.total_errors: int = 0
self.latencies_s: List[float] = []
async def start(self) -> None:
if self._task is None:
self._task = asyncio.create_task(self._run_loop())
def queue_size(self) -> int:
return self._queue.qsize()
async def close(self) -> None:
self._stopping.set()
await self._queue.put(None)
if self._task:
await self._task
self._task = None
client = self._tool_server_client
self._tool_server_client = None
if client is not None:
await client.aclose()
# Best-effort release any remaining slots.
async with self._slots_lock:
slots = list(self._slot_by_trajectory.items())
self._slot_by_trajectory.clear()
for _, slot in slots:
try:
await self.backend.release(slot, reset_workspace=False)
except Exception:
pass
async def execute(
self,
trajectory_id: str,
call: ToolCall,
timeout_s: Optional[float] = None,
) -> ToolResult:
if self._task is None:
raise RuntimeError("ToolExecutor not started (call start() first)")
# Allow tool args to suggest a timeout (Hermes-compatible terminal tool),
# but never let the model choose "infinite" timeouts.
if timeout_s is None:
raw_timeout = call.arguments.get("timeout")
if isinstance(raw_timeout, (int, float)):
timeout_s = float(raw_timeout)
if timeout_s is not None:
timeout_s = max(1.0, min(float(timeout_s), 600.0))
loop = asyncio.get_running_loop()
fut: asyncio.Future = loop.create_future()
started = time.perf_counter()
await self._queue.put(_QueuedToolRequest(trajectory_id=trajectory_id, call=call, timeout_s=timeout_s, future=fut))
try:
result: ToolResult = await fut
return result
finally:
self.latencies_s.append(time.perf_counter() - started)
async def release_trajectory(self, trajectory_id: str, reset_workspace: bool = False) -> None:
async with self._slots_lock:
slot = self._slot_by_trajectory.pop(trajectory_id, None)
if slot is not None:
await self.backend.release(slot, reset_workspace=reset_workspace)
async def _get_slot_if_present(self, trajectory_id: str) -> Optional[Slot]:
async with self._slots_lock:
return self._slot_by_trajectory.get(trajectory_id)
# ---------------------------------------------------------------------
# Artifact helpers (optional)
# ---------------------------------------------------------------------
async def read_artifact(self, req: ArtifactReadRequestPayload) -> ArtifactReadResponsePayload:
slot = await self._get_slot_if_present(req.trajectory_id)
if slot is None:
return ArtifactReadResponsePayload(success=False, error="No active slot for trajectory (run a sandbox tool first)")
data = await self.backend.read_artifact(
slot,
req.path,
encoding=req.encoding,
max_bytes=req.max_bytes,
include_sha256=req.include_sha256,
)
if isinstance(data, dict):
data = dict(data)
data.pop("http_status", None)
try:
return ArtifactReadResponsePayload(**(data or {}))
except Exception as e:
return ArtifactReadResponsePayload(success=False, error=f"Invalid artifact read response: {e}")
async def list_artifacts(self, req: ArtifactListRequestPayload) -> ArtifactListResponsePayload:
slot = await self._get_slot_if_present(req.trajectory_id)
if slot is None:
return ArtifactListResponsePayload(success=False, error="No active slot for trajectory (run a sandbox tool first)")
data = await self.backend.list_artifacts(
slot,
req.path,
recursive=req.recursive,
max_entries=req.max_entries,
)
if isinstance(data, dict):
data = dict(data)
data.pop("http_status", None)
try:
return ArtifactListResponsePayload(**(data or {}))
except Exception as e:
return ArtifactListResponsePayload(success=False, error=f"Invalid artifact list response: {e}")
async def archive_artifacts(self, req: ArtifactArchiveRequestPayload) -> ArtifactArchiveResponsePayload:
slot = await self._get_slot_if_present(req.trajectory_id)
if slot is None:
return ArtifactArchiveResponsePayload(success=False, error="No active slot for trajectory (run a sandbox tool first)")
data = await self.backend.archive_artifacts(
slot,
req.path,
archive_format=req.format,
max_bytes=req.max_bytes,
max_entries=req.max_entries,
)
if isinstance(data, dict):
data = dict(data)
data.pop("http_status", None)
try:
return ArtifactArchiveResponsePayload(**(data or {}))
except Exception as e:
return ArtifactArchiveResponsePayload(success=False, error=f"Invalid artifact archive response: {e}")
async def _get_or_acquire_slot(self, trajectory_id: str) -> Slot:
async with self._slots_lock:
existing = self._slot_by_trajectory.get(trajectory_id)
if existing is not None:
return existing
slot = await self.backend.acquire(trajectory_id)
async with self._slots_lock:
existing = self._slot_by_trajectory.get(trajectory_id)
if existing is not None:
# Another coroutine won the race; return its slot.
await self.backend.release(slot, reset_workspace=False)
return existing
self._slot_by_trajectory[trajectory_id] = slot
return slot
async def _run_loop(self) -> None:
pending: List[_QueuedToolRequest] = []
deadline: Optional[float] = None
batch_window_s = max(0.0, self.config.batch_window_ms / 1000.0)
max_batch = max(1, self.config.max_batch_size)
while True:
if self._stopping.is_set() and self._queue.empty() and not pending:
break
timeout = None
if pending and deadline is not None:
timeout = max(0.0, deadline - time.perf_counter())
try:
item = await asyncio.wait_for(self._queue.get(), timeout=timeout)
if item is None:
continue
pending.append(item)
if len(pending) == 1:
deadline = time.perf_counter() + batch_window_s
if len(pending) < max_batch:
continue
except asyncio.TimeoutError:
# batch window elapsed
pass
if not pending:
deadline = None
continue
batch = pending
pending = []
deadline = None
await self._execute_batch(batch)
async def _get_tool_server_client(self) -> httpx.AsyncClient:
url = self.config.tool_server_url
if not url:
raise RuntimeError("ToolServer not configured")
if self._tool_server_client is not None:
return self._tool_server_client
async with self._tool_server_lock:
if self._tool_server_client is None:
self._tool_server_client = httpx.AsyncClient(base_url=url.rstrip("/"))
return self._tool_server_client
def _tool_server_headers(self) -> Dict[str, str]:
token = self.config.tool_server_token
if not token:
return {}
return {"Authorization": f"Bearer {token}"}
async def _execute_external(self, req: _QueuedToolRequest) -> ToolResult:
client = await self._get_tool_server_client()
slot_id: Optional[str] = None
container_addr: Optional[str] = None
slot = await self._get_slot_if_present(req.trajectory_id)
if slot is not None:
slot_id = slot.slot_id
container_addr = slot.container_addr
payload = ToolServerExecuteRequest(
trajectory_id=req.trajectory_id,
tool=ToolCallPayload.from_tool_call(req.call),
timeout_s=req.timeout_s,
slot_id=slot_id,
container_addr=container_addr,
)
try:
resp = await client.post(
"/execute",
json=payload.model_dump(),
headers=self._tool_server_headers(),
timeout=req.timeout_s,
)
resp.raise_for_status()
data = resp.json()
parsed = ToolResultPayload(**data)
result = parsed.to_tool_result()
if result.uniq_id is None:
result.uniq_id = req.call.uniq_id
return result
except Exception as e:
return ToolResult(
success=False,
error=f"External tool failed: {e}",
uniq_id=req.call.uniq_id,
)
async def _execute_batch(self, batch: List[_QueuedToolRequest]) -> None:
# Resolve tool schemas once per request and separate sandbox/external/unknown.
sandbox_items: List[_QueuedToolRequest] = []
external_items: List[_QueuedToolRequest] = []
unknown_items: List[_QueuedToolRequest] = []
for it in batch:
tool = self.tools.get(it.call.name)
if tool is None:
unknown_items.append(it)
continue
schema = tool.schema
if not schema.external:
sandbox_items.append(it)
else:
external_items.append(it)
for it in unknown_items:
self.total_requests += 1
self.total_errors += 1
if not it.future.done():
it.future.set_result(
ToolResult(
success=False,
error=f"Unknown tool: {it.call.name}",
uniq_id=it.call.uniq_id,
)
)
if external_items:
if not self.config.tool_server_url:
for it in external_items:
self.total_requests += 1
self.total_errors += 1
if not it.future.done():
it.future.set_result(
ToolResult(
success=False,
error=f"External tool not available (ToolServer not configured): {it.call.name}",
uniq_id=it.call.uniq_id,
)
)
else:
results = await asyncio.gather(*[self._execute_external(it) for it in external_items])
for it, res in zip(external_items, results):
self.total_requests += 1
if not getattr(res, "success", False):
self.total_errors += 1
if not it.future.done():
it.future.set_result(res)
if not sandbox_items:
return
# Acquire slots for the distinct trajectories in this batch.
try:
traj_ids = list({it.trajectory_id for it in sandbox_items})
slots = await asyncio.gather(*[self._get_or_acquire_slot(tid) for tid in traj_ids])
slot_by_traj = dict(zip(traj_ids, slots))
except Exception as e:
for it in sandbox_items:
self.total_requests += 1
self.total_errors += 1
if not it.future.done():
it.future.set_result(
ToolResult(
success=False,
error=f"Failed to acquire slot: {e}",
uniq_id=it.call.uniq_id,
)
)
return
# Group by timeout so we don't accidentally make short timeouts wait on long ones.
by_timeout: Dict[float, List[_QueuedToolRequest]] = {}
default_timeout = self.backend.default_timeout_s
for it in sandbox_items:
t = it.timeout_s
if t is None:
t = default_timeout
if t is None:
t = 30.0
by_timeout.setdefault(float(t), []).append(it)
for timeout_s, items in by_timeout.items():
requests = []
dispatched: List[_QueuedToolRequest] = []
for it in items:
slot = slot_by_traj[it.trajectory_id]
tool_name = it.call.name
args = dict(it.call.arguments)
# Hermes compatibility: treat `terminal` as an alias of sandbox `bash`.
if tool_name == "terminal":
if args.get("background"):
self.total_requests += 1
self.total_errors += 1
if not it.future.done():
it.future.set_result(
ToolResult(
success=False,
error="terminal background execution is not supported in sandbox",
uniq_id=it.call.uniq_id,
)
)
continue
tool_name = "bash"
# `timeout` is handled at the ToolExecutor level, not passed to the sandbox tool args.
args.pop("timeout", None)
elif tool_name == "terminal_stateful":
tool_name = "bash_stateful"
args.pop("timeout", None)
elif tool_name == "tmux":
# `tmux` is a sandbox tool backed by the stateful session manager.
# Network policy is env-controlled.
args.pop("allow_network", None)
if tool_name == "bash":
# Network policy is set by the environment/executor, not by the model.
args.pop("allow_network", None)
args.pop("require_sandbox", None)
args["allow_network"] = bool(self.config.allow_network)
args["require_sandbox"] = bool(self.config.require_sandbox)
# `timeout` is handled at the ToolExecutor level, not passed to the sandbox tool args.
args.pop("timeout", None)
elif tool_name == "bash_stateful":
# Network policy is set by the environment/executor, not by the model.
args.pop("allow_network", None)
args.pop("require_sandbox", None)
args.pop("require_stateful_sandbox", None)
args["allow_network"] = bool(self.config.allow_network)
args["require_stateful_sandbox"] = bool(self.config.require_stateful_sandbox)
args.pop("timeout", None)
elif tool_name == "tmux":
# Network policy applies to the underlying stateful session.
args.pop("allow_network", None)
args.pop("require_sandbox", None)
args.pop("require_stateful_sandbox", None)
args["allow_network"] = bool(self.config.allow_network)
args["require_stateful_sandbox"] = bool(self.config.require_stateful_sandbox)
requests.append((slot, tool_name, args))
dispatched.append(it)
results = None
try:
if not dispatched:
continue
results = await self.backend.execute_batch(requests, timeout_s=timeout_s)
except Exception as e:
for it in items:
self.total_requests += 1
self.total_errors += 1
if not it.future.done():
it.future.set_result(
ToolResult(
success=False,
error=f"Batch execution failed: {e}",
uniq_id=it.call.uniq_id,
)
)
continue
for it, res in zip(dispatched, results):
self.total_requests += 1
if not getattr(res, "success", False):
self.total_errors += 1
tool_result = res.to_tool_result()
tool_result.uniq_id = it.call.uniq_id
if not it.future.done():
it.future.set_result(tool_result)

View File

@@ -0,0 +1,88 @@
"""
Toolset resolution for Hermes-Agent Atropos integration.
We primarily reuse Hermes-Agent toolsets (`toolsets.py`), but Atropos training/envs
need a few extra sandbox-oriented toolsets that Hermes doesn't expose by default
(e.g. filesystem + stateful terminal).
"""
from __future__ import annotations
from typing import Any, Dict, List, Optional, Set
import toolsets as hermes_toolsets
ATROPOS_TOOLSETS: Dict[str, Dict[str, Any]] = {
"filesystem": {
"description": "Read/write files in the sandbox workspace.",
"tools": ["read_file", "write_file"],
"includes": [],
},
"terminal_stateful": {
"description": "Stateful terminal execution (tmux/TUI support) inside the sandbox.",
"tools": ["terminal_stateful", "tmux"],
"includes": [],
},
"sandbox": {
"description": "Sandbox tools (terminal + filesystem).",
"tools": [],
"includes": ["terminal", "filesystem"],
},
"default": {
"description": "Default toolset for Atropos AgentEnv tasks.",
"tools": [],
"includes": ["sandbox"],
},
"full": {
"description": "All Hermes tools plus Atropos sandbox additions.",
"tools": [],
"includes": ["all", "filesystem", "sandbox", "terminal_stateful"],
},
}
def validate_toolset(name: str) -> bool:
if name in {"all", "*"}:
return True
return hermes_toolsets.validate_toolset(name) or name in ATROPOS_TOOLSETS
def resolve_toolset(name: str, visited: Optional[Set[str]] = None) -> List[str]:
if visited is None:
visited = set()
if name in {"all", "*"}:
# Union Hermes + Atropos toolsets.
all_tools: Set[str] = set()
for tname in hermes_toolsets.get_toolset_names():
all_tools.update(resolve_toolset(tname, visited=set()))
for tname, spec in ATROPOS_TOOLSETS.items():
# Avoid recursion: some Atropos toolsets (e.g. "full") include "all".
if tname == "full" or "all" in (spec.get("includes") or []):
continue
all_tools.update(resolve_toolset(tname, visited=set()))
return sorted(all_tools)
if name in ATROPOS_TOOLSETS:
if name in visited:
return []
visited.add(name)
spec = ATROPOS_TOOLSETS[name]
tools: Set[str] = set(spec.get("tools", []))
for inc in spec.get("includes", []):
tools.update(resolve_toolset(inc, visited=set(visited)))
return sorted(tools)
# Fall back to Hermes toolsets.
# IMPORTANT: do not pre-add `name` to `visited` here; Hermes' resolver uses
# `visited` for its own cycle detection and will treat the presence of `name`
# as a circular dependency.
return sorted(hermes_toolsets.resolve_toolset(name, visited=set(visited)))
def resolve_multiple_toolsets(names: List[str]) -> List[str]:
tools: Set[str] = set()
for name in names:
tools.update(resolve_toolset(name, visited=set()))
return sorted(tools)

415
atropos_compatible_agent.py Normal file
View File

@@ -0,0 +1,415 @@
#!/usr/bin/env python3
"""
Atropos-compatible Hermes agent runner.
This is a minimal subclass of Hermes-Agent's `AIAgent` that swaps the OpenAI
function-calling backend for Atroposlib's `ManagedServer`/`ServerManager` backend
and uses Hermes-style XML tool tags:
- <tool_call>{"name": "...", "arguments": {...}}</tool_call>
- <tool_response>{...}</tool_response>
Tool observations are appended as `role="user"` messages containing one or more
`<tool_response>` blocks so they survive common chat templates during tokenization.
"""
from __future__ import annotations
import asyncio
import json
import re
import time
import warnings
import os
from contextlib import asynccontextmanager
from typing import Any, AsyncGenerator, Dict, List, Optional, Tuple
from model_tools import cleanup_vm, handle_function_call
from run_agent import AIAgent
_TOOL_CALL_RE = re.compile(r"<tool_call>\\s*(.*?)\\s*</tool_call>", re.DOTALL)
ATROPOS_TOOL_SYSTEM_PROMPT = """You are a helpful AI assistant with access to tools.
## Available Tools
<tools>
{tool_descriptions}
</tools>
## How to Use Tools
To call a tool, output:
<tool_call>{{"name": "tool_name", "arguments": {{"arg1": "value1"}}}}</tool_call>
You may include optional reasoning in <think>...</think> before tool calls.
After each tool call, you will receive tool results as:
<tool_response>{{...}}</tool_response>
Continue until finished, then provide a final response with no <tool_call> blocks.
"""
class AtroposAIAgent(AIAgent):
"""
Hermes `AIAgent` variant that uses Atroposlib ServerManager/ManagedServer.
Notes:
- The default Hermes `AIAgent` remains unchanged; this class is opt-in.
- The underlying server must expose `managed_server(tokenizer=...)` OR be a single
APIServer-compatible object usable by Atroposlib's `ManagedServer`.
"""
def __init__(
self,
*,
server: Any,
tokenizer: Any = None,
model: str = "local",
max_iterations: int = 10,
tool_delay: float = 0.0,
enabled_toolsets: Optional[List[str]] = None,
disabled_toolsets: Optional[List[str]] = None,
save_trajectories: bool = False,
verbose_logging: bool = False,
quiet_mode: bool = False,
ephemeral_system_prompt: Optional[str] = None,
log_prefix_chars: int = 100,
log_prefix: str = "",
session_id: Optional[str] = None,
temperature: Optional[float] = None,
max_tokens: Optional[int] = None,
):
# Call parent init mainly to reuse tool selection + trajectory saving utilities.
super().__init__(
base_url="http://unused",
api_key="dummy-key",
model=model,
max_iterations=max_iterations,
tool_delay=tool_delay,
enabled_toolsets=enabled_toolsets,
disabled_toolsets=disabled_toolsets,
save_trajectories=save_trajectories,
verbose_logging=verbose_logging,
quiet_mode=quiet_mode,
ephemeral_system_prompt=ephemeral_system_prompt,
log_prefix_chars=log_prefix_chars,
log_prefix=log_prefix,
session_id=session_id,
)
self.server = server
self.tokenizer = tokenizer
self.temperature = temperature
self.max_tokens = max_tokens
@asynccontextmanager
async def _managed(self) -> AsyncGenerator[Any, None]:
if hasattr(self.server, "managed_server"):
with warnings.catch_warnings():
warnings.filterwarnings(
"ignore",
message=r"Using OpenAIServer with managed_server does not allow for state tracking",
category=UserWarning,
)
async with self.server.managed_server(tokenizer=self.tokenizer) as managed:
yield managed
return
# Fall back to directly wrapping a single server object.
from atroposlib.envs.server_handling.managed_server import ManagedServer
managed = ManagedServer(server=self.server, tokenizer=self.tokenizer)
try:
yield managed
finally:
managed.reset()
def _tool_descriptions_text(self) -> str:
if not self.tools:
return "(no tools available)"
parts: List[str] = []
for tool in self.tools:
fn = (tool or {}).get("function", {})
name = fn.get("name", "")
desc = (fn.get("description") or "").strip()
if not name:
continue
if desc:
parts.append(f"- {name}: {desc}")
else:
parts.append(f"- {name}")
return "\n".join(parts) if parts else "(no tools available)"
def _build_system_prompt(self, system_message: Optional[str]) -> Optional[str]:
tool_prompt = ATROPOS_TOOL_SYSTEM_PROMPT.format(
tool_descriptions=self._tool_descriptions_text()
)
parts: List[str] = []
if system_message:
parts.append(system_message)
if self.ephemeral_system_prompt:
parts.append(self.ephemeral_system_prompt)
parts.append(tool_prompt)
return "\n\n".join(parts)
def _parse_tool_calls(self, content: str) -> Tuple[List[Tuple[str, Dict[str, Any]]], List[str]]:
"""
Returns:
(calls, errors)
"""
calls: List[Tuple[str, Dict[str, Any]]] = []
errors: List[str] = []
for raw in _TOOL_CALL_RE.findall(content or ""):
try:
payload = json.loads(raw)
except json.JSONDecodeError as exc:
errors.append(f"Invalid JSON inside <tool_call>: {exc}")
continue
name = payload.get("name")
args = payload.get("arguments", {})
if not isinstance(name, str) or not name:
errors.append("Tool call missing 'name' string")
continue
if not isinstance(args, dict):
errors.append("Tool call 'arguments' must be an object")
continue
calls.append((name, args))
return calls, errors
async def run_conversation_async(
self,
user_message: str,
system_message: Optional[str] = None,
conversation_history: Optional[List[Dict[str, Any]]] = None,
task_id: Optional[str] = None,
) -> Dict[str, Any]:
import uuid
effective_task_id = task_id or str(uuid.uuid4())
messages: List[Dict[str, Any]] = conversation_history.copy() if conversation_history else []
messages.append({"role": "user", "content": user_message})
active_system_prompt = self._build_system_prompt(system_message)
api_call_count = 0
final_response: Optional[str] = None
managed_state: Optional[Dict[str, Any]] = None
completed = False
try:
async with self._managed() as managed:
while api_call_count < self.max_iterations:
api_call_count += 1
api_messages = messages.copy()
if active_system_prompt:
api_messages = [{"role": "system", "content": active_system_prompt}] + api_messages
chat_kwargs: Dict[str, Any] = {"messages": api_messages, "n": 1}
if self.max_tokens is not None:
chat_kwargs["max_tokens"] = self.max_tokens
if self.temperature is not None:
chat_kwargs["temperature"] = self.temperature
# Prefer OpenAI tool calling when supported by the backend:
# - Many providers normalize Hermes-style <tool_call> tags into tool_calls when `tools` is provided.
# - ManagedServer (atroposlib) does prompt->completion conversion and does not support `tools`.
# Only pass `tools` when we're calling an OpenAI-compatible chat endpoint directly.
tool_schemas = self.tools if self.tools else None
managed_cls = type(managed).__name__
if tool_schemas and managed_cls != "ManagedServer":
chat_kwargs["tools"] = tool_schemas
if os.getenv("HERMES_DEBUG_ATROPOS_REQUEST") == "1":
meta = {
"managed_type": managed_cls,
"model": getattr(getattr(managed, "config", None), "model_name", self.model),
"base_url": getattr(getattr(managed, "config", None), "base_url", None),
"kwargs": chat_kwargs,
}
# Avoid dumping megabytes of data accidentally.
# (Messages can be large; this is still "full" but bounded.)
print("\n=== HERMES_DEBUG_ATROPOS_REQUEST ===", flush=True)
print(json.dumps(meta, ensure_ascii=False, indent=2)[:200_000], flush=True)
response = await managed.chat_completion(**chat_kwargs)
if os.getenv("HERMES_DEBUG_ATROPOS_RESPONSE") == "1":
try:
dumped = response.model_dump() # openai pydantic model
except Exception:
dumped = getattr(response, "__dict__", {"repr": repr(response)})
print("\n=== HERMES_DEBUG_ATROPOS_RESPONSE: ChatCompletion (raw) ===", flush=True)
print(json.dumps(dumped, ensure_ascii=False, indent=2), flush=True)
if hasattr(managed, "get_state"):
managed_state = managed.get_state()
msg = response.choices[0].message
assistant_content = (msg.content or "")
msg_reasoning = getattr(msg, "reasoning", None)
# Use tool_calls if the backend provides them (preferred).
structured_tool_calls = getattr(msg, "tool_calls", None)
# If the backend emits content="" but includes useful text in reasoning,
# use it for parsing *only if needed* (e.g. tool tags).
if assistant_content == "" and isinstance(msg_reasoning, str) and msg_reasoning:
if os.getenv("HERMES_DEBUG_ATROPOS_RESPONSE") == "1":
print("\n=== HERMES_DEBUG_ATROPOS_RESPONSE: message.reasoning present (content empty) ===", flush=True)
print(msg_reasoning, flush=True)
assistant_msg: Dict[str, Any] = {"role": "assistant", "content": assistant_content}
if structured_tool_calls:
# Preserve tool_calls so the next request is consistent with OpenAI protocol.
try:
assistant_msg["tool_calls"] = [
{
"id": tc.id,
"type": tc.type,
"function": {"name": tc.function.name, "arguments": tc.function.arguments},
}
for tc in structured_tool_calls
]
except Exception:
# Best-effort; keep conversation moving.
pass
messages.append(assistant_msg)
# Mode A: OpenAI tool calling (preferred when supported)
if structured_tool_calls:
for tc in structured_tool_calls:
tool_start = time.time()
try:
tool_args = json.loads(tc.function.arguments or "{}")
except Exception:
tool_args = {}
tool_result = handle_function_call(tc.function.name, tool_args, effective_task_id)
tool_duration = time.time() - tool_start
# Keep the raw tool result as tool content (OpenAI protocol expects role=tool).
messages.append(
{
"role": "tool",
"tool_call_id": tc.id,
"content": tool_result,
}
)
if self.tool_delay and self.tool_delay > 0:
await asyncio.sleep(self.tool_delay)
# Continue loop after tool execution.
continue
# Mode B: Hermes XML tool tags in assistant text (fallback).
parse_source = assistant_content or (msg_reasoning or "")
tool_calls, parse_errors = self._parse_tool_calls(parse_source)
if parse_errors and not tool_calls:
# Ask the model to retry with valid tool JSON.
err_text = "; ".join(parse_errors[:3])
messages.append(
{
"role": "user",
"content": (
f"<tool_response>{json.dumps({'error': err_text}, ensure_ascii=False)}</tool_response>\n"
"The previous <tool_call> blocks were invalid. Please output valid JSON inside <tool_call>."
),
}
)
continue
if not tool_calls:
# No tool calls: treat as final answer.
final_response = (assistant_content or "").strip()
completed = True
break
tool_responses: List[str] = []
for tool_name, tool_args in tool_calls:
tool_start = time.time()
tool_result = handle_function_call(tool_name, tool_args, effective_task_id)
tool_duration = time.time() - tool_start
try:
parsed = json.loads(tool_result)
payload: Any = parsed
except Exception:
payload = tool_result
tool_payload = {
"name": tool_name,
"duration_s": round(tool_duration, 3),
"result": payload,
}
tool_responses.append(
f"<tool_response>{json.dumps(tool_payload, ensure_ascii=False)}</tool_response>"
)
if self.tool_delay and self.tool_delay > 0:
await asyncio.sleep(self.tool_delay)
messages.append({"role": "user", "content": "\n".join(tool_responses)})
if final_response is None:
final_response = "I've reached the maximum number of iterations."
finally:
try:
cleanup_vm(effective_task_id)
except Exception:
pass
# Save trajectory using Hermes formatting (optional).
self._save_trajectory(messages, user_message, completed=completed)
return {
"final_response": final_response,
"messages": messages,
"api_calls": api_call_count,
"completed": completed,
"managed_state": managed_state,
"system_prompt": active_system_prompt,
"task_id": effective_task_id,
}
def run_conversation(self, *args: Any, **kwargs: Any) -> Dict[str, Any]:
"""
Sync wrapper for convenience.
If called from within a running event loop (e.g. prompt_toolkit), this
runs the async conversation in a dedicated thread to avoid nested loops.
"""
try:
asyncio.get_running_loop()
except RuntimeError:
return asyncio.run(self.run_conversation_async(*args, **kwargs))
import queue
import threading
out: "queue.Queue[object]" = queue.Queue(maxsize=1)
def runner() -> None:
try:
out.put(asyncio.run(self.run_conversation_async(*args, **kwargs)))
except BaseException as exc: # noqa: BLE001
out.put(exc)
thread = threading.Thread(target=runner, daemon=True)
thread.start()
result = out.get()
if isinstance(result, BaseException):
raise result
return result # type: ignore[return-value]

View File

@@ -0,0 +1,83 @@
# Endless Terminals Environment Configuration
#
# Two modes:
# 1. Dataset mode (default): Load pre-generated tasks from HuggingFace
# 2. Procedural mode: Generate tasks on-demand via LLM
#
# Usage:
# python -m atropos.envs.endless_terminals_env process \
# --config configs/endless_terminals.yaml
# Environment settings
env:
# Dataset mode (primary - recommended)
use_dataset: true # Load from HuggingFace (fast, no vLLM needed)
dataset_name: "obiwan96/endless-terminals-train"
dataset_split: "train"
dataset_cache_dir: "~/.cache/huggingface/datasets"
tasks_base_dir: "" # Set to dir containing task_* folders if not using default paths
# Example: "/path/to/endless-terminals-train"
# Task generation (fallback if use_dataset=false)
task_gen_model: "Qwen/Qwen3-32B" # Only needed if use_dataset=false
task_gen_temperature: 1.0
task_gen_max_tokens: 2048
# Container settings
base_container_image: "ubuntu:22.04"
container_timeout_s: 180
test_timeout_s: 60
# Workspace
workspace_dir: "/tmp/endless_terminals_workspace"
keep_failed_tasks: false # Set true to debug failed tasks
# Agent config (increased for long traces)
agent_max_steps: 32
agent_temperature: 0.7
agent_max_tokens: null # Let backend decide
# Tooling: terminal only
enabled_toolsets: ["terminal"]
disabled_toolsets: []
# Training settings
group_size: 4 # Parallel trajectory collection
batch_size: 32
total_steps: 1000 # Total training episodes
use_wandb: false # Enable for experiment tracking
include_messages: true
# Tool execution backend (nomad or modal)
tool_pool_mode: "nomad"
# Nomad settings (if using nomad)
nomad_address: "http://localhost:4646"
sandbox_job_id: "atropos-sandbox-endless"
sandbox_image: "atropos-sandbox:local"
slots_per_container: 10
min_containers: 1
max_containers: 10
privileged: false
acquire_timeout_s: 30.0
purge_job_on_start: true
purge_job_on_shutdown: true
# Modal settings (if using modal instead)
# modal_app_name: "atropos-endless"
# modal_image: "python:3.11"
# modal_slots_per_sandbox: 10
# modal_min_sandboxes: 1
# modal_max_sandboxes: 5
# Server config
server_base_url: "http://127.0.0.1:8080"
server_model: "hermes-4-36b"
tokenizer_name: "NousResearch/Hermes-4.3-36B"
# Server configs are auto-generated from env vars and env.server_* settings
# Override via environment variables:
# ATROPOS_SERVER_BASE_URL
# ATROPOS_SERVER_MODEL
# ATROPOS_SERVER_API_KEY
# ATROPOS_TOKENIZER_NAME

224
docs/MODAL_BACKEND.md Normal file
View File

@@ -0,0 +1,224 @@
# Modal Backend
Hermes Agent uses [Modal](https://modal.com) for scalable, isolated cloud execution environments. There are two Modal integrations:
1. **Terminal Tool** (`tools/terminal_tool.py`) - For CLI/agent command execution
2. **Atropos Backend** (`atropos/backends/modal_backend.py`) - For batch RL training workloads
---
## Terminal Tool (CLI/Agent)
The terminal tool provides a simple interface for executing commands in Modal sandboxes.
### Configuration
Set environment variables:
```bash
export TERMINAL_ENV=modal
export TERMINAL_MODAL_IMAGE=python:3.11
export TERMINAL_MODAL_APP_NAME=hermes-sandbox
```
Or use a YAML config file (`modal_profiles.yaml`):
```yaml
profiles:
default:
image: python:3.11
cpu: 1.0
memory: 2048
min_pool: 1
max_pool: 5
idle_timeout: 120
gpu:
image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
gpu: T4
memory: 16384
min_pool: 0
max_pool: 2
```
### Features
| Feature | Description |
|---------|-------------|
| **Sandbox Pool** | Pre-warmed sandboxes for low latency |
| **Auto-scaling** | Grows/shrinks pool based on demand |
| **Idle Timeout** | Sandboxes auto-terminate when unused |
| **Profile Selection** | Different configs for different workloads |
| **Credential Injection** | `modal.Secret` integration |
### Usage
```python
from tools.terminal_tool import terminal_tool
# Simple command
output = terminal_tool("echo hello", task_id="my-task")
# With profile selection
output = terminal_tool("python train.py", task_id="training", profile="gpu")
# Cleanup when done
from tools.terminal_tool import cleanup_vm
cleanup_vm("my-task")
```
### Architecture
```
_ModalPoolManager (singleton)
├── "default" pool → [sandbox-0, sandbox-1, ...]
└── "gpu" pool → [sandbox-0, ...]
Each pool:
- Maintains min_pool warm sandboxes
- Scales up to max_pool on demand
- Background thread scales down idle sandboxes
```
---
## Atropos Backend (RL Training)
The Atropos backend is designed for high-throughput batch execution during reinforcement learning training.
### Key Concept: Slot-based Multiplexing
Instead of one sandbox per trajectory, multiple trajectories share sandboxes via **slots**:
```
Sandbox (1 container)
├── Slot 0 → Trajectory A (workspace: /data/slot_0)
├── Slot 1 → Trajectory B (workspace: /data/slot_1)
└── Slot 2 → Trajectory C (workspace: /data/slot_2)
```
**Benefits**:
- Fewer containers = lower cost
- Shared warm-up time
- Better GPU utilization
### Configuration
```python
from atropos.backends.modal_backend import ModalSandboxConfig, ModalToolBackend
config = ModalSandboxConfig(
name="default",
image="python:3.11",
cpu=1.0,
memory=2048,
slots_per_sandbox=10, # 10 trajectories per container
min_sandboxes=1,
max_sandboxes=5,
)
backend = ModalToolBackend(config.with_app_name("my-training"))
```
### Multi-Profile Support
Different trajectory types can request different resources:
```python
backend = ModalToolBackend.with_profiles(
app_name="rl-training",
profiles={
"default": ModalSandboxConfig(
name="default",
cpu=1.0,
memory=2048,
),
"pytorch-gpu": ModalSandboxConfig(
name="pytorch-gpu",
image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
gpu="T4",
memory=16384,
),
}
)
# CPU task
slot1 = await backend.acquire("traj-1", profile="default")
# GPU task
slot2 = await backend.acquire("traj-2", profile="pytorch-gpu")
```
### Batched Execution
The key optimization - execute many commands in parallel:
```python
# Acquire slots for multiple trajectories
slots = [await backend.acquire(f"traj-{i}") for i in range(50)]
# Execute batch across all slots in parallel
results = await backend.execute_batch([
(slot, "bash", {"command": "python step.py"})
for slot in slots
])
# Release slots
for slot in slots:
await backend.release(slot)
```
### Architecture
```
ModalToolBackend
└── _ModalMultiProfileManager
├── "default" → _ModalSandboxPool
│ ├── Sandbox 0 (slots 0-9)
│ └── Sandbox 1 (slots 0-9)
└── "pytorch-gpu" → _ModalSandboxPool
└── Sandbox 0 (slots 0-9)
```
---
## Credentials
Inject secrets securely using Modal's secret management:
```bash
# Create secret in Modal dashboard or CLI
modal secret create my-api-key API_KEY=sk-xxx
```
```python
# Reference in config
config = ModalSandboxConfig(
secrets=["my-api-key"], # Modal secret names
env_vars={"DEBUG": "1"}, # Additional env vars
)
```
## Troubleshooting
### "Modal package not installed"
```bash
pip install modal
modal token new # Authenticate
```
### "Sandbox creation failed"
- Check Modal dashboard for quota limits
- Verify image exists and is accessible
- Check secret names are correct
### Shutdown errors
These are harmless warnings during Python interpreter shutdown:
```
[Modal] Error terminating ...: cannot schedule new futures after interpreter shutdown
```
The sandboxes will auto-terminate via Modal's idle_timeout anyway.

34
hermes
View File

@@ -7,6 +7,40 @@ Usage: ./hermes [options]
""" """
if __name__ == "__main__": if __name__ == "__main__":
"""
Fire (google/python-fire) does not support POSIX-style short flags like `-p`.
We translate the most common shorthands to their long equivalents so wrapper
scripts can reliably use:
- `-p "..."` -> `--prompt "..."` (no TUI/banner; print result and exit)
- `-q "..."` -> `--query "..."` (single-shot with banner UX)
"""
import sys
def _rewrite_short_flags(argv: list[str]) -> list[str]:
rewritten: list[str] = []
i = 0
while i < len(argv):
arg = argv[i]
if arg == "-p":
rewritten.append("--prompt")
if i + 1 < len(argv):
rewritten.append(argv[i + 1])
i += 2
continue
if arg == "-q":
rewritten.append("--query")
if i + 1 < len(argv):
rewritten.append(argv[i + 1])
i += 2
continue
rewritten.append(arg)
i += 1
return rewritten
sys.argv = [sys.argv[0]] + _rewrite_short_flags(sys.argv[1:])
from cli import main from cli import main
import fire import fire
fire.Fire(main) fire.Fire(main)

View File

@@ -0,0 +1,659 @@
Metadata-Version: 2.4
Name: hermes-agent
Version: 0.1.0
Summary: AI agent with advanced tool-calling and toolsets
Author: Nous Research
License: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: openai
Requires-Dist: python-dotenv
Requires-Dist: fire
Requires-Dist: httpx
Requires-Dist: rich
Requires-Dist: tenacity
Requires-Dist: pyyaml
Requires-Dist: prompt_toolkit
Requires-Dist: requests
Requires-Dist: jinja2
Requires-Dist: pydantic>=2.0
Requires-Dist: firecrawl-py
Requires-Dist: fal-client
Requires-Dist: litellm>=1.75.5
Requires-Dist: typer
Requires-Dist: platformdirs
Provides-Extra: modal
Requires-Dist: modal; extra == "modal"
Requires-Dist: boto3; extra == "modal"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-asyncio; extra == "dev"
Provides-Extra: atropos
Requires-Dist: atroposlib @ git+https://github.com/NousResearch/atropos.git ; extra == "atropos"
Requires-Dist: aiohttp; extra == "atropos"
Requires-Dist: fastapi; extra == "atropos"
Requires-Dist: uvicorn; extra == "atropos"
Requires-Dist: pyte; extra == "atropos"
# Hermes Agent
An AI agent with advanced tool-calling capabilities, featuring a flexible toolsets system for organizing and managing tools.
## Features
- **Interactive CLI**: Beautiful terminal interface with animated feedback, personalities, and session management
- **Web Tools**: Search, extract content, and crawl websites
- **Terminal Tools**: Execute commands via local, Docker, Singularity, Modal, or SSH backends
- **Browser Tools**: Automate web browsers to navigate, click, type, and extract content
- **Vision Tools**: Analyze images from URLs
- **Reasoning Tools**: Advanced multi-model reasoning (Mixture of Agents)
- **Creative Tools**: Generate images from text prompts
- **Skills Tools**: On-demand knowledge documents with progressive disclosure
- **Toolsets System**: Organize tools into logical groups for different scenarios
- **Batch Processing**: Process datasets in parallel with checkpointing and statistics tracking
- **Ephemeral System Prompts**: Guide model behavior without polluting training datasets
## Quick Start (CLI)
```bash
# After setup (see below), just run:
./hermes
# Or with options:
./hermes --model "anthropic/claude-sonnet-4" --toolsets "web,terminal"
```
The CLI provides:
- Animated spinners during thinking and tool execution
- Kawaii-style feedback messages
- `/commands` for configuration, history, and session management
- Customizable personalities (`/personality kawaii`, `/personality pirate`, etc.)
- Persistent configuration via `cli-config.yaml`
## Setup
### 1. Clone the Repository
```bash
# Clone with submodules (recommended)
git clone --recurse-submodules https://github.com/NousResearch/Hermes-Agent.git
cd Hermes-Agent
# Or if already cloned without submodules:
git submodule update --init --recursive
```
### 2. Install Dependencies
```bash
# Create and activate virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install Python packages
pip install -r requirements.txt
# Install mini-swe-agent for terminal tools
pip install -e ./mini-swe-agent
# Install Node.js dependencies for browser tools (requires Node.js)
npm install
```
### 3. Configure Environment Variables
```bash
# Copy the example environment file
cp .env.example .env
# Edit .env and add your API keys
nano .env # or use your preferred editor
```
**Required API Keys:**
- `OPENROUTER_API_KEY` - LLM access via OpenRouter (get at: https://openrouter.ai/keys)
- `FIRECRAWL_API_KEY` - Web tools (get at: https://firecrawl.dev/)
- `NOUS_API_KEY` - Vision & reasoning tools (get at: https://inference-api.nousresearch.com/)
- `FAL_KEY` - Image generation (get at: https://fal.ai/)
**Optional API Keys (for specific features):**
- `BROWSERBASE_API_KEY` - Browser automation (get at: https://browserbase.com/)
- `BROWSERBASE_PROJECT_ID` - From Browserbase dashboard
- `MORPH_API_KEY` - For legacy Hecate terminal backend (get at: https://morph.so/)
### 4. Configure Terminal Backend
The terminal tool uses **mini-swe-agent** environments. Configure in `.env` or `cli-config.yaml`:
```bash
# Backend: "local", "docker", "singularity", "modal", or "ssh"
TERMINAL_ENV=local # Default: runs on host machine (no isolation)
TERMINAL_ENV=ssh # Remote execution via SSH (agent code stays local)
TERMINAL_ENV=singularity # Recommended for HPC: Apptainer/Singularity containers
TERMINAL_ENV=docker # Isolated Docker containers
TERMINAL_ENV=modal # Cloud execution via Modal
# Container image (for docker/singularity/modal backends)
TERMINAL_DOCKER_IMAGE=python:3.11-slim
TERMINAL_SINGULARITY_IMAGE=docker://python:3.11-slim
TERMINAL_TIMEOUT=60
# SSH backend (for ssh)
TERMINAL_SSH_HOST=my-server.example.com
TERMINAL_SSH_USER=myuser
TERMINAL_SSH_KEY=~/.ssh/id_rsa # Optional, uses ssh-agent if not set
```
**Backend Requirements:**
- **local**: No extra setup (runs directly on your machine, no isolation)
- **ssh**: SSH access to remote machine (great for sandboxing - agent can't touch its own code)
- **singularity**: Requires Apptainer or Singularity installed (common on HPC clusters, no root needed)
- **docker**: Requires Docker installed and user in `docker` group
- **modal**: Requires Modal account (see setup below)
### Singularity/Apptainer Setup (Recommended for HPC)
Singularity/Apptainer provides rootless container execution, ideal for HPC clusters:
```bash
# 1. Verify Apptainer is installed
apptainer --version # or: singularity --version
# 2. Set up cache directories (important for parallel workers)
# Use /scratch if available (HPC), otherwise /tmp
export APPTAINER_CACHEDIR=/scratch/$USER/.apptainer
export APPTAINER_TMPDIR=/scratch/$USER/.apptainer/tmp
mkdir -p "$APPTAINER_CACHEDIR" "$APPTAINER_TMPDIR"
# 3. Pre-build SIF image (recommended for parallel batch processing)
# This avoids race conditions when multiple workers start simultaneously
apptainer build $APPTAINER_CACHEDIR/python-nodejs.sif docker://nikolaik/python-nodejs:python3.11-nodejs20
# 4. Configure .env to use the local SIF
TERMINAL_ENV=singularity
TERMINAL_SINGULARITY_IMAGE=/scratch/$USER/.apptainer/python-nodejs.sif
```
**Tip:** The batch scripts in `configs/` automatically handle SIF pre-building if `/scratch` is available.
### Modal Cloud Backend Setup
[Modal](https://modal.com) provides serverless cloud compute for running sandboxed environments at scale.
```bash
# 1. Install Modal and dependencies
pip install modal boto3
# 2. Authenticate with Modal (opens browser)
modal setup
# 3. Set terminal backend to modal in .env
TERMINAL_ENV=modal
```
Modal uses CLI-based authentication (stored in `~/.modal/`), so no API key is needed in `.env`. After running `modal setup`, commands will automatically execute in Modal's cloud sandboxes.
### Browser Tools Setup
Browser tools enable the agent to navigate websites, fill forms, click buttons, and extract content. They use [agent-browser](https://github.com/vercel-labs/agent-browser) CLI with [Browserbase](https://browserbase.com) cloud execution.
```bash
# 1. Install Node.js (if not already installed)
# Use nvm (recommended) or your package manager
# 2. Install agent-browser CLI (choose one option):
npm install -g agent-browser # Option A: Global install (recommended)
npm install # Option B: Local install (uses npx fallback)
# 3. Get Browserbase credentials
# Sign up at https://browserbase.com/ and get your:
# - API Key (from Settings → API Keys)
# - Project ID (from your project dashboard)
# 4. Add to your .env file:
BROWSERBASE_API_KEY=your_api_key_here
BROWSERBASE_PROJECT_ID=your_project_id_here
```
**Available Browser Tools:**
| Tool | Description |
|------|-------------|
| `browser_navigate` | Navigate to a URL |
| `browser_snapshot` | Get text-based page snapshot with element refs |
| `browser_click` | Click an element by ref (e.g., `@e5`) |
| `browser_type` | Type text into an input field |
| `browser_scroll` | Scroll up or down |
| `browser_back` | Go back in browser history |
| `browser_press` | Press a keyboard key (Enter, Tab, etc.) |
| `browser_close` | Close the browser session |
| `browser_get_images` | Get list of images on the page |
**Example Usage:**
```bash
# Use browser tools with web search and vision
python run_agent.py \
--query "Go to amazon.com and find the price of the latest Kindle" \
--enabled_toolsets=browser,web,vision
# Use browser-focused distribution
python batch_runner.py \
--dataset_file=browser_tasks.jsonl \
--distribution=browser_use \
--run_name=browser_run
```
See `.env.example` for all available configuration options including debug settings.
### Skills Tools
Skills are on-demand knowledge documents the agent can load when needed. They follow a **progressive disclosure** pattern to minimize token usage:
```
skills/
├── mlops/ # Category folder
│ ├── axolotl/ # Skill folder
│ │ ├── SKILL.md # Main instructions (required)
│ │ ├── references/ # Additional docs, API specs
│ │ └── templates/ # Output formats, configs
│ └── vllm/
│ └── SKILL.md
```
**Available Skills Tools:**
| Tool | Description |
|------|-------------|
| `skills_categories` | List available skill categories (~50 tokens) |
| `skills_list` | List skills with name + description (~3k tokens for 40 skills) |
| `skill_view` | Load full skill content, tags, and linked files |
**Example Usage:**
```bash
# Use skills tools
python run_agent.py \
--query "What skills do you have for fine-tuning? Show me the axolotl skill." \
--enabled_toolsets=skills
```
**Creating Skills:**
Skills use YAML frontmatter for metadata:
```yaml
---
name: my-skill
description: Brief description shown in skills_list
tags: [tag1, tag2]
related_skills: [other-skill]
version: 1.0.0
---
# Skill Content
Instructions, examples, and guidelines here...
```
Skills can include:
- `references/` - Additional documentation, API specs, examples
- `templates/` - Output formats, config files, boilerplate code
- `scripts/` - Executable helpers (Python, shell scripts)
## Session Logging
Every conversation is automatically logged to `logs/` for debugging and inspection:
```
logs/
├── session_20260201_143052_a1b2c3.json
├── session_20260201_150217_d4e5f6.json
└── ...
```
**Log Format:**
```json
{
"session_id": "20260201_143052_a1b2c3",
"model": "anthropic/claude-sonnet-4",
"session_start": "2026-02-01T14:30:52.123456",
"last_updated": "2026-02-01T14:35:12.789012",
"message_count": 8,
"conversations": [
{"from": "system", "value": "..."},
{"from": "human", "value": "..."},
{"from": "gpt", "value": "..."},
{"from": "tool", "value": "..."}
]
}
```
- **Automatic**: Logs are created and updated automatically after each conversation turn
- **Session ID in Banner**: The CLI displays the session ID in the welcome banner
- **Trajectory Format**: Uses the same format as batch processing for consistency
- **Git Ignored**: `logs/` is in `.gitignore` so logs aren't committed
## Interactive CLI
The CLI provides a rich interactive experience for working with the agent.
### Running the CLI
```bash
# Basic usage
./hermes
# With specific model
./hermes --model "anthropic/claude-sonnet-4"
# With specific toolsets
./hermes --toolsets "web,terminal,skills"
```
### CLI Commands
| Command | Description |
|---------|-------------|
| `/help` | Show available commands |
| `/tools` | List available tools by toolset |
| `/toolsets` | List available toolsets |
| `/model [name]` | Show or change the current model |
| `/prompt [text]` | View/set custom system prompt |
| `/personality [name]` | Set a predefined personality |
| `/clear` | Clear screen and reset conversation |
| `/reset` | Reset conversation only |
| `/history` | Show conversation history |
| `/save` | Save current conversation to file |
| `/config` | Show current configuration |
| `/quit` | Exit the CLI |
### Configuration
Copy `cli-config.yaml.example` to `cli-config.yaml` and customize:
```yaml
# Model settings
model:
default: "anthropic/claude-sonnet-4"
# Terminal backend (local, docker, singularity, modal, or ssh)
terminal:
env_type: "local"
cwd: "." # Use current directory
# Or use SSH for remote execution (keeps agent code isolated)
# terminal:
# env_type: "ssh"
# ssh_host: "my-server.example.com"
# ssh_user: "myuser"
# ssh_key: "~/.ssh/id_rsa"
# cwd: "/home/myuser/project"
# Enable specific toolsets
toolsets:
- all # or: web, terminal, browser, vision, etc.
# Custom personalities (use with /personality command)
agent:
personalities:
helpful: "You are a helpful assistant."
kawaii: "You are a kawaii assistant! Use cute expressions..."
```
### Personalities
Built-in personalities available via `/personality`:
- `helpful`, `concise`, `technical`, `creative`, `teacher`
- `kawaii`, `catgirl`, `pirate`, `shakespeare`, `surfer`
- `noir`, `uwu`, `philosopher`, `hype`
## Toolsets System
The agent uses a toolsets system for organizing and managing tools. All tools must be part of a toolset to be accessible - individual tool selection is not supported. This ensures consistent and logical grouping of capabilities.
### Key Concepts
- **Toolsets**: Logical groups of tools for specific use cases (e.g., "research", "development", "debugging")
- **Composition**: Toolsets can include other toolsets for powerful combinations
- **Custom Toolsets**: Create your own toolsets at runtime or by editing `toolsets.py`
- **Toolset-Only Access**: Tools are only accessible through toolsets, not individually
### Available Toolsets
See `toolsets.py` for the complete list of predefined toolsets including:
- Basic toolsets (web, terminal, vision, creative, reasoning)
- Composite toolsets (research, development, analysis, etc.)
- Scenario-specific toolsets (debugging, documentation, API testing, etc.)
- Special toolsets (safe mode without terminal, minimal, offline)
### Using Toolsets
```bash
# Use a predefined toolset
python run_agent.py --enabled_toolsets=research --query "Find latest AI papers"
# Combine multiple toolsets
python run_agent.py --enabled_toolsets=web,vision --query "Analyze this website"
# Enable all toolsets explicitly (same as omitting the flag)
python run_agent.py --enabled_toolsets=all --query "Do web research and run commands if helpful"
# Safe mode (no terminal access)
python run_agent.py --enabled_toolsets=safe --query "Help without running commands"
# List all available toolsets and tools
python run_agent.py --list_tools
```
See `toolsets.py` for the complete list of available toolsets and how to create custom ones.
## Basic Usage
### Default (all tools enabled)
```bash
# Uses OpenRouter by default - just set OPENROUTER_API_KEY in .env
python run_agent.py \
--query "search up the latest docs on jit in python 3.13 and write me basic example that's not in their docs. profile its perf" \
--max_turns 20 \
--model anthropic/claude-sonnet-4-20250514
```
### With specific toolset
```bash
python run_agent.py \
--query "Debug this Python error" \
--enabled_toolsets=debugging \
--model anthropic/claude-sonnet-4-20250514
```
### Python API
```python
from run_agent import AIAgent
# Uses OpenRouter by default (reads OPENROUTER_API_KEY from .env)
agent = AIAgent(
model="anthropic/claude-sonnet-4-20250514",
enabled_toolsets=["research"]
)
response = agent.chat("Find information about quantum computing")
# Create custom toolset at runtime
from toolsets import create_custom_toolset
create_custom_toolset(
name="my_tools",
description="My custom toolkit",
tools=["web_search"],
includes=["terminal", "vision"]
)
agent = AIAgent(enabled_toolsets=["my_tools"])
```
## Batch Processing
Process multiple prompts from a dataset in parallel with automatic checkpointing and statistics tracking:
```bash
# Basic batch processing
python batch_runner.py \
--dataset_file=prompts.jsonl \
--batch_size=20 \
--run_name=my_run
# With specific distribution
python batch_runner.py \
--dataset_file=prompts.jsonl \
--batch_size=20 \
--run_name=image_run \
--distribution=image_gen \
--num_workers=4
```
**Key Features:**
- Parallel processing with configurable workers
- Toolset distributions for varied data generation
- Automatic checkpointing and resume capability
- Combined output in `data/<run_name>/trajectories.jsonl`
- Tool usage statistics and success rates
Use `--list_distributions` to see available toolset distributions for varied data generation.
### Trajectory Compression
Post-process trajectories to fit within token budgets for training:
```bash
# Compress a directory of JSONL files
python trajectory_compressor.py --input=data/my_run
# Compress a single JSONL file
python trajectory_compressor.py --input=data/trajectories.jsonl
# Compress a 15% sample (useful for creating smaller training sets)
python trajectory_compressor.py --input=data/trajectories.jsonl --sample_percent=15
# Custom output and token target
python trajectory_compressor.py \
--input=data/trajectories.jsonl \
--output=data/compressed.jsonl \
--target_max_tokens=16000
```
**Features:**
- Protects first turns (system, human, first GPT response, first tool call)
- Protects last N turns (configurable)
- Summarizes middle turns using LLM to fit target token budget
- Supports both directory and single file input
- Optional random sampling with `--sample_percent`
- Configurable via `configs/trajectory_compression.yaml`
### Ephemeral System Prompts
The ephemeral system prompt feature allows you to guide the model's behavior during batch processing **without** saving that prompt to the training dataset trajectories. This is useful for:
- Guiding model behavior during data collection
- Adding task-specific instructions
- Keeping saved trajectories clean and focused on tool-calling format
**Example:**
```bash
python batch_runner.py \
--dataset_file=prompts.jsonl \
--batch_size=10 \
--run_name=my_run \
--ephemeral_system_prompt="You are a helpful assistant focused on image generation."
```
The ephemeral prompt will influence the model's behavior during execution, but **only the standard tool-calling system prompt** will be saved in the trajectory files.
The ephemeral prompt influences model behavior during execution, but **only the standard tool-calling system prompt** is saved in trajectory files.
## Command Line Arguments
**Single Agent (`run_agent.py`):**
- `--query`: The question or task for the agent
- `--model`: Model to use (default: claude-opus-4-20250514)
- `--api_key`: API key for authentication
- `--base_url`: API endpoint URL
- `--max_turns`: Maximum number of tool-calling iterations
- `--enabled_toolsets`: Comma-separated list of toolsets to enable. Use `all` (or `*`) to enable everything. If omitted, all toolsets are enabled by default.
- `--disabled_toolsets`: Comma-separated list of toolsets to disable
- `--list_tools`: List all available toolsets and tools
- `--save_trajectories`: Save conversation trajectories to JSONL files
**Batch Processing (`batch_runner.py`):**
- `--dataset_file`: Path to JSONL file with prompts
- `--batch_size`: Number of prompts per batch
- `--run_name`: Name for this run (for output/checkpointing)
- `--distribution`: Toolset distribution to use (default: "default")
- `--num_workers`: Number of parallel workers (default: 4)
- `--resume`: Resume from checkpoint if interrupted
- `--ephemeral_system_prompt`: System prompt used during execution but NOT saved to trajectories
- `--list_distributions`: List available toolset distributions
## Environment Variables
All environment variables can be configured in the `.env` file (copy from `.env.example`).
**LLM Provider (OpenRouter):**
- `OPENROUTER_API_KEY`: Primary LLM access via OpenRouter (supports Claude, GPT-4, Gemini, etc.)
- `LLM_MODEL`: Default model (e.g., `anthropic/claude-sonnet-4`, `openai/gpt-4o`)
**Tool API Keys:**
- `FIRECRAWL_API_KEY`: Web tools (search, extract, crawl)
- `NOUS_API_KEY`: Vision and reasoning tools
- `FAL_KEY`: Image generation tools
**Terminal Tool Configuration (mini-swe-agent backend):**
- `TERMINAL_ENV`: Backend type - `local`, `docker`, `singularity`, `modal`, or `ssh` (default: `local`)
- `TERMINAL_DOCKER_IMAGE`: Docker image for docker backend (default: `python:3.11-slim`)
- `TERMINAL_SINGULARITY_IMAGE`: Singularity/Apptainer image (can be `docker://...` URL or local `.sif` path)
- `TERMINAL_TIMEOUT`: Command timeout in seconds (default: `60`)
- `TERMINAL_LIFETIME_SECONDS`: Cleanup inactive environments after this time (default: `300`)
- `TERMINAL_CWD`: Working directory inside containers (default: `/tmp`)
- `TERMINAL_SCRATCH_DIR`: Custom scratch directory for sandbox storage (optional, auto-detects `/scratch`)
- `SUDO_PASSWORD`: Enable sudo commands by piping password via `sudo -S` (works with all backends)
- If unset in CLI mode, you'll be prompted interactively when sudo is needed (45s timeout)
**SSH Backend Configuration (for remote execution):**
- `TERMINAL_SSH_HOST`: Remote server hostname or IP
- `TERMINAL_SSH_USER`: SSH username
- `TERMINAL_SSH_PORT`: SSH port (default: `22`)
- `TERMINAL_SSH_KEY`: Path to SSH private key (optional, uses ssh-agent if not set)
**Browser Tool Configuration (agent-browser + Browserbase):**
- `BROWSERBASE_API_KEY`: Browserbase API key for cloud browser execution
- `BROWSERBASE_PROJECT_ID`: Browserbase project ID
- `BROWSER_SESSION_TIMEOUT`: Session timeout in seconds (default: `300`)
**Legacy Hecate Terminal Backend (optional):**
- `MORPH_API_KEY`: For Hecate/MorphCloud terminal backend
- `HECATE_VM_LIFETIME_SECONDS`: VM lifetime (default: 300)
- `HECATE_DEFAULT_SNAPSHOT_ID`: Default snapshot (default: snapshot_p5294qxt)
**Debug Options:**
- `WEB_TOOLS_DEBUG`, `VISION_TOOLS_DEBUG`, `MOA_TOOLS_DEBUG`, `IMAGE_TOOLS_DEBUG`: Enable debug logging
## Key Files
| File | Purpose |
|------|---------|
| `hermes` | CLI launcher script (run with `./hermes`) |
| `cli.py` | Interactive CLI implementation |
| `cli-config.yaml` | CLI configuration (copy from `.example`) |
| `run_agent.py` | Main agent runner - single query execution |
| `batch_runner.py` | Parallel batch processing with checkpointing |
| `model_tools.py` | Core tool definitions and handlers |
| `toolsets.py` | Toolset definitions and composition |
| `toolset_distributions.py` | Probability distributions for data generation |
| `trajectory_compressor.py` | Post-process trajectories for training |
| `tools/` | Individual tool implementations |
| `tools/skills_tool.py` | Skills system with progressive disclosure |
| `skills/` | On-demand knowledge documents |
| `docs/` | Documentation |
| `configs/` | Example batch run scripts |
# Atropos Integrations & RL Training
## Nomad Setup
Follow this: https://developer.hashicorp.com/nomad/docs/deploy
## Atropos dependencies
python3 -m venv .venv
source .venv/bin/activate
pip install -e '.[atropos]'

View File

@@ -0,0 +1,70 @@
README.md
atropos_compatible_agent.py
batch_runner.py
local_server.py
model_tools.py
pyproject.toml
run_agent.py
toolset_distributions.py
toolsets.py
trajectory_compressor.py
atropos/__init__.py
atropos/sandbox_server.py
atropos/agent/__init__.py
atropos/agent/atropos_agent.py
atropos/api/__init__.py
atropos/api/tool_executor_server.py
atropos/api/tool_server.py
atropos/backends/__init__.py
atropos/backends/base.py
atropos/backends/modal_backend.py
atropos/backends/nomad_backend.py
atropos/envs/__init__.py
atropos/envs/agent_env.py
atropos/envs/hermes_compat_test_env.py
atropos/envs/sandbox_terminal_smoke_env.py
atropos/envs/swe_smith_oracle_env.py
atropos/envs/test_env.py
atropos/envs/toolserver_smoke_env.py
atropos/nomad/__init__.py
atropos/nomad/client.py
atropos/slots/__init__.py
atropos/slots/executor.py
atropos/slots/pool.py
atropos/slots/slot.py
atropos/terminal/__init__.py
atropos/terminal/asciinema_stream.py
atropos/tools/__init__.py
atropos/tools/base.py
atropos/tools/build_registry.py
atropos/tools/hermes_external_tools.py
atropos/tools/sandbox_stubs.py
atropos/tools/terminal_stateful_tool.py
atropos/tools/tmux_tool.py
atropos/tools/tool_executor.py
atropos/tools/toolset_resolver.py
hermes_agent.egg-info/PKG-INFO
hermes_agent.egg-info/SOURCES.txt
hermes_agent.egg-info/dependency_links.txt
hermes_agent.egg-info/entry_points.txt
hermes_agent.egg-info/requires.txt
hermes_agent.egg-info/top_level.txt
tests/test_batch_runner.py
tests/test_checkpoint_resumption.py
tests/test_modal_integration.py
tests/test_modal_stress.py
tests/test_modal_terminal.py
tests/test_nous_api_limits.py
tests/test_nous_api_pattern.py
tests/test_temperature_fix.py
tests/test_tool_call_parsing.py
tests/test_web_tools.py
tools/__init__.py
tools/browser_tool.py
tools/image_generation_tool.py
tools/mixture_of_agents_tool.py
tools/skills_tool.py
tools/terminal_hecate.py
tools/terminal_tool.py
tools/vision_tools.py
tools/web_tools.py

View File

@@ -0,0 +1 @@

View File

@@ -0,0 +1,4 @@
[console_scripts]
hermes-agent = run_agent:main
hermes-atropos-sandbox-smoke = atropos.envs.sandbox_terminal_smoke_env:SandboxTerminalSmokeEnv.cli
hermes-atropos-toolserver-smoke = atropos.envs.toolserver_smoke_env:ToolServerSmokeEnv.cli

View File

@@ -0,0 +1,31 @@
openai
python-dotenv
fire
httpx
rich
tenacity
pyyaml
prompt_toolkit
requests
jinja2
pydantic>=2.0
firecrawl-py
fal-client
litellm>=1.75.5
typer
platformdirs
[atropos]
atroposlib @ git+https://github.com/NousResearch/atropos.git
aiohttp
fastapi
uvicorn
pyte
[dev]
pytest
pytest-asyncio
[modal]
modal
boto3

View File

@@ -0,0 +1,10 @@
atropos
atropos_compatible_agent
batch_runner
local_server
model_tools
run_agent
tools
toolset_distributions
toolsets
trajectory_compressor

353
local_server.py Normal file
View File

@@ -0,0 +1,353 @@
"""
Local OpenAI-compatible server implementation for Hermes-Agent (Atropos integration).
Extends the Atropos APIServer to work with local OpenAI-compatible APIs (e.g. vLLM, SGLang),
providing tokens_and_logprobs_completion support via client-side tokenization.
"""
import asyncio
import os
import warnings
from typing import Any, List, Optional
import openai
from openai.types.chat.chat_completion import ChatCompletion
from openai.types.completion import Completion
from atroposlib.envs.server_handling.server_baseline import (
APIServer,
APIServerConfig,
ReasoningConfig,
)
class LocalServer(APIServer):
"""
OpenAI-compatible local server with tokens_and_logprobs support.
Uses an OpenAI-compatible API (typically at a /v1 endpoint) and handles
token extraction via client-side tokenization.
Note: Many local servers don't return per-token logprobs in the standard API,
so this implementation uses placeholder logprobs (0.0) for PoC purposes.
For production training, use vLLM/SGLang servers that return real logprobs.
"""
def __init__(
self,
config: APIServerConfig,
tokenizer: Optional[Any] = None,
tokenizer_name: str = "gpt2",
reasoning_config: Optional[ReasoningConfig] = None,
):
"""
Initialize the local server.
Args:
config: Server configuration
tokenizer: Pre-initialized tokenizer (optional)
tokenizer_name: Name of tokenizer to load if tokenizer not provided
reasoning_config: Optional reasoning configuration
"""
# Build the OpenAI client pointing to the server's /v1 endpoint
base_url = config.base_url
if base_url and not base_url.endswith("/v1"):
base_url = f"{base_url.rstrip('/')}/v1"
self.openai = openai.AsyncClient(
api_key=config.api_key or "local", # Local servers often ignore auth
base_url=base_url,
timeout=config.timeout,
)
# Initialize tokenizer
if tokenizer is not None:
self.tokenizer = tokenizer
else:
try:
from transformers import AutoTokenizer # type: ignore
except ModuleNotFoundError as exc:
raise ModuleNotFoundError(
"Missing optional dependency 'transformers'. Pass a tokenizer instance to LocalServer, "
"or install transformers to enable `tokenizer_name` auto-loading."
) from exc
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
# Add a simple chat template if the tokenizer doesn't have one
# This is needed for ManagedServer's chat_completion to work
if not hasattr(self.tokenizer, 'chat_template') or self.tokenizer.chat_template is None:
# Simple ChatML-style template
self.tokenizer.chat_template = (
"{% for message in messages %}"
"{% if message['role'] == 'system' %}<|im_start|>system\n{{ message['content'] }}<|im_end|>\n"
"{% elif message['role'] == 'user' %}<|im_start|>user\n{{ message['content'] }}<|im_end|>\n"
"{% elif message['role'] == 'assistant' %}<|im_start|>assistant\n{{ message['content'] }}<|im_end|>\n"
"{% endif %}"
"{% endfor %}"
"{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
)
super().__init__(config, reasoning_config=reasoning_config)
# Local servers are treated as always-healthy unless a status task is enabled.
self.server_healthy = True
@classmethod
def from_env(
cls,
base_url: Optional[str] = None,
model: Optional[str] = None,
api_key: Optional[str] = None,
tokenizer_name: str = "gpt2",
**kwargs,
) -> "LocalServer":
"""
Create a LocalServer from environment variables (or explicit overrides).
Env vars (checked in order):
- base URL: ATROPOS_SERVER_BASE_URL, OPENAI_BASE_URL, LOCAL_LLM_BASE_URL, LLM_BASE_URL
- model: ATROPOS_SERVER_MODEL, LLM_MODEL, LOCAL_LLM_MODEL
- api key: ATROPOS_SERVER_API_KEY, OPENAI_API_KEY, LOCAL_LLM_API_KEY, LLM_API_KEY
"""
from dotenv import load_dotenv
load_dotenv()
base_url = (
base_url
or os.getenv("ATROPOS_SERVER_BASE_URL")
or os.getenv("OPENAI_BASE_URL")
or os.getenv("LOCAL_LLM_BASE_URL")
or os.getenv("LLM_BASE_URL")
or "http://localhost:11434"
)
model = (
model
or os.getenv("ATROPOS_SERVER_MODEL")
or os.getenv("LLM_MODEL")
or os.getenv("LOCAL_LLM_MODEL")
or "hermes3:8b"
)
api_key = (
api_key
or os.getenv("ATROPOS_SERVER_API_KEY")
or os.getenv("OPENAI_API_KEY")
or os.getenv("LOCAL_LLM_API_KEY")
or os.getenv("LLM_API_KEY")
)
config = APIServerConfig(
model_name=model,
base_url=base_url,
api_key=api_key or "local",
timeout=kwargs.get("timeout", 120),
num_max_requests_at_once=kwargs.get("num_max_requests_at_once", 4),
num_requests_for_eval=kwargs.get("num_requests_for_eval", 4),
health_check=False, # Local dev servers often lack /health
)
return cls(config, tokenizer_name=tokenizer_name)
async def check_server_status_task(self, chat_completion: bool = True):
"""
Check if the server is healthy.
For local development, we generally assume the server is healthy.
"""
while True:
try:
# Simple health check via a minimal completion
if chat_completion:
await self.openai.chat.completions.create(
model=self.config.model_name,
messages=[{"role": "user", "content": "hi"}],
max_tokens=1,
)
else:
await self.openai.completions.create(
model=self.config.model_name,
prompt="hi",
max_tokens=1,
)
self.server_healthy = True
except Exception:
self.server_healthy = False
await asyncio.sleep(5)
async def _chat_completion_wrapper(self, **kwargs) -> ChatCompletion:
"""
Wrapper for chat completion using an OpenAI-compatible API.
"""
assert kwargs.get("model") is not None, "Model is required!"
assert kwargs.get("messages") is not None, "Messages are required!"
n = kwargs.get("n", 1)
# Some OpenAI-compatible servers don't support n > 1, so we make multiple requests.
if n > 1:
completion_list = await asyncio.gather(
*[self.openai.chat.completions.create(**{**kwargs, "n": 1}) for _ in range(n)]
)
# Merge completions
completions = completion_list[0]
for c in completion_list[1:]:
for choice in c.choices:
choice.index = len(completions.choices)
completions.choices.append(choice)
return completions
else:
return await self.openai.chat.completions.create(**kwargs)
async def _completion_wrapper(self, **kwargs) -> Completion:
"""
Wrapper for completion using an OpenAI-compatible API.
"""
assert kwargs.get("model") is not None, "Model is required!"
assert kwargs.get("prompt") is not None, "Prompt is required!"
n = kwargs.get("n", 1)
# Some OpenAI-compatible servers don't support n > 1.
if n > 1:
completion_list = await asyncio.gather(
*[self.openai.completions.create(**{**kwargs, "n": 1}) for _ in range(n)]
)
completions = completion_list[0]
for c in completion_list[1:]:
for choice in c.choices:
choice.index = len(completions.choices)
completions.choices.append(choice)
return completions
else:
return await self.openai.completions.create(**kwargs)
async def _tokens_and_logprobs_completion_wrapper(
self, **kwargs
) -> tuple[List[int], List[List[int]], List[List[float]], List[str]]:
"""
Wrapper for tokens and logprobs completion.
Returns:
Tuple of (prompt_tokens, output_tokens_list, output_logprobs_list, finish_reasons)
Note: Many OpenAI-compatible local servers don't return per-token logprobs,
so we use placeholder logprobs (0.0). For real training, use vLLM/SGLang.
"""
model = kwargs.get("model")
assert model is not None, "Model is required!"
# Handle input_ids (from ManagedServer) or prompt
if "input_ids" in kwargs:
prompt_tokens = kwargs.pop("input_ids")
prompt = self.tokenizer.decode(prompt_tokens)
kwargs.pop("prompt", None)
else:
prompt = kwargs.pop("prompt", "")
prompt_tokens = self.tokenizer.encode(prompt, add_special_tokens=True)
n = kwargs.pop("n", 1)
max_tokens = kwargs.pop("max_tokens", 256)
temperature = kwargs.pop("temperature", 0.7)
stop = kwargs.pop("stop", None)
# Make completion requests
completions = []
for _ in range(n):
try:
response = await self.openai.completions.create(
model=model,
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature,
stop=stop,
)
completions.append(response)
except Exception as e:
# Fallback to chat completion if completion endpoint not supported
warnings.warn(f"Completion API failed, trying chat: {e}")
response = await self.openai.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
temperature=temperature,
stop=stop,
)
# Convert to completion-like response
completions.append(response)
output_tokens_list = []
output_logprobs_list = []
finish_reasons = []
for completion in completions:
# Extract text from response
if hasattr(completion.choices[0], "text"):
# Completion API response
text = completion.choices[0].text
finish_reason = completion.choices[0].finish_reason or "stop"
else:
# Chat completion API response
text = completion.choices[0].message.content or ""
finish_reason = completion.choices[0].finish_reason or "stop"
# Tokenize output
output_tokens = self.tokenizer.encode(text, add_special_tokens=False)
# Placeholder logprobs (many local servers don't provide per-token logprobs).
# In production, use vLLM/SGLang which return real logprobs
output_logprobs = [0.0] * len(output_tokens)
output_tokens_list.append(output_tokens)
output_logprobs_list.append(output_logprobs)
finish_reasons.append(finish_reason)
return prompt_tokens, output_tokens_list, output_logprobs_list, finish_reasons
def managed_server(self, tokenizer=None, track_tree: bool = False):
"""
Create a ManagedServer context manager for this server.
Args:
tokenizer: Optional tokenizer override
track_tree: Whether to maintain tree structure for multi-turn
Returns:
ManagedServer context manager
"""
from atroposlib.envs.server_handling.managed_server import ManagedServer
return ManagedServerContext(
self,
tokenizer=tokenizer or self.tokenizer,
track_tree=track_tree,
)
class ManagedServerContext:
"""
Context manager wrapper for ManagedServer.
Usage:
async with server.managed_server(tokenizer=tokenizer) as managed:
response = await managed.chat_completion(...)
state = managed.get_state()
"""
def __init__(self, server: LocalServer, tokenizer, track_tree: bool = False):
self.server = server
self.tokenizer = tokenizer
self.track_tree = track_tree
self.managed = None
async def __aenter__(self):
from atroposlib.envs.server_handling.managed_server import ManagedServer
self.managed = ManagedServer(
self.server,
tokenizer=self.tokenizer,
track_tree=self.track_tree,
)
return self.managed
async def __aexit__(self, exc_type, exc_val, exc_tb):
if self.managed:
self.managed.reset()
return False

View File

@@ -0,0 +1,61 @@
# Active Context
## Current Focus
Tinker RL training integration - pipeline fully wired up, waiting on Tinker billing to test.
## Recently Completed (Feb 9, 2026)
### Tinker RL Training Integration
Created a complete agent training pipeline using Tinker (Thinking Machines) + Atropos:
**New Files Created:**
1. `tinker-atropos/tinker_atropos/environments/gsm8k_agent.py` - Agent GSM8k environment with:
- Python REPL tool calling (Hermes-style `<tool_call>` format)
- Multi-step agent loop within `collect_trajectories()`
- Math answer verification via `math_verify`
- Subprocess-based Python execution
- WandB metrics (percent_correct, tool_use_rate)
2. `tinker-atropos/configs/gsm8k_agent.yaml` - Config for Qwen3-4B-Instruct training
**Dependencies Updated:**
- `pyproject.toml` `[atropos]` extra now includes: tinker SDK, torch, wandb, math-verify
- Installed: tinker 0.12.0, tinker-atropos 0.1.0, torch (CPU)
**README Updated:**
- Added comprehensive "RL Training with Tinker" section with architecture diagram, quick start, config docs
- Added TINKER_API_KEY and WANDB_API_KEY to optional keys table
**Verified Working:**
- Tinker SDK connection ✅
- All imports (tinker, tinker_atropos, trainer, environment) ✅
- Python REPL execution + tool call parsing ✅
- Math verification ✅
- Atropos run-api (port 8000) ✅
- Tinker trainer starts, loads config, creates inference server (port 8001) ✅
**Blocked:** Tinker billing (402 error) - user's payment didn't process (possibly regional card issue)
### Main Branch Merge (Feb 9, 2026)
Merged `origin/main` into `atropos-integrations` - 22,560 lines, 79 files, 5 conflicts resolved.
### Modal Backend (Feb 8, 2026)
Merged modal-integration branch, working with Modal Sandboxes.
### Singularity/Apptainer (Feb 6, 2026)
Completed and tested.
## Architecture: Training Pipeline
```
Terminal 1: run-api (port 8000) - Atropos Rollout API
Terminal 2: launch_training.py (port 8001) - Tinker Trainer + FastAPI inference
Terminal 3: gsm8k_agent.py serve - Environment (generates trajectories)
```
The agent env gets math problems → model calls Python REPL tool → scores answer → sends to Atropos → Tinker does LoRA training → updates sampling weights → repeat.
## Next Steps
- [ ] Resolve Tinker billing to test full training loop
- [ ] Run GSM8k agent training for ~20 steps (proof of concept)
- [ ] Monitor WandB for reward improvement
- [ ] Graduate to more complex agent envs (SWE tasks with Modal backend)

View File

@@ -0,0 +1,55 @@
# Product Context: Hermes-Agent
## Why This Project Exists
Hermes-Agent addresses several key challenges in the AI agent space:
1. **Unified Tool Interface** - Provides a clean, consistent interface for LLMs to use various tools (web, terminal, browser, vision, etc.) without requiring custom integration for each model provider.
2. **Training Data Generation** - Enables efficient generation of high-quality tool-calling trajectories for fine-tuning LLMs, with features like batch processing, checkpointing, and trajectory compression.
3. **Flexible Deployment** - Supports multiple execution environments (local, Docker, Singularity, Modal, SSH) to accommodate different security and isolation requirements.
4. **Developer Experience** - Offers a beautiful, interactive CLI with kawaii-style feedback that makes working with AI agents enjoyable.
## Problems It Solves
### For AI Researchers
- **Data Generation at Scale**: Parallel batch processing with content-based checkpointing for fault tolerance
- **Clean Trajectories**: Trajectory compression to fit token budgets while preserving important information
- **Toolset Distributions**: Probability-based tool selection for varied training data
### For Developers
- **Tool Orchestration**: Logical grouping of tools into toolsets (research, development, debugging, etc.)
- **Session Persistence**: Conversation history and session logging for debugging
- **Multi-Model Support**: Works with any OpenAI-compatible API (OpenRouter, local models, etc.)
### For MLOps
- **Skills System**: On-demand knowledge documents for specific tools/frameworks (Axolotl, vLLM, TRL, etc.)
- **Sandboxed Execution**: Terminal commands can run in isolated environments (Docker, Singularity, Modal)
- **Configurable Backends**: Easy switching between local and cloud execution
## How It Should Work
### User Flow (CLI)
1. User launches `./hermes`
2. Beautiful welcome banner displays with caduceus logo, model info, and available tools
3. User types a natural language request
4. Agent processes request, potentially calling tools with animated feedback
5. Agent responds with results, conversation continues
6. Session is automatically logged for debugging
### User Flow (Batch Processing)
1. User prepares JSONL file with prompts
2. Runs `batch_runner.py` with distribution and worker count
3. System processes prompts in parallel, saves checkpoints
4. Completed trajectories saved to `data/<run_name>/trajectories.jsonl`
5. Optional: compress trajectories with `trajectory_compressor.py`
## User Experience Goals
- **Delightful Interaction**: Kawaii ASCII faces, animated spinners, cute messages
- **Informative Feedback**: Clear progress indication during tool execution
- **Configurable Personalities**: From "helpful" to "pirate" to "Shakespeare"
- **Easy Configuration**: YAML config file + environment variables + CLI flags
- **Graceful Degradation**: Missing tools/APIs don't break the system, just disable features

96
memory-bank/progress.md Normal file
View File

@@ -0,0 +1,96 @@
# Progress
## Completed Features
### ✅ Modal Backend Integration (Feb 8, 2026 - MERGED & TESTED)
Merged the `modal-integration` branch and fixed integration issues.
**What Works:**
- `ModalToolBackend` implements full `ToolBackend` interface (start, stop, acquire, release, execute_batch)
- Modal Sandboxes used for long-lived containers (not Functions)
- `sandbox.exec()` for direct command execution (no HTTP server needed)
- Slot-based multiplexing matching Nomad pattern
- Multi-profile support (`ModalSandboxConfig`, `_ModalMultiProfileManager`)
- YAML profile loading (`modal_profiles.yaml`)
- `AgentEnvConfig` fields for all Modal settings (`--env.modal_*`)
- `create_tool_backend()` supports `tool_pool_mode="modal"`
- Terminal tool (`tools/terminal_tool.py`) native Modal integration with pool management
- Named sandbox recovery via `Sandbox.from_name()`
- Auto-scaling sandbox pool per profile
- Artifact helpers (read, list, archive)
**CLI Usage:**
```bash
# Atropos backend
python -m atropos.envs.swe_smith_oracle_env process \
--env.tool_pool_mode modal \
--env.modal_image python:3.11
# Terminal tool
TERMINAL_ENV=modal ./hermes
```
**Files Modified/Created:**
- `atropos/backends/modal_backend.py` - Full implementation (~1200 lines)
- `atropos/backends/__init__.py` - `create_tool_backend()` updated
- `atropos/envs/agent_env.py` - 15 Modal config fields added
- `tools/terminal_tool.py` - Native Modal sandbox pool
- `docs/MODAL_BACKEND.md` - Documentation
- `modal_profiles.yaml.example` - Example profiles
- `tests/test_modal_integration.py` - Integration tests
- `tests/test_modal_stress.py` - Stress tests
- `tests/test_modal_terminal.py` - Terminal tool tests
### ✅ Singularity/Apptainer Sandbox Integration (Feb 6, 2026 - FULLY TESTED)
Adapted the Atropos sandbox environment from Docker to Singularity/Apptainer for HPC clusters.
**What Works:**
- `create_sandbox_job()` supports both `driver="docker"` and `driver="singularity"`
- SlotPoolConfig and NomadBackendConfig propagate driver settings
- Singularity container runs sandbox_server.py via Nomad's raw_exec driver
- All sandbox operations work: bash execution, file read/write
- **CLI arguments** `--env.driver` and `--env.singularity_image` for AgentEnvConfig
- **Static port binding** for Singularity (ReservedPorts vs DynamicPorts)
### ✅ Memory Bank Initialized (Feb 5, 2026)
Set up project documentation structure for context persistence.
## In Progress
None currently.
## Known Issues
- Modal backend not yet live-tested with actual Modal cloud credentials
- `bwrap_available: false` in Singularity containers
- Health check timing - may need longer wait for container startup on slower systems
## What's Left to Build
### Modal Backend
- [ ] Live test with Modal credentials on actual cloud
- [ ] Test multi-profile GPU workflows
- [ ] Test sandbox recovery after restart
- [ ] Integrate with SWE-smith-oracle env for GRPO training loop
- [ ] Performance benchmarking vs Nomad backend
### HPC Deployment
- [ ] Test on actual HPC cluster with Slurm/PBS integration
- [ ] Document cluster-specific deployment procedures
### Documentation
- [ ] Add Singularity deployment to README
- [ ] Create HPC deployment skill in skills/mlops/
## Evolution of Decisions
### Container Runtime Selection
- **Initial**: Docker-only via Nomad docker driver
- **Problem**: HPC clusters don't allow Docker without sudo
- **Solution**: Added Singularity/Apptainer support via raw_exec driver
- **Result**: Both runtimes now supported with same API
### Modal Backend Architecture
- **Initial**: Stub placeholder raising RuntimeError
- **Investigation**: Modal Sandboxes vs Functions - chose Sandboxes for long-lived containers
- **Design**: Direct `sandbox.exec()` instead of HTTP/sandbox_server.py (simpler, no networking needed)
- **Implementation**: Merged from `modal-integration` branch, fixed agent_env.py config fields
- **Result**: Three backends now supported: Nomad/Docker, Nomad/Singularity, Modal

View File

@@ -0,0 +1,44 @@
# Project Brief: Hermes-Agent
## Overview
Hermes-Agent is an AI agent harness for LLMs with advanced tool-calling capabilities, featuring a flexible toolsets system for organizing and managing tools. Named after Hermes, the Greek messenger god, it serves as a bridge between human intent and AI-powered task execution.
## Core Requirements
### Primary Goals
1. **Interactive CLI Experience** - Beautiful terminal interface with animated feedback, personalities, and session management
2. **Flexible Tool System** - Modular tools organized into logical toolsets for different use cases
3. **Batch Processing** - Process multiple prompts in parallel with checkpointing and statistics
4. **Multi-Backend Support** - Support for local, Docker, Singularity, Modal, and SSH terminal backends
5. **Training Data Generation** - Save conversation trajectories in formats suitable for LLM fine-tuning
### Target Users
- AI researchers generating training data
- Developers needing an AI assistant with tool access
- MLOps practitioners automating workflows
- Anyone needing a powerful CLI-based AI agent
## Scope
### In Scope
- Interactive CLI with rich formatting and kawaii-style feedback
- Web tools (search, extract, crawl via Firecrawl)
- Terminal tools (command execution across multiple backends)
- Browser automation (via agent-browser + Browserbase)
- Vision tools (image analysis)
- Image generation (FLUX via FAL.ai)
- Mixture-of-Agents reasoning
- Skills system for on-demand knowledge
- Batch processing with parallel workers
- Trajectory compression for training
### Out of Scope (Current)
- Proactive suggestions (agent only runs on request)
- Clipboard integration (no local system access)
- Real-time streaming of thinking/reasoning (deferred)
## Success Metrics
- Clean, maintainable tool architecture
- Reliable tool execution with proper error handling
- Efficient context management for long conversations
- High-quality trajectory data for training

View File

@@ -0,0 +1,191 @@
# System Patterns: Hermes-Agent
## Architecture Overview
```
┌─────────────────────────────────────────────────────────────────┐
│ CLI (cli.py) │
│ - Rich welcome banner with caduceus │
│ - prompt_toolkit for input with history │
│ - Kawaii-style feedback and personalities │
└────────────────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ AIAgent (run_agent.py) │
│ - Conversation loop with tool calling │
│ - KawaiiSpinner for animated feedback │
│ - Retry logic with exponential backoff │
│ - Session logging to logs/ directory │
└────────────────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Tool Routing (model_tools.py) │
│ - get_tool_definitions() - returns tools for API calls │
│ - handle_function_call() - dispatches to tool handlers │
│ - Toolset filtering (enabled/disabled) │
└────────────────────────────┬────────────────────────────────────┘
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Web Tools │ │ Terminal │ │ Browser │
│ (Firecrawl)│ │ (mini-swe)│ │(agent-brw)│
└───────────┘ └───────────┘ └───────────┘
│ │ │
└─────────────────┼─────────────────┘
┌───────────────┐
│ Toolsets │
│ (toolsets.py)│
│ Composition │
└───────────────┘
```
## Key Design Patterns
### 1. Toolset Composition Pattern
Toolsets can include other toolsets, allowing flexible composition:
```python
TOOLSETS = {
"web": {"tools": ["web_search", "web_extract"], "includes": []},
"debugging": {"tools": ["terminal"], "includes": ["web"]},
"full_stack": {"tools": [], "includes": ["web", "terminal", "vision", "browser"]}
}
```
Resolution is recursive with cycle detection.
### 2. Graceful Degradation Pattern
Each tool module has a `check_*_requirements()` function:
- Tools are only loaded if requirements are met
- Missing API keys disable tools, not crash the system
- Import errors are caught and tools marked unavailable
```python
try:
from tools.web_tools import web_search_tool, check_firecrawl_api_key
except ModuleNotFoundError:
web_search_tool = None
def check_firecrawl_api_key(): return False
```
### 3. Session Isolation Pattern (task_id)
Stateful tools (terminal, browser) use `task_id` to isolate concurrent sessions:
- Each batch worker gets unique task_id
- VMs and browser sessions are tracked per task_id
- Cleanup functions release resources: `cleanup_vm(task_id)`, `cleanup_browser(task_id)`
### 4. Trajectory Format Pattern
Conversations are saved in ShareGPT format for training:
```json
{"from": "system", "value": "System prompt with <tools>...</tools>"}
{"from": "human", "value": "User message"}
{"from": "gpt", "value": "<think>reasoning</think>\n<tool_call>{...}</tool_call>"}
{"from": "tool", "value": "<tool_response>{...}</tool_response>"}
{"from": "gpt", "value": "Final response"}
```
### 5. Ephemeral System Prompt Pattern
Guide model behavior during data collection without saving to trajectories:
- `ephemeral_system_prompt` influences execution
- Only standard tool-calling system prompt saved to trajectories
- Keeps training data clean
### 6. Retry with Validation Pattern
The agent validates responses before accepting:
- Check tool names against `valid_tool_names` set
- Validate JSON arguments can be parsed
- Check for content after `<think>` blocks
- Roll back to last valid state on persistent failures
## Component Relationships
### AIAgent Class
- Central orchestrator for conversations
- Manages conversation history
- Calls OpenAI-compatible API
- Routes tool calls to handlers
- Provides animated feedback (KawaiiSpinner)
### Tool Modules (tools/*.py)
- Self-contained tool implementations
- Export: handler function + check function + schema
- Return JSON strings (never raw dicts)
- Accept optional `task_id` for stateful tools
### Toolsets System (toolsets.py)
- Defines logical groupings of tools
- Supports composition via `includes`
- `resolve_toolset()` recursively resolves all tools
- `validate_toolset()` checks if name is valid
### Model Tools (model_tools.py)
- Aggregates all tool definitions
- Routes function calls to correct handlers
- Filters tools based on enabled/disabled toolsets
- Bridge between agent and tool implementations
## Critical Implementation Paths
### Tool Execution Flow
1. AIAgent receives tool_calls from API response
2. Validates tool names against `valid_tool_names`
3. Validates JSON arguments can be parsed
4. Calls `handle_function_call()` with tool name, args, task_id
5. `handle_function_call()` routes to appropriate handler
6. Tool executes, returns JSON string
7. Result added to conversation as tool message
8. Loop continues until natural language response
### Configuration Loading Flow
1. `cli.py` calls `load_cli_config()`
2. Loads `cli-config.yaml`, merges with defaults
3. Sets environment variables for terminal config
4. `AIAgent` reads env vars when initializing terminal tool
5. Terminal tool creates appropriate backend based on `TERMINAL_ENV`
## Atropos Backend Architecture
### Backend Hierarchy
```
ToolBackend (Protocol - base.py)
├── NomadToolBackend → SlotPool → NomadClient + SandboxExecutor (HTTP)
│ ├── Docker driver (default)
│ └── Singularity driver (HPC)
└── ModalToolBackend → _ModalSandboxPool → modal.Sandbox.exec() (direct)
└── _ModalMultiProfileManager (multi-profile support)
```
### Slot-Based Multiplexing Pattern
All backends share the same slot multiplexing concept:
- **Sandbox/Container**: Long-lived compute unit
- **Slot**: Isolated workspace directory within a sandbox (e.g., `/data/slot_0`)
- **Trajectory**: One agent task using one slot
- Multiple trajectories share a sandbox via different slots
### Nomad Backend (HTTP-based)
- Deploys `sandbox_server.py` inside containers (Docker or Singularity)
- Uses `SandboxExecutor` for HTTP communication (POST /execute, POST /batch)
- Nomad manages container lifecycle (scaling, health checks)
- Tools: bash, bash_stateful, read_file, write_file, tmux
### Modal Backend (exec-based)
- Creates `modal.Sandbox` instances (long-lived containers)
- Uses `sandbox.exec("bash", "-c", command)` directly (no HTTP server)
- Modal manages container lifecycle (idle_timeout, max_lifetime)
- Multi-profile support: different resource configs (CPU, GPU, memory)
- Named sandboxes for recovery: `Sandbox.from_name(app_name, sandbox_name)`
- YAML config via `modal_profiles.yaml`
### Backend Selection
```python
# In agent_env.py / create_tool_backend()
if mode == "nomad":
return NomadToolBackend(NomadBackendConfig.from_agent_env_config(cfg))
if mode == "modal":
return ModalToolBackend(ModalSandboxConfig.from_agent_env_config(cfg))
```

113
memory-bank/techContext.md Normal file
View File

@@ -0,0 +1,113 @@
# Technical Context: Hermes-Agent
## Technologies Used
### Core Stack
- **Python 3.11+** - Primary language
- **OpenAI SDK** - For LLM API interactions (OpenAI-compatible)
- **OpenRouter** - Default LLM provider (supports multiple models)
- **Rich** - Terminal formatting and panels
- **prompt_toolkit** - Interactive input with history
- **Fire** - CLI argument parsing
- **PyYAML** - Configuration files
- **python-dotenv** - Environment variable management
### Tool Dependencies
- **Firecrawl** - Web search and extraction (`FIRECRAWL_API_KEY`)
- **mini-swe-agent** - Terminal tool backend (local/docker/singularity/modal/ssh)
- **agent-browser** - Browser automation (npm package)
- **Browserbase** - Cloud browser execution (`BROWSERBASE_API_KEY`)
- **FAL.ai** - Image generation with FLUX (`FAL_KEY`)
- **Nous API** - Vision and MoA tools (`NOUS_API_KEY`)
### Optional Dependencies
- **Modal** - Cloud compute for sandboxed environments
- **Singularity/Apptainer** - Rootless containers (HPC environments)
- **Docker** - Container isolation
## Development Setup
### Quick Start
```bash
# Clone with submodules
git clone --recurse-submodules https://github.com/NousResearch/Hermes-Agent.git
cd Hermes-Agent
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
pip install -e ./mini-swe-agent
# Install browser tools (optional)
npm install
# Configure environment
cp .env.example .env
# Edit .env with your API keys
```
### Key Configuration Files
- `.env` - API keys and secrets
- `cli-config.yaml` - CLI configuration (model, terminal, toolsets, personalities)
- `configs/` - Batch run scripts and configuration
### Environment Variables
**Required for Full Functionality:**
- `OPENROUTER_API_KEY` - Primary LLM access
- `FIRECRAWL_API_KEY` - Web tools
- `NOUS_API_KEY` - Vision and reasoning tools
- `FAL_KEY` - Image generation
**Terminal Backend:**
- `TERMINAL_ENV` - Backend type: `local`, `docker`, `singularity`, `modal`, `ssh`
- `TERMINAL_CWD` - Working directory
- `TERMINAL_DOCKER_IMAGE` / `TERMINAL_SINGULARITY_IMAGE` - Container images
- `TERMINAL_SSH_HOST/USER/KEY` - SSH backend config
- `SUDO_PASSWORD` - Optional sudo support
**Browser:**
- `BROWSERBASE_API_KEY` - Browser automation
- `BROWSERBASE_PROJECT_ID` - Browserbase project
## Technical Constraints
1. **Context Window Limits** - Long tool outputs can exhaust context; trajectory compression helps
2. **API Rate Limits** - OpenRouter and tool APIs have rate limits; exponential backoff implemented
3. **Tool Availability** - Tools gracefully degrade if dependencies/keys missing
4. **Async Compatibility** - Some tools are async, handled via `asyncio.run()` in sync context
## Dependency Graph
```
tools/*.py → tools/__init__.py → model_tools.py → toolsets.py → toolset_distributions.py
run_agent.py ──────────────────────────┘
cli.py → run_agent.py (uses AIAgent with quiet_mode=True)
batch_runner.py → run_agent.py + toolset_distributions.py
```
## Tool Usage Patterns
### Adding a New Tool
1. Create `tools/your_tool.py` with handler + requirements check
2. Export in `tools/__init__.py`
3. Register in `model_tools.py` (definitions + handler routing)
4. Add to toolset in `toolsets.py`
5. Optionally add to `toolset_distributions.py` for batch processing
### Tool Handler Pattern
```python
def your_tool(param: str, task_id: str = None) -> str:
"""Execute tool and return JSON string result."""
try:
result = {"success": True, "data": "..."}
return json.dumps(result, ensure_ascii=False)
except Exception as e:
return json.dumps({"error": str(e)}, ensure_ascii=False)
```
All tool handlers MUST return a JSON string, never raw dicts.

134
modal_profiles.yaml.example Normal file
View File

@@ -0,0 +1,134 @@
# Modal Sandbox Profiles Configuration
# =====================================
# This file defines different sandbox profiles for heterogeneous workloads.
# Copy to modal_profiles.yaml and customize as needed.
#
# Usage:
# terminal_tool("python train.py", profile="pytorch-gpu")
# terminal_tool("npm test", profile="node")
#
# Each profile can specify:
# - image: Docker image to use
# - gpu: GPU type (null, "T4", "A10G", "A100", "H100")
# - cpu: CPU cores (float)
# - memory: Memory in MB
# - min_pool: Minimum warm sandboxes (cost vs latency tradeoff)
# - max_pool: Maximum sandboxes (hard cost cap)
# - idle_timeout: Server-side auto-cleanup in seconds
# - max_lifetime: Maximum sandbox lifetime in seconds
# - scale_down_idle: Client-side scale-down threshold in seconds
# - workdir: Working directory inside container
# - secrets: List of Modal Secret names to inject (created via dashboard/CLI)
# - env_vars: Dict of environment variables to pass directly
# - use_dotenv: If true, loads local .env file into sandbox
#
# SECRETS SETUP:
# Create secrets via Modal dashboard or CLI:
# modal secret create huggingface-token HF_TOKEN=hf_xxx
# modal secret create openai-key OPENAI_API_KEY=sk-xxx
# Then reference by name in profile's secrets list.
# Default profile used when no profile specified
default_profile: default
profiles:
# Default Python environment - good for most tasks
default:
image: python:3.11
gpu: null
cpu: 1.0
memory: 2048
min_pool: 1 # Keep 1 warm for fast response
max_pool: 5
idle_timeout: 120 # Modal terminates if idle 2 min
max_lifetime: 3600 # Max 1 hour
scale_down_idle: 180
workdir: /workspace
secrets: [] # Add secret names here: ["my-api-keys"]
env_vars: {} # Add env vars here: {DEBUG: "1"}
use_dotenv: false # Set to true to load local .env
# PyTorch with GPU for ML training/inference
pytorch-gpu:
image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
gpu: T4 # Options: T4, A10G, A100, H100
cpu: 4.0
memory: 16384 # 16GB
min_pool: 0 # Don't keep GPU sandboxes warm (expensive!)
max_pool: 2
idle_timeout: 60 # Shorter idle timeout for GPU (cost)
max_lifetime: 1800 # 30 min max for GPU tasks
scale_down_idle: 60
workdir: /workspace
# ML-specific secrets
secrets:
- huggingface-token # HF_TOKEN env var
- wandb-key # WANDB_API_KEY env var
env_vars:
CUDA_VISIBLE_DEVICES: "0"
PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:True"
# High-end GPU for large models
pytorch-a100:
image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
gpu: A100
cpu: 8.0
memory: 65536 # 64GB
min_pool: 0
max_pool: 1 # Only 1 at a time (very expensive)
idle_timeout: 30
max_lifetime: 3600
scale_down_idle: 30
workdir: /workspace
# Node.js for JavaScript/TypeScript tasks
node:
image: node:18
gpu: null
cpu: 1.0
memory: 2048
min_pool: 0 # Create on-demand
max_pool: 3
idle_timeout: 120
max_lifetime: 3600
scale_down_idle: 180
workdir: /workspace
# High memory for data processing
high-memory:
image: python:3.11
gpu: null
cpu: 4.0
memory: 32768 # 32GB
min_pool: 0
max_pool: 2
idle_timeout: 120
max_lifetime: 3600
scale_down_idle: 180
workdir: /workspace
# Rust development environment
rust:
image: rust:1.75
gpu: null
cpu: 2.0
memory: 4096
min_pool: 0
max_pool: 2
idle_timeout: 120
max_lifetime: 3600
scale_down_idle: 180
workdir: /workspace
# Go development environment
golang:
image: golang:1.21
gpu: null
cpu: 2.0
memory: 4096
min_pool: 0
max_pool: 2
idle_timeout: 120
max_lifetime: 3600
scale_down_idle: 180
workdir: /workspace

37
nomad-dev.hcl Normal file
View File

@@ -0,0 +1,37 @@
# Nomad Development Configuration (Hermes-Agent)
# Run with: nomad agent -dev -config=nomad-dev.hcl
#
# This is intended for local development only.
client {
enabled = true
options {
# Enable Docker volume mounts for persistent slot workspaces
"docker.volumes.enabled" = "true"
}
}
# Docker driver plugin configuration
plugin "docker" {
config {
# CRITICAL: Enable volume mounts
volumes {
enabled = true
}
# Allow privileged containers if needed
allow_privileged = false
# Garbage collection settings
gc {
image = true
# NOTE: For local dev we often rely on locally built images like `atropos-sandbox:local`.
# A short image GC delay can delete these between runs, causing confusing "Failed to pull"
# crash loops. Keep this comfortably long; tighten it for CI/production if needed.
image_delay = "24h"
container = true
}
}
}

31
nomad-singularity.hcl Normal file
View File

@@ -0,0 +1,31 @@
# Nomad Configuration for Singularity/Apptainer Sandbox
# Run with: nomad agent -dev -config=nomad-singularity.hcl
#
# This uses the raw_exec driver to run Apptainer containers.
# Suitable for HPC environments where Docker cannot run without sudo.
client {
enabled = true
options {
# Enable raw_exec driver for Singularity/Apptainer
"driver.raw_exec.enable" = "1"
}
}
# raw_exec driver plugin configuration
plugin "raw_exec" {
config {
enabled = true
}
}
# Optional: If you have the nomad-driver-singularity plugin installed,
# uncomment the following instead of using raw_exec:
# plugin "singularity" {
# config {
# enabled = true
# # Allow bind mounts
# bind_paths = ["/tmp", "/var/tmp"]
# }
# }

View File

@@ -19,6 +19,7 @@ dependencies = [
"rich", "rich",
"tenacity", "tenacity",
"pyyaml", "pyyaml",
"prompt_toolkit",
"requests", "requests",
"jinja2", "jinja2",
"pydantic>=2.0", "pydantic>=2.0",
@@ -39,6 +40,19 @@ dev = ["pytest", "pytest-asyncio"]
messaging = ["python-telegram-bot>=20.0", "discord.py>=2.0", "aiohttp>=3.9.0"] messaging = ["python-telegram-bot>=20.0", "discord.py>=2.0", "aiohttp>=3.9.0"]
cron = ["croniter"] cron = ["croniter"]
cli = ["simple-term-menu"] cli = ["simple-term-menu"]
# Install Atropos + Tinker training integration from source.
atropos = [
"atroposlib @ git+https://github.com/NousResearch/atropos.git",
"tinker @ git+https://github.com/thinking-machines-lab/tinker.git",
# Atropos integration runtime deps (kept optional for Hermes-only users)
"aiohttp",
"fastapi",
"uvicorn",
"pyte",
"torch",
"wandb",
"math-verify",
]
all = [ all = [
"hermes-agent[modal]", "hermes-agent[modal]",
"hermes-agent[messaging]", "hermes-agent[messaging]",
@@ -50,9 +64,21 @@ all = [
[project.scripts] [project.scripts]
hermes = "hermes_cli.main:main" hermes = "hermes_cli.main:main"
hermes-agent = "run_agent:main" hermes-agent = "run_agent:main"
hermes-atropos-sandbox-smoke = "atropos.envs.sandbox_terminal_smoke_env:SandboxTerminalSmokeEnv.cli"
hermes-atropos-toolserver-smoke = "atropos.envs.toolserver_smoke_env:ToolServerSmokeEnv.cli"
[tool.setuptools] [tool.setuptools]
py-modules = ["run_agent", "model_tools", "toolsets", "batch_runner", "trajectory_compressor", "toolset_distributions", "cli"] py-modules = [
"run_agent",
"model_tools",
"toolsets",
"batch_runner",
"trajectory_compressor",
"toolset_distributions",
"atropos_compatible_agent",
"local_server",
"cli",
]
[tool.setuptools.packages.find] [tool.setuptools.packages.find]
include = ["tools", "hermes_cli", "gateway", "cron"] include = ["tools", "hermes_cli", "gateway", "cron", "atropos", "atropos.*"]

View File

@@ -30,7 +30,6 @@ import threading
import uuid import uuid
from typing import List, Dict, Any, Optional from typing import List, Dict, Any, Optional
from openai import OpenAI from openai import OpenAI
import fire
from datetime import datetime from datetime import datetime
from pathlib import Path from pathlib import Path
@@ -1581,6 +1580,16 @@ class AIAgent:
if active_system_prompt: if active_system_prompt:
# Insert system message at the beginning # Insert system message at the beginning
api_messages = [{"role": "system", "content": active_system_prompt}] + api_messages api_messages = [{"role": "system", "content": active_system_prompt}] + api_messages
if os.getenv("HERMES_DEBUG_OPENAI_REQUEST") == "1":
meta = {
"model": self.model,
"base_url": self.base_url,
"messages": api_messages,
"tools": self.tools if self.tools else None,
}
print("\n=== HERMES_DEBUG_OPENAI_REQUEST ===", flush=True)
print(json.dumps(meta, ensure_ascii=False, indent=2)[:200_000], flush=True)
# Calculate approximate request size for logging # Calculate approximate request size for logging
total_chars = sum(len(str(msg)) for msg in api_messages) total_chars = sum(len(str(msg)) for msg in api_messages)
@@ -1594,12 +1603,13 @@ class AIAgent:
print(f"{self.log_prefix} 📊 Request size: {len(api_messages)} messages, ~{approx_tokens:,} tokens (~{total_chars:,} chars)") print(f"{self.log_prefix} 📊 Request size: {len(api_messages)} messages, ~{approx_tokens:,} tokens (~{total_chars:,} chars)")
print(f"{self.log_prefix} 🔧 Available tools: {len(self.tools) if self.tools else 0}") print(f"{self.log_prefix} 🔧 Available tools: {len(self.tools) if self.tools else 0}")
else: else:
# Animated thinking spinner in quiet mode # Animated thinking spinner in quiet mode (disable for wrappers/non-TTY usage)
face = random.choice(KawaiiSpinner.KAWAII_THINKING) if os.getenv("HERMES_DISABLE_SPINNER") != "1":
verb = random.choice(KawaiiSpinner.THINKING_VERBS) face = random.choice(KawaiiSpinner.KAWAII_THINKING)
spinner_type = random.choice(['brain', 'sparkle', 'pulse', 'moon', 'star']) verb = random.choice(KawaiiSpinner.THINKING_VERBS)
thinking_spinner = KawaiiSpinner(f"{face} {verb}...", spinner_type=spinner_type) spinner_type = random.choice(['brain', 'sparkle', 'pulse', 'moon', 'star'])
thinking_spinner.start() thinking_spinner = KawaiiSpinner(f"{face} {verb}...", spinner_type=spinner_type)
thinking_spinner.start()
# Log request details if verbose # Log request details if verbose
if self.verbose_logging: if self.verbose_logging:
@@ -1659,6 +1669,14 @@ class AIAgent:
api_kwargs["extra_body"] = extra_body api_kwargs["extra_body"] = extra_body
response = self.client.chat.completions.create(**api_kwargs) response = self.client.chat.completions.create(**api_kwargs)
if os.getenv("HERMES_DEBUG_OPENAI_RESPONSE") == "1":
try:
dumped = response.model_dump()
except Exception:
dumped = getattr(response, "__dict__", {"repr": repr(response)})
print("\n=== HERMES_DEBUG_OPENAI_RESPONSE: ChatCompletion (raw) ===", flush=True)
print(json.dumps(dumped, ensure_ascii=False, indent=2), flush=True)
api_duration = time.time() - api_start_time api_duration = time.time() - api_start_time
@@ -2137,7 +2155,7 @@ class AIAgent:
tool_start_time = time.time() tool_start_time = time.time()
# Execute the tool - with animated spinner in quiet mode # Execute the tool - with animated spinner in quiet mode
if self.quiet_mode: if self.quiet_mode and os.getenv("HERMES_DISABLE_SPINNER") != "1":
# Tool-specific spinner animations # Tool-specific spinner animations
tool_spinners = { tool_spinners = {
'web_search': ('arrows', ['🔍', '🌐', '📡', '🔎']), 'web_search': ('arrows', ['🔍', '🌐', '📡', '🔎']),
@@ -2167,6 +2185,9 @@ class AIAgent:
tool_duration = time.time() - tool_start_time tool_duration = time.time() - tool_start_time
cute_msg = self._get_cute_tool_message(function_name, function_args, tool_duration) cute_msg = self._get_cute_tool_message(function_name, function_args, tool_duration)
spinner.stop(cute_msg) spinner.stop(cute_msg)
elif self.quiet_mode:
function_result = handle_function_call(function_name, function_args, effective_task_id)
tool_duration = time.time() - tool_start_time
else: else:
function_result = handle_function_call(function_name, function_args, effective_task_id) function_result = handle_function_call(function_name, function_args, effective_task_id)
tool_duration = time.time() - tool_start_time tool_duration = time.time() - tool_start_time
@@ -2635,4 +2656,11 @@ def main(
if __name__ == "__main__": if __name__ == "__main__":
try:
import fire # type: ignore
except ModuleNotFoundError as exc:
raise SystemExit(
"Missing optional dependency 'fire'. Install hermes-agent with its CLI extras or add `fire` "
f"to your environment. Original error: {exc}"
) from exc
fire.Fire(main) fire.Fire(main)

View File

@@ -0,0 +1,62 @@
#!/usr/bin/env bash
set -euo pipefail
# Launch a local llama.cpp OpenAI-compatible server running GLM-4.7-Flash (GGUF).
#
# Requires:
# - `llama-server` installed (e.g. `brew install llama.cpp`)
#
# Default settings are chosen to avoid clashing with Atropos sandbox_server
# (which commonly uses port 8080 in local dev).
#
# Usage:
# Hermes-Agent/scripts/launch_llama_cpp_glm47_flash.sh
#
# Override defaults:
# LLAMA_CPP_HOST=127.0.0.1 LLAMA_CPP_PORT=8082 \
# LLAMA_CPP_HF_REPO=ggml-org/GLM-4.7-Flash-GGUF \
# LLAMA_CPP_HF_FILE=GLM-4.7-Flash-Q4_K.gguf \
# Hermes-Agent/scripts/launch_llama_cpp_glm47_flash.sh
HOST="${LLAMA_CPP_HOST:-127.0.0.1}"
PORT="${LLAMA_CPP_PORT:-8080}"
HF_REPO="${LLAMA_CPP_HF_REPO:-ggml-org/GLM-4.7-Flash-GGUF}"
HF_FILE="${LLAMA_CPP_HF_FILE:-GLM-4.7-Flash-Q4_K.gguf}"
ALIAS="${LLAMA_CPP_ALIAS:-glm-4.7-flash}"
if ! command -v llama-server >/dev/null 2>&1; then
echo "Error: llama-server not found in PATH."
echo "Install via Homebrew: brew install llama.cpp"
exit 1
fi
echo "Launching llama.cpp server..."
echo " host: $HOST"
echo " port: $PORT"
echo " repo: $HF_REPO"
echo " file: $HF_FILE"
echo " alias: $ALIAS"
echo
echo "Suggested env vars for Hermes/Atropos integration:"
echo " export ATROPOS_SERVER_BASE_URL=http://${HOST}:${PORT}"
echo " export ATROPOS_SERVER_MODEL=${ALIAS}"
echo " export ATROPOS_SERVER_API_KEY=local"
echo
if command -v lsof >/dev/null 2>&1; then
if lsof -nP -iTCP:"$PORT" -sTCP:LISTEN >/dev/null 2>&1; then
echo "Error: port $PORT is already in use."
echo "Pick a different port, e.g.:"
echo " LLAMA_CPP_PORT=8082 Hermes-Agent/scripts/launch_llama_cpp_glm47_flash.sh"
exit 1
fi
fi
exec llama-server \
--host "$HOST" \
--port "$PORT" \
--hf-repo "$HF_REPO" \
--hf-file "$HF_FILE" \
--alias "$ALIAS" \
-c 32768 \
-n -1

View File

@@ -0,0 +1,70 @@
#!/usr/bin/env bash
set -euo pipefail
# Launch a local llama.cpp OpenAI-compatible server running Hermes 4.3 36B (GGUF).
#
# Requires:
# - `llama-server` installed (e.g. `brew install llama.cpp`)
#
# Note: Port choice can conflict with other local dev servers. If 8080 is already
# in use, override via `LLAMA_CPP_PORT=...`.
#
# Usage:
# Hermes-Agent/scripts/launch_llama_cpp_hermes_4_36b.sh
#
# Override defaults:
# LLAMA_CPP_HOST=127.0.0.1 LLAMA_CPP_PORT=8082 \
# LLAMA_CPP_HF_REPO=NousResearch/Hermes-4.3-36B-GGUF \
# LLAMA_CPP_HF_FILE=hermes-4_3_36b-Q4_K_M.gguf \
# LLAMA_CPP_ALIAS=hermes-4-36b \
# LLAMA_CPP_PARALLEL=4 LLAMA_CPP_THREADS_HTTP=4 \
# Hermes-Agent/scripts/launch_llama_cpp_hermes_4_36b.sh
HOST="${LLAMA_CPP_HOST:-127.0.0.1}"
PORT="${LLAMA_CPP_PORT:-8080}"
HF_REPO="${LLAMA_CPP_HF_REPO:-NousResearch/Hermes-4.3-36B-GGUF}"
HF_FILE="${LLAMA_CPP_HF_FILE:-hermes-4_3_36b-Q4_K_M.gguf}"
ALIAS="${LLAMA_CPP_ALIAS:-hermes-4-36b}"
PARALLEL="${LLAMA_CPP_PARALLEL:-4}"
THREADS_HTTP="${LLAMA_CPP_THREADS_HTTP:-4}"
if ! command -v llama-server >/dev/null 2>&1; then
echo "Error: llama-server not found in PATH."
echo "Install via Homebrew: brew install llama.cpp"
exit 1
fi
echo "Launching llama.cpp server..."
echo " host: $HOST"
echo " port: $PORT"
echo " repo: $HF_REPO"
echo " file: $HF_FILE"
echo " alias: $ALIAS"
echo " slots: $PARALLEL"
echo
echo "Suggested env vars for Hermes/Atropos integration:"
echo " export ATROPOS_SERVER_BASE_URL=http://${HOST}:${PORT}"
echo " export ATROPOS_SERVER_MODEL=${ALIAS}"
echo " export ATROPOS_TOKENIZER_NAME=NousResearch/Hermes-4.3-36B"
echo " export ATROPOS_SERVER_API_KEY=local"
echo
if command -v lsof >/dev/null 2>&1; then
if lsof -nP -iTCP:"$PORT" -sTCP:LISTEN >/dev/null 2>&1; then
echo "Error: port $PORT is already in use."
echo "Pick a different port, e.g.:"
echo " LLAMA_CPP_PORT=8082 Hermes-Agent/scripts/launch_llama_cpp_hermes_4_36b.sh"
exit 1
fi
fi
exec llama-server \
--host "$HOST" \
--port "$PORT" \
--hf-repo "$HF_REPO" \
--hf-file "$HF_FILE" \
--alias "$ALIAS" \
--parallel "$PARALLEL" \
--threads-http "$THREADS_HTTP" \
-c 32768 \
-n -1

View File

@@ -0,0 +1,15 @@
{"prompt": "Test prompt 0: What is 2+2? Just answer briefly.", "test_id": 0}
{"prompt": "Test prompt 1: What is 2+2? Just answer briefly.", "test_id": 1}
{"prompt": "Test prompt 2: What is 2+2? Just answer briefly.", "test_id": 2}
{"prompt": "Test prompt 3: What is 2+2? Just answer briefly.", "test_id": 3}
{"prompt": "Test prompt 4: What is 2+2? Just answer briefly.", "test_id": 4}
{"prompt": "Test prompt 5: What is 2+2? Just answer briefly.", "test_id": 5}
{"prompt": "Test prompt 6: What is 2+2? Just answer briefly.", "test_id": 6}
{"prompt": "Test prompt 7: What is 2+2? Just answer briefly.", "test_id": 7}
{"prompt": "Test prompt 8: What is 2+2? Just answer briefly.", "test_id": 8}
{"prompt": "Test prompt 9: What is 2+2? Just answer briefly.", "test_id": 9}
{"prompt": "Test prompt 10: What is 2+2? Just answer briefly.", "test_id": 10}
{"prompt": "Test prompt 11: What is 2+2? Just answer briefly.", "test_id": 11}
{"prompt": "Test prompt 12: What is 2+2? Just answer briefly.", "test_id": 12}
{"prompt": "Test prompt 13: What is 2+2? Just answer briefly.", "test_id": 13}
{"prompt": "Test prompt 14: What is 2+2? Just answer briefly.", "test_id": 14}

View File

@@ -0,0 +1,5 @@
{"prompt": "Test prompt 0: What is 2+2? Just answer briefly.", "test_id": 0}
{"prompt": "Test prompt 1: What is 2+2? Just answer briefly.", "test_id": 1}
{"prompt": "Test prompt 2: What is 2+2? Just answer briefly.", "test_id": 2}
{"prompt": "Test prompt 3: What is 2+2? Just answer briefly.", "test_id": 3}
{"prompt": "Test prompt 4: What is 2+2? Just answer briefly.", "test_id": 4}

File diff suppressed because it is too large Load Diff

923
tests/test_modal_stress.py Normal file
View File

@@ -0,0 +1,923 @@
#!/usr/bin/env python3
"""
Modal Integration Stress Tests & Full Integration Tests
This test suite includes:
1. Stress tests for Modal sandbox pools (concurrent load, scaling)
2. Atropos backend tests (requires atroposlib)
3. mini-swe-agent integration tests
Prerequisites:
# Install dev dependencies
pip install -e '.[dev,modal]'
# Install atroposlib for Atropos tests
pip install -e '.[atropos]'
# Clone mini-swe-agent (if not present)
git clone https://github.com/anthropics/mini-swe-agent.git mini-swe-agent
# Or as submodule:
git submodule add https://github.com/anthropics/mini-swe-agent.git mini-swe-agent
Run with:
# All tests
python tests/test_modal_stress.py
# Stress tests only
python tests/test_modal_stress.py --category stress
# Atropos tests only
python tests/test_modal_stress.py --category atropos
# Mini-swe-agent tests only
python tests/test_modal_stress.py --category miniswe
# Dry run (no Modal calls)
python tests/test_modal_stress.py --dry-run
"""
import asyncio
import json
import os
import sys
import time
import random
import traceback
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from typing import Dict, Any, List, Optional, Tuple
from dataclasses import dataclass
# Add parent to path for imports
sys.path.insert(0, str(Path(__file__).parent.parent))
# =============================================================================
# Test Configuration
# =============================================================================
@dataclass
class StressTestConfig:
dry_run: bool = False
verbose: bool = True
category: Optional[str] = None
# Stress test parameters (reduced defaults for faster first-run)
concurrent_tasks: int = 3 # Start small - Modal cold starts are slow
total_operations: int = 10
max_sandboxes: int = 3
slots_per_sandbox: int = 3
# =============================================================================
# Test Results Tracking
# =============================================================================
class TestResults:
def __init__(self):
self.passed: List[str] = []
self.failed: List[Tuple[str, str]] = []
self.skipped: List[Tuple[str, str]] = []
self.metrics: Dict[str, Any] = {}
def record_pass(self, name: str, metrics: Optional[Dict] = None):
self.passed.append(name)
if metrics:
self.metrics[name] = metrics
print(f"{name}")
if metrics:
for k, v in metrics.items():
print(f" 📊 {k}: {v}")
def record_fail(self, name: str, error: str):
self.failed.append((name, error))
print(f"{name}: {error}")
def record_skip(self, name: str, reason: str):
self.skipped.append((name, reason))
print(f" ⏭️ {name}: {reason}")
def summary(self):
total = len(self.passed) + len(self.failed) + len(self.skipped)
print(f"\n{'='*70}")
print(f"STRESS TEST RESULTS: {len(self.passed)}/{total} passed")
print(f" Passed: {len(self.passed)}")
print(f" Failed: {len(self.failed)}")
print(f" Skipped: {len(self.skipped)}")
if self.failed:
print(f"\nFailed tests:")
for name, error in self.failed:
print(f" - {name}: {error}")
if self.metrics:
print(f"\nPerformance Metrics:")
for test, metrics in self.metrics.items():
print(f" {test}:")
for k, v in metrics.items():
print(f" - {k}: {v}")
return len(self.failed) == 0
results = TestResults()
# =============================================================================
# Helper: Atropos Import
# =============================================================================
def try_import_atropos():
"""Try importing Atropos backend components."""
try:
from atropos.backends.modal_backend import (
ModalToolBackend, ModalSandboxConfig,
_ModalMultiProfileManager
)
from atropos.slots.slot import Slot, SlotState
return ModalToolBackend, ModalSandboxConfig, Slot, SlotState
except (ImportError, ModuleNotFoundError) as e:
return None
def try_import_miniswe():
"""Try importing mini-swe-agent components."""
try:
# Check if mini-swe-agent path exists and has content
mini_swe_path = Path(__file__).parent.parent / "mini-swe-agent" / "src"
if mini_swe_path.exists() and list(mini_swe_path.iterdir()):
sys.path.insert(0, str(mini_swe_path))
import minisweagent
return minisweagent
return None
except (ImportError, ModuleNotFoundError) as e:
return None
# =============================================================================
# CATEGORY 1: Stress Tests (Terminal Tool)
# =============================================================================
def test_stress_concurrent_tasks(config: StressTestConfig):
"""Stress test: Multiple concurrent task_ids hitting the pool."""
if config.dry_run:
results.record_skip("test_stress_concurrent_tasks", "Dry run mode")
return
from tools.terminal_tool import terminal_tool, cleanup_vm
original_env = os.environ.get("TERMINAL_ENV")
os.environ["TERMINAL_ENV"] = "modal"
try:
num_tasks = config.concurrent_tasks
task_ids = [f"stress-concurrent-{i}-{int(time.time())}" for i in range(num_tasks)]
start_time = time.time()
errors = []
successes = 0
def run_task(task_id: str) -> Tuple[bool, str]:
try:
result = json.loads(terminal_tool(
f"echo 'Hello from {task_id}' && sleep 0.5",
task_id=task_id,
))
success = result["exit_code"] == 0
# IMPORTANT: Clean up immediately after task completes
# This releases the sandbox back to the pool for other tasks
try:
cleanup_vm(task_id)
except:
pass
if success:
return True, ""
# Include more details for debugging
error_detail = result.get("error", "no error message")
output = result.get("output", "")[:100] # First 100 chars
return False, f"Exit code: {result['exit_code']}, error: {error_detail}, output: {output}"
except Exception as e:
# Clean up even on failure
try:
cleanup_vm(task_id)
except:
pass
import traceback
return False, f"Exception: {str(e)}\n{traceback.format_exc()}"
# Run all tasks concurrently using threads
with ThreadPoolExecutor(max_workers=num_tasks) as executor:
futures = {executor.submit(run_task, tid): tid for tid in task_ids}
for future in as_completed(futures):
task_id = futures[future]
try:
success, error = future.result(timeout=60)
if success:
successes += 1
else:
errors.append(f"{task_id}: {error}")
except Exception as e:
errors.append(f"{task_id}: {str(e)}")
elapsed = time.time() - start_time
# No need for cleanup here - each task cleans up immediately
# Report
success_rate = successes / num_tasks * 100
if success_rate >= 90: # Allow 10% failure rate for stress test
results.record_pass("test_stress_concurrent_tasks", {
"concurrent_tasks": num_tasks,
"successes": successes,
"failures": len(errors),
"success_rate": f"{success_rate:.1f}%",
"total_time": f"{elapsed:.2f}s",
"avg_time_per_task": f"{elapsed/num_tasks:.2f}s",
})
else:
results.record_fail(
"test_stress_concurrent_tasks",
f"Success rate {success_rate:.1f}% < 90%. Errors: {errors[:3]}"
)
except Exception as e:
results.record_fail("test_stress_concurrent_tasks", str(e))
finally:
if original_env:
os.environ["TERMINAL_ENV"] = original_env
elif "TERMINAL_ENV" in os.environ:
del os.environ["TERMINAL_ENV"]
def test_stress_rapid_fire(config: StressTestConfig):
"""Stress test: Rapid sequential commands to same task_id."""
if config.dry_run:
results.record_skip("test_stress_rapid_fire", "Dry run mode")
return
from tools.terminal_tool import terminal_tool, cleanup_vm
original_env = os.environ.get("TERMINAL_ENV")
os.environ["TERMINAL_ENV"] = "modal"
try:
task_id = f"stress-rapid-{int(time.time())}"
num_commands = config.total_operations
start_time = time.time()
successes = 0
errors = []
for i in range(num_commands):
try:
result = json.loads(terminal_tool(f"echo {i}", task_id=task_id))
if result["exit_code"] == 0 and str(i) in result["output"]:
successes += 1
else:
errors.append(f"Command {i}: unexpected result")
except Exception as e:
errors.append(f"Command {i}: {str(e)}")
elapsed = time.time() - start_time
cleanup_vm(task_id)
success_rate = successes / num_commands * 100
commands_per_second = num_commands / elapsed
if success_rate >= 95:
results.record_pass("test_stress_rapid_fire", {
"total_commands": num_commands,
"successes": successes,
"success_rate": f"{success_rate:.1f}%",
"total_time": f"{elapsed:.2f}s",
"commands_per_second": f"{commands_per_second:.1f}",
})
else:
results.record_fail(
"test_stress_rapid_fire",
f"Success rate {success_rate:.1f}% < 95%"
)
except Exception as e:
results.record_fail("test_stress_rapid_fire", str(e))
finally:
if original_env:
os.environ["TERMINAL_ENV"] = original_env
elif "TERMINAL_ENV" in os.environ:
del os.environ["TERMINAL_ENV"]
def test_stress_pool_scaling(config: StressTestConfig):
"""Stress test: Force pool to scale up and down by running tasks in batches."""
if config.dry_run:
results.record_skip("test_stress_pool_scaling", "Dry run mode")
return
from tools.terminal_tool import terminal_tool, cleanup_vm, _ModalPoolManager
original_env = os.environ.get("TERMINAL_ENV")
os.environ["TERMINAL_ENV"] = "modal"
try:
# Run tasks in batches matching max_sandboxes to test pool reuse
# This verifies sandboxes can be acquired, used, released, and reused
batch_size = config.max_sandboxes
num_batches = 3
total_tasks = batch_size * num_batches
start_time = time.time()
successes = 0
for batch in range(num_batches):
task_ids = [f"stress-scale-{batch}-{i}-{int(time.time())}" for i in range(batch_size)]
def run_task(task_id: str):
try:
result = json.loads(terminal_tool(
"echo done", # Fast command to test scaling
task_id=task_id,
))
success = result["exit_code"] == 0
try:
cleanup_vm(task_id)
except:
pass
return success
except:
try:
cleanup_vm(task_id)
except:
pass
return False
# Run batch concurrently
with ThreadPoolExecutor(max_workers=batch_size) as executor:
batch_results = list(executor.map(run_task, task_ids))
successes += sum(batch_results)
elapsed = time.time() - start_time
# Check pool status
try:
manager = _ModalPoolManager.get_instance()
pool_status = manager.get_status() if hasattr(manager, 'get_status') else {}
except:
pool_status = {}
success_rate = successes / total_tasks * 100
if success_rate >= 80: # Allow some tolerance
results.record_pass("test_stress_pool_scaling", {
"total_tasks": total_tasks,
"num_batches": num_batches,
"batch_size": batch_size,
"successes": successes,
"success_rate": f"{success_rate:.1f}%",
"total_time": f"{elapsed:.2f}s",
"pool_status": pool_status,
})
else:
results.record_fail(
"test_stress_pool_scaling",
f"Success rate {success_rate:.1f}% < 80%"
)
except Exception as e:
results.record_fail("test_stress_pool_scaling", str(e))
finally:
if original_env:
os.environ["TERMINAL_ENV"] = original_env
elif "TERMINAL_ENV" in os.environ:
del os.environ["TERMINAL_ENV"]
def test_stress_large_output(config: StressTestConfig):
"""Stress test: Commands producing large output."""
if config.dry_run:
results.record_skip("test_stress_large_output", "Dry run mode")
return
from tools.terminal_tool import terminal_tool, cleanup_vm
original_env = os.environ.get("TERMINAL_ENV")
os.environ["TERMINAL_ENV"] = "modal"
try:
task_id = f"stress-large-{int(time.time())}"
# First verify basic connectivity with simple command
warmup = json.loads(terminal_tool("echo warmup", task_id=task_id))
if warmup["exit_code"] != 0:
results.record_fail(
"test_stress_large_output",
f"Warmup failed: {warmup.get('error', 'unknown')}"
)
return
# Generate output - use seq which is more portable
start_time = time.time()
result = json.loads(terminal_tool(
'seq 1 500 | while read i; do echo "Line $i: This is test content for large output"; done',
task_id=task_id,
timeout=60,
))
elapsed = time.time() - start_time
cleanup_vm(task_id)
output_size = len(result.get("output", ""))
error_msg = result.get("error", "")
if result["exit_code"] == 0 and output_size > 5000:
results.record_pass("test_stress_large_output", {
"output_size": f"{output_size:,} bytes",
"time": f"{elapsed:.2f}s",
"throughput": f"{output_size/elapsed/1024:.1f} KB/s" if elapsed > 0 else "N/A",
})
else:
results.record_fail(
"test_stress_large_output",
f"Exit code: {result['exit_code']}, output size: {output_size}, error: {error_msg}"
)
except Exception as e:
import traceback
results.record_fail("test_stress_large_output", f"{str(e)}\n{traceback.format_exc()}")
finally:
try:
cleanup_vm(task_id)
except:
pass
if original_env:
os.environ["TERMINAL_ENV"] = original_env
elif "TERMINAL_ENV" in os.environ:
del os.environ["TERMINAL_ENV"]
def test_stress_error_recovery(config: StressTestConfig):
"""Stress test: Commands that fail and verify sandbox continues working."""
if config.dry_run:
results.record_skip("test_stress_error_recovery", "Dry run mode")
return
from tools.terminal_tool import terminal_tool, cleanup_vm
original_env = os.environ.get("TERMINAL_ENV")
os.environ["TERMINAL_ENV"] = "modal"
try:
task_id = f"stress-error-{int(time.time())}"
# Run some failing commands
failing_commands = [
"exit 1",
"false",
"cat /nonexistent/file",
"command_that_does_not_exist",
]
for cmd in failing_commands:
result = json.loads(terminal_tool(cmd, task_id=task_id))
# These should fail but not crash
assert result["exit_code"] != 0 or result.get("error"), f"Expected failure for: {cmd}"
# Now run a command that should succeed
result = json.loads(terminal_tool("echo 'recovery success'", task_id=task_id))
cleanup_vm(task_id)
if result["exit_code"] == 0 and "recovery success" in result["output"]:
results.record_pass("test_stress_error_recovery", {
"failed_commands": len(failing_commands),
"recovery": "success",
})
else:
results.record_fail(
"test_stress_error_recovery",
f"Recovery failed: {result}"
)
except Exception as e:
results.record_fail("test_stress_error_recovery", str(e))
finally:
if original_env:
os.environ["TERMINAL_ENV"] = original_env
elif "TERMINAL_ENV" in os.environ:
del os.environ["TERMINAL_ENV"]
# =============================================================================
# CATEGORY 2: Atropos Backend Stress Tests
# =============================================================================
async def test_atropos_stress_slot_churn(config: StressTestConfig):
"""Atropos stress test: Rapid slot acquire/release cycles."""
if config.dry_run:
results.record_skip("test_atropos_stress_slot_churn", "Dry run mode")
return
imports = try_import_atropos()
if imports is None:
results.record_skip("test_atropos_stress_slot_churn", "Requires atroposlib")
return
ModalToolBackend, ModalSandboxConfig, _, _ = imports
try:
backend_config = ModalSandboxConfig(
app_name=f"stress-churn-{int(time.time())}",
min_sandboxes=1,
max_sandboxes=3,
slots_per_sandbox=5,
)
backend = ModalToolBackend(backend_config)
await backend.start()
try:
num_cycles = config.total_operations
start_time = time.time()
successes = 0
for i in range(num_cycles):
try:
slot = await backend.acquire(f"churn-{i}")
# Quick command
results_list = await backend.execute_batch([
(slot, "bash", {"command": f"echo {i}"})
])
if results_list[0].success:
successes += 1
await backend.release(slot, reset_workspace=(i % 5 == 0))
except Exception as e:
pass # Count as failure
elapsed = time.time() - start_time
success_rate = successes / num_cycles * 100
if success_rate >= 90:
results.record_pass("test_atropos_stress_slot_churn", {
"cycles": num_cycles,
"successes": successes,
"success_rate": f"{success_rate:.1f}%",
"total_time": f"{elapsed:.2f}s",
"cycles_per_second": f"{num_cycles/elapsed:.1f}",
})
else:
results.record_fail(
"test_atropos_stress_slot_churn",
f"Success rate {success_rate:.1f}% < 90%"
)
finally:
await backend.stop(purge=True)
except Exception as e:
results.record_fail("test_atropos_stress_slot_churn", str(e))
async def test_atropos_stress_parallel_batches(config: StressTestConfig):
"""Atropos stress test: Multiple parallel batch executions."""
if config.dry_run:
results.record_skip("test_atropos_stress_parallel_batches", "Dry run mode")
return
imports = try_import_atropos()
if imports is None:
results.record_skip("test_atropos_stress_parallel_batches", "Requires atroposlib")
return
ModalToolBackend, ModalSandboxConfig, _, _ = imports
try:
backend_config = ModalSandboxConfig(
app_name=f"stress-batch-{int(time.time())}",
min_sandboxes=2,
max_sandboxes=4,
slots_per_sandbox=5,
)
backend = ModalToolBackend(backend_config)
await backend.start()
try:
num_slots = 10
slots = []
# Acquire multiple slots
for i in range(num_slots):
slot = await backend.acquire(f"batch-{i}")
slots.append(slot)
# Run multiple batches in parallel
start_time = time.time()
num_batches = 5
async def run_batch(batch_id: int):
requests = [
(slot, "bash", {"command": f"echo 'batch{batch_id}-slot{i}'"})
for i, slot in enumerate(slots)
]
return await backend.execute_batch(requests)
batch_tasks = [run_batch(i) for i in range(num_batches)]
all_results = await asyncio.gather(*batch_tasks)
elapsed = time.time() - start_time
# Count successes
total_commands = num_batches * num_slots
successes = sum(
1 for batch_result in all_results
for r in batch_result
if r.success
)
# Release slots
for slot in slots:
await backend.release(slot)
success_rate = successes / total_commands * 100
if success_rate >= 90:
results.record_pass("test_atropos_stress_parallel_batches", {
"batches": num_batches,
"slots": num_slots,
"total_commands": total_commands,
"successes": successes,
"success_rate": f"{success_rate:.1f}%",
"total_time": f"{elapsed:.2f}s",
"commands_per_second": f"{total_commands/elapsed:.1f}",
})
else:
results.record_fail(
"test_atropos_stress_parallel_batches",
f"Success rate {success_rate:.1f}% < 90%"
)
finally:
await backend.stop(purge=True)
except Exception as e:
results.record_fail("test_atropos_stress_parallel_batches", str(e))
async def test_atropos_stress_multi_profile_load(config: StressTestConfig):
"""Atropos stress test: Load across multiple profiles."""
if config.dry_run:
results.record_skip("test_atropos_stress_multi_profile_load", "Dry run mode")
return
imports = try_import_atropos()
if imports is None:
results.record_skip("test_atropos_stress_multi_profile_load", "Requires atroposlib")
return
ModalToolBackend, ModalSandboxConfig, _, _ = imports
try:
backend = ModalToolBackend.with_profiles(
app_name=f"stress-multiprofile-{int(time.time())}",
profiles={
"cpu-light": ModalSandboxConfig(
name="cpu-light",
cpu=0.5,
memory=1024,
min_sandboxes=1,
max_sandboxes=2,
slots_per_sandbox=5,
),
"cpu-heavy": ModalSandboxConfig(
name="cpu-heavy",
cpu=2.0,
memory=4096,
min_sandboxes=0,
max_sandboxes=2,
slots_per_sandbox=3,
),
}
)
await backend.start(profiles_to_start=["cpu-light", "cpu-heavy"])
try:
num_tasks_per_profile = 5
slots = []
# Acquire from both profiles
for i in range(num_tasks_per_profile):
light_slot = await backend.acquire(f"light-{i}", profile="cpu-light")
heavy_slot = await backend.acquire(f"heavy-{i}", profile="cpu-heavy")
slots.append((light_slot, "cpu-light"))
slots.append((heavy_slot, "cpu-heavy"))
# Execute batch across all profiles
start_time = time.time()
requests = [
(slot, "bash", {"command": f"echo 'profile={profile}'"})
for slot, profile in slots
]
batch_results = await backend.execute_batch(requests)
elapsed = time.time() - start_time
successes = sum(1 for r in batch_results if r.success)
# Release all
for slot, _ in slots:
await backend.release(slot)
status = backend.get_status()
success_rate = successes / len(slots) * 100
if success_rate >= 90:
results.record_pass("test_atropos_stress_multi_profile_load", {
"profiles": 2,
"tasks_per_profile": num_tasks_per_profile,
"total_tasks": len(slots),
"successes": successes,
"success_rate": f"{success_rate:.1f}%",
"time": f"{elapsed:.2f}s",
"status": status,
})
else:
results.record_fail(
"test_atropos_stress_multi_profile_load",
f"Success rate {success_rate:.1f}% < 90%"
)
finally:
await backend.stop(purge=True)
except Exception as e:
results.record_fail("test_atropos_stress_multi_profile_load", str(e))
# =============================================================================
# CATEGORY 3: Mini-SWE-Agent Integration Tests
# =============================================================================
def test_miniswe_environment_available():
"""Check if mini-swe-agent is properly set up."""
mini_swe_path = Path(__file__).parent.parent / "mini-swe-agent" / "src"
if not mini_swe_path.exists():
results.record_skip(
"test_miniswe_environment_available",
"mini-swe-agent not found. Run: git clone https://github.com/anthropics/mini-swe-agent.git mini-swe-agent"
)
return
if not list(mini_swe_path.iterdir()):
results.record_skip(
"test_miniswe_environment_available",
"mini-swe-agent directory is empty. Run: git submodule update --init"
)
return
miniswe = try_import_miniswe()
if miniswe is None:
results.record_fail(
"test_miniswe_environment_available",
"Failed to import minisweagent module"
)
return
results.record_pass("test_miniswe_environment_available", {
"path": str(mini_swe_path),
"module": miniswe.__name__,
})
def test_miniswe_modal_backend(config: StressTestConfig):
"""Test mini-swe-agent with Modal backend."""
if config.dry_run:
results.record_skip("test_miniswe_modal_backend", "Dry run mode")
return
miniswe = try_import_miniswe()
if miniswe is None:
results.record_skip(
"test_miniswe_modal_backend",
"mini-swe-agent not available"
)
return
try:
# Check if ModalEnvironment exists in minisweagent
if not hasattr(miniswe, 'ModalEnvironment'):
results.record_skip(
"test_miniswe_modal_backend",
"minisweagent.ModalEnvironment not found"
)
return
# Create Modal environment
env = miniswe.ModalEnvironment(
image="python:3.11",
timeout=60,
)
# Execute a command
result = env.execute("echo 'Hello from mini-swe-agent Modal'")
env.cleanup()
if "Hello from mini-swe-agent Modal" in str(result):
results.record_pass("test_miniswe_modal_backend")
else:
results.record_fail(
"test_miniswe_modal_backend",
f"Unexpected result: {result}"
)
except Exception as e:
results.record_fail("test_miniswe_modal_backend", str(e))
# =============================================================================
# Test Runner
# =============================================================================
def run_sync_tests(config: StressTestConfig):
"""Run synchronous tests."""
if config.category in (None, "stress"):
print("\n" + "="*70)
print("STRESS TESTS (Terminal Tool)")
print("="*70)
test_stress_concurrent_tasks(config)
test_stress_rapid_fire(config)
test_stress_pool_scaling(config)
test_stress_large_output(config)
test_stress_error_recovery(config)
if config.category in (None, "miniswe"):
print("\n" + "="*70)
print("MINI-SWE-AGENT INTEGRATION TESTS")
print("="*70)
test_miniswe_environment_available()
test_miniswe_modal_backend(config)
async def run_async_tests(config: StressTestConfig):
"""Run asynchronous tests."""
if config.category in (None, "atropos"):
print("\n" + "="*70)
print("ATROPOS BACKEND STRESS TESTS")
print("="*70)
await test_atropos_stress_slot_churn(config)
await test_atropos_stress_parallel_batches(config)
await test_atropos_stress_multi_profile_load(config)
def main():
import argparse
parser = argparse.ArgumentParser(description="Modal Stress Test Suite")
parser.add_argument("--dry-run", action="store_true", help="Skip tests requiring Modal")
parser.add_argument("--category", choices=["stress", "atropos", "miniswe"], help="Run specific category")
parser.add_argument("--concurrent", type=int, default=10, help="Number of concurrent tasks")
parser.add_argument("--operations", type=int, default=50, help="Total operations for stress tests")
parser.add_argument("--verbose", action="store_true", default=True)
args = parser.parse_args()
config = StressTestConfig(
dry_run=args.dry_run,
verbose=args.verbose,
category=args.category,
concurrent_tasks=args.concurrent,
total_operations=args.operations,
)
print("="*70)
print("MODAL STRESS & INTEGRATION TEST SUITE")
print("="*70)
print(f"Mode: {'DRY RUN' if config.dry_run else 'LIVE'}")
print(f"Category: {config.category or 'ALL'}")
print(f"Concurrent tasks: {config.concurrent_tasks}")
print(f"Total operations: {config.total_operations}")
# Run sync tests
run_sync_tests(config)
# Run async tests
asyncio.run(run_async_tests(config))
# Summary
success = results.summary()
sys.exit(0 if success else 1)
if __name__ == "__main__":
main()

View File

@@ -236,6 +236,63 @@ def test_environment_isolation():
return isolated return isolated
def test_pool_status():
"""Test that the Modal pool manager reports status correctly."""
print("\n" + "=" * 60)
print("TEST 7: Pool Status")
print("=" * 60)
try:
# Import pool manager
_ModalPoolManager = terminal_module._ModalPoolManager
# Get pool manager instance
manager = _ModalPoolManager.get_instance()
status = manager.get_status()
print(f"\nPool Manager Status:")
print(f" App name: {manager.app_name}")
print(f" Default profile: {manager.default_profile}")
print(f" Available profiles: {list(manager.profiles.keys())}")
print(f" Active pools: {list(status.keys())}")
for pool_name, pool_status in status.items():
print(f"\n Pool '{pool_name}':")
print(f" Size: {pool_status['pool_size']}/{pool_status['max_pool']}")
print(f" In use: {pool_status['in_use']}")
print(f" Min pool: {pool_status['min_pool']}")
print(f"\nTest: ✅ Passed")
return True
except Exception as e:
print(f"\nError: {e}")
print(f"\nTest: ❌ Failed")
return False
def test_profile_selection():
"""Test that profile parameter is accepted (even if profile doesn't exist)."""
print("\n" + "=" * 60)
print("TEST 8: Profile Selection")
print("=" * 60)
test_task_id = "modal_test_profile"
# Test with default profile (no profile specified)
print("Testing with default profile...")
result = terminal_tool("echo 'default profile'", task_id=test_task_id)
result_json = json.loads(result)
success = result_json.get('exit_code') == 0
print(f" Default profile: {'' if success else ''} (exit code: {result_json.get('exit_code')})")
# Cleanup
cleanup_vm(test_task_id)
print(f"\nTest: {'✅ Passed' if success else '❌ Failed'}")
return success
def main(): def main():
"""Run all Modal terminal tests.""" """Run all Modal terminal tests."""
print("🧪 Modal Terminal Tool Test Suite") print("🧪 Modal Terminal Tool Test Suite")
@@ -247,6 +304,8 @@ def main():
print(f" TERMINAL_ENV: {config['env_type']}") print(f" TERMINAL_ENV: {config['env_type']}")
print(f" TERMINAL_MODAL_IMAGE: {config['modal_image']}") print(f" TERMINAL_MODAL_IMAGE: {config['modal_image']}")
print(f" TERMINAL_TIMEOUT: {config['timeout']}s") print(f" TERMINAL_TIMEOUT: {config['timeout']}s")
print(f" TERMINAL_MODAL_APP_NAME: {os.getenv('TERMINAL_MODAL_APP_NAME', 'hermes-sandbox')}")
print(f" TERMINAL_MODAL_DEFAULT_PROFILE: {os.getenv('TERMINAL_MODAL_DEFAULT_PROFILE', 'default')}")
if config['env_type'] != 'modal': if config['env_type'] != 'modal':
print(f"\n⚠️ WARNING: TERMINAL_ENV is set to '{config['env_type']}', not 'modal'") print(f"\n⚠️ WARNING: TERMINAL_ENV is set to '{config['env_type']}', not 'modal'")
@@ -270,6 +329,8 @@ def main():
results['pip_install'] = test_pip_install() results['pip_install'] = test_pip_install()
results['filesystem_persistence'] = test_filesystem_persistence() results['filesystem_persistence'] = test_filesystem_persistence()
results['environment_isolation'] = test_environment_isolation() results['environment_isolation'] = test_environment_isolation()
results['pool_status'] = test_pool_status()
results['profile_selection'] = test_profile_selection()
# Summary # Summary
print("\n" + "=" * 60) print("\n" + "=" * 60)

View File

@@ -0,0 +1,31 @@
from __future__ import annotations
from atropos.tools.base import ToolCall
def test_parse_tool_call_json_wrapper() -> None:
text = '<tool_call>{"name":"terminal","arguments":{"command":"pwd"}}</tool_call>'
calls = ToolCall.parse_from_text(text)
assert len(calls) == 1
assert calls[0].name == "terminal"
assert calls[0].arguments == {"command": "pwd"}
def test_parse_tool_call_glm_style() -> None:
text = '<tool_call>terminal{"command":"ls -la"}</tool_call>'
calls = ToolCall.parse_from_text(text)
assert len(calls) == 1
assert calls[0].name == "terminal"
assert calls[0].arguments == {"command": "ls -la"}
def test_parse_tool_call_missing_close_tag() -> None:
text = '<tool_call>terminal{"command":"echo hi"}'
calls = ToolCall.parse_from_text(text)
assert calls == []
def test_parse_tool_call_strips_accidental_xml() -> None:
text = '<tool_call>terminal{"command":"ls -la"}</arg_value></tool_call>'
calls = ToolCall.parse_from_text(text)
assert calls == []

View File

@@ -16,14 +16,6 @@ The tools are imported into model_tools.py which provides a unified interface
for the AI agent to access all capabilities. for the AI agent to access all capabilities.
""" """
# Export all tools for easy importing
from .web_tools import (
web_search_tool,
web_extract_tool,
web_crawl_tool,
check_firecrawl_api_key
)
# Primary terminal tool (mini-swe-agent backend: local/docker/singularity/modal) # Primary terminal tool (mini-swe-agent backend: local/docker/singularity/modal)
from .terminal_tool import ( from .terminal_tool import (
terminal_tool, terminal_tool,
@@ -34,54 +26,106 @@ from .terminal_tool import (
TERMINAL_TOOL_DESCRIPTION TERMINAL_TOOL_DESCRIPTION
) )
# Alternative terminal tool (Hecate/MorphCloud cloud VMs) # Optional toolsets: keep imports soft so users can run subsets of tools without
from .terminal_hecate import ( # installing every dependency (requirements gating lives in model_tools.py).
terminal_hecate_tool, try:
check_hecate_requirements, from .web_tools import check_firecrawl_api_key, web_crawl_tool, web_extract_tool, web_search_tool
TERMINAL_HECATE_DESCRIPTION except ModuleNotFoundError: # pragma: no cover
) web_search_tool = None # type: ignore[assignment]
web_extract_tool = None # type: ignore[assignment]
web_crawl_tool = None # type: ignore[assignment]
from .vision_tools import ( def check_firecrawl_api_key() -> bool: # type: ignore[no-redef]
vision_analyze_tool, return False
check_vision_requirements
)
from .mixture_of_agents_tool import ( try:
mixture_of_agents_tool, # Alternative terminal tool (Hecate/MorphCloud cloud VMs)
check_moa_requirements from .terminal_hecate import TERMINAL_HECATE_DESCRIPTION, check_hecate_requirements, terminal_hecate_tool
) except ModuleNotFoundError: # pragma: no cover
terminal_hecate_tool = None # type: ignore[assignment]
TERMINAL_HECATE_DESCRIPTION = ""
from .image_generation_tool import ( def check_hecate_requirements() -> bool: # type: ignore[no-redef]
image_generate_tool, return False
check_image_generation_requirements
)
from .skills_tool import ( try:
skills_categories, from .vision_tools import check_vision_requirements, vision_analyze_tool
skills_list, except ModuleNotFoundError: # pragma: no cover
skill_view, vision_analyze_tool = None # type: ignore[assignment]
check_skills_requirements,
SKILLS_TOOL_DESCRIPTION
)
# Browser automation tools (agent-browser + Browserbase) def check_vision_requirements() -> bool: # type: ignore[no-redef]
from .browser_tool import ( return False
browser_navigate,
browser_snapshot, try:
browser_click, from .mixture_of_agents_tool import check_moa_requirements, mixture_of_agents_tool
browser_type, except ModuleNotFoundError: # pragma: no cover
browser_scroll, mixture_of_agents_tool = None # type: ignore[assignment]
browser_back,
browser_press, def check_moa_requirements() -> bool: # type: ignore[no-redef]
browser_close, return False
browser_get_images,
browser_vision, try:
cleanup_browser, from .image_generation_tool import check_image_generation_requirements, image_generate_tool
cleanup_all_browsers, except ModuleNotFoundError: # pragma: no cover
get_active_browser_sessions, image_generate_tool = None # type: ignore[assignment]
check_browser_requirements,
BROWSER_TOOL_SCHEMAS def check_image_generation_requirements() -> bool: # type: ignore[no-redef]
) return False
try:
from .skills_tool import (
SKILLS_TOOL_DESCRIPTION,
check_skills_requirements,
skill_view,
skills_categories,
skills_list,
)
except ModuleNotFoundError: # pragma: no cover
skills_categories = None # type: ignore[assignment]
skills_list = None # type: ignore[assignment]
skill_view = None # type: ignore[assignment]
SKILLS_TOOL_DESCRIPTION = ""
def check_skills_requirements() -> bool: # type: ignore[no-redef]
return False
try:
# Browser automation tools (agent-browser + Browserbase)
from .browser_tool import (
BROWSER_TOOL_SCHEMAS,
browser_back,
browser_click,
browser_close,
browser_get_images,
browser_navigate,
browser_press,
browser_scroll,
browser_snapshot,
browser_type,
browser_vision,
check_browser_requirements,
cleanup_all_browsers,
cleanup_browser,
get_active_browser_sessions,
)
except ModuleNotFoundError: # pragma: no cover
browser_navigate = None # type: ignore[assignment]
browser_snapshot = None # type: ignore[assignment]
browser_click = None # type: ignore[assignment]
browser_type = None # type: ignore[assignment]
browser_scroll = None # type: ignore[assignment]
browser_back = None # type: ignore[assignment]
browser_press = None # type: ignore[assignment]
browser_close = None # type: ignore[assignment]
browser_get_images = None # type: ignore[assignment]
browser_vision = None # type: ignore[assignment]
cleanup_browser = None # type: ignore[assignment]
cleanup_all_browsers = None # type: ignore[assignment]
get_active_browser_sessions = None # type: ignore[assignment]
BROWSER_TOOL_SCHEMAS = []
def check_browser_requirements() -> bool: # type: ignore[no-redef]
return False
# Cronjob management tools (CLI-only, hermes-cli toolset) # Cronjob management tools (CLI-only, hermes-cli toolset)
from .cronjob_tools import ( from .cronjob_tools import (
@@ -206,4 +250,3 @@ __all__ = [
'clear_file_ops_cache', 'clear_file_ops_cache',
'check_file_requirements', 'check_file_requirements',
] ]

View File

@@ -36,8 +36,14 @@ import shutil
import subprocess import subprocess
import tempfile import tempfile
import uuid import uuid
from dataclasses import dataclass, field
from pathlib import Path from pathlib import Path
from typing import Optional, Dict, Any from typing import Optional, Dict, Any, ClassVar, List
try:
import yaml
except ImportError:
yaml = None
# Add mini-swe-agent to path if not installed # Add mini-swe-agent to path if not installed
mini_swe_path = Path(__file__).parent.parent / "mini-swe-agent" / "src" mini_swe_path = Path(__file__).parent.parent / "mini-swe-agent" / "src"
@@ -951,38 +957,585 @@ class _DockerEnvironment:
pass pass
class _ModalEnvironment: pass
@dataclass
class ModalProfile:
""" """
Modal cloud execution environment wrapper with sudo support. Configuration for a Modal sandbox profile.
Wraps mini-swe-agent's SwerexModalEnvironment but adds: Each profile defines the container image, resources, and pool scaling behavior.
- SUDO_PASSWORD support via _transform_sudo_command Different profiles can be used for different workloads
Note: stdin handling is not needed for Modal since it uses remote async execution. Secrets:
secrets: List of Modal Secret names to inject into the sandbox.
These secrets must be created on Modal dashboard or via CLI.
env_vars: Dict of environment variables to pass directly to sandbox.
Use for non-sensitive configuration.
Example: {"DEBUG": "1", "LOG_LEVEL": "info"}
use_dotenv: loads local dotenv
"""
name: str
image: str = "python:3.11"
gpu: Optional[str] = None # None, "T4", "A10G", "A100", "H100"
cpu: float = 1.0
memory: int = 2048 # MB
min_pool: int = 1
max_pool: int = 5
idle_timeout: int = 120 # Modal server-side auto-cleanup (seconds)
max_lifetime: int = 3600 # Max sandbox lifetime (seconds)
scale_down_idle: int = 180 # Client-side scale down threshold (seconds)
workdir: str = "/workspace"
# Secrets and environment variables
secrets: List[str] = field(default_factory=list) # Modal Secret names
env_vars: Dict[str, str] = field(default_factory=dict) # Direct env vars
use_dotenv: bool = False # Load .env file and pass to sandbox
@classmethod
def from_env(cls, profile_name: str) -> "ModalProfile":
"""Load profile configuration from environment variables."""
prefix = f"TERMINAL_MODAL_PROFILE_{profile_name}_"
# Parse secrets list from comma-separated string
secrets_str = os.getenv(f"{prefix}SECRETS", "")
secrets = [s.strip() for s in secrets_str.split(",") if s.strip()]
# Parse env_vars from KEY=VALUE pairs separated by semicolons
env_vars_str = os.getenv(f"{prefix}ENV_VARS", "")
env_vars = {}
if env_vars_str:
for pair in env_vars_str.split(";"):
if "=" in pair:
k, v = pair.split("=", 1)
env_vars[k.strip()] = v.strip()
return cls(
name=profile_name,
image=os.getenv(f"{prefix}IMAGE", "python:3.11"),
gpu=os.getenv(f"{prefix}GPU"),
cpu=float(os.getenv(f"{prefix}CPU", "1.0")),
memory=int(os.getenv(f"{prefix}MEMORY", "2048")),
min_pool=int(os.getenv(f"{prefix}MIN_POOL", "1")),
max_pool=int(os.getenv(f"{prefix}MAX_POOL", "5")),
idle_timeout=int(os.getenv(f"{prefix}IDLE_TIMEOUT", "120")),
max_lifetime=int(os.getenv(f"{prefix}MAX_LIFETIME", "3600")),
scale_down_idle=int(os.getenv(f"{prefix}SCALE_DOWN_IDLE", "180")),
workdir=os.getenv(f"{prefix}WORKDIR", "/workspace"),
secrets=secrets,
env_vars=env_vars,
use_dotenv=os.getenv(f"{prefix}USE_DOTENV", "").lower() in ("true", "1", "yes"),
)
@classmethod
def load_profiles(cls, config_file: Optional[str] = None) -> Dict[str, "ModalProfile"]:
"""
Load all profiles from YAML file or environment variables.
Priority:
1. YAML file specified by config_file or TERMINAL_MODAL_PROFILES_FILE
2. Environment variables with TERMINAL_MODAL_PROFILE_<name>_* pattern
3. Default profile with basic settings
"""
profiles = {}
# Try YAML file first
yaml_path = config_file or os.getenv("TERMINAL_MODAL_PROFILES_FILE", "modal_profiles.yaml")
if Path(yaml_path).exists():
try:
with open(yaml_path) as f:
config = yaml.safe_load(f)
for name, cfg in config.get("profiles", {}).items():
profiles[name] = cls(name=name, **cfg)
if not os.getenv("HERMES_QUIET"):
print(f"[Modal] Loaded {len(profiles)} profiles from {yaml_path}")
return profiles
except Exception as e:
if not os.getenv("HERMES_QUIET"):
print(f"[Modal] Warning: Failed to load {yaml_path}: {e}")
# Check for environment variable profiles
# Look for any env vars starting with TERMINAL_MODAL_PROFILE_
profile_names = set()
for key in os.environ:
if key.startswith("TERMINAL_MODAL_PROFILE_") and "_IMAGE" in key:
# Extract profile name: TERMINAL_MODAL_PROFILE_<name>_IMAGE
parts = key.replace("TERMINAL_MODAL_PROFILE_", "").rsplit("_IMAGE", 1)
if parts[0]:
profile_names.add(parts[0])
for name in profile_names:
profiles[name] = cls.from_env(name)
# If no profiles found, create a default one
if not profiles:
default_name = os.getenv("TERMINAL_MODAL_DEFAULT_PROFILE", "default")
profiles[default_name] = cls(
name=default_name,
image=os.getenv("TERMINAL_MODAL_IMAGE", "python:3.11"),
min_pool=int(os.getenv("TERMINAL_MODAL_MIN_POOL", "1")),
max_pool=int(os.getenv("TERMINAL_MODAL_MAX_POOL", "5")),
idle_timeout=int(os.getenv("TERMINAL_MODAL_IDLE_TIMEOUT", "120")),
max_lifetime=int(os.getenv("TERMINAL_MODAL_MAX_LIFETIME", "3600")),
scale_down_idle=int(os.getenv("TERMINAL_MODAL_SCALE_DOWN_IDLE", "180")),
)
return profiles
class _ModalSandboxPool:
"""
Auto-scaling pool of warm Modal sandboxes for a single profile.
Features:
- Named sandboxes for recovery after restart
- Reactive scale-up when demand exceeds capacity
- Background scale-down when sandboxes are idle
- Server-side idle_timeout for orphan protection
""" """
def __init__(self, image: str, cwd: str = "/root", timeout: int = 60): def __init__(self, profile: ModalProfile, app_name: str):
from minisweagent.environments.extra.swerex_modal import SwerexModalEnvironment self.profile = profile
self._inner = SwerexModalEnvironment(image=image, cwd=cwd, timeout=timeout) self.app_name = app_name
self._app = None
self._modal_image = None
self._pool: Dict[str, Any] = {} # sandbox_name -> modal.Sandbox
self._in_use: Dict[str, str] = {} # task_id -> sandbox_name
self._last_used: Dict[str, float] = {} # sandbox_name -> timestamp
self._lock = threading.Lock()
self._running = True
self._next_index = 0
# Start scale-down monitor if min_pool > 0 (worth keeping warm)
self._monitor_thread = None
if profile.min_pool > 0 or profile.max_pool > 0:
self._monitor_thread = threading.Thread(
target=self._scale_down_monitor,
daemon=True,
name=f"modal-pool-{profile.name}"
)
self._monitor_thread.start()
def _get_sandbox_name(self, index: int) -> str:
"""Generate a unique sandbox name for this profile."""
return f"hermes-{self.profile.name}-{index}"
def _ensure_app(self):
"""Lazy initialization of Modal app and image."""
if self._app is None:
try:
import modal
self._app = modal.App.lookup(self.app_name, create_if_missing=True)
self._modal_image = modal.Image.from_registry(self.profile.image)
except ImportError:
raise ImportError("Modal package not installed. Run: pip install modal")
def _recover_or_create_sandbox(self, name: str) -> Any:
"""
Try to recover an existing named sandbox, or create a new one.
Uses Modal's named sandbox feature for recovery after Hermes restart.
Supports Modal Secrets for secure credential injection.
"""
import modal
# Try to recover existing sandbox
try:
sb = modal.Sandbox.from_name(self.app_name, name)
if sb.poll() is None: # Still running
# Health check - verify sandbox is responsive
try:
sb.exec("echo", "ok", timeout=10)
if not os.getenv("HERMES_QUIET"):
print(f"[Modal] Recovered existing sandbox: {name}")
return sb
except Exception:
# Sandbox is not healthy, will create new
pass
except modal.exception.NotFoundError:
pass
except Exception as e:
if not os.getenv("HERMES_QUIET"):
print(f"[Modal] Could not recover sandbox {name}: {e}")
# Build create kwargs based on profile
create_kwargs = {
"app": self._app,
"name": name,
"image": self._modal_image,
"timeout": self.profile.max_lifetime,
"idle_timeout": self.profile.idle_timeout,
"workdir": self.profile.workdir,
}
# Add resource specs
if self.profile.cpu != 1.0:
create_kwargs["cpu"] = self.profile.cpu
if self.profile.memory != 2048:
create_kwargs["memory"] = self.profile.memory
# Add GPU if specified
if self.profile.gpu:
create_kwargs["gpu"] = self.profile.gpu
# Build secrets list
secrets_list = []
# Add named secrets from Modal dashboard/CLI
for secret_name in self.profile.secrets:
try:
secrets_list.append(modal.Secret.from_name(secret_name))
if not os.getenv("HERMES_QUIET"):
print(f"[Modal] Adding secret: {secret_name}")
except Exception as e:
if not os.getenv("HERMES_QUIET"):
print(f"[Modal] Warning: Could not load secret '{secret_name}': {e}")
# Add direct environment variables
if self.profile.env_vars:
secrets_list.append(modal.Secret.from_dict(self.profile.env_vars))
# Add .env file if requested
if self.profile.use_dotenv:
try:
secrets_list.append(modal.Secret.from_dotenv())
if not os.getenv("HERMES_QUIET"):
print(f"[Modal] Loading .env file into sandbox")
except Exception as e:
if not os.getenv("HERMES_QUIET"):
print(f"[Modal] Warning: Could not load .env file: {e}")
# Add global secrets from environment variable
global_secrets_str = os.getenv("TERMINAL_MODAL_SECRETS", "")
if global_secrets_str:
for secret_name in global_secrets_str.split(","):
secret_name = secret_name.strip()
if secret_name and secret_name not in self.profile.secrets:
try:
secrets_list.append(modal.Secret.from_name(secret_name))
except Exception as e:
if not os.getenv("HERMES_QUIET"):
print(f"[Modal] Warning: Could not load global secret '{secret_name}': {e}")
if secrets_list:
create_kwargs["secrets"] = secrets_list
if not os.getenv("HERMES_QUIET"):
gpu_str = f" with GPU={self.profile.gpu}" if self.profile.gpu else ""
secrets_str = f" with {len(secrets_list)} secret(s)" if secrets_list else ""
print(f"[Modal] Creating sandbox: {name}{gpu_str}{secrets_str}")
return modal.Sandbox.create(**create_kwargs)
def _find_available_slot(self) -> Optional[str]:
"""Find an available sandbox in the pool (not currently in use)."""
in_use_names = set(self._in_use.values())
for name in self._pool:
if name not in in_use_names:
# Verify sandbox is still running
try:
if self._pool[name].poll() is None:
return name
else:
# Sandbox died, remove it
del self._pool[name]
self._last_used.pop(name, None)
except:
pass
return None
def _current_size(self) -> int:
"""Get current pool size."""
return len(self._pool)
def acquire(self, task_id: str, timeout: float = 60.0) -> Any:
"""
Acquire a sandbox for a task.
- Returns existing sandbox if task already has one
- Finds available sandbox in pool if any
- Scales up if under max_pool and all busy
- Waits if at max_pool and all busy
"""
deadline = time.time() + timeout
while True:
with self._lock:
# Task already has a sandbox?
if task_id in self._in_use:
name = self._in_use[task_id]
self._last_used[name] = time.time()
return self._pool[name]
self._ensure_app()
# Find available slot in pool
available = self._find_available_slot()
if available:
self._in_use[task_id] = available
self._last_used[available] = time.time()
return self._pool[available]
# Scale up if under max
if self._current_size() < self.profile.max_pool:
name = self._get_sandbox_name(self._next_index)
self._next_index += 1
try:
sb = self._recover_or_create_sandbox(name)
self._pool[name] = sb
self._in_use[task_id] = name
self._last_used[name] = time.time()
return sb
except Exception as e:
if not os.getenv("HERMES_QUIET"):
print(f"[Modal] Failed to create sandbox: {e}")
raise
# At capacity - wait and retry
if time.time() > deadline:
raise TimeoutError(
f"No Modal sandbox available for profile '{self.profile.name}' "
f"within {timeout}s (pool size: {self._current_size()}/{self.profile.max_pool})"
)
time.sleep(0.5)
def release(self, task_id: str, terminate: bool = False):
"""
Release a sandbox back to the pool.
If terminate=False, sandbox stays warm for reuse.
If terminate=True, sandbox is terminated immediately.
"""
with self._lock:
if task_id not in self._in_use:
return
name = self._in_use.pop(task_id)
self._last_used[name] = time.time()
if terminate:
self._terminate_sandbox(name)
def _terminate_sandbox(self, name: str, during_shutdown: bool = False):
"""Terminate and remove a sandbox from the pool."""
if name in self._pool:
try:
self._pool[name].terminate()
if not os.getenv("HERMES_QUIET"):
print(f"[Modal] Terminated sandbox: {name}")
except Exception as e:
if not during_shutdown and not os.getenv("HERMES_QUIET"):
print(f"[Modal] Error terminating {name}: {e}")
del self._pool[name]
self._last_used.pop(name, None)
def _scale_down_monitor(self):
"""Background thread: terminate idle sandboxes above min_pool size."""
while self._running:
time.sleep(30) # Check every 30 seconds
with self._lock:
if self._current_size() <= self.profile.min_pool:
continue
now = time.time()
in_use_names = set(self._in_use.values())
# Find idle sandboxes to terminate
to_terminate = []
for name, last_used in list(self._last_used.items()):
if name in in_use_names:
continue
if now - last_used > self.profile.scale_down_idle:
# Don't go below min_pool
if self._current_size() - len(to_terminate) > self.profile.min_pool:
to_terminate.append(name)
for name in to_terminate:
if not os.getenv("HERMES_QUIET"):
print(f"[Modal] Scaling down idle sandbox: {name}")
self._terminate_sandbox(name)
def shutdown(self, during_shutdown: bool = False):
"""Stop monitor thread and terminate all sandboxes."""
self._running = False
with self._lock:
for name in list(self._pool.keys()):
self._terminate_sandbox(name, during_shutdown=during_shutdown)
class _ModalPoolManager:
"""
Manages multiple sandbox pools, one per profile.
Singleton pattern - shared across all _ModalSandboxEnvironment instances.
Each profile has its own pool with independent scaling.
"""
_instance: ClassVar[Optional["_ModalPoolManager"]] = None
_init_lock: ClassVar[threading.Lock] = threading.Lock()
@classmethod
def get_instance(cls) -> "_ModalPoolManager":
"""Get or create the singleton instance."""
with cls._init_lock:
if cls._instance is None:
cls._instance = cls()
return cls._instance
@classmethod
def reset_instance(cls):
"""Reset the singleton (for testing)."""
with cls._init_lock:
if cls._instance is not None:
cls._instance.shutdown()
cls._instance = None
def __init__(self):
self.app_name = os.getenv("TERMINAL_MODAL_APP_NAME", "hermes-sandbox")
self.profiles = ModalProfile.load_profiles()
self.default_profile = os.getenv("TERMINAL_MODAL_DEFAULT_PROFILE", "default")
# Fall back to first profile if default not found
if self.default_profile not in self.profiles and self.profiles:
self.default_profile = next(iter(self.profiles.keys()))
self._pools: Dict[str, _ModalSandboxPool] = {}
self._pools_lock = threading.Lock()
if not os.getenv("HERMES_QUIET"):
print(f"[Modal] Pool manager initialized with profiles: {list(self.profiles.keys())}")
print(f"[Modal] Default profile: {self.default_profile}")
def _get_pool(self, profile_name: str) -> _ModalSandboxPool:
"""Get or create a pool for a profile."""
with self._pools_lock:
if profile_name not in self._pools:
if profile_name not in self.profiles:
available = list(self.profiles.keys())
raise ValueError(
f"Unknown Modal profile: '{profile_name}'. "
f"Available profiles: {available}"
)
profile = self.profiles[profile_name]
self._pools[profile_name] = _ModalSandboxPool(profile, self.app_name)
return self._pools[profile_name]
def acquire(self, task_id: str, profile: Optional[str] = None, timeout: float = 60.0) -> Any:
"""Acquire a sandbox from the appropriate profile's pool."""
profile_name = profile or self.default_profile
return self._get_pool(profile_name).acquire(task_id, timeout=timeout)
def release(self, task_id: str, profile: Optional[str] = None, terminate: bool = False):
"""Release a sandbox back to its pool."""
profile_name = profile or self.default_profile
if profile_name in self._pools:
self._pools[profile_name].release(task_id, terminate=terminate)
def get_status(self) -> Dict[str, Any]:
"""Get status of all pools."""
status = {}
with self._pools_lock:
for name, pool in self._pools.items():
with pool._lock:
status[name] = {
"pool_size": pool._current_size(),
"in_use": len(pool._in_use),
"max_pool": pool.profile.max_pool,
"min_pool": pool.profile.min_pool,
}
return status
def shutdown(self, during_shutdown: bool = False):
"""Shutdown all pools."""
with self._pools_lock:
for pool in self._pools.values():
pool.shutdown(during_shutdown=during_shutdown)
self._pools.clear()
class _ModalSandboxEnvironment:
"""
Modal Sandbox environment with profile-based pool management.
Features:
- Profile selection for heterogeneous workloads
- Auto-scaling warm sandbox pool
- Named sandbox recovery
- SUDO_PASSWORD support
"""
def __init__(
self,
image: str, # Used only if no profile config
cwd: str = "/workspace",
timeout: int = 60,
task_id: str = "",
profile: Optional[str] = None, # Profile name (e.g., "pytorch-gpu")
):
self.cwd = cwd self.cwd = cwd
self.timeout = timeout self.timeout = timeout
self.task_id = task_id or str(uuid.uuid4())
self.profile = profile
self._released = False
# Acquire sandbox from pool
manager = _ModalPoolManager.get_instance()
self._sandbox = manager.acquire(self.task_id, profile=profile)
def execute(self, command: str, cwd: str = "", *, timeout: int | None = None) -> dict: def execute(self, command: str, cwd: str = "", *, timeout: int | None = None) -> dict:
"""Execute a command in Modal with sudo support.""" """Execute a command in the Modal sandbox."""
# Transform sudo commands if SUDO_PASSWORD is available # Transform sudo commands if SUDO_PASSWORD is available
exec_command = _transform_sudo_command(command) exec_command = _transform_sudo_command(command)
work_dir = cwd or self.cwd
# Delegate to inner environment with transformed command try:
return self._inner.execute(exec_command, cwd=cwd, timeout=timeout) # Run command via bash with proper working directory
process = self._sandbox.exec(
"bash", "-c", f"cd {work_dir} && {exec_command}",
timeout=timeout or self.timeout
)
# Read output
stdout = process.stdout.read()
stderr = process.stderr.read()
process.wait()
# Combine stdout and stderr
output = stdout
if stderr:
output = output + stderr if output else stderr
return {"output": output, "returncode": process.returncode}
except Exception as e:
error_msg = str(e)
if "timeout" in error_msg.lower():
return {"output": f"Command timed out after {timeout or self.timeout}s", "returncode": 124}
return {"output": f"Modal execution error: {error_msg}", "returncode": 1}
def cleanup(self): def cleanup(self):
"""Cleanup the Modal deployment.""" """Release sandbox back to pool (stays warm for reuse)."""
if hasattr(self._inner, 'stop'): if not self._released:
self._inner.stop() self._released = True
_ModalPoolManager.get_instance().release(
self.task_id,
profile=self.profile,
terminate=False
)
def stop(self): def stop(self):
"""Stop the Modal deployment.""" """Terminate this sandbox explicitly."""
self.cleanup() if not self._released:
self._released = True
_ModalPoolManager.get_instance().release(
self.task_id,
profile=self.profile,
terminate=True
)
def __del__(self): def __del__(self):
"""Cleanup on destruction.""" """Cleanup on destruction."""
@@ -1090,8 +1643,14 @@ def _create_environment(env_type: str, image: str, cwd: str, timeout: int, ssh_c
return _SingularityEnvironment(image=image, cwd=cwd, timeout=timeout) return _SingularityEnvironment(image=image, cwd=cwd, timeout=timeout)
elif env_type == "modal": elif env_type == "modal":
# Use custom Modal wrapper with sudo support # Use native Modal Sandbox with auto-scaling pool and profile support
return _ModalEnvironment(image=image, cwd=cwd, timeout=timeout) return _ModalSandboxEnvironment(
image=image,
cwd=cwd,
timeout=timeout,
task_id=task_id,
profile=profile,
)
elif env_type == "ssh": elif env_type == "ssh":
if not ssh_config or not ssh_config.get("host") or not ssh_config.get("user"): if not ssh_config or not ssh_config.get("host") or not ssh_config.get("user"):
@@ -1286,13 +1845,24 @@ def cleanup_vm(task_id: str):
atexit.register(_stop_cleanup_thread) atexit.register(_stop_cleanup_thread)
def _shutdown_modal_pools():
"""Shutdown Modal pool manager on exit (silently, as interpreter is shutting down)."""
try:
if _ModalPoolManager._instance is not None:
_ModalPoolManager._instance.shutdown(during_shutdown=True)
except:
pass # Ignore all errors during interpreter shutdown
atexit.register(_shutdown_modal_pools)
def terminal_tool( def terminal_tool(
command: str, command: str,
background: bool = False, background: bool = False,
timeout: Optional[int] = None, timeout: Optional[int] = None,
task_id: Optional[str] = None, task_id: Optional[str] = None,
force: bool = False force: bool = False,
profile: Optional[str] = None,
) -> str: ) -> str:
""" """
Execute a command using mini-swe-agent's execution environments. Execute a command using mini-swe-agent's execution environments.

3845
uv.lock generated Normal file

File diff suppressed because it is too large Load Diff

1
wandb/latest-run Symbolic link
View File

@@ -0,0 +1 @@
run-20260206_003827-82b0oahi

View File

@@ -0,0 +1,180 @@
_wandb:
value:
cli_version: 0.24.2
e:
2gw7xuffca69jbm2b60l3w5ymo5pb5lf:
args:
- process
- --env.driver
- singularity
- --env.singularity_image
- /root/Hermes-Agent/atropos/atropos-sandbox.sif
email: shannon@nousresearch.com
executable: /root/Hermes-Agent/.venv/bin/python
git:
commit: 4d619bcd21feedc9eed36c53c038585d97e7295e
remote: https://github.com/NousResearch/Hermes-Agent.git
host: vultr
os: Linux-6.8.0-90-generic-x86_64-with-glibc2.39
program: -m atropos.envs.swe_smith_oracle_env
python: CPython 3.12.3
root: /root/Hermes-Agent
startedAt: "2026-02-06T00:38:27.351013Z"
writerId: 2gw7xuffca69jbm2b60l3w5ymo5pb5lf
m: []
python_version: 3.12.3
t:
"1":
- 11
- 49
- 51
- 95
"3":
- 13
- 16
"4": 3.12.3
"5": 0.24.2
"6": 5.0.0
"12": 0.24.2
"13": linux-x86_64
acquire_timeout_s:
value: 30
agent_max_steps:
value: 50
agent_max_tokens:
value: null
agent_temperature:
value: 0.7
agent_tool_delay_s:
value: 0
allow_network:
value: true
batch_size:
value: 1
custom_thinking_prompt:
value: null
data_dir_to_save_evals:
value: null
data_path_to_save_groups:
value: data/swe_smith_oracle_env_2.jsonl
dataset_name:
value: NousResearch/SWE-smith-oracle
dataset_split:
value: train
disabled_toolsets:
value: []
driver:
value: singularity
enabled_toolsets:
value:
- terminal
ensure_scores_are_not_same:
value: false
eval_handling:
value: STOP_TRAIN
eval_limit_ratio:
value: 0.5
group_size:
value: 1
include_messages:
value: true
inference_weight:
value: 1
install_timeout_s:
value: 600
max_batches_offpolicy:
value: 3
max_containers:
value: 10
max_eval_workers:
value: 16
max_items:
value: 0
max_num_workers:
value: -1
max_num_workers_per_node:
value: 8
max_reasoning_tokens:
value: null
max_token_length:
value: 8192
min_batch_allocation:
value: null
min_containers:
value: 1
min_items_sent_before_logging:
value: 2
modal_app_name:
value: atropos-sandbox
modal_function_name:
value: sandbox_server
modal_volume_mount_path:
value: /data
modal_volume_name:
value: null
nomad_address:
value: http://localhost:4646
num_rollouts_per_group_for_logging:
value: 1
num_rollouts_to_keep:
value: 32
privileged:
value: false
prompt_mode:
value: problem_statement
purge_job_on_shutdown:
value: true
purge_job_on_start:
value: true
python_only:
value: true
reasoning_effort:
value: null
repo_base_url:
value: https://github.com
require_sandbox:
value: false
require_stateful_sandbox:
value: false
rollout_server_url:
value: http://localhost:8000
sandbox_image:
value: atropos-sandbox:local
sandbox_job_id:
value: atropos-sandbox-agent-env
score_include_fail_to_pass:
value: true
seed:
value: 0
shuffle:
value: true
singularity_image:
value: /root/Hermes-Agent/atropos/atropos-sandbox.sif
slots_per_container:
value: 10
steps_per_eval:
value: 1
test_timeout_s:
value: 600
thinking_mode:
value: false
tokenizer_name:
value: NousResearch/Hermes-4.3-36B
tool_batch_window_ms:
value: 20
tool_max_batch_size:
value: 200
tool_pool_mode:
value: nomad
tool_server_token:
value: null
tool_server_url:
value: null
total_steps:
value: 1
use_wandb:
value: true
wandb_name:
value: swe_smith_oracle
worker_timeout:
value: 600

Binary file not shown.