Compare commits

...

51 Commits

Author SHA1 Message Date
hjc-puro
e06a15b3ab add profiling 2025-11-18 07:12:05 -05:00
hjc-puro
349e37de0a add linewise profiling 2025-11-17 23:21:36 -05:00
hjc-puro
ab7293bed6 don't log exit code !=0 as terminal failure 2025-11-17 18:39:16 -05:00
hjc-puro
1614c15bb1 rate limits 2025-11-17 18:35:36 -05:00
hjc-puro
f813959750 add simple terminal 2025-11-17 01:14:31 -05:00
teknium
f957ec2267 update distribution and gitignore 2025-11-16 01:03:23 +00:00
Teknium
92e3074c10 Merge pull request #9 from NousResearch/tc-logging
Add logging for first 100 chars of the tool call args json / tool response
2025-11-15 14:03:24 -08:00
hjc-puro
31c733383b add tracking for cluster failurse 2025-11-15 00:01:19 -05:00
hjc-puro
0c618482c4 add logging of prefix of tool call and tool response 2025-11-07 14:43:44 -05:00
hjc-puro
2d8f6c46f1 log first 20 chars 2025-11-07 14:08:06 -05:00
teknium
c27787f09f fix gitignore again 2025-11-05 06:43:03 +00:00
teknium
d90fcd4e2b update gitignore 2025-11-05 06:43:03 +00:00
Teknium
69fd0ca9aa Merge pull request #7 from NousResearch/test
some cleanups
2025-11-04 19:54:49 -08:00
Teknium
4135cf4682 Merge branch 'main' into test 2025-11-04 19:54:40 -08:00
teknium
c82741c3d8 some cleanups 2025-11-05 03:47:17 +00:00
Teknium
9573b2ac2d Merge pull request #6 from NousResearch/fix-leakage
Fix VM instance sharing across tasks
2025-11-04 02:15:32 -08:00
hjc-puro
fbd3a2fdb8 prevent leakage of morph instances between tasks 2025-11-04 03:32:43 -05:00
hjc-puro
a4db3fdee5 fix leakage 2025-11-03 17:42:23 -05:00
Teknium
ab5c9fc37b Merge pull request #5 from NousResearch/update-snapshot
Update snapshot
2025-11-02 21:30:08 -08:00
hjc-puro
0ca3e0aaa9 update snapshot 2025-11-02 23:13:49 -05:00
teknium
f6f75cbe2b update webtools 2025-11-02 06:03:21 +00:00
Teknium
d4544f08c5 Merge pull request #4 from NousResearch/fix-terminal
Fix terminal interactivity
2025-11-01 22:39:21 -07:00
hjc-puro
a6ec79730c terminal tool 2025-11-02 08:57:04 +08:00
hjc-puro
faecbddd9b fix terminal interactivity 2025-11-02 08:52:05 +08:00
teknium
de9c0edc51 some bugfixes 2025-10-15 18:07:06 +00:00
teknium
8d256779d8 Update vision_tools.py to include image downloading and base64 conversion features.
add excluding tmp image dl's in .gitignore
2025-10-08 02:38:04 +00:00
teknium
d36790de91 Add ephemeral system prompt support in batch and agent runners. Update README with usage examples and documentation for the new feature. Ensure prompt is not saved to trajectories. 2025-10-08 02:33:58 +00:00
teknium
a398d320b7 update gitignore 2025-10-07 14:09:37 +00:00
teknium
22b6d5866c Fix some issues around async and tool constraints 2025-10-07 14:08:46 +00:00
teknium
0e2e69a71d Add batch processing capabilities with checkpointing and statistics tracking, along with toolset distribution management. Update README and add test scripts for validation. 2025-10-06 03:17:58 +00:00
teknium
bc5f0e62d9 Add support for enabling all toolsets with 'all' or '*' alias in README and toolset resolution logic 2025-10-03 13:45:29 +00:00
teknium
6fac6fecde Enhance import handling for Hecate in terminal_tool.py to manage local folder shadowing and improve error reporting for import failures. 2025-10-03 09:46:44 +00:00
teknium
c42d9055ed Move test run back to repo root. weirdness occurred 2025-10-02 20:05:09 +00:00
teknium
a7ff4d49e9 A bit of restructuring for simplicity and organization 2025-10-01 23:29:25 +00:00
teknium
0411ca1880 Add environment configuration file, restructure tool imports, and enhance README setup instructions 2025-10-01 09:54:17 +00:00
Teknium
c5386ed7e6 add better logging when requests fail 2025-09-10 00:51:41 -07:00
Teknium
2082c7caa3 update gitignore 2025-09-10 00:50:56 -07:00
Teknium
17608c1142 Update to use toolsets and make them easy to create and configure 2025-09-10 00:43:55 -07:00
Teknium
c7fa4447b8 cleanup 2025-09-06 22:07:38 -07:00
Teknium
587d1cf720 Fix Web Tools, Upgrade MoA to GPT5, Add Trajectory Saving 2025-08-31 03:04:10 -07:00
Teknium
4ece87efb0 update to firecrawl 2025-08-21 08:12:24 -07:00
Teknium
96cff78335 cleanup 2025-08-21 08:12:01 -07:00
Teknium
58d5fa1e4c update fal requirements 2025-08-21 08:10:54 -07:00
Teknium
f4ff1f496b update gitignore 2025-08-09 17:42:15 -07:00
Teknium
e1710378b7 update model_tools for imagen and moa 2025-08-09 09:52:25 -07:00
Teknium
bc71dffd4c update requirements for fal image api 2025-08-09 09:52:07 -07:00
Teknium
ebb46ba0e6 add image generation tool 2025-08-09 09:51:53 -07:00
Teknium
3078053795 add mixture of agents tool 2025-08-09 09:51:45 -07:00
Teknium
cde7e64418 add vision model tool, cli updates for exclusive and inclusive toolsets 2025-08-04 00:14:16 -07:00
Teknium
bf4223f381 implement first pass of scrape/crawl content compression 2025-07-31 10:11:27 -07:00
Teknium
1dacd941f6 Merge pull request #1 from NousResearch/terminal
Terminal tool
2025-07-26 07:13:34 -07:00
33 changed files with 9437 additions and 1204 deletions

23
.cursorrules Normal file
View File

@@ -0,0 +1,23 @@
Hermes-Agent is an agent harness for LLMs.
When building, the tool functionality is in the tools/ directory, where each specific tool (or in some cases, tools that are built for the same execution category or api) are placed in a script each their own.
Each tool is then consolidated in the model_tools.py file in the repo root.
There is also a way to consolidate sets of tools in toolsets.py for the agent to use.
The primary agent runner code is in run_agent, but other runners could be developed using the tools and framework.
Always ensure consistency between tools, the model_tools.py and toolsets.py when changing any of them, otherwise they could become desynced in a way that is detrimental to functionality.
The expected pathway for using API keys is to setup and place them in a .env file in the repo root.
Test scripts will be placed in tests/
The run_agent loop is setup to:
- Process the enabled toolsets to provide to the model,
- Pipe in a prompt or problem from the input to the agent,
- Loop the LLM each time it calls a tool, until the model decides no more tools are needed and provides a natural language response,
- Return that response.
There are additional caveats for logging, where we restructure the "tools" as a system prompt for storage later into a format that can be used and handled properly later.

49
.env.example Normal file
View File

@@ -0,0 +1,49 @@
# Hermes Agent Environment Configuration
# Copy this file to .env and fill in your API keys
# Get API keys from the URLs listed below
# =============================================================================
# REQUIRED API KEYS
# =============================================================================
# Anthropic API Key - Main agent model
# Get at: https://console.anthropic.com/
ANTHROPIC_API_KEY=
# Firecrawl API Key - Web search, extract, and crawl
# Get at: https://firecrawl.dev/
FIRECRAWL_API_KEY=
# Nous Research API Key - Vision analysis and multi-model reasoning
# Get at: https://inference-api.nousresearch.com/
NOUS_API_KEY=
# Morph API Key - Terminal/command execution tools
# Get at: https://morph.so/
MORPH_API_KEY=
# FAL.ai API Key - Image generation
# Get at: https://fal.ai/
FAL_KEY=
# =============================================================================
# OPTIONAL API KEYS
# =============================================================================
# OpenAI API Key - Optional, for enhanced Hecate features
# Get at: https://platform.openai.com/
OPENAI_API_KEY=
# =============================================================================
# OPTIONAL CONFIGURATION
# =============================================================================
# Terminal Tool Settings
HECATE_VM_LIFETIME_SECONDS=300
HECATE_DEFAULT_SNAPSHOT_ID=snapshot_p5294qxt
# Debug Logging (set to "true" to enable, logs saved to ./logs/)
WEB_TOOLS_DEBUG=false
VISION_TOOLS_DEBUG=false
MOA_TOOLS_DEBUG=false
IMAGE_TOOLS_DEBUG=false

32
.gitignore vendored
View File

@@ -1,2 +1,32 @@
/venv/
/_pycache/
/_pycache/
hecate/
hecate-lib/
*.pyc*
__pycache__/
.venv/
.vscode/
.env
.env.local
.env.development.local
.env.test.local
.env.production.local
.env.development
.env.test
export*
__pycache__/model_tools.cpython-310.pyc
__pycache__/web_tools.cpython-310.pyc
logs/
data/
.pytest_cache/
tmp/
temp_vision_images/
hermes-*/*
examples/
tests/quick_test_dataset.jsonl
tests/sample_dataset.jsonl
run_datagen_kimik2-thinking.sh
run_datagen_megascience_glm4-6.sh
run_datagen_sonnet.sh
source-data/*
run_datagen_megascience_glm4-6.sh

230
README.md
View File

@@ -1,13 +1,99 @@
# Hermes Agent
An AI agent with advanced tool-calling capabilities, featuring a flexible toolsets system for organizing and managing tools.
## Features
- **Web Tools**: Search, extract content, and crawl websites
- **Terminal Tools**: Execute commands with interactive session support
- **Vision Tools**: Analyze images from URLs
- **Reasoning Tools**: Advanced multi-model reasoning (Mixture of Agents)
- **Creative Tools**: Generate images from text prompts
- **Toolsets System**: Organize tools into logical groups for different scenarios
- **Batch Processing**: Process datasets in parallel with checkpointing and statistics tracking
- **Ephemeral System Prompts**: Guide model behavior without polluting training datasets
## Setup
```
### 1. Install Dependencies
```bash
# Create and activate virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install required packages
pip install -r requirements.txt
# Install Hecate for terminal tools
git clone git@github.com:NousResearch/hecate.git
cd hecate
pip install -e .
cd ..
```
## Run
### 2. Configure Environment Variables
```bash
# Copy the example environment file
cp .env.example .env
# Edit .env and add your API keys
nano .env # or use your preferred editor
```
**Required API Keys:**
- `ANTHROPIC_API_KEY` - Main agent model (get at: https://console.anthropic.com/)
- `FIRECRAWL_API_KEY` - Web tools (get at: https://firecrawl.dev/)
- `NOUS_API_KEY` - Vision & reasoning tools (get at: https://inference-api.nousresearch.com/)
- `MORPH_API_KEY` - Terminal tools (get at: https://morph.so/)
- `FAL_KEY` - Image generation (get at: https://fal.ai/)
- `OPENAI_API_KEY` - Optional, for some Hecate features
See `.env.example` for all available configuration options including debug settings and terminal tool configuration.
## Toolsets System
The agent uses a toolsets system for organizing and managing tools. All tools must be part of a toolset to be accessible - individual tool selection is not supported. This ensures consistent and logical grouping of capabilities.
### Key Concepts
- **Toolsets**: Logical groups of tools for specific use cases (e.g., "research", "development", "debugging")
- **Composition**: Toolsets can include other toolsets for powerful combinations
- **Custom Toolsets**: Create your own toolsets at runtime or by editing `toolsets.py`
- **Toolset-Only Access**: Tools are only accessible through toolsets, not individually
### Available Toolsets
See `toolsets.py` for the complete list of predefined toolsets including:
- Basic toolsets (web, terminal, vision, creative, reasoning)
- Composite toolsets (research, development, analysis, etc.)
- Scenario-specific toolsets (debugging, documentation, API testing, etc.)
- Special toolsets (safe mode without terminal, minimal, offline)
### Using Toolsets
```bash
# Use a predefined toolset
python run_agent.py --enabled_toolsets=research --query "Find latest AI papers"
# Combine multiple toolsets
python run_agent.py --enabled_toolsets=web,vision --query "Analyze this website"
# Enable all toolsets explicitly (same as omitting the flag)
python run_agent.py --enabled_toolsets=all --query "Do web research and run commands if helpful"
# Safe mode (no terminal access)
python run_agent.py --enabled_toolsets=safe --query "Help without running commands"
# List all available toolsets and tools
python run_agent.py --list_tools
```
For detailed documentation on toolsets, see `TOOLSETS_README.md`.
## Basic Usage
### Default (all tools enabled)
```bash
python run_agent.py \
--query "search up the latest docs on jit in python 3.13 and write me basic example that's not in their docs. profile its perf" \
--max_turns 20 \
@@ -15,3 +101,143 @@ python run_agent.py \
--base_url https://api.anthropic.com/v1/ \
--api_key $ANTHROPIC_API_KEY
```
### With specific toolset
```bash
python run_agent.py \
--query "Debug this Python error" \
--enabled_toolsets=debugging \
--model claude-sonnet-4-20250514 \
--api_key $ANTHROPIC_API_KEY
```
### Python API
```python
from run_agent import AIAgent
# Use a specific toolset
agent = AIAgent(
model="claude-opus-4-20250514",
enabled_toolsets=["research"]
)
response = agent.chat("Find information about quantum computing")
# Create custom toolset at runtime
from toolsets import create_custom_toolset
create_custom_toolset(
name="my_tools",
description="My custom toolkit",
tools=["web_search"],
includes=["terminal", "vision"]
)
agent = AIAgent(enabled_toolsets=["my_tools"])
```
## Batch Processing
Process multiple prompts from a dataset in parallel with automatic checkpointing and statistics tracking:
```bash
# Basic batch processing
python batch_runner.py \
--dataset_file=prompts.jsonl \
--batch_size=20 \
--run_name=my_run
# With specific distribution
python batch_runner.py \
--dataset_file=prompts.jsonl \
--batch_size=20 \
--run_name=image_run \
--distribution=image_gen \
--num_workers=4
```
**Key Features:**
- Parallel processing with configurable workers
- Toolset distributions for varied data generation
- Automatic checkpointing and resume capability
- Combined output in `data/<run_name>/trajectories.jsonl`
- Tool usage statistics and success rates
**Quick Start:** See [QUICKSTART_BATCH.md](QUICKSTART_BATCH.md) for a 5-minute getting started guide.
**Full Documentation:** See [BATCH_PROCESSING.md](BATCH_PROCESSING.md) for comprehensive documentation.
### Ephemeral System Prompts
The ephemeral system prompt feature allows you to guide the model's behavior during batch processing **without** saving that prompt to the training dataset trajectories. This is useful for:
- Guiding model behavior during data collection
- Adding task-specific instructions
- Keeping saved trajectories clean and focused on tool-calling format
**Example:**
```bash
python batch_runner.py \
--dataset_file=prompts.jsonl \
--batch_size=10 \
--run_name=my_run \
--ephemeral_system_prompt="You are a helpful assistant focused on image generation."
```
The ephemeral prompt will influence the model's behavior during execution, but **only the standard tool-calling system prompt** will be saved in the trajectory files.
**Documentation:** See [docs/ephemeral_system_prompt.md](docs/ephemeral_system_prompt.md) for complete details.
## Command Line Arguments
**Single Agent (`run_agent.py`):**
- `--query`: The question or task for the agent
- `--model`: Model to use (default: claude-opus-4-20250514)
- `--api_key`: API key for authentication
- `--base_url`: API endpoint URL
- `--max_turns`: Maximum number of tool-calling iterations
- `--enabled_toolsets`: Comma-separated list of toolsets to enable. Use `all` (or `*`) to enable everything. If omitted, all toolsets are enabled by default.
- `--disabled_toolsets`: Comma-separated list of toolsets to disable
- `--list_tools`: List all available toolsets and tools
- `--save_trajectories`: Save conversation trajectories to JSONL files
**Batch Processing (`batch_runner.py`):**
- `--dataset_file`: Path to JSONL file with prompts
- `--batch_size`: Number of prompts per batch
- `--run_name`: Name for this run (for output/checkpointing)
- `--distribution`: Toolset distribution to use (default: "default")
- `--num_workers`: Number of parallel workers (default: 4)
- `--resume`: Resume from checkpoint if interrupted
- `--ephemeral_system_prompt`: System prompt used during execution but NOT saved to trajectories
- `--list_distributions`: List available toolset distributions
## Environment Variables
All environment variables can be configured in the `.env` file (copy from `.env.example`).
**Core API Keys:**
- `ANTHROPIC_API_KEY`: Main agent model
- `FIRECRAWL_API_KEY`: Web tools (search, extract, crawl)
- `NOUS_API_KEY`: Vision and reasoning tools
- `MORPH_API_KEY`: Terminal tools
- `FAL_KEY`: Image generation tools
- `OPENAI_API_KEY`: Optional, for some Hecate features
**Configuration Options:**
- `HECATE_VM_LIFETIME_SECONDS`: VM lifetime (default: 300)
- `HECATE_DEFAULT_SNAPSHOT_ID`: Default snapshot (default: snapshot_p5294qxt)
- `WEB_TOOLS_DEBUG`, `VISION_TOOLS_DEBUG`, `MOA_TOOLS_DEBUG`, `IMAGE_TOOLS_DEBUG`: Enable debug logging
## Documentation
**Single Agent Usage:**
- `TOOLSETS_README.md`: Comprehensive guide to the toolsets system
- `toolsets.py`: View and modify available toolsets
- `model_tools.py`: Core tool definitions and handlers
**Batch Processing:**
- `QUICKSTART_BATCH.md`: 5-minute quick start guide
- `BATCH_PROCESSING.md`: Complete batch processing documentation
- `toolset_distributions.py`: Toolset distributions for data generation
## Examples
See `TOOLSETS_README.md` for extensive examples of using different toolsets for various scenarios.

1262
batch_runner.py Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

381
profiling.py Normal file
View File

@@ -0,0 +1,381 @@
"""
Profiling module for tracking timing statistics of tools and LLM API calls.
This module provides a centralized way to track timing information for various
operations in the agent system, including:
- Individual tool executions
- OpenAI API calls
- Aggregate statistics (min, max, median, mean, total)
"""
import time
from typing import Dict, List, Optional
from dataclasses import dataclass, field
from collections import defaultdict
import statistics
@dataclass
class ProfilingStats:
"""Statistics for a particular operation type."""
call_count: int = 0
total_time: float = 0.0
min_time: float = float('inf')
max_time: float = 0.0
times: List[float] = field(default_factory=list)
def add_timing(self, duration: float):
"""Add a timing measurement."""
self.call_count += 1
self.total_time += duration
self.min_time = min(self.min_time, duration)
self.max_time = max(self.max_time, duration)
self.times.append(duration)
@property
def mean_time(self) -> float:
"""Calculate mean time."""
return self.total_time / self.call_count if self.call_count > 0 else 0.0
@property
def median_time(self) -> float:
"""Calculate median time."""
return statistics.median(self.times) if self.times else 0.0
def to_dict(self) -> Dict:
"""Convert to dictionary for serialization."""
return {
"call_count": self.call_count,
"total_time": self.total_time,
"min_time": self.min_time if self.min_time != float('inf') else 0.0,
"max_time": self.max_time,
"mean_time": self.mean_time,
"median_time": self.median_time
}
class Profiler:
"""
Global profiler for tracking timing statistics across tools and API calls.
Usage:
profiler = Profiler()
# Time a tool execution
with profiler.time_tool("web_search"):
# ... tool execution code ...
pass
# Time an API call
with profiler.time_api_call():
# ... API call code ...
pass
# Get statistics
stats = profiler.get_statistics()
"""
def __init__(self):
"""Initialize the profiler."""
self.tool_stats: Dict[str, ProfilingStats] = defaultdict(ProfilingStats)
self.api_stats: ProfilingStats = ProfilingStats()
self._enabled = True
def enable(self):
"""Enable profiling."""
self._enabled = True
def disable(self):
"""Disable profiling."""
self._enabled = False
def reset(self):
"""Reset all profiling data."""
self.tool_stats.clear()
self.api_stats = ProfilingStats()
def record_tool_timing(self, tool_name: str, duration: float):
"""Record timing for a tool execution."""
if self._enabled:
self.tool_stats[tool_name].add_timing(duration)
def record_api_timing(self, duration: float):
"""Record timing for an API call."""
if self._enabled:
self.api_stats.add_timing(duration)
def get_statistics(self) -> Dict:
"""
Get all profiling statistics.
Returns:
Dictionary containing tool and API statistics
"""
return {
"tools": {
tool_name: stats.to_dict()
for tool_name, stats in sorted(self.tool_stats.items())
},
"api_calls": self.api_stats.to_dict()
}
def print_statistics(self, detailed: bool = True):
"""
Print profiling statistics in a readable format.
Args:
detailed: If True, show per-tool breakdown. If False, show summary only.
"""
print("\n" + "="*80)
print("📊 PROFILING STATISTICS")
print("="*80)
# API Call Statistics
print("\n🔷 OpenAI API Calls:")
if self.api_stats.call_count > 0:
api_dict = self.api_stats.to_dict()
print(f" Total Calls: {api_dict['call_count']}")
print(f" Total Time: {api_dict['total_time']:.2f}s")
print(f" Min Time: {api_dict['min_time']:.2f}s")
print(f" Max Time: {api_dict['max_time']:.2f}s")
print(f" Mean Time: {api_dict['mean_time']:.2f}s")
print(f" Median Time: {api_dict['median_time']:.2f}s")
else:
print(" No API calls recorded")
# Tool Statistics
print("\n🔧 Tool Executions:")
if self.tool_stats:
if detailed:
for tool_name in sorted(self.tool_stats.keys()):
stats_dict = self.tool_stats[tool_name].to_dict()
print(f"\n 📌 {tool_name}:")
print(f" Total Calls: {stats_dict['call_count']}")
print(f" Total Time: {stats_dict['total_time']:.2f}s")
print(f" Min Time: {stats_dict['min_time']:.2f}s")
print(f" Max Time: {stats_dict['max_time']:.2f}s")
print(f" Mean Time: {stats_dict['mean_time']:.2f}s")
print(f" Median Time: {stats_dict['median_time']:.2f}s")
# Summary
total_tool_calls = sum(s.call_count for s in self.tool_stats.values())
total_tool_time = sum(s.total_time for s in self.tool_stats.values())
print(f"\n 📊 Summary:")
print(f" Total Tool Calls: {total_tool_calls}")
print(f" Total Tool Time: {total_tool_time:.2f}s")
print(f" Unique Tools Used: {len(self.tool_stats)}")
else:
print(" No tool executions recorded")
# Overall Summary
total_api_time = self.api_stats.total_time
total_tool_time = sum(s.total_time for s in self.tool_stats.values())
print(f"\n📈 Overall Summary:")
print(f" Total API Time: {total_api_time:.2f}s")
print(f" Total Tool Time: {total_tool_time:.2f}s")
print(f" Total Time: {total_api_time + total_tool_time:.2f}s")
print("="*80 + "\n")
def export_to_json(self) -> str:
"""Export statistics as JSON string."""
import json
return json.dumps(self.get_statistics(), indent=2)
def export_to_file(self, filepath: str):
"""
Export statistics to a JSON file.
Args:
filepath: Path to output file
"""
import json
with open(filepath, 'w') as f:
json.dump(self.get_statistics(), f, indent=2)
print(f"📁 Profiling statistics exported to: {filepath}")
# Global profiler instance
_global_profiler: Optional[Profiler] = None
def get_profiler() -> Profiler:
"""Get or create the global profiler instance."""
global _global_profiler
if _global_profiler is None:
_global_profiler = Profiler()
return _global_profiler
def reset_profiler():
"""Reset the global profiler."""
global _global_profiler
if _global_profiler is not None:
_global_profiler.reset()
class TimingContext:
"""Context manager for timing operations."""
def __init__(self, profiler: Profiler, operation_type: str, operation_name: Optional[str] = None):
"""
Initialize timing context.
Args:
profiler: Profiler instance to record timing
operation_type: 'tool' or 'api'
operation_name: Name of the operation (required for tools)
"""
self.profiler = profiler
self.operation_type = operation_type
self.operation_name = operation_name
self.start_time = None
def __enter__(self):
"""Start timing."""
self.start_time = time.time()
return self
def __exit__(self, exc_type, exc_val, exc_tb):
"""Stop timing and record."""
duration = time.time() - self.start_time
if self.operation_type == 'tool':
self.profiler.record_tool_timing(self.operation_name, duration)
elif self.operation_type == 'api':
self.profiler.record_api_timing(duration)
return False # Don't suppress exceptions
def aggregate_profiling_stats(stats_list: List[Dict]) -> Dict:
"""
Aggregate multiple profiling statistics dictionaries into one.
This is useful for batch processing where each worker process has its own
profiler instance that needs to be combined.
Args:
stats_list: List of statistics dictionaries from get_statistics()
Returns:
Dict: Aggregated statistics with combined tool and API call data
"""
aggregated = {
"tools": defaultdict(lambda: {"times": []}),
"api_calls": {"times": []}
}
# Aggregate tool statistics
for stats in stats_list:
# Aggregate tool timings
for tool_name, tool_stats in stats.get("tools", {}).items():
# Reconstruct individual timings from aggregated stats
# Since we have mean_time and call_count, we approximate
aggregated["tools"][tool_name]["times"].extend(
[tool_stats.get("mean_time", 0.0)] * tool_stats.get("call_count", 0)
)
# Aggregate API call timings
api_stats = stats.get("api_calls", {})
if api_stats.get("call_count", 0) > 0:
aggregated["api_calls"]["times"].extend(
[api_stats.get("mean_time", 0.0)] * api_stats.get("call_count", 0)
)
# Calculate final statistics for tools
final_stats = {"tools": {}, "api_calls": {}}
for tool_name, data in aggregated["tools"].items():
times = data["times"]
if times:
final_stats["tools"][tool_name] = {
"call_count": len(times),
"total_time": sum(times),
"min_time": min(times),
"max_time": max(times),
"mean_time": statistics.mean(times),
"median_time": statistics.median(times)
}
# Calculate final statistics for API calls
api_times = aggregated["api_calls"]["times"]
if api_times:
final_stats["api_calls"] = {
"call_count": len(api_times),
"total_time": sum(api_times),
"min_time": min(api_times),
"max_time": max(api_times),
"mean_time": statistics.mean(api_times),
"median_time": statistics.median(api_times)
}
else:
final_stats["api_calls"] = {
"call_count": 0,
"total_time": 0.0,
"min_time": 0.0,
"max_time": 0.0,
"mean_time": 0.0,
"median_time": 0.0
}
return final_stats
def print_aggregated_statistics(stats: Dict, detailed: bool = True):
"""
Print aggregated profiling statistics in a readable format.
Args:
stats: Aggregated statistics dictionary from aggregate_profiling_stats()
detailed: If True, show per-tool breakdown. If False, show summary only.
"""
print("\n" + "="*80)
print("📊 AGGREGATED PROFILING STATISTICS")
print("="*80)
# API Call Statistics
print("\n🔷 OpenAI API Calls:")
api_stats = stats.get("api_calls", {})
if api_stats.get("call_count", 0) > 0:
print(f" Total Calls: {api_stats['call_count']}")
print(f" Total Time: {api_stats['total_time']:.2f}s")
print(f" Min Time: {api_stats['min_time']:.2f}s")
print(f" Max Time: {api_stats['max_time']:.2f}s")
print(f" Mean Time: {api_stats['mean_time']:.2f}s")
print(f" Median Time: {api_stats['median_time']:.2f}s")
else:
print(" No API calls recorded")
# Tool Statistics
print("\n🔧 Tool Executions:")
tool_stats = stats.get("tools", {})
if tool_stats:
if detailed:
for tool_name in sorted(tool_stats.keys()):
stats_dict = tool_stats[tool_name]
print(f"\n 📌 {tool_name}:")
print(f" Total Calls: {stats_dict['call_count']}")
print(f" Total Time: {stats_dict['total_time']:.2f}s")
print(f" Min Time: {stats_dict['min_time']:.2f}s")
print(f" Max Time: {stats_dict['max_time']:.2f}s")
print(f" Mean Time: {stats_dict['mean_time']:.2f}s")
print(f" Median Time: {stats_dict['median_time']:.2f}s")
# Summary
total_tool_calls = sum(s["call_count"] for s in tool_stats.values())
total_tool_time = sum(s["total_time"] for s in tool_stats.values())
print(f"\n 📊 Summary:")
print(f" Total Tool Calls: {total_tool_calls}")
print(f" Total Tool Time: {total_tool_time:.2f}s")
print(f" Unique Tools Used: {len(tool_stats)}")
else:
print(" No tool executions recorded")
# Overall Summary
total_api_time = api_stats.get("total_time", 0.0)
total_tool_time = sum(s["total_time"] for s in tool_stats.values())
print(f"\n📈 Overall Summary:")
print(f" Total API Time: {total_api_time:.2f}s")
print(f" Total Tool Time: {total_tool_time:.2f}s")
print(f" Total Time: {total_api_time + total_tool_time:.2f}s")
print("="*80 + "\n")

28
pyproject.toml Normal file
View File

@@ -0,0 +1,28 @@
[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"
[project]
name = "hermes-agent"
version = "0.1.0"
description = "AI agent with advanced tool-calling and toolsets"
readme = "README.md"
requires-python = ">=3.10"
authors = [{ name = "Hermes Agent" }]
license = { text = "MIT" }
dependencies = [
"firecrawl-py",
"openai",
"fal-client",
"python-dotenv",
"fire"
]
[project.scripts]
hermes-agent = "run_agent:main"
[tool.setuptools]
py-modules = ["run_agent", "model_tools", "toolsets"]
[tool.setuptools.packages.find]
include = ["tools"]

View File

@@ -1,2 +1,6 @@
tavily-python
openai
firecrawl-py
openai
fal-client
python-dotenv
fire
httpx

File diff suppressed because it is too large Load Diff

12
run_datagen_images.sh Normal file
View File

@@ -0,0 +1,12 @@
python batch_runner.py \
--dataset_file="hermes-agent-imagen-data/hermes_agent_imagen_eval.jsonl" \
--batch_size=10 \
--run_name="imagen_eval_gpt5" \
--distribution="image_gen" \
--model="gpt-5" \
--base_url="https://api.openai.com/v1" \
--api_key="${OPENAI_API_KEY}" \
--num_workers=4 \
--max_turns=5 \
--verbose \
--ephemeral_system_prompt="When generating an image for the user view the image by using the vision_analyze tool to ensure it is what the user wanted. If it isn't feel free to retry a few times. If none are perfect, choose the best option that is the closest match, and explain its imperfections. If the image generation tool fails, try again a few times. If the vision analyze tool fails, provide the image to the user and explain it is your best effort attempt."

12
run_datagen_megascience.sh Executable file
View File

@@ -0,0 +1,12 @@
python batch_runner.py \
--dataset_file="hermes-agent-megascience-data/hermes_agent_megascience_eval.jsonl" \
--batch_size=10 \
--run_name="megascience_eval_gpt5_2" \
--distribution="science" \
--model="gpt-5" \
--base_url="https://api.openai.com/v1" \
--api_key="${OPENAI_API_KEY}" \
--num_workers=5 \
--max_turns=30 \
--verbose \
--ephemeral_system_prompt="You have access to a variety of tools to help you solve scientific, math, and technology problems presented to you. You can use them in sequence and build off of the results of prior tools you've used results. Always use a tool if it can provide additional context, verify formulas, double check concepts and recent studies and understanding, doing all calculations, etc. You should not be confident in your own reasoning, knowledge, or calculations without using a tool to verify or validate your work."

View File

@@ -0,0 +1,12 @@
python batch_runner.py \
--dataset_file="hermes-agent-megascience-data/hermes_agent_megascience_eval.jsonl" \
--batch_size=10 \
--run_name="megascience_eval_glm4-6-fixedterminal-2" \
--distribution="science" \
--model="z-ai/glm-4.6" \
--base_url="https://openrouter.ai/api/v1" \
--api_key="${OPENROUTER_API_KEY}" \
--num_workers=5 \
--max_turns=30 \
--verbose \
--ephemeral_system_prompt="You have access to a variety of tools to help you solve scientific, math, and technology problems presented to you. You can use them in sequence and build off of the results of prior tools you've used results. Always use a tool if it can provide additional context, verify formulas, double check concepts and recent studies and understanding, doing all calculations, etc. You should only be confident in your own reasoning, knowledge, or calculations if you've exhaustively used all tools available to you to that can help you verify or validate your work. Always pip install any packages you need to use the python scripts you want to run."

20
safe_print.py Normal file
View File

@@ -0,0 +1,20 @@
#!/usr/bin/env python3
"""Simple safe print that tries rich, falls back to regular print."""
try:
from rich import print as rich_print
RICH_AVAILABLE = True
except ImportError:
RICH_AVAILABLE = False
def safe_print(*args, **kwargs):
"""Try rich.print, fall back to regular print if it fails."""
if RICH_AVAILABLE:
try:
rich_print(*args, **kwargs)
return
except Exception:
pass
# Fallback to regular print
print(*args, **kwargs)

View File

@@ -1,234 +0,0 @@
#!/usr/bin/env python3
"""
Terminal Tool Module
This module provides a single terminal tool using Hecate's VM infrastructure.
It wraps Hecate's functionality to provide a simple interface for executing commands
on Morph VMs with automatic lifecycle management.
Available tool:
- terminal_tool: Execute commands with optional interactive session support
Usage:
from terminal_tool import terminal_tool
# Execute a single command
result = terminal_tool("ls -la")
# Execute in an interactive session
result = terminal_tool("python", input_keys="print('hello')\\nexit()\\n")
"""
import json
import os
from typing import Optional, Dict, Any
from hecate import run_tool_with_lifecycle_management
from morphcloud._llm import ToolCall
# Detailed description for the terminal tool based on Hermes Terminal system prompt
TERMINAL_TOOL_DESCRIPTION = """Execute commands on a secure, persistent Linux VM environment with full interactive application support.
**Environment:**
- Minimal Debian-based OS with internet access
- Automatic VM lifecycle management (creates on-demand, reuses, cleans up)
- **Full state persistence across tool calls**: current directory (pwd), environment variables, activated virtual environments (conda/venv), running processes, and command history all persist between consecutive tool calls
- Session state managed automatically via tmux
**Command Execution:**
- Simple commands: Just provide the 'command' parameter
- Background processes: Set 'background': True for servers/long-running tasks
- Interactive applications automatically detected and handled
**Interactive Applications (TUIs/Pagers/Prompts):**
When commands enter interactive mode (vim, nano, less, git prompts, package managers, etc.), you'll receive screen content with "frozen" status. This is NORMAL - the session is still active and waiting for input.
**To interact with frozen sessions:**
1. Use 'input_keys' parameter with keystrokes to send
2. System auto-detects and uses the active session
3. Session stays active until application exits
**Special Key Syntax for input_keys:**
- `<ESC>`: Escape key
- `<ENTER>`: Enter/Return
- `<CTRL+C>`, `<CTRL+D>`, `<CTRL+Z>`: Control combinations
- `<UP>`, `<DOWN>`, `<LEFT>`, `<RIGHT>`: Arrow keys
- `<TAB>`, `<BACKSPACE>`: Tab and Backspace
- `<F1>` through `<F12>`: Function keys
- `<SHIFT+TAB>`: Shift+Tab
- Uppercase letters for Shift+letter (e.g., 'V' for Shift+V)
- Symbols for Shift+number (e.g., '!' for Shift+1, ':' for Shift+;)
**Examples:**
- Start vim: `{"command": "vim file.txt"}`
- Type in vim: `{"input_keys": "iHello World<ESC>"}`
- Save and quit: `{"input_keys": ":wq<ENTER>"}`
- Navigate in less: `{"input_keys": "j"}`
- Quit less: `{"input_keys": "q"}`
**Best Practices:**
- Run servers/long processes in background with separate tool calls
- Chain multiple foreground commands in single call if needed
- Monitor disk usage for large tasks, clean up to free space
- Test components incrementally with mock inputs
- Install whatever tools needed - full system access provided"""
def terminal_tool(
command: Optional[str] = None,
input_keys: Optional[str] = None,
session_id: Optional[str] = None,
background: bool = False,
idle_threshold: float = 5.0,
timeout: Optional[int] = None
) -> str:
"""
Execute a command on a Morph VM with optional interactive session support.
This tool uses Hecate's VM lifecycle management to automatically create
and manage VMs. VMs are reused within the configured lifetime window
and automatically cleaned up after inactivity.
Args:
command: The command to execute (optional if continuing existing session)
input_keys: Keystrokes to send to interactive session (e.g., "hello\\n")
session_id: ID of existing session to continue (optional)
background: Whether to run the command in the background (default: False)
idle_threshold: Seconds to wait for output before considering session idle (default: 5.0)
timeout: Command timeout in seconds (optional)
Returns:
str: JSON string containing command output, session info, exit code, and any errors
Examples:
# Execute a simple command
>>> result = terminal_tool(command="ls -la /tmp")
# Start an interactive Python session
>>> result = terminal_tool(command="python3")
>>> session_data = json.loads(result)
>>> session_id = session_data["session_id"]
# Send input to the session
>>> result = terminal_tool(input_keys="print('Hello')\\n", session_id=session_id)
# Run a background task
>>> result = terminal_tool(command="sleep 60", background=True)
"""
try:
# Build tool input based on provided parameters
tool_input = {}
if command:
tool_input["command"] = command
if input_keys:
tool_input["input_keys"] = input_keys
if session_id:
tool_input["session_id"] = session_id
if background:
tool_input["background"] = background
if idle_threshold != 5.0:
tool_input["idle_threshold"] = idle_threshold
if timeout is not None:
tool_input["timeout"] = timeout
tool_call = ToolCall(
name="run_command",
input=tool_input
)
# Execute with lifecycle management
result = run_tool_with_lifecycle_management(tool_call)
# Format the result with all possible fields
# Map hecate's "stdout" to "output" for compatibility
formatted_result = {
"output": result.get("stdout", result.get("output", "")),
"screen": result.get("screen", ""),
"session_id": result.get("session_id"),
"exit_code": result.get("returncode", result.get("exit_code", -1)),
"error": result.get("error"),
"status": "active" if result.get("session_id") else "ended"
}
return json.dumps(formatted_result)
except Exception as e:
return json.dumps({
"output": "",
"screen": "",
"session_id": None,
"exit_code": -1,
"error": f"Failed to execute terminal command: {str(e)}",
"status": "error"
})
def check_hecate_requirements() -> bool:
"""
Check if all requirements for terminal tools are met.
Returns:
bool: True if all requirements are met, False otherwise
"""
# Check for required environment variables
required_vars = ["MORPH_API_KEY"]
optional_vars = ["OPENAI_API_KEY"] # Needed for Hecate's LLM features
missing_required = [var for var in required_vars if not os.getenv(var)]
missing_optional = [var for var in optional_vars if not os.getenv(var)]
if missing_required:
print(f"Missing required environment variables: {', '.join(missing_required)}")
return False
if missing_optional:
print(f"Warning: Missing optional environment variables: {', '.join(missing_optional)}")
print(" (Some Hecate features may be limited)")
# Check if Hecate is importable
try:
import hecate
return True
except ImportError:
print("Hecate is not installed. Please install it with: pip install hecate")
return False
# Module-level initialization check
_requirements_met = check_hecate_requirements()
if __name__ == "__main__":
"""
Simple test/demo when run directly
"""
print("Terminal Tool Module")
print("=" * 40)
if not _requirements_met:
print("Requirements not met. Please check the messages above.")
exit(1)
print("All requirements met!")
print("\nAvailable Tool:")
print(" - terminal_tool: Execute commands with optional interactive session support")
print("\nUsage Examples:")
print(" # Execute a command")
print(" result = terminal_tool(command='ls -la')")
print(" ")
print(" # Start an interactive session")
print(" result = terminal_tool(command='python3')")
print(" session_data = json.loads(result)")
print(" session_id = session_data['session_id']")
print(" ")
print(" # Send input to the session")
print(" result = terminal_tool(")
print(" input_keys='print(\"Hello\")\\\\n',")
print(" session_id=session_id")
print(" )")
print(" ")
print(" # Run a background task")
print(" result = terminal_tool(command='sleep 60', background=True)")
print("\nEnvironment Variables:")
print(f" MORPH_API_KEY: {'Set' if os.getenv('MORPH_API_KEY') else 'Not set'}")
print(f" OPENAI_API_KEY: {'Set' if os.getenv('OPENAI_API_KEY') else 'Not set (optional)'}")
print(f" HECATE_VM_LIFETIME_SECONDS: {os.getenv('HECATE_VM_LIFETIME_SECONDS', '300')} (default: 300)")
print(f" HECATE_DEFAULT_SNAPSHOT_ID: {os.getenv('HECATE_DEFAULT_SNAPSHOT_ID', 'snapshot_p5294qxt')} (default: snapshot_p5294qxt)")

23
test_run.sh Executable file
View File

@@ -0,0 +1,23 @@
#!/bin/bash
# Check if a prompt argument was provided
if [ $# -eq 0 ]; then
echo "Error: Please provide a prompt as an argument"
echo "Usage: $0 \"your prompt here\""
exit 1
fi
# Get the prompt from the first argument
PROMPT="$1"
# Set debug mode for web tools
export WEB_TOOLS_DEBUG=true
# Run the agent with the provided prompt
python run_agent.py \
--query "$PROMPT" \
--max_turns 30 \
--model claude-sonnet-4-5-20250929 \
--base_url https://api.anthropic.com/v1/ \
--api_key $ANTHROPIC_API_KEY \
--save_trajectories

0
tests/__init__.py Normal file
View File

129
tests/test_batch_runner.py Normal file
View File

@@ -0,0 +1,129 @@
#!/usr/bin/env python3
"""
Test script for batch runner
This script tests the batch runner with a small sample dataset
to verify functionality before running large batches.
"""
import json
import shutil
from pathlib import Path
def create_test_dataset():
"""Create a small test dataset."""
test_file = Path("tests/test_dataset.jsonl")
test_file.parent.mkdir(exist_ok=True)
prompts = [
{"prompt": "What is 2 + 2?"},
{"prompt": "What is the capital of France?"},
{"prompt": "Explain what Python is in one sentence."},
]
with open(test_file, 'w') as f:
for prompt in prompts:
f.write(json.dumps(prompt, ensure_ascii=False) + "\n")
print(f"✅ Created test dataset: {test_file}")
return test_file
def cleanup_test_run(run_name):
"""Clean up test run output."""
output_dir = Path("data") / run_name
if output_dir.exists():
shutil.rmtree(output_dir)
print(f"🗑️ Cleaned up test output: {output_dir}")
def verify_output(run_name):
"""Verify that output files were created correctly."""
output_dir = Path("data") / run_name
# Check directory exists
if not output_dir.exists():
print(f"❌ Output directory not found: {output_dir}")
return False
# Check for checkpoint
checkpoint_file = output_dir / "checkpoint.json"
if not checkpoint_file.exists():
print(f"❌ Checkpoint file not found: {checkpoint_file}")
return False
# Check for statistics
stats_file = output_dir / "statistics.json"
if not stats_file.exists():
print(f"❌ Statistics file not found: {stats_file}")
return False
# Check for batch files
batch_files = list(output_dir.glob("batch_*.jsonl"))
if not batch_files:
print(f"❌ No batch files found in: {output_dir}")
return False
print(f"✅ Output verification passed:")
print(f" - Checkpoint: {checkpoint_file}")
print(f" - Statistics: {stats_file}")
print(f" - Batch files: {len(batch_files)}")
# Load and display statistics
with open(stats_file) as f:
stats = json.load(f)
print(f"\n📊 Statistics Summary:")
print(f" - Total prompts: {stats['total_prompts']}")
print(f" - Total batches: {stats['total_batches']}")
print(f" - Duration: {stats['duration_seconds']}s")
if stats.get('tool_statistics'):
print(f" - Tool calls:")
for tool, tool_stats in stats['tool_statistics'].items():
print(f"{tool}: {tool_stats['count']} calls, {tool_stats['success_rate']:.1f}% success")
return True
def main():
"""Run the test."""
print("🧪 Batch Runner Test")
print("=" * 60)
run_name = "test_run"
# Clean up any previous test run
cleanup_test_run(run_name)
# Create test dataset
test_file = create_test_dataset()
print(f"\n📝 To run the test manually:")
print(f" python batch_runner.py \\")
print(f" --dataset_file={test_file} \\")
print(f" --batch_size=2 \\")
print(f" --run_name={run_name} \\")
print(f" --distribution=minimal \\")
print(f" --num_workers=2")
print(f"\n💡 Or test with different distributions:")
print(f" python batch_runner.py --list_distributions")
print(f"\n🔍 After running, you can verify output with:")
print(f" python tests/test_batch_runner.py --verify")
# Note: We don't actually run the batch runner here to avoid API calls during testing
# Users should run it manually with their API keys configured
if __name__ == "__main__":
import sys
if "--verify" in sys.argv:
run_name = "test_run"
verify_output(run_name)
else:
main()

View File

@@ -0,0 +1,424 @@
#!/usr/bin/env python3
"""
Test script to verify checkpoint behavior in batch_runner.py
This script simulates batch processing with intentional failures to test:
1. Whether checkpoints are saved incrementally during processing
2. Whether resume functionality works correctly after interruption
3. Whether data integrity is maintained across checkpoint cycles
Usage:
# Test current implementation
python tests/test_checkpoint_resumption.py --test_current
# Test after fix is applied
python tests/test_checkpoint_resumption.py --test_fixed
# Run full comparison
python tests/test_checkpoint_resumption.py --compare
"""
import json
import os
import shutil
import sys
import time
import signal
from pathlib import Path
from typing import List, Dict, Any
import traceback
# Add parent directory to path to import batch_runner
sys.path.insert(0, str(Path(__file__).parent.parent))
def create_test_dataset(num_prompts: int = 20) -> Path:
"""Create a small test dataset for checkpoint testing."""
test_data_dir = Path("tests/test_data")
test_data_dir.mkdir(parents=True, exist_ok=True)
dataset_file = test_data_dir / "checkpoint_test_dataset.jsonl"
with open(dataset_file, 'w', encoding='utf-8') as f:
for i in range(num_prompts):
entry = {
"prompt": f"Test prompt {i}: What is 2+2? Just answer briefly.",
"test_id": i
}
f.write(json.dumps(entry, ensure_ascii=False) + "\n")
print(f"✅ Created test dataset: {dataset_file} ({num_prompts} prompts)")
return dataset_file
def monitor_checkpoint_during_run(checkpoint_file: Path, duration: int = 30) -> List[Dict[str, Any]]:
"""
Monitor checkpoint file during a batch run to see when it gets updated.
Args:
checkpoint_file: Path to checkpoint file to monitor
duration: How long to monitor (seconds)
Returns:
List of checkpoint snapshots with timestamps
"""
snapshots = []
start_time = time.time()
last_mtime = None
print(f"\n🔍 Monitoring checkpoint file: {checkpoint_file}")
print(f" Duration: {duration}s")
print("-" * 70)
while time.time() - start_time < duration:
if checkpoint_file.exists():
current_mtime = checkpoint_file.stat().st_mtime
# Check if file was modified
if last_mtime is None or current_mtime != last_mtime:
elapsed = time.time() - start_time
try:
with open(checkpoint_file, 'r') as f:
checkpoint_data = json.load(f)
snapshot = {
"elapsed_seconds": round(elapsed, 2),
"completed_count": len(checkpoint_data.get("completed_prompts", [])),
"completed_prompts": checkpoint_data.get("completed_prompts", [])[:5], # First 5 for display
"timestamp": checkpoint_data.get("last_updated")
}
snapshots.append(snapshot)
print(f"[{elapsed:6.2f}s] Checkpoint updated: {snapshot['completed_count']} prompts completed")
except Exception as e:
print(f"[{elapsed:6.2f}s] Error reading checkpoint: {e}")
last_mtime = current_mtime
else:
if len(snapshots) == 0:
print(f"[{time.time() - start_time:6.2f}s] Checkpoint file not yet created...")
time.sleep(0.5) # Check every 0.5 seconds
return snapshots
def test_current_implementation():
"""Test the current checkpoint implementation."""
print("\n" + "=" * 70)
print("TEST 1: Current Implementation - Checkpoint Timing")
print("=" * 70)
print("\n📝 Testing whether checkpoints are saved incrementally during run...")
# Setup
dataset_file = create_test_dataset(num_prompts=12)
run_name = "checkpoint_test_current"
output_dir = Path("data") / run_name
# Clean up any existing test data
if output_dir.exists():
shutil.rmtree(output_dir)
# Import here to avoid issues if module changes
from batch_runner import BatchRunner
checkpoint_file = output_dir / "checkpoint.json"
# Start monitoring in a separate process would be ideal, but for simplicity
# we'll just check before and after
print(f"\n▶️ Starting batch run...")
print(f" Dataset: {dataset_file}")
print(f" Batch size: 3 (4 batches total)")
print(f" Workers: 2")
print(f" Expected behavior: If incremental, checkpoint should update during run")
start_time = time.time()
try:
runner = BatchRunner(
dataset_file=str(dataset_file),
batch_size=3,
run_name=run_name,
distribution="default",
max_iterations=3, # Keep it short
model="claude-opus-4-20250514",
num_workers=2,
verbose=False
)
# Run with monitoring
import threading
snapshots = []
def monitor():
nonlocal snapshots
snapshots = monitor_checkpoint_during_run(checkpoint_file, duration=60)
monitor_thread = threading.Thread(target=monitor, daemon=True)
monitor_thread.start()
runner.run(resume=False)
monitor_thread.join(timeout=2)
except Exception as e:
print(f"❌ Error during run: {e}")
traceback.print_exc()
return False
elapsed = time.time() - start_time
# Analyze results
print("\n" + "=" * 70)
print("📊 TEST RESULTS")
print("=" * 70)
print(f"Total run time: {elapsed:.2f}s")
print(f"Checkpoint updates observed: {len(snapshots)}")
if len(snapshots) == 0:
print("\n❌ ISSUE: No checkpoint updates observed during run")
print(" This suggests checkpoints are only saved at the end")
return False
elif len(snapshots) == 1:
print("\n⚠️ WARNING: Only 1 checkpoint update (likely at the end)")
print(" This confirms the bug - no incremental checkpointing")
return False
else:
print(f"\n✅ GOOD: Multiple checkpoint updates ({len(snapshots)}) observed")
print(" Checkpointing appears to be incremental")
# Show timeline
print("\n📈 Checkpoint Timeline:")
for i, snapshot in enumerate(snapshots, 1):
print(f" {i}. [{snapshot['elapsed_seconds']:6.2f}s] "
f"{snapshot['completed_count']} prompts completed")
return True
def test_interruption_and_resume():
"""Test that resume actually works after interruption."""
print("\n" + "=" * 70)
print("TEST 2: Interruption and Resume")
print("=" * 70)
print("\n📝 Testing whether resume works after manual interruption...")
# Setup
dataset_file = create_test_dataset(num_prompts=15)
run_name = "checkpoint_test_resume"
output_dir = Path("data") / run_name
# Clean up any existing test data
if output_dir.exists():
shutil.rmtree(output_dir)
from batch_runner import BatchRunner
checkpoint_file = output_dir / "checkpoint.json"
print(f"\n▶️ Starting first run (will process 5 prompts, then simulate interruption)...")
try:
# Create a modified dataset with only first 5 prompts for initial run
temp_dataset = Path("tests/test_data/checkpoint_test_resume_partial.jsonl")
with open(dataset_file, 'r') as f:
lines = f.readlines()[:5]
with open(temp_dataset, 'w') as f:
f.writelines(lines)
runner = BatchRunner(
dataset_file=str(temp_dataset),
batch_size=2,
run_name=run_name,
distribution="default",
max_iterations=3,
model="claude-opus-4-20250514",
num_workers=1,
verbose=False
)
runner.run(resume=False)
# Check checkpoint after first run
if not checkpoint_file.exists():
print("❌ ERROR: Checkpoint file not created after first run")
return False
with open(checkpoint_file, 'r') as f:
checkpoint_data = json.load(f)
initial_completed = len(checkpoint_data.get("completed_prompts", []))
print(f"✅ First run completed: {initial_completed} prompts saved to checkpoint")
# Now try to resume with full dataset
print(f"\n▶️ Starting resume run with full dataset (15 prompts)...")
runner2 = BatchRunner(
dataset_file=str(dataset_file),
batch_size=2,
run_name=run_name,
distribution="default",
max_iterations=3,
model="claude-opus-4-20250514",
num_workers=1,
verbose=False
)
runner2.run(resume=True)
# Check final checkpoint
with open(checkpoint_file, 'r') as f:
final_checkpoint = json.load(f)
final_completed = len(final_checkpoint.get("completed_prompts", []))
print("\n" + "=" * 70)
print("📊 TEST RESULTS")
print("=" * 70)
print(f"Initial completed: {initial_completed}")
print(f"Final completed: {final_completed}")
print(f"Expected: 15")
if final_completed == 15:
print("\n✅ PASS: Resume successfully completed all prompts")
return True
else:
print(f"\n❌ FAIL: Expected 15 completed, got {final_completed}")
return False
except Exception as e:
print(f"❌ Error during test: {e}")
traceback.print_exc()
return False
def test_simulated_crash():
"""Test behavior when process crashes mid-execution."""
print("\n" + "=" * 70)
print("TEST 3: Simulated Crash During Execution")
print("=" * 70)
print("\n📝 This test would require running in a subprocess and killing it...")
print(" Skipping for safety - manual testing recommended")
return None
def print_test_plan():
"""Print the detailed test and fix plan."""
print("\n" + "=" * 70)
print("CHECKPOINT FIX - DETAILED PLAN")
print("=" * 70)
print("""
📋 PROBLEM SUMMARY
------------------
Current implementation uses pool.map() which blocks until ALL batches complete.
Checkpoint is only saved after all batches finish (line 558-559).
If process crashes during batch processing:
- All progress is lost
- Resume does nothing (no incremental checkpoint was saved)
📋 PROPOSED SOLUTION
--------------------
Replace pool.map() with pool.imap_unordered() to get results as they complete.
Save checkpoint after EACH batch completes using a multiprocessing Lock.
Key changes:
1. Use Manager().Lock() for thread-safe checkpoint writes
2. Replace pool.map() with pool.imap_unordered()
3. Update checkpoint after each batch result
4. Maintain backward compatibility with existing checkpoints
📋 IMPLEMENTATION STEPS
-----------------------
1. Add Manager and Lock initialization before Pool creation
2. Pass shared checkpoint data and lock to workers (via Manager)
3. Replace pool.map() with pool.imap_unordered()
4. In result loop: save checkpoint after each batch
5. Add error handling for checkpoint write failures
📋 RISKS & MITIGATIONS
----------------------
Risk: Checkpoint file corruption if two processes write simultaneously
→ Mitigation: Use multiprocessing.Lock() for exclusive access
Risk: Performance impact from frequent checkpoint writes
→ Mitigation: Checkpoint writes are fast (small JSON), negligible impact
Risk: Breaking existing runs that are already checkpointed
→ Mitigation: Maintain checkpoint format, only change timing
Risk: Bugs in multiprocessing lock/manager code
→ Mitigation: Thorough testing with this test script
📋 TESTING STRATEGY
-------------------
1. Run test_current_implementation() - Confirm bug exists
2. Apply fix to batch_runner.py
3. Run test_current_implementation() again - Should see incremental updates
4. Run test_interruption_and_resume() - Verify resume works
5. Manual test: Start run, kill process mid-batch, resume
📋 ROLLBACK PLAN
----------------
If issues arise:
1. Git revert the changes
2. Original code is working (just missing incremental checkpoint)
3. No data corruption risk - checkpoints are write-only
""")
def main(
test_current: bool = False,
test_resume: bool = False,
test_crash: bool = False,
compare: bool = False,
show_plan: bool = False
):
"""
Run checkpoint behavior tests.
Args:
test_current: Test current implementation checkpoint timing
test_resume: Test interruption and resume functionality
test_crash: Test simulated crash scenario (manual)
compare: Run all tests and compare
show_plan: Show detailed fix plan
"""
if show_plan or (not any([test_current, test_resume, test_crash, compare])):
print_test_plan()
return
results = {}
if test_current or compare:
results['current'] = test_current_implementation()
if test_resume or compare:
results['resume'] = test_interruption_and_resume()
if test_crash or compare:
results['crash'] = test_simulated_crash()
# Summary
if results:
print("\n" + "=" * 70)
print("OVERALL TEST SUMMARY")
print("=" * 70)
for test_name, result in results.items():
if result is None:
status = "⏭️ SKIPPED"
elif result:
status = "✅ PASS"
else:
status = "❌ FAIL"
print(f"{status} - {test_name}")
if __name__ == "__main__":
import fire
fire.Fire(main)

176
tests/test_nous_api_limits.py Executable file
View File

@@ -0,0 +1,176 @@
#!/usr/bin/env python3
"""
Test script to diagnose Nous API 400 errors with gemini-2.5-flash model.
This tests various content lengths and parameters to identify what causes failures.
"""
import asyncio
import os
from openai import AsyncOpenAI
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
# Initialize the Nous API client
nous_client = AsyncOpenAI(
api_key=os.getenv("NOUS_API_KEY"),
base_url="https://inference-api.nousresearch.com/v1"
)
MODEL = "gemini-2.5-flash"
async def test_api_call(test_name: str, content_length: int, **kwargs):
"""Test an API call with specific parameters."""
print(f"\n{'='*60}")
print(f"Test: {test_name}")
print(f"Content length: {content_length:,} characters")
print(f"Additional params: {kwargs}")
print(f"{'='*60}")
# Generate test content
content = "A" * content_length
system_prompt = """You are an expert content analyst. Your job is to process web content and create a comprehensive yet concise summary that preserves all important information while dramatically reducing bulk.
Create a well-structured markdown summary that includes:
1. Key excerpts (quotes, code snippets, important facts) in their original format
2. Comprehensive summary of all other important information
3. Proper markdown formatting with headers, bullets, and emphasis
Your goal is to preserve ALL important information while reducing length. Never lose key facts, figures, insights, or actionable information. Make it scannable and well-organized."""
user_prompt = f"""Please process this web content and create a comprehensive markdown summary:
CONTENT TO PROCESS:
{content}
Create a markdown summary that captures all key information in a well-organized, scannable format. Include important quotes and code snippets in their original formatting. Focus on actionable information, specific details, and unique insights."""
try:
response = await nous_client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
**kwargs
)
result = response.choices[0].message.content
print(f"✅ SUCCESS")
print(f" Response length: {len(result)} characters")
print(f" Model used: {response.model}")
print(f" Usage: {response.usage}")
return True
except Exception as e:
print(f"❌ FAILED: {str(e)}")
return False
async def main():
"""Run all tests."""
print("Testing Nous API with gemini-2.5-flash model")
print(f"API Key present: {'Yes' if os.getenv('NOUS_API_KEY') else 'No'}")
results = {}
# Test 1: Small content (should always work)
results['small'] = await test_api_call(
"Small content (5,000 chars)",
5000,
temperature=0.1,
max_tokens=4000
)
await asyncio.sleep(1)
# Test 2: Medium content (around what was failing)
results['medium'] = await test_api_call(
"Medium content (20,000 chars)",
20000,
temperature=0.1,
max_tokens=4000
)
await asyncio.sleep(1)
# Test 3: Large content (79,625 chars like the error)
results['large'] = await test_api_call(
"Large content (79,625 chars)",
79625,
temperature=0.1,
max_tokens=4000
)
await asyncio.sleep(1)
# Test 4: Very large content (100k chars)
results['very_large'] = await test_api_call(
"Very large content (100,000 chars)",
100000,
temperature=0.1,
max_tokens=4000
)
await asyncio.sleep(1)
# Test 5: Same as working case but different max_tokens
results['diff_max_tokens'] = await test_api_call(
"Medium content with higher max_tokens",
20000,
temperature=0.1,
max_tokens=8000
)
await asyncio.sleep(1)
# Test 6: No max_tokens specified
results['no_max_tokens'] = await test_api_call(
"Medium content without max_tokens",
20000,
temperature=0.1
)
await asyncio.sleep(1)
# Test 7: With actual web content (mixed characters)
mixed_content = """
This is a test of web content with various characters:
- Unicode: 你好世界 🌍
- Special chars: <>&"'
- Numbers: 123456789
- Markdown: **bold** _italic_ `code`
- URLs: https://example.com
""" * 1000 # Repeat to make it ~79k chars
print(f"\n{'='*60}")
print(f"Test: Mixed content (real-world scenario)")
print(f"Content length: {len(mixed_content):,} characters")
print(f"{'='*60}")
try:
response = await nous_client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "Summarize this content."},
{"role": "user", "content": mixed_content}
],
temperature=0.1,
max_tokens=4000
)
print(f"✅ SUCCESS")
results['mixed_content'] = True
except Exception as e:
print(f"❌ FAILED: {str(e)}")
results['mixed_content'] = False
# Summary
print(f"\n{'='*60}")
print("SUMMARY OF RESULTS:")
print(f"{'='*60}")
for test, passed in results.items():
status = "✅ PASS" if passed else "❌ FAIL"
print(f"{test:20s}: {status}")
passed = sum(results.values())
total = len(results)
print(f"\nTotal: {passed}/{total} tests passed")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,131 @@
#!/usr/bin/env python3
"""
Test to understand the pattern of failures - it's not about content length!
"""
import asyncio
import os
from openai import AsyncOpenAI
from dotenv import load_dotenv
load_dotenv()
nous_client = AsyncOpenAI(
api_key=os.getenv("NOUS_API_KEY"),
base_url="https://inference-api.nousresearch.com/v1"
)
MODEL = "gemini-2.5-flash"
async def quick_test(description: str, content: str, **kwargs):
"""Quick API test."""
print(f"\n{description} ({len(content):,} chars)...", end=" ")
try:
response = await nous_client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "Summarize this."},
{"role": "user", "content": content}
],
**kwargs
)
print(f"✅ SUCCESS")
return True
except Exception as e:
print(f"❌ FAILED: {str(e)[:80]}")
return False
async def main():
print("Testing different content types and parameters...")
# Theory 1: Repeated characters trigger validation
print("\n" + "="*60)
print("THEORY 1: Repeated characters")
print("="*60)
await quick_test("Repeated 'A's (5k)", "A" * 5000, temperature=0.1, max_tokens=4000)
await asyncio.sleep(0.5)
await quick_test("Repeated 'A's (79k)", "A" * 79625, temperature=0.1, max_tokens=4000)
await asyncio.sleep(0.5)
await quick_test("Varied text (5k)", "Test content. " * 400, temperature=0.1, max_tokens=4000)
await asyncio.sleep(0.5)
await quick_test("Varied text (79k)", "Test content with variety. " * 3000, temperature=0.1, max_tokens=4000)
# Theory 2: max_tokens parameter
print("\n" + "="*60)
print("THEORY 2: max_tokens parameter")
print("="*60)
content = "Test " * 4000 # 20k chars
await quick_test("max_tokens=4000", content, temperature=0.1, max_tokens=4000)
await asyncio.sleep(0.5)
await quick_test("max_tokens=8000", content, temperature=0.1, max_tokens=8000)
await asyncio.sleep(0.5)
await quick_test("max_tokens=2000", content, temperature=0.1, max_tokens=2000)
await asyncio.sleep(0.5)
await quick_test("No max_tokens", content, temperature=0.1)
# Theory 3: Temperature parameter
print("\n" + "="*60)
print("THEORY 3: Temperature parameter")
print("="*60)
content = "Test " * 4000
await quick_test("temperature=0.1", content, temperature=0.1, max_tokens=4000)
await asyncio.sleep(0.5)
await quick_test("temperature=0.0", content, temperature=0.0, max_tokens=4000)
await asyncio.sleep(0.5)
await quick_test("temperature=0.5", content, temperature=0.5, max_tokens=4000)
await asyncio.sleep(0.5)
await quick_test("No temperature", content, max_tokens=4000)
# Theory 4: System prompt impact
print("\n" + "="*60)
print("THEORY 4: System prompt length")
print("="*60)
short_system = "Summarize this."
long_system = """You are an expert content analyst. Your job is to process web content and create a comprehensive yet concise summary that preserves all important information while dramatically reducing bulk.
Create a well-structured markdown summary that includes:
1. Key excerpts (quotes, code snippets, important facts) in their original format
2. Comprehensive summary of all other important information
3. Proper markdown formatting with headers, bullets, and emphasis
Your goal is to preserve ALL important information while reducing length."""
content = "A" * 5000
print(f"\nShort system prompt...", end=" ")
try:
response = await nous_client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": short_system},
{"role": "user", "content": content}
],
temperature=0.1,
max_tokens=4000
)
print(f"✅ SUCCESS")
except Exception as e:
print(f"❌ FAILED")
await asyncio.sleep(0.5)
print(f"Long system prompt...", end=" ")
try:
response = await nous_client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": long_system},
{"role": "user", "content": content}
],
temperature=0.1,
max_tokens=4000
)
print(f"✅ SUCCESS")
except Exception as e:
print(f"❌ FAILED")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,109 @@
#!/usr/bin/env python3
"""
Test to confirm: temperature < 0.3 causes failures on Nous API
"""
import asyncio
import os
from openai import AsyncOpenAI
from dotenv import load_dotenv
load_dotenv()
nous_client = AsyncOpenAI(
api_key=os.getenv("NOUS_API_KEY"),
base_url="https://inference-api.nousresearch.com/v1"
)
MODEL = "gemini-2.5-flash"
async def test_temp(temp_value):
"""Test a specific temperature value."""
content = "Test content. " * 1000 # 14k chars
print(f"Testing temperature={temp_value}...", end=" ")
try:
response = await nous_client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "Summarize this content."},
{"role": "user", "content": content}
],
temperature=temp_value,
max_tokens=4000
)
print(f"✅ SUCCESS")
return True
except Exception as e:
print(f"❌ FAILED")
return False
async def main():
print("Testing temperature threshold for Nous API...")
print("="*60)
temps = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 1.0]
for temp in temps:
await test_temp(temp)
await asyncio.sleep(0.5)
print("="*60)
print("\nNow testing with ACTUAL web_tools.py content and parameters:")
print("="*60)
# Simulate the actual web_tools.py call
system_prompt = """You are an expert content analyst. Your job is to process web content and create a comprehensive yet concise summary that preserves all important information while dramatically reducing bulk.
Create a well-structured markdown summary that includes:
1. Key excerpts (quotes, code snippets, important facts) in their original format
2. Comprehensive summary of all other important information
3. Proper markdown formatting with headers, bullets, and emphasis
Your goal is to preserve ALL important information while reducing length. Never lose key facts, figures, insights, or actionable information. Make it scannable and well-organized."""
content = "Sample web page content. " * 3000 # ~75k chars like the real failures
user_prompt = f"""Please process this web content and create a comprehensive markdown summary:
CONTENT TO PROCESS:
{content}
Create a markdown summary that captures all key information in a well-organized, scannable format. Include important quotes and code snippets in their original formatting. Focus on actionable information, specific details, and unique insights."""
print(f"\nActual web_tools call (temp=0.1, {len(content):,} chars)...", end=" ")
try:
response = await nous_client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.1,
max_tokens=4000
)
print(f"✅ SUCCESS")
except:
print(f"❌ FAILED")
await asyncio.sleep(0.5)
print(f"Same call but with temp=0.3...", end=" ")
try:
response = await nous_client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.3,
max_tokens=4000
)
print(f"✅ SUCCESS")
except:
print(f"❌ FAILED")
if __name__ == "__main__":
asyncio.run(main())

620
tests/test_web_tools.py Normal file
View File

@@ -0,0 +1,620 @@
#!/usr/bin/env python3
"""
Comprehensive Test Suite for Web Tools Module
This script tests all web tools functionality to ensure they work correctly.
Run this after any updates to the web_tools.py module or Firecrawl library.
Usage:
python test_web_tools.py # Run all tests
python test_web_tools.py --no-llm # Skip LLM processing tests
python test_web_tools.py --verbose # Show detailed output
Requirements:
- FIRECRAWL_API_KEY environment variable must be set
- NOUS_API_KEY environment vitinariable (optional, for LLM tests)
"""
import json
import asyncio
import sys
import os
import argparse
from datetime import datetime
from typing import List, Dict, Any
# Import the web tools to test (updated path after moving tools/)
from tools.web_tools import (
web_search_tool,
web_extract_tool,
web_crawl_tool,
check_firecrawl_api_key,
check_nous_api_key,
get_debug_session_info
)
class Colors:
"""ANSI color codes for terminal output"""
HEADER = '\033[95m'
BLUE = '\033[94m'
CYAN = '\033[96m'
GREEN = '\033[92m'
WARNING = '\033[93m'
FAIL = '\033[91m'
ENDC = '\033[0m'
BOLD = '\033[1m'
UNDERLINE = '\033[4m'
def print_header(text: str):
"""Print a formatted header"""
print(f"\n{Colors.HEADER}{Colors.BOLD}{'='*60}{Colors.ENDC}")
print(f"{Colors.HEADER}{Colors.BOLD}{text}{Colors.ENDC}")
print(f"{Colors.HEADER}{Colors.BOLD}{'='*60}{Colors.ENDC}")
def print_section(text: str):
"""Print a formatted section header"""
print(f"\n{Colors.CYAN}{Colors.BOLD}📌 {text}{Colors.ENDC}")
print(f"{Colors.CYAN}{'-'*50}{Colors.ENDC}")
def print_success(text: str):
"""Print success message"""
print(f"{Colors.GREEN}{text}{Colors.ENDC}")
def print_error(text: str):
"""Print error message"""
print(f"{Colors.FAIL}{text}{Colors.ENDC}")
def print_warning(text: str):
"""Print warning message"""
print(f"{Colors.WARNING}⚠️ {text}{Colors.ENDC}")
def print_info(text: str, indent: int = 0):
"""Print info message"""
indent_str = " " * indent
print(f"{indent_str}{Colors.BLUE} {text}{Colors.ENDC}")
class WebToolsTester:
"""Test suite for web tools"""
def __init__(self, verbose: bool = False, test_llm: bool = True):
self.verbose = verbose
self.test_llm = test_llm
self.test_results = {
"passed": [],
"failed": [],
"skipped": []
}
self.start_time = None
self.end_time = None
def log_result(self, test_name: str, status: str, details: str = ""):
"""Log test result"""
result = {
"test": test_name,
"status": status,
"details": details,
"timestamp": datetime.now().isoformat()
}
if status == "passed":
self.test_results["passed"].append(result)
print_success(f"{test_name}: {details}" if details else test_name)
elif status == "failed":
self.test_results["failed"].append(result)
print_error(f"{test_name}: {details}" if details else test_name)
elif status == "skipped":
self.test_results["skipped"].append(result)
print_warning(f"{test_name} skipped: {details}" if details else f"{test_name} skipped")
def test_environment(self) -> bool:
"""Test environment setup and API keys"""
print_section("Environment Check")
# Check Firecrawl API key
if not check_firecrawl_api_key():
self.log_result("Firecrawl API Key", "failed", "FIRECRAWL_API_KEY not set")
return False
else:
self.log_result("Firecrawl API Key", "passed", "Found")
# Check Nous API key (optional)
if not check_nous_api_key():
self.log_result("Nous API Key", "skipped", "NOUS_API_KEY not set (LLM tests will be skipped)")
self.test_llm = False
else:
self.log_result("Nous API Key", "passed", "Found")
# Check debug mode
debug_info = get_debug_session_info()
if debug_info["enabled"]:
print_info(f"Debug mode enabled - Session: {debug_info['session_id']}")
print_info(f"Debug log: {debug_info['log_path']}")
return True
def test_web_search(self) -> List[str]:
"""Test web search functionality"""
print_section("Test 1: Web Search")
test_queries = [
("Python web scraping tutorial", 5),
("Firecrawl API documentation", 3),
("inflammatory arthritis symptoms treatment", 8) # Test medical query from your example
]
extracted_urls = []
for query, limit in test_queries:
try:
print(f"\n Testing search: '{query}' (limit={limit})")
if self.verbose:
print(f" Calling web_search_tool(query='{query}', limit={limit})")
# Perform search
result = web_search_tool(query, limit)
# Parse result
try:
data = json.loads(result)
except json.JSONDecodeError as e:
self.log_result(f"Search: {query[:30]}...", "failed", f"Invalid JSON: {e}")
if self.verbose:
print(f" Raw response (first 500 chars): {result[:500]}...")
continue
if "error" in data:
self.log_result(f"Search: {query[:30]}...", "failed", f"API error: {data['error']}")
continue
# Check structure
if "success" not in data or "data" not in data:
self.log_result(f"Search: {query[:30]}...", "failed", "Missing success or data fields")
if self.verbose:
print(f" Response keys: {list(data.keys())}")
continue
web_results = data.get("data", {}).get("web", [])
if not web_results:
self.log_result(f"Search: {query[:30]}...", "failed", "Empty web results array")
if self.verbose:
print(f" data.web content: {data.get('data', {}).get('web')}")
continue
# Validate each result
valid_results = 0
missing_fields = []
for i, result in enumerate(web_results):
required_fields = ["url", "title", "description"]
has_all_fields = all(key in result for key in required_fields)
if has_all_fields:
valid_results += 1
# Collect URLs for extraction test
if len(extracted_urls) < 3:
extracted_urls.append(result["url"])
if self.verbose:
print(f" Result {i+1}: ✓ {result['title'][:50]}...")
print(f" URL: {result['url'][:60]}...")
else:
missing = [f for f in required_fields if f not in result]
missing_fields.append(f"Result {i+1} missing: {missing}")
if self.verbose:
print(f" Result {i+1}: ✗ Missing fields: {missing}")
# Log results
if valid_results == len(web_results):
self.log_result(
f"Search: {query[:30]}...",
"passed",
f"All {valid_results} results valid"
)
else:
self.log_result(
f"Search: {query[:30]}...",
"failed",
f"Only {valid_results}/{len(web_results)} valid. Issues: {'; '.join(missing_fields[:3])}"
)
except Exception as e:
self.log_result(f"Search: {query[:30]}...", "failed", f"Exception: {type(e).__name__}: {str(e)}")
if self.verbose:
import traceback
print(f" Traceback: {traceback.format_exc()}")
if self.verbose and extracted_urls:
print(f"\n URLs collected for extraction test: {len(extracted_urls)}")
for url in extracted_urls:
print(f" - {url}")
return extracted_urls
async def test_web_extract(self, urls: List[str] = None):
"""Test web content extraction"""
print_section("Test 2: Web Extract (without LLM)")
# Use provided URLs or defaults
if not urls:
urls = [
"https://docs.firecrawl.dev/introduction",
"https://www.python.org/about/"
]
print(f" Using default URLs for testing")
else:
print(f" Using {len(urls)} URLs from search results")
# Test extraction
if urls:
try:
test_urls = urls[:2] # Test with max 2 URLs
print(f"\n Extracting content from {len(test_urls)} URL(s)...")
for url in test_urls:
print(f" - {url}")
if self.verbose:
print(f" Calling web_extract_tool(urls={test_urls}, format='markdown', use_llm_processing=False)")
result = await web_extract_tool(
test_urls,
format="markdown",
use_llm_processing=False
)
# Parse result
try:
data = json.loads(result)
except json.JSONDecodeError as e:
self.log_result("Extract (no LLM)", "failed", f"Invalid JSON: {e}")
if self.verbose:
print(f" Raw response (first 500 chars): {result[:500]}...")
return
if "error" in data:
self.log_result("Extract (no LLM)", "failed", f"API error: {data['error']}")
return
results = data.get("results", [])
if not results:
self.log_result("Extract (no LLM)", "failed", "No results in response")
if self.verbose:
print(f" Response keys: {list(data.keys())}")
return
# Validate each result
valid_results = 0
failed_results = 0
total_content_length = 0
extraction_details = []
for i, result in enumerate(results):
title = result.get("title", "No title")
content = result.get("content", "")
error = result.get("error")
if error:
failed_results += 1
extraction_details.append(f"Page {i+1}: ERROR - {error}")
if self.verbose:
print(f" Page {i+1}: ✗ Error - {error}")
elif content:
content_len = len(content)
total_content_length += content_len
valid_results += 1
extraction_details.append(f"Page {i+1}: {title[:40]}... ({content_len} chars)")
if self.verbose:
print(f" Page {i+1}: ✓ {title[:50]}... - {content_len} characters")
print(f" First 100 chars: {content[:100]}...")
else:
extraction_details.append(f"Page {i+1}: {title[:40]}... (EMPTY)")
if self.verbose:
print(f" Page {i+1}: ⚠ {title[:50]}... - Empty content")
# Log results
if valid_results > 0:
self.log_result(
"Extract (no LLM)",
"passed",
f"{valid_results}/{len(results)} pages extracted, {total_content_length} total chars"
)
else:
self.log_result(
"Extract (no LLM)",
"failed",
f"No valid content. {failed_results} errors, {len(results) - failed_results} empty"
)
if self.verbose:
print(f"\n Extraction details:")
for detail in extraction_details:
print(f" {detail}")
except Exception as e:
self.log_result("Extract (no LLM)", "failed", f"Exception: {type(e).__name__}: {str(e)}")
if self.verbose:
import traceback
print(f" Traceback: {traceback.format_exc()}")
async def test_web_extract_with_llm(self, urls: List[str] = None):
"""Test web extraction with LLM processing"""
print_section("Test 3: Web Extract (with Gemini LLM)")
if not self.test_llm:
self.log_result("Extract (with LLM)", "skipped", "LLM testing disabled")
return
# Use a URL likely to have substantial content
test_url = urls[0] if urls else "https://docs.firecrawl.dev/features/scrape"
try:
print(f"\n Extracting and processing: {test_url}")
result = await web_extract_tool(
[test_url],
format="markdown",
use_llm_processing=True,
min_length=1000 # Lower threshold for testing
)
data = json.loads(result)
if "error" in data:
self.log_result("Extract (with LLM)", "failed", data["error"])
return
results = data.get("results", [])
if not results:
self.log_result("Extract (with LLM)", "failed", "No results returned")
return
result = results[0]
content = result.get("content", "")
if content:
content_len = len(content)
# Check if content was actually processed (should be shorter than typical raw content)
if content_len > 0:
self.log_result(
"Extract (with LLM)",
"passed",
f"Content processed: {content_len} chars"
)
if self.verbose:
print(f"\n First 300 chars of processed content:")
print(f" {content[:300]}...")
else:
self.log_result("Extract (with LLM)", "failed", "No content after processing")
else:
self.log_result("Extract (with LLM)", "failed", "No content field in result")
except json.JSONDecodeError as e:
self.log_result("Extract (with LLM)", "failed", f"Invalid JSON: {e}")
except Exception as e:
self.log_result("Extract (with LLM)", "failed", str(e))
async def test_web_crawl(self):
"""Test web crawling functionality"""
print_section("Test 4: Web Crawl")
test_sites = [
("https://docs.firecrawl.dev", None, 2), # Test docs site
("https://firecrawl.dev", None, 3), # Test main site
]
for url, instructions, expected_min_pages in test_sites:
try:
print(f"\n Testing crawl of: {url}")
if instructions:
print(f" Instructions: {instructions}")
else:
print(f" No instructions (general crawl)")
print(f" Expected minimum pages: {expected_min_pages}")
# Show what's being called
if self.verbose:
print(f" Calling web_crawl_tool(url='{url}', instructions={instructions}, use_llm_processing=False)")
result = await web_crawl_tool(
url,
instructions=instructions,
use_llm_processing=False # Disable LLM for faster testing
)
# Check if result is valid JSON
try:
data = json.loads(result)
except json.JSONDecodeError as e:
self.log_result(f"Crawl: {url}", "failed", f"Invalid JSON response: {e}")
if self.verbose:
print(f" Raw response (first 500 chars): {result[:500]}...")
continue
# Check for errors
if "error" in data:
self.log_result(f"Crawl: {url}", "failed", f"API error: {data['error']}")
continue
# Get results
results = data.get("results", [])
if not results:
self.log_result(f"Crawl: {url}", "failed", "No pages in results array")
if self.verbose:
print(f" Full response: {json.dumps(data, indent=2)[:1000]}...")
continue
# Analyze pages
valid_pages = 0
empty_pages = 0
total_content = 0
page_details = []
for i, page in enumerate(results):
content = page.get("content", "")
title = page.get("title", "Untitled")
error = page.get("error")
if error:
page_details.append(f"Page {i+1}: ERROR - {error}")
elif content:
valid_pages += 1
content_len = len(content)
total_content += content_len
page_details.append(f"Page {i+1}: {title[:40]}... ({content_len} chars)")
else:
empty_pages += 1
page_details.append(f"Page {i+1}: {title[:40]}... (EMPTY)")
# Show detailed results if verbose
if self.verbose:
print(f"\n Crawl Results:")
print(f" Total pages returned: {len(results)}")
print(f" Valid pages (with content): {valid_pages}")
print(f" Empty pages: {empty_pages}")
print(f" Total content size: {total_content} characters")
print(f"\n Page Details:")
for detail in page_details[:10]: # Show first 10 pages
print(f" - {detail}")
if len(page_details) > 10:
print(f" ... and {len(page_details) - 10} more pages")
# Determine pass/fail
if valid_pages >= expected_min_pages:
self.log_result(
f"Crawl: {url}",
"passed",
f"{valid_pages}/{len(results)} valid pages, {total_content} chars total"
)
else:
self.log_result(
f"Crawl: {url}",
"failed",
f"Only {valid_pages} valid pages (expected >= {expected_min_pages}), {empty_pages} empty, {len(results)} total"
)
except Exception as e:
self.log_result(f"Crawl: {url}", "failed", f"Exception: {type(e).__name__}: {str(e)}")
if self.verbose:
import traceback
print(f" Traceback:")
print(" " + "\n ".join(traceback.format_exc().split("\n")))
async def run_all_tests(self):
"""Run all tests"""
self.start_time = datetime.now()
print_header("WEB TOOLS TEST SUITE")
print(f"Started at: {self.start_time.strftime('%Y-%m-%d %H:%M:%S')}")
# Test environment
if not self.test_environment():
print_error("\nCannot proceed without required API keys!")
return False
# Test search and collect URLs
urls = self.test_web_search()
# Test extraction
await self.test_web_extract(urls if urls else None)
# Test extraction with LLM
if self.test_llm:
await self.test_web_extract_with_llm(urls if urls else None)
# Test crawling
await self.test_web_crawl()
# Print summary
self.end_time = datetime.now()
duration = (self.end_time - self.start_time).total_seconds()
print_header("TEST SUMMARY")
print(f"Duration: {duration:.2f} seconds")
print(f"\n{Colors.GREEN}Passed: {len(self.test_results['passed'])}{Colors.ENDC}")
print(f"{Colors.FAIL}Failed: {len(self.test_results['failed'])}{Colors.ENDC}")
print(f"{Colors.WARNING}Skipped: {len(self.test_results['skipped'])}{Colors.ENDC}")
# List failed tests
if self.test_results["failed"]:
print(f"\n{Colors.FAIL}{Colors.BOLD}Failed Tests:{Colors.ENDC}")
for test in self.test_results["failed"]:
print(f" - {test['test']}: {test['details']}")
# Save results to file
self.save_results()
return len(self.test_results["failed"]) == 0
def save_results(self):
"""Save test results to a JSON file"""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"test_results_web_tools_{timestamp}.json"
results = {
"test_suite": "Web Tools",
"start_time": self.start_time.isoformat() if self.start_time else None,
"end_time": self.end_time.isoformat() if self.end_time else None,
"duration_seconds": (self.end_time - self.start_time).total_seconds() if self.start_time and self.end_time else None,
"summary": {
"passed": len(self.test_results["passed"]),
"failed": len(self.test_results["failed"]),
"skipped": len(self.test_results["skipped"])
},
"results": self.test_results,
"environment": {
"firecrawl_api_key": check_firecrawl_api_key(),
"nous_api_key": check_nous_api_key(),
"debug_mode": get_debug_session_info()["enabled"]
}
}
try:
with open(filename, 'w') as f:
json.dump(results, f, indent=2, ensure_ascii=False)
print_info(f"Test results saved to: {filename}")
except Exception as e:
print_warning(f"Failed to save results: {e}")
async def main():
"""Main entry point"""
parser = argparse.ArgumentParser(description="Test Web Tools Module")
parser.add_argument("--no-llm", action="store_true", help="Skip LLM processing tests")
parser.add_argument("--verbose", "-v", action="store_true", help="Show detailed output")
parser.add_argument("--debug", action="store_true", help="Enable debug mode for web tools")
args = parser.parse_args()
# Set debug mode if requested
if args.debug:
os.environ["WEB_TOOLS_DEBUG"] = "true"
print_info("Debug mode enabled for web tools")
# Create tester
tester = WebToolsTester(
verbose=args.verbose,
test_llm=not args.no_llm
)
# Run tests
success = await tester.run_all_tests()
# Exit with appropriate code
sys.exit(0 if success else 1)
if __name__ == "__main__":
asyncio.run(main())

67
tools/__init__.py Normal file
View File

@@ -0,0 +1,67 @@
#!/usr/bin/env python3
"""
Tools Package
This package contains all the specific tool implementations for the Hermes Agent.
Each module provides specialized functionality for different capabilities:
- web_tools: Web search, content extraction, and crawling
- terminal_tool: Command execution on virtual machines
- vision_tools: Image analysis and understanding
- mixture_of_agents_tool: Multi-model collaborative reasoning
- image_generation_tool: Text-to-image generation with upscaling
The tools are imported into model_tools.py which provides a unified interface
for the AI agent to access all capabilities.
"""
# Export all tools for easy importing
from .web_tools import (
web_search_tool,
web_extract_tool,
web_crawl_tool,
check_firecrawl_api_key
)
from .terminal_tool import (
terminal_tool,
check_hecate_requirements,
TERMINAL_TOOL_DESCRIPTION
)
from .vision_tools import (
vision_analyze_tool,
check_vision_requirements
)
from .mixture_of_agents_tool import (
mixture_of_agents_tool,
check_moa_requirements
)
from .image_generation_tool import (
image_generate_tool,
check_image_generation_requirements
)
__all__ = [
# Web tools
'web_search_tool',
'web_extract_tool',
'web_crawl_tool',
'check_firecrawl_api_key',
# Terminal tools
'terminal_tool',
'check_hecate_requirements',
'TERMINAL_TOOL_DESCRIPTION',
# Vision tools
'vision_analyze_tool',
'check_vision_requirements',
# MoA tools
'mixture_of_agents_tool',
'check_moa_requirements',
# Image generation tools
'image_generate_tool',
'check_image_generation_requirements',
]

View File

@@ -0,0 +1,563 @@
#!/usr/bin/env python3
"""
Image Generation Tools Module
This module provides image generation tools using FAL.ai's FLUX.1 Krea model with
automatic upscaling via FAL.ai's Clarity Upscaler for enhanced image quality.
Available tools:
- image_generate_tool: Generate images from text prompts with automatic upscaling
Features:
- High-quality image generation using FLUX.1 Krea model
- Automatic 2x upscaling using Clarity Upscaler for enhanced quality
- Comprehensive parameter control (size, steps, guidance, etc.)
- Proper error handling and validation with fallback to original images
- Debug logging support
- Sync mode for immediate results
Usage:
from image_generation_tool import image_generate_tool
import asyncio
# Generate and automatically upscale an image
result = await image_generate_tool(
prompt="A serene mountain landscape with cherry blossoms",
image_size="landscape_4_3",
num_images=1
)
"""
import json
import os
import asyncio
import uuid
import datetime
from pathlib import Path
from typing import Dict, Any, Optional, Union
import fal_client
# Configuration for image generation
DEFAULT_MODEL = "fal-ai/flux/krea"
DEFAULT_IMAGE_SIZE = "landscape_4_3"
DEFAULT_NUM_INFERENCE_STEPS = 50
DEFAULT_GUIDANCE_SCALE = 4.5
DEFAULT_NUM_IMAGES = 1
DEFAULT_OUTPUT_FORMAT = "png"
# Configuration for automatic upscaling
UPSCALER_MODEL = "fal-ai/clarity-upscaler"
UPSCALER_FACTOR = 2
UPSCALER_SAFETY_CHECKER = False
UPSCALER_DEFAULT_PROMPT = "masterpiece, best quality, highres"
UPSCALER_NEGATIVE_PROMPT = "(worst quality, low quality, normal quality:2)"
UPSCALER_CREATIVITY = 0.35
UPSCALER_RESEMBLANCE = 0.6
UPSCALER_GUIDANCE_SCALE = 4
UPSCALER_NUM_INFERENCE_STEPS = 18
# Valid parameter values for validation based on FLUX Krea documentation
VALID_IMAGE_SIZES = [
"square_hd", "square", "portrait_4_3", "portrait_16_9", "landscape_4_3", "landscape_16_9"
]
VALID_OUTPUT_FORMATS = ["jpeg", "png"]
VALID_ACCELERATION_MODES = ["none", "regular", "high"]
# Debug mode configuration
DEBUG_MODE = os.getenv("IMAGE_TOOLS_DEBUG", "false").lower() == "true"
DEBUG_SESSION_ID = str(uuid.uuid4())
DEBUG_LOG_PATH = Path("./logs")
DEBUG_DATA = {
"session_id": DEBUG_SESSION_ID,
"start_time": datetime.datetime.now().isoformat(),
"debug_enabled": DEBUG_MODE,
"tool_calls": []
} if DEBUG_MODE else None
# Create logs directory if debug mode is enabled
if DEBUG_MODE:
DEBUG_LOG_PATH.mkdir(exist_ok=True)
print(f"🐛 Image generation debug mode enabled - Session ID: {DEBUG_SESSION_ID}")
def _log_debug_call(tool_name: str, call_data: Dict[str, Any]) -> None:
"""
Log a debug call entry to the global debug data structure.
Args:
tool_name (str): Name of the tool being called
call_data (Dict[str, Any]): Data about the call including parameters and results
"""
if not DEBUG_MODE or not DEBUG_DATA:
return
call_entry = {
"timestamp": datetime.datetime.now().isoformat(),
"tool_name": tool_name,
**call_data
}
DEBUG_DATA["tool_calls"].append(call_entry)
def _save_debug_log() -> None:
"""
Save the current debug data to a JSON file in the logs directory.
"""
if not DEBUG_MODE or not DEBUG_DATA:
return
try:
debug_filename = f"image_tools_debug_{DEBUG_SESSION_ID}.json"
debug_filepath = DEBUG_LOG_PATH / debug_filename
# Update end time
DEBUG_DATA["end_time"] = datetime.datetime.now().isoformat()
DEBUG_DATA["total_calls"] = len(DEBUG_DATA["tool_calls"])
with open(debug_filepath, 'w', encoding='utf-8') as f:
json.dump(DEBUG_DATA, f, indent=2, ensure_ascii=False)
print(f"🐛 Image generation debug log saved: {debug_filepath}")
except Exception as e:
print(f"❌ Error saving image generation debug log: {str(e)}")
def _validate_parameters(
image_size: Union[str, Dict[str, int]],
num_inference_steps: int,
guidance_scale: float,
num_images: int,
output_format: str,
acceleration: str = "none"
) -> Dict[str, Any]:
"""
Validate and normalize image generation parameters for FLUX Krea model.
Args:
image_size: Either a preset string or custom size dict
num_inference_steps: Number of inference steps
guidance_scale: Guidance scale value
num_images: Number of images to generate
output_format: Output format for images
acceleration: Acceleration mode for generation speed
Returns:
Dict[str, Any]: Validated and normalized parameters
Raises:
ValueError: If any parameter is invalid
"""
validated = {}
# Validate image_size
if isinstance(image_size, str):
if image_size not in VALID_IMAGE_SIZES:
raise ValueError(f"Invalid image_size '{image_size}'. Must be one of: {VALID_IMAGE_SIZES}")
validated["image_size"] = image_size
elif isinstance(image_size, dict):
if "width" not in image_size or "height" not in image_size:
raise ValueError("Custom image_size must contain 'width' and 'height' keys")
if not isinstance(image_size["width"], int) or not isinstance(image_size["height"], int):
raise ValueError("Custom image_size width and height must be integers")
if image_size["width"] < 64 or image_size["height"] < 64:
raise ValueError("Custom image_size dimensions must be at least 64x64")
if image_size["width"] > 2048 or image_size["height"] > 2048:
raise ValueError("Custom image_size dimensions must not exceed 2048x2048")
validated["image_size"] = image_size
else:
raise ValueError("image_size must be either a preset string or a dict with width/height")
# Validate num_inference_steps
if not isinstance(num_inference_steps, int) or num_inference_steps < 1 or num_inference_steps > 100:
raise ValueError("num_inference_steps must be an integer between 1 and 100")
validated["num_inference_steps"] = num_inference_steps
# Validate guidance_scale (FLUX Krea default is 4.5)
if not isinstance(guidance_scale, (int, float)) or guidance_scale < 0.1 or guidance_scale > 20.0:
raise ValueError("guidance_scale must be a number between 0.1 and 20.0")
validated["guidance_scale"] = float(guidance_scale)
# Validate num_images
if not isinstance(num_images, int) or num_images < 1 or num_images > 4:
raise ValueError("num_images must be an integer between 1 and 4")
validated["num_images"] = num_images
# Validate output_format
if output_format not in VALID_OUTPUT_FORMATS:
raise ValueError(f"Invalid output_format '{output_format}'. Must be one of: {VALID_OUTPUT_FORMATS}")
validated["output_format"] = output_format
# Validate acceleration
if acceleration not in VALID_ACCELERATION_MODES:
raise ValueError(f"Invalid acceleration '{acceleration}'. Must be one of: {VALID_ACCELERATION_MODES}")
validated["acceleration"] = acceleration
return validated
async def _upscale_image(image_url: str, original_prompt: str) -> Dict[str, Any]:
"""
Upscale an image using FAL.ai's Clarity Upscaler.
Args:
image_url (str): URL of the image to upscale
original_prompt (str): Original prompt used to generate the image
Returns:
Dict[str, Any]: Upscaled image data or None if upscaling fails
"""
try:
print(f"🔍 Upscaling image with Clarity Upscaler...")
# Prepare arguments for upscaler
upscaler_arguments = {
"image_url": image_url,
"prompt": f"{UPSCALER_DEFAULT_PROMPT}, {original_prompt}",
"upscale_factor": UPSCALER_FACTOR,
"negative_prompt": UPSCALER_NEGATIVE_PROMPT,
"creativity": UPSCALER_CREATIVITY,
"resemblance": UPSCALER_RESEMBLANCE,
"guidance_scale": UPSCALER_GUIDANCE_SCALE,
"num_inference_steps": UPSCALER_NUM_INFERENCE_STEPS,
"enable_safety_checker": UPSCALER_SAFETY_CHECKER
}
# Submit upscaler request
handler = await fal_client.submit_async(
UPSCALER_MODEL,
arguments=upscaler_arguments
)
# Get the upscaled result
result = await handler.get()
if result and "image" in result:
upscaled_image = result["image"]
print(f"✅ Image upscaled successfully to {upscaled_image.get('width', 'unknown')}x{upscaled_image.get('height', 'unknown')}")
return {
"url": upscaled_image["url"],
"width": upscaled_image.get("width", 0),
"height": upscaled_image.get("height", 0),
"upscaled": True,
"upscale_factor": UPSCALER_FACTOR
}
else:
print("❌ Upscaler returned invalid response")
return None
except Exception as e:
print(f"❌ Error upscaling image: {str(e)}")
return None
async def image_generate_tool(
prompt: str,
image_size: Union[str, Dict[str, int]] = DEFAULT_IMAGE_SIZE,
num_inference_steps: int = DEFAULT_NUM_INFERENCE_STEPS,
guidance_scale: float = DEFAULT_GUIDANCE_SCALE,
num_images: int = DEFAULT_NUM_IMAGES,
enable_safety_checker: bool = True,
output_format: str = DEFAULT_OUTPUT_FORMAT,
acceleration: str = "none",
allow_nsfw_images: bool = True,
seed: Optional[int] = None
) -> str:
"""
Generate images from text prompts using FAL.ai's FLUX.1 Krea model with automatic upscaling.
This tool uses FAL.ai's FLUX.1 Krea model for high-quality text-to-image generation
with extensive customization options. Generated images are automatically upscaled 2x
using FAL.ai's Clarity Upscaler for enhanced quality. The final upscaled images are
returned as URLs that can be displayed using <img src="{URL}"></img> tags.
Args:
prompt (str): The text prompt describing the desired image
image_size (Union[str, Dict[str, int]]): Preset size or custom {"width": int, "height": int}
num_inference_steps (int): Number of denoising steps (1-50, default: 28)
guidance_scale (float): How closely to follow prompt (0.1-20.0, default: 4.5)
num_images (int): Number of images to generate (1-4, default: 1)
enable_safety_checker (bool): Enable content safety filtering (default: True)
output_format (str): Image format "jpeg" or "png" (default: "png")
acceleration (str): Generation speed "none", "regular", or "high" (default: "none")
allow_nsfw_images (bool): Allow generation of NSFW content (default: True)
seed (Optional[int]): Random seed for reproducible results (optional)
Returns:
str: JSON string containing minimal generation results:
{
"success": bool,
"image": str or None # URL of the upscaled image, or None if failed
}
"""
debug_call_data = {
"parameters": {
"prompt": prompt,
"image_size": image_size,
"num_inference_steps": num_inference_steps,
"guidance_scale": guidance_scale,
"num_images": num_images,
"enable_safety_checker": enable_safety_checker,
"output_format": output_format,
"acceleration": acceleration,
"allow_nsfw_images": allow_nsfw_images,
"seed": seed
},
"error": None,
"success": False,
"images_generated": 0,
"generation_time": 0
}
start_time = datetime.datetime.now()
try:
print(f"🎨 Generating {num_images} image(s) with FLUX Krea: {prompt[:80]}{'...' if len(prompt) > 80 else ''}")
# Validate prompt
if not prompt or not isinstance(prompt, str) or len(prompt.strip()) == 0:
raise ValueError("Prompt is required and must be a non-empty string")
# Check API key availability
if not os.getenv("FAL_KEY"):
raise ValueError("FAL_KEY environment variable not set")
# Validate parameters
validated_params = _validate_parameters(
image_size, num_inference_steps, guidance_scale, num_images, output_format, acceleration
)
# Prepare arguments for FAL.ai FLUX Krea API
arguments = {
"prompt": prompt.strip(),
"image_size": validated_params["image_size"],
"num_inference_steps": validated_params["num_inference_steps"],
"guidance_scale": validated_params["guidance_scale"],
"num_images": validated_params["num_images"],
"enable_safety_checker": enable_safety_checker,
"output_format": validated_params["output_format"],
"acceleration": validated_params["acceleration"],
"allow_nsfw_images": allow_nsfw_images,
"sync_mode": True # Use sync mode for immediate results
}
# Add seed if provided
if seed is not None and isinstance(seed, int):
arguments["seed"] = seed
print(f"🚀 Submitting generation request to FAL.ai FLUX Krea...")
print(f" Model: {DEFAULT_MODEL}")
print(f" Size: {validated_params['image_size']}")
print(f" Steps: {validated_params['num_inference_steps']}")
print(f" Guidance: {validated_params['guidance_scale']}")
print(f" Acceleration: {validated_params['acceleration']}")
# Submit request to FAL.ai
handler = await fal_client.submit_async(
DEFAULT_MODEL,
arguments=arguments
)
# Get the result
result = await handler.get()
generation_time = (datetime.datetime.now() - start_time).total_seconds()
# Process the response
if not result or "images" not in result:
raise ValueError("Invalid response from FAL.ai API - no images returned")
images = result.get("images", [])
if not images:
raise ValueError("No images were generated")
# Format image data and upscale images
formatted_images = []
for img in images:
if isinstance(img, dict) and "url" in img:
original_image = {
"url": img["url"],
"width": img.get("width", 0),
"height": img.get("height", 0)
}
# Attempt to upscale the image
upscaled_image = await _upscale_image(img["url"], prompt.strip())
if upscaled_image:
# Use upscaled image if successful
formatted_images.append(upscaled_image)
else:
# Fall back to original image if upscaling fails
print(f"⚠️ Using original image as fallback")
original_image["upscaled"] = False
formatted_images.append(original_image)
if not formatted_images:
raise ValueError("No valid image URLs returned from API")
upscaled_count = sum(1 for img in formatted_images if img.get("upscaled", False))
print(f"✅ Generated {len(formatted_images)} image(s) in {generation_time:.1f}s ({upscaled_count} upscaled)")
# Prepare successful response - minimal format
response_data = {
"success": True,
"image": formatted_images[0]["url"] if formatted_images else None
}
debug_call_data["success"] = True
debug_call_data["images_generated"] = len(formatted_images)
debug_call_data["generation_time"] = generation_time
# Log debug information
_log_debug_call("image_generate_tool", debug_call_data)
_save_debug_log()
return json.dumps(response_data, indent=2, ensure_ascii=False)
except Exception as e:
generation_time = (datetime.datetime.now() - start_time).total_seconds()
error_msg = f"Error generating image: {str(e)}"
print(f"{error_msg}")
# Prepare error response - minimal format
response_data = {
"success": False,
"image": None
}
debug_call_data["error"] = error_msg
debug_call_data["generation_time"] = generation_time
_log_debug_call("image_generate_tool", debug_call_data)
_save_debug_log()
return json.dumps(response_data, indent=2, ensure_ascii=False)
def check_fal_api_key() -> bool:
"""
Check if the FAL.ai API key is available in environment variables.
Returns:
bool: True if API key is set, False otherwise
"""
return bool(os.getenv("FAL_KEY"))
def check_image_generation_requirements() -> bool:
"""
Check if all requirements for image generation tools are met.
Returns:
bool: True if requirements are met, False otherwise
"""
try:
# Check API key
if not check_fal_api_key():
return False
# Check if fal_client is available
import fal_client
return True
except ImportError:
return False
def get_debug_session_info() -> Dict[str, Any]:
"""
Get information about the current debug session.
Returns:
Dict[str, Any]: Dictionary containing debug session information
"""
if not DEBUG_MODE or not DEBUG_DATA:
return {
"enabled": False,
"session_id": None,
"log_path": None,
"total_calls": 0
}
return {
"enabled": True,
"session_id": DEBUG_SESSION_ID,
"log_path": str(DEBUG_LOG_PATH / f"image_tools_debug_{DEBUG_SESSION_ID}.json"),
"total_calls": len(DEBUG_DATA["tool_calls"])
}
if __name__ == "__main__":
"""
Simple test/demo when run directly
"""
print("🎨 Image Generation Tools Module - FLUX.1 Krea + Auto Upscaling")
print("=" * 60)
# Check if API key is available
api_available = check_fal_api_key()
if not api_available:
print("❌ FAL_KEY environment variable not set")
print("Please set your API key: export FAL_KEY='your-key-here'")
print("Get API key at: https://fal.ai/")
exit(1)
else:
print("✅ FAL.ai API key found")
# Check if fal_client is available
try:
import fal_client
print("✅ fal_client library available")
except ImportError:
print("❌ fal_client library not found")
print("Please install: pip install fal-client")
exit(1)
print("🛠️ Image generation tools ready for use!")
print(f"🤖 Using model: {DEFAULT_MODEL}")
print(f"🔍 Auto-upscaling with: {UPSCALER_MODEL} ({UPSCALER_FACTOR}x)")
# Show debug mode status
if DEBUG_MODE:
print(f"🐛 Debug mode ENABLED - Session ID: {DEBUG_SESSION_ID}")
print(f" Debug logs will be saved to: ./logs/image_tools_debug_{DEBUG_SESSION_ID}.json")
else:
print("🐛 Debug mode disabled (set IMAGE_TOOLS_DEBUG=true to enable)")
print("\nBasic usage:")
print(" from image_generation_tool import image_generate_tool")
print(" import asyncio")
print("")
print(" async def main():")
print(" # Generate image with automatic 2x upscaling")
print(" result = await image_generate_tool(")
print(" prompt='A serene mountain landscape with cherry blossoms',")
print(" image_size='landscape_4_3',")
print(" num_images=1")
print(" )")
print(" print(result)")
print(" asyncio.run(main())")
print("\nSupported image sizes:")
for size in VALID_IMAGE_SIZES:
print(f" - {size}")
print(" - Custom: {'width': 512, 'height': 768} (if needed)")
print("\nAcceleration modes:")
for mode in VALID_ACCELERATION_MODES:
print(f" - {mode}")
print("\nExample prompts:")
print(" - 'A candid street photo of a woman with a pink bob and bold eyeliner'")
print(" - 'Modern architecture building with glass facade, sunset lighting'")
print(" - 'Abstract art with vibrant colors and geometric patterns'")
print(" - 'Portrait of a wise old owl perched on ancient tree branch'")
print(" - 'Futuristic cityscape with flying cars and neon lights'")
print("\nDebug mode:")
print(" # Enable debug logging")
print(" export IMAGE_TOOLS_DEBUG=true")
print(" # Debug logs capture all image generation calls and results")
print(" # Logs saved to: ./logs/image_tools_debug_UUID.json")

View File

@@ -0,0 +1,586 @@
#!/usr/bin/env python3
"""
Mixture-of-Agents Tool Module
This module implements the Mixture-of-Agents (MoA) methodology that leverages
the collective strengths of multiple LLMs through a layered architecture to
achieve state-of-the-art performance on complex reasoning tasks.
Based on the research paper: "Mixture-of-Agents Enhances Large Language Model Capabilities"
by Junlin Wang et al. (arXiv:2406.04692v1)
Key Features:
- Multi-layer LLM collaboration for enhanced reasoning
- Parallel processing of reference models for efficiency
- Intelligent aggregation and synthesis of diverse responses
- Specialized for extremely difficult problems requiring intense reasoning
- Optimized for coding, mathematics, and complex analytical tasks
Available Tool:
- mixture_of_agents_tool: Process complex queries using multiple frontier models
Architecture:
1. Reference models generate diverse initial responses in parallel
2. Aggregator model synthesizes responses into a high-quality output
3. Multiple layers can be used for iterative refinement (future enhancement)
Models Used:
- Reference Models: claude-opus-4-20250514, gemini-2.5-pro, o4-mini, deepseek-r1
- Aggregator Model: claude-opus-4-20250514 (highest capability for synthesis)
Configuration:
To customize the MoA setup, modify the configuration constants at the top of this file:
- REFERENCE_MODELS: List of models for generating diverse initial responses
- AGGREGATOR_MODEL: Model used to synthesize the final response
- REFERENCE_TEMPERATURE/AGGREGATOR_TEMPERATURE: Sampling temperatures
- MIN_SUCCESSFUL_REFERENCES: Minimum successful models needed to proceed
Usage:
from mixture_of_agents_tool import mixture_of_agents_tool
import asyncio
# Process a complex query
result = await mixture_of_agents_tool(
user_prompt="Solve this complex mathematical proof..."
)
"""
import json
import os
import asyncio
import uuid
import datetime
from pathlib import Path
from typing import Dict, Any, List, Optional
from openai import AsyncOpenAI
# Initialize Nous Research API client for MoA processing
nous_client = AsyncOpenAI(
api_key=os.getenv("NOUS_API_KEY"),
base_url="https://inference-api.nousresearch.com/v1"
)
# Configuration for MoA processing
# Reference models - these generate diverse initial responses in parallel
REFERENCE_MODELS = [
"claude-opus-4-20250514",
"gemini-2.5-pro",
"gpt-5",
"deepseek-r1"
]
# Aggregator model - synthesizes reference responses into final output
AGGREGATOR_MODEL = "claude-opus-4-20250514" # Use highest capability model for aggregation
# Temperature settings optimized for MoA performance
REFERENCE_TEMPERATURE = 0.6 # Balanced creativity for diverse perspectives
AGGREGATOR_TEMPERATURE = 0.4 # Focused synthesis for consistency
# Failure handling configuration
MIN_SUCCESSFUL_REFERENCES = 1 # Minimum successful reference models needed to proceed
# System prompt for the aggregator model (from the research paper)
AGGREGATOR_SYSTEM_PROMPT = """You have been provided with a set of responses from various open-source models to the latest user query. Your task is to synthesize these responses into a single, high-quality response. It is crucial to critically evaluate the information provided in these responses, recognizing that some of it may be biased or incorrect. Your response should not simply replicate the given answers but should offer a refined, accurate, and comprehensive reply to the instruction. Ensure your response is well-structured, coherent, and adheres to the highest standards of accuracy and reliability.
Responses from models:"""
# Debug mode configuration
DEBUG_MODE = os.getenv("MOA_TOOLS_DEBUG", "false").lower() == "true"
DEBUG_SESSION_ID = str(uuid.uuid4())
DEBUG_LOG_PATH = Path("./logs")
DEBUG_DATA = {
"session_id": DEBUG_SESSION_ID,
"start_time": datetime.datetime.now().isoformat(),
"debug_enabled": DEBUG_MODE,
"tool_calls": []
} if DEBUG_MODE else None
# Create logs directory if debug mode is enabled
if DEBUG_MODE:
DEBUG_LOG_PATH.mkdir(exist_ok=True)
print(f"🐛 MoA debug mode enabled - Session ID: {DEBUG_SESSION_ID}")
def _log_debug_call(tool_name: str, call_data: Dict[str, Any]) -> None:
"""
Log a debug call entry to the global debug data structure.
Args:
tool_name (str): Name of the tool being called
call_data (Dict[str, Any]): Data about the call including parameters and results
"""
if not DEBUG_MODE or not DEBUG_DATA:
return
call_entry = {
"timestamp": datetime.datetime.now().isoformat(),
"tool_name": tool_name,
**call_data
}
DEBUG_DATA["tool_calls"].append(call_entry)
def _save_debug_log() -> None:
"""
Save the current debug data to a JSON file in the logs directory.
"""
if not DEBUG_MODE or not DEBUG_DATA:
return
try:
debug_filename = f"moa_tools_debug_{DEBUG_SESSION_ID}.json"
debug_filepath = DEBUG_LOG_PATH / debug_filename
# Update end time
DEBUG_DATA["end_time"] = datetime.datetime.now().isoformat()
DEBUG_DATA["total_calls"] = len(DEBUG_DATA["tool_calls"])
with open(debug_filepath, 'w', encoding='utf-8') as f:
json.dump(DEBUG_DATA, f, indent=2, ensure_ascii=False)
print(f"🐛 MoA debug log saved: {debug_filepath}")
except Exception as e:
print(f"❌ Error saving MoA debug log: {str(e)}")
def _construct_aggregator_prompt(system_prompt: str, responses: List[str]) -> str:
"""
Construct the final system prompt for the aggregator including all model responses.
Args:
system_prompt (str): Base system prompt for aggregation
responses (List[str]): List of responses from reference models
Returns:
str: Complete system prompt with enumerated responses
"""
response_text = "\n".join([f"{i+1}. {response}" for i, response in enumerate(responses)])
return f"{system_prompt}\n\n{response_text}"
async def _run_reference_model_safe(
model: str,
user_prompt: str,
temperature: float = REFERENCE_TEMPERATURE,
max_tokens: int = 32000,
max_retries: int = 6
) -> tuple[str, str, bool]:
"""
Run a single reference model with retry logic and graceful failure handling.
Args:
model (str): Model identifier to use
user_prompt (str): The user's query
temperature (float): Sampling temperature for response generation
max_tokens (int): Maximum tokens in response
max_retries (int): Maximum number of retry attempts
Returns:
tuple[str, str, bool]: (model_name, response_content_or_error, success_flag)
"""
for attempt in range(max_retries):
try:
print(f"🤖 Querying {model} (attempt {attempt + 1}/{max_retries})")
# Build parameters for the API call
api_params = {
"model": model,
"messages": [{"role": "user", "content": user_prompt}]
}
# GPT models (especially gpt-4o-mini) don't support custom temperature values
# Only include temperature for non-GPT models
if not model.lower().startswith('gpt-'):
api_params["temperature"] = temperature
response = await nous_client.chat.completions.create(**api_params)
content = response.choices[0].message.content.strip()
print(f"{model} responded ({len(content)} characters)")
return model, content, True
except Exception as e:
error_str = str(e)
# Log more detailed error information for debugging
if "invalid" in error_str.lower():
print(f"⚠️ {model} invalid request error (attempt {attempt + 1}): {error_str}")
elif "rate" in error_str.lower() or "limit" in error_str.lower():
print(f"⚠️ {model} rate limit error (attempt {attempt + 1}): {error_str}")
else:
print(f"⚠️ {model} unknown error (attempt {attempt + 1}): {error_str}")
if attempt < max_retries - 1:
# Exponential backoff for rate limiting: 2s, 4s, 8s, 16s, 32s, 60s
sleep_time = min(2 ** (attempt + 1), 60)
print(f" Retrying in {sleep_time}s...")
await asyncio.sleep(sleep_time)
else:
error_msg = f"{model} failed after {max_retries} attempts: {error_str}"
print(f"{error_msg}")
return model, error_msg, False
async def _run_aggregator_model(
system_prompt: str,
user_prompt: str,
temperature: float = AGGREGATOR_TEMPERATURE,
max_tokens: int = None
) -> str:
"""
Run the aggregator model to synthesize the final response.
Args:
system_prompt (str): System prompt with all reference responses
user_prompt (str): Original user query
temperature (float): Focused temperature for consistent aggregation
max_tokens (int): Maximum tokens in final response
Returns:
str: Synthesized final response
"""
print(f"🧠 Running aggregator model: {AGGREGATOR_MODEL}")
# Build parameters for the API call
api_params = {
"model": AGGREGATOR_MODEL,
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
]
}
# GPT models (especially gpt-4o-mini) don't support custom temperature values
# Only include temperature for non-GPT models
if not AGGREGATOR_MODEL.lower().startswith('gpt-'):
api_params["temperature"] = temperature
response = await nous_client.chat.completions.create(**api_params)
content = response.choices[0].message.content.strip()
print(f"✅ Aggregation complete ({len(content)} characters)")
return content
async def mixture_of_agents_tool(
user_prompt: str,
reference_models: Optional[List[str]] = None,
aggregator_model: Optional[str] = None
) -> str:
"""
Process a complex query using the Mixture-of-Agents methodology.
This tool leverages multiple frontier language models to collaboratively solve
extremely difficult problems requiring intense reasoning. It's particularly
effective for:
- Complex mathematical proofs and calculations
- Advanced coding problems and algorithm design
- Multi-step analytical reasoning tasks
- Problems requiring diverse domain expertise
- Tasks where single models show limitations
The MoA approach uses a fixed 2-layer architecture:
1. Layer 1: Multiple reference models generate diverse responses in parallel (temp=0.6)
2. Layer 2: Aggregator model synthesizes the best elements into final response (temp=0.4)
Args:
user_prompt (str): The complex query or problem to solve
reference_models (Optional[List[str]]): Custom reference models to use
aggregator_model (Optional[str]): Custom aggregator model to use
Returns:
str: JSON string containing the MoA results with the following structure:
{
"success": bool,
"response": str,
"models_used": {
"reference_models": List[str],
"aggregator_model": str
},
"processing_time": float
}
Raises:
Exception: If MoA processing fails or API key is not set
"""
start_time = datetime.datetime.now()
debug_call_data = {
"parameters": {
"user_prompt": user_prompt[:200] + "..." if len(user_prompt) > 200 else user_prompt,
"reference_models": reference_models or REFERENCE_MODELS,
"aggregator_model": aggregator_model or AGGREGATOR_MODEL,
"reference_temperature": REFERENCE_TEMPERATURE,
"aggregator_temperature": AGGREGATOR_TEMPERATURE,
"min_successful_references": MIN_SUCCESSFUL_REFERENCES
},
"error": None,
"success": False,
"reference_responses_count": 0,
"failed_models_count": 0,
"failed_models": [],
"final_response_length": 0,
"processing_time_seconds": 0,
"models_used": {}
}
try:
print(f"🚀 Starting Mixture-of-Agents processing...")
print(f"📝 Query: {user_prompt[:100]}{'...' if len(user_prompt) > 100 else ''}")
# Validate API key availability
if not os.getenv("NOUS_API_KEY"):
raise ValueError("NOUS_API_KEY environment variable not set")
# Use provided models or defaults
ref_models = reference_models or REFERENCE_MODELS
agg_model = aggregator_model or AGGREGATOR_MODEL
print(f"🔄 Using {len(ref_models)} reference models in 2-layer MoA architecture")
# Layer 1: Generate diverse responses from reference models (with failure handling)
print("📡 Layer 1: Generating reference responses...")
model_results = await asyncio.gather(*[
_run_reference_model_safe(model, user_prompt, REFERENCE_TEMPERATURE)
for model in ref_models
])
# Separate successful and failed responses
successful_responses = []
failed_models = []
for model_name, content, success in model_results:
if success:
successful_responses.append(content)
else:
failed_models.append(model_name)
successful_count = len(successful_responses)
failed_count = len(failed_models)
print(f"📊 Reference model results: {successful_count} successful, {failed_count} failed")
if failed_models:
print(f"⚠️ Failed models: {', '.join(failed_models)}")
# Check if we have enough successful responses to proceed
if successful_count < MIN_SUCCESSFUL_REFERENCES:
raise ValueError(f"Insufficient successful reference models ({successful_count}/{len(ref_models)}). Need at least {MIN_SUCCESSFUL_REFERENCES} successful responses.")
debug_call_data["reference_responses_count"] = successful_count
debug_call_data["failed_models_count"] = failed_count
debug_call_data["failed_models"] = failed_models
# Layer 2: Aggregate responses using the aggregator model
print("🧠 Layer 2: Synthesizing final response...")
aggregator_system_prompt = _construct_aggregator_prompt(
AGGREGATOR_SYSTEM_PROMPT,
successful_responses
)
final_response = await _run_aggregator_model(
aggregator_system_prompt,
user_prompt,
AGGREGATOR_TEMPERATURE
)
# Calculate processing time
end_time = datetime.datetime.now()
processing_time = (end_time - start_time).total_seconds()
print(f"✅ MoA processing completed in {processing_time:.2f} seconds")
# Prepare successful response (only final aggregated result, minimal fields)
result = {
"success": True,
"response": final_response,
"models_used": {
"reference_models": ref_models,
"aggregator_model": agg_model
}
}
debug_call_data["success"] = True
debug_call_data["final_response_length"] = len(final_response)
debug_call_data["processing_time_seconds"] = processing_time
debug_call_data["models_used"] = result["models_used"]
# Log debug information
_log_debug_call("mixture_of_agents_tool", debug_call_data)
_save_debug_log()
return json.dumps(result, indent=2, ensure_ascii=False)
except Exception as e:
error_msg = f"Error in MoA processing: {str(e)}"
print(f"{error_msg}")
# Calculate processing time even for errors
end_time = datetime.datetime.now()
processing_time = (end_time - start_time).total_seconds()
# Prepare error response (minimal fields)
result = {
"success": False,
"response": "MoA processing failed. Please try again or use a single model for this query.",
"models_used": {
"reference_models": reference_models or REFERENCE_MODELS,
"aggregator_model": aggregator_model or AGGREGATOR_MODEL
},
"error": error_msg
}
debug_call_data["error"] = error_msg
debug_call_data["processing_time_seconds"] = processing_time
_log_debug_call("mixture_of_agents_tool", debug_call_data)
_save_debug_log()
return json.dumps(result, indent=2, ensure_ascii=False)
def check_nous_api_key() -> bool:
"""
Check if the Nous Research API key is available in environment variables.
Returns:
bool: True if API key is set, False otherwise
"""
return bool(os.getenv("NOUS_API_KEY"))
def check_moa_requirements() -> bool:
"""
Check if all requirements for MoA tools are met.
Returns:
bool: True if requirements are met, False otherwise
"""
return check_nous_api_key()
def get_debug_session_info() -> Dict[str, Any]:
"""
Get information about the current debug session.
Returns:
Dict[str, Any]: Dictionary containing debug session information
"""
if not DEBUG_MODE or not DEBUG_DATA:
return {
"enabled": False,
"session_id": None,
"log_path": None,
"total_calls": 0
}
return {
"enabled": True,
"session_id": DEBUG_SESSION_ID,
"log_path": str(DEBUG_LOG_PATH / f"moa_tools_debug_{DEBUG_SESSION_ID}.json"),
"total_calls": len(DEBUG_DATA["tool_calls"])
}
def get_available_models() -> Dict[str, List[str]]:
"""
Get information about available models for MoA processing.
Returns:
Dict[str, List[str]]: Dictionary with reference and aggregator models
"""
return {
"reference_models": REFERENCE_MODELS,
"aggregator_models": [AGGREGATOR_MODEL],
"supported_models": REFERENCE_MODELS + [AGGREGATOR_MODEL]
}
def get_moa_configuration() -> Dict[str, Any]:
"""
Get the current MoA configuration settings.
Returns:
Dict[str, Any]: Dictionary containing all configuration parameters
"""
return {
"reference_models": REFERENCE_MODELS,
"aggregator_model": AGGREGATOR_MODEL,
"reference_temperature": REFERENCE_TEMPERATURE,
"aggregator_temperature": AGGREGATOR_TEMPERATURE,
"min_successful_references": MIN_SUCCESSFUL_REFERENCES,
"total_reference_models": len(REFERENCE_MODELS),
"failure_tolerance": f"{len(REFERENCE_MODELS) - MIN_SUCCESSFUL_REFERENCES}/{len(REFERENCE_MODELS)} models can fail"
}
if __name__ == "__main__":
"""
Simple test/demo when run directly
"""
print("🤖 Mixture-of-Agents Tool Module")
print("=" * 50)
# Check if API key is available
api_available = check_nous_api_key()
if not api_available:
print("❌ NOUS_API_KEY environment variable not set")
print("Please set your API key: export NOUS_API_KEY='your-key-here'")
print("Get API key at: https://inference-api.nousresearch.com/")
exit(1)
else:
print("✅ Nous Research API key found")
print("🛠️ MoA tools ready for use!")
# Show current configuration
config = get_moa_configuration()
print(f"\n⚙️ Current Configuration:")
print(f" 🤖 Reference models ({len(config['reference_models'])}): {', '.join(config['reference_models'])}")
print(f" 🧠 Aggregator model: {config['aggregator_model']}")
print(f" 🌡️ Reference temperature: {config['reference_temperature']}")
print(f" 🌡️ Aggregator temperature: {config['aggregator_temperature']}")
print(f" 🛡️ Failure tolerance: {config['failure_tolerance']}")
print(f" 📊 Minimum successful models: {config['min_successful_references']}")
# Show debug mode status
if DEBUG_MODE:
print(f"\n🐛 Debug mode ENABLED - Session ID: {DEBUG_SESSION_ID}")
print(f" Debug logs will be saved to: ./logs/moa_tools_debug_{DEBUG_SESSION_ID}.json")
else:
print("\n🐛 Debug mode disabled (set MOA_TOOLS_DEBUG=true to enable)")
print("\nBasic usage:")
print(" from mixture_of_agents_tool import mixture_of_agents_tool")
print(" import asyncio")
print("")
print(" async def main():")
print(" result = await mixture_of_agents_tool(")
print(" user_prompt='Solve this complex mathematical proof...'")
print(" )")
print(" print(result)")
print(" asyncio.run(main())")
print("\nBest use cases:")
print(" - Complex mathematical proofs and calculations")
print(" - Advanced coding problems and algorithm design")
print(" - Multi-step analytical reasoning tasks")
print(" - Problems requiring diverse domain expertise")
print(" - Tasks where single models show limitations")
print("\nPerformance characteristics:")
print(" - Higher latency due to multiple model calls")
print(" - Significantly improved quality for complex tasks")
print(" - Parallel processing for efficiency")
print(f" - Optimized temperatures: {REFERENCE_TEMPERATURE} for reference models, {AGGREGATOR_TEMPERATURE} for aggregation")
print(" - Token-efficient: only returns final aggregated response")
print(" - Resilient: continues with partial model failures")
print(f" - Configurable: easy to modify models and settings at top of file")
print(" - State-of-the-art results on challenging benchmarks")
print("\nDebug mode:")
print(" # Enable debug logging")
print(" export MOA_TOOLS_DEBUG=true")
print(" # Debug logs capture all MoA processing steps and metrics")
print(" # Logs saved to: ./logs/moa_tools_debug_UUID.json")

View File

@@ -0,0 +1,395 @@
#!/usr/bin/env python3
"""
Simple Terminal Tool Module
A simplified terminal tool that executes commands on MorphCloud VMs without tmux.
No session persistence, no interactive app support - just simple command execution.
Features:
- Direct SSH command execution
- Background task support
- VM lifecycle management with TTL
- Automatic cleanup after inactivity
Usage:
from simple_terminal_tool import simple_terminal_tool
# Execute a simple command
result = simple_terminal_tool("ls -la")
# Execute in background
result = simple_terminal_tool("python server.py", background=True)
"""
import json
import os
import time
import threading
import atexit
from typing import Optional, Dict, Any
# Tool description for LLM
SIMPLE_TERMINAL_TOOL_DESCRIPTION = """Execute commands on a secure Linux VM environment.
**Environment:**
- Minimal Debian-based OS with internet access
- Automatic VM lifecycle management (creates on-demand, reuses, cleans up)
- Filesystem is persisted between tool calls but environment variables, venvs, etc are reset.
**Command Execution:**
- Simple commands: Just provide the 'command' parameter
- Background processes: Set 'background': True for servers/long-running tasks
- Command timeout: Optional 'timeout' parameter in seconds
**Examples:**
- Run command: `{"command": "ls -la"}`
- Background task: `{"command": "source path/to/my/venv/bin/activate && python server.py", "background": True}`
- With timeout: `{"command": "long_task.sh", "timeout": 300}`
**Best Practices:**
- Run servers/long processes in background
- Monitor disk usage for large tasks
- Install whatever tools you need with sudo apt-get
- Do not be afraid to run pip with --break-system-packages
**Things to avoid**
- Do NOT use interactive tools such as tmux, vim, nano, python repl - you will get stuck. Even git sometimes becomes interactive if the output is large. If you're not sure pipe to cat.
"""
# Global state for VM lifecycle management
_active_instances: Dict[str, Any] = {}
_last_activity: Dict[str, float] = {}
_instance_lock = threading.Lock()
_cleanup_thread = None
_cleanup_running = False
def _cleanup_inactive_vms(vm_lifetime_seconds: int = 300):
"""Clean up VMs that have been inactive for longer than vm_lifetime_seconds."""
global _active_instances, _last_activity
current_time = time.time()
tasks_to_cleanup = []
with _instance_lock:
for task_id, last_time in list(_last_activity.items()):
if current_time - last_time > vm_lifetime_seconds:
tasks_to_cleanup.append(task_id)
for task_id in tasks_to_cleanup:
try:
if task_id in _active_instances:
instance = _active_instances[task_id]
if hasattr(instance, 'terminate'):
instance.terminate()
elif hasattr(instance, 'stop'):
instance.stop()
elif hasattr(instance, 'delete'):
instance.delete()
del _active_instances[task_id]
print(f"[VM Cleanup] Terminated inactive VM for task: {task_id}")
if task_id in _last_activity:
del _last_activity[task_id]
except Exception as e:
# 404 errors are benign - VM already cleaned up by TTL
error_str = str(e)
if "404" in error_str or "InstanceNotFoundError" in error_str or "not found" in error_str.lower():
print(f"[VM Cleanup] VM for task {task_id} already cleaned up (likely TTL expiration)")
else:
print(f"[VM Cleanup] Error cleaning up VM for task {task_id}: {e}")
def _cleanup_thread_worker():
"""Background thread worker that periodically cleans up inactive VMs."""
global _cleanup_running
while _cleanup_running:
try:
vm_lifetime = int(os.getenv("HECATE_VM_LIFETIME_SECONDS", "300"))
_cleanup_inactive_vms(vm_lifetime)
except Exception as e:
print(f"[VM Cleanup] Error in cleanup thread: {e}")
for _ in range(60):
if not _cleanup_running:
break
time.sleep(1)
def _start_cleanup_thread():
"""Start the background cleanup thread if not already running."""
global _cleanup_thread, _cleanup_running
with _instance_lock:
if _cleanup_thread is None or not _cleanup_thread.is_alive():
_cleanup_running = True
_cleanup_thread = threading.Thread(target=_cleanup_thread_worker, daemon=True)
_cleanup_thread.start()
def _stop_cleanup_thread():
"""Stop the background cleanup thread."""
global _cleanup_running
_cleanup_running = False
if _cleanup_thread is not None:
_cleanup_thread.join(timeout=5)
def cleanup_vm(task_id: str):
"""Manually clean up a specific VM by task_id."""
global _active_instances, _last_activity
with _instance_lock:
try:
if task_id in _active_instances:
instance = _active_instances[task_id]
if hasattr(instance, 'terminate'):
instance.terminate()
elif hasattr(instance, 'stop'):
instance.stop()
elif hasattr(instance, 'delete'):
instance.delete()
del _active_instances[task_id]
print(f"[VM Cleanup] Manually terminated VM for task: {task_id}")
if task_id in _last_activity:
del _last_activity[task_id]
except Exception as e:
# 404 errors are benign - VM already cleaned up by TTL
error_str = str(e)
if "404" in error_str or "InstanceNotFoundError" in error_str or "not found" in error_str.lower():
print(f"[VM Cleanup] VM for task {task_id} already cleaned up (likely TTL expiration)")
else:
print(f"[VM Cleanup] Error manually cleaning up VM for task {task_id}: {e}")
atexit.register(_stop_cleanup_thread)
def _execute_ssh_command(instance, command: str, timeout: Optional[int] = None) -> Dict[str, Any]:
"""
Execute a command via SSH on the VM instance.
Args:
instance: MorphVM instance
command: Command to execute
timeout: Optional timeout in seconds
Returns:
dict with stdout, stderr, returncode
"""
ssh_context_manager = None
try:
# Use the instance's SSH context manager
ssh_context_manager = instance.ssh()
ssh_context = ssh_context_manager.__enter__()
# Execute the command
result = ssh_context.run(command, get_pty=False, timeout=timeout or 120)
# Close the SSH connection
if ssh_context_manager:
try:
ssh_context_manager.__exit__(None, None, None)
except:
pass
return {
"stdout": result.stdout or "",
"stderr": result.stderr or "",
"returncode": result.returncode
}
except Exception as e:
# Close connection on error
if ssh_context_manager:
try:
ssh_context_manager.__exit__(None, None, None)
except:
pass
# Check if it's a timeout
error_str = str(e).lower()
if "timeout" in error_str:
return {
"stdout": "",
"stderr": f"Command timed out after {timeout or 120} seconds",
"returncode": 124
}
return {
"stdout": "",
"stderr": f"SSH execution failed: {str(e)}",
"returncode": -1
}
def simple_terminal_tool(
command: str,
background: bool = False,
timeout: Optional[int] = None,
task_id: Optional[str] = None
) -> str:
"""
Execute a command on a MorphCloud VM without session persistence.
Args:
command: The command to execute
background: Whether to run in background (default: False)
timeout: Command timeout in seconds (default: 120)
task_id: Unique identifier for VM isolation (optional)
Returns:
str: JSON string with output, exit_code, and error fields
Examples:
# Execute a simple command
>>> result = simple_terminal_tool(command="ls -la /tmp")
# Run a background task
>>> result = simple_terminal_tool(command="python server.py", background=True)
# With custom timeout
>>> result = simple_terminal_tool(command="long_task.sh", timeout=300)
"""
global _active_instances, _last_activity
try:
# Import required modules
try:
from morphcloud.api import MorphCloudClient
except ImportError as import_error:
return json.dumps({
"output": "",
"exit_code": -1,
"error": f"Terminal tool disabled: {import_error}",
"status": "disabled"
}, ensure_ascii=False)
# Get configuration
vm_ttl_seconds = int(os.getenv("HECATE_VM_TTL_SECONDS", "1200"))
snapshot_id = os.getenv("HECATE_DEFAULT_SNAPSHOT_ID", "snapshot_defv9tjg")
# Check API key
morph_api_key = os.getenv("MORPH_API_KEY")
if not morph_api_key:
return json.dumps({
"output": "",
"exit_code": -1,
"error": "MORPH_API_KEY environment variable not set",
"status": "disabled"
}, ensure_ascii=False)
# Use task_id for VM isolation
effective_task_id = task_id or "default"
# Start cleanup thread
_start_cleanup_thread()
# Get or create VM instance
with _instance_lock:
if effective_task_id not in _active_instances:
morph_client = MorphCloudClient(api_key=morph_api_key)
_active_instances[effective_task_id] = morph_client.instances.start(
snapshot_id=snapshot_id,
ttl_seconds=vm_ttl_seconds,
ttl_action="stop"
)
# Update last activity time
_last_activity[effective_task_id] = time.time()
instance = _active_instances[effective_task_id]
# Wait for instance to be ready
instance.wait_until_ready()
# Prepare command for execution
if background:
# Run in background with nohup and redirect output
exec_command = f"nohup {command} > /tmp/bg_output.log 2>&1 &"
result = _execute_ssh_command(instance, exec_command, timeout=10)
# For background tasks, return immediately with info
if result["returncode"] == 0:
return json.dumps({
"output": "Background task started successfully",
"exit_code": 0,
"error": None
}, ensure_ascii=False)
else:
return json.dumps({
"output": result["stdout"],
"exit_code": result["returncode"],
"error": result["stderr"]
}, ensure_ascii=False)
else:
# Run foreground command
result = _execute_ssh_command(instance, command, timeout=timeout)
# Combine stdout and stderr for output
output = result["stdout"]
if result["stderr"] and result["returncode"] != 0:
output = f"{output}\n{result['stderr']}" if output else result["stderr"]
return json.dumps({
"output": output.strip(),
"exit_code": result["returncode"],
"error": result["stderr"] if result["returncode"] != 0 else None
}, ensure_ascii=False)
except Exception as e:
return json.dumps({
"output": "",
"exit_code": -1,
"error": f"Failed to execute command: {str(e)}",
"status": "error"
}, ensure_ascii=False)
def check_requirements() -> bool:
"""Check if all requirements for the simple terminal tool are met."""
required_vars = ["MORPH_API_KEY"]
missing_required = [var for var in required_vars if not os.getenv(var)]
if missing_required:
print(f"Missing required environment variables: {', '.join(missing_required)}")
return False
try:
from morphcloud.api import MorphCloudClient
return True
except Exception as e:
print(f"MorphCloud not available: {e}")
return False
if __name__ == "__main__":
"""Simple test when run directly."""
print("Simple Terminal Tool Module")
print("=" * 40)
if not check_requirements():
print("Requirements not met. Please check the messages above.")
exit(1)
print("All requirements met!")
print("\nAvailable Tool:")
print(" - simple_terminal_tool: Execute commands without session persistence")
print("\nUsage Examples:")
print(" # Execute a command")
print(" result = simple_terminal_tool(command='ls -la')")
print(" ")
print(" # Run a background task")
print(" result = simple_terminal_tool(command='python server.py', background=True)")
print("\nEnvironment Variables:")
print(f" MORPH_API_KEY: {'Set' if os.getenv('MORPH_API_KEY') else 'Not set'}")
print(f" HECATE_VM_TTL_SECONDS: {os.getenv('HECATE_VM_TTL_SECONDS', '1200')} (default: 1200 / 20 minutes)")
print(f" HECATE_VM_LIFETIME_SECONDS: {os.getenv('HECATE_VM_LIFETIME_SECONDS', '300')} (default: 300 / 5 minutes)")
print(f" HECATE_DEFAULT_SNAPSHOT_ID: {os.getenv('HECATE_DEFAULT_SNAPSHOT_ID', 'snapshot_defv9tjg')}")

456
tools/terminal_tool.py Normal file
View File

@@ -0,0 +1,456 @@
#!/usr/bin/env python3
"""
Terminal Tool Module
This module provides a single terminal tool using Hecate's VM infrastructure.
It wraps Hecate's functionality to provide a simple interface for executing commands
on Morph VMs with automatic lifecycle management.
VM Lifecycle:
- VMs have a TTL (time to live) set at creation (default: 20 minutes)
- VMs are also cleaned up locally after 5 minutes of inactivity
- Timer resets with each use
Available tool:
- terminal_tool: Execute commands with optional interactive session support
Usage:
from terminal_tool import terminal_tool
# Execute a single command
result = terminal_tool("ls -la")
# Execute in an interactive session
result = terminal_tool("python", input_keys="print('hello')\\nexit()\\n")
"""
import json
import os
import uuid
import threading
import time
import atexit
from typing import Optional, Dict, Any
# Detailed description for the terminal tool based on Hermes Terminal system prompt
TERMINAL_TOOL_DESCRIPTION = """Execute commands on a secure, persistent Linux VM environment with full interactive application support.
**Environment:**
- Minimal Debian-based OS with internet access
- Automatic VM lifecycle management (creates on-demand, reuses, cleans up)
- **Full state persistence across tool calls**: current directory (pwd), environment variables, activated virtual environments (conda/venv), running processes, and command history all persist between consecutive tool calls
- Session state managed automatically via tmux
**Command Execution:**
- Simple commands: Just provide the 'command' parameter
- Background processes: Set 'background': True for servers/long-running tasks
- Interactive applications automatically detected and handled
**Interactive Applications (TUIs/Pagers/Prompts):**
When commands enter interactive mode (vim, nano, less, git prompts, package managers, etc.), you'll receive screen content with "frozen" status. This is NORMAL - the session is still active and waiting for input.
**To interact with frozen sessions:**
1. Use 'input_keys' parameter with keystrokes to send
2. System auto-detects and uses the active session
3. Session stays active until application exits
**Special Key Syntax for input_keys:**
- `<ESC>`: Escape key
- `<ENTER>`: Enter/Return
- `<CTRL+C>`, `<CTRL+D>`, `<CTRL+Z>`: Control combinations
- `<UP>`, `<DOWN>`, `<LEFT>`, `<RIGHT>`: Arrow keys
- `<TAB>`, `<BACKSPACE>`: Tab and Backspace
- `<F1>` through `<F12>`: Function keys
- `<SHIFT+TAB>`: Shift+Tab
- Uppercase letters for Shift+letter (e.g., 'V' for Shift+V)
- Symbols for Shift+number (e.g., '!' for Shift+1, ':' for Shift+;)
**Examples:**
- Start vim: `{"command": "vim file.txt"}`
- Type in vim: `{"input_keys": "iHello World<ESC>"}`
- Save and quit: `{"input_keys": ":wq<ENTER>"}`
- Navigate in less: `{"input_keys": "j"}`
- Quit less: `{"input_keys": "q"}`
**Best Practices:**
- Run servers/long processes in background with separate tool calls
- Chain multiple foreground commands in single call if needed
- Monitor disk usage for large tasks, clean up to free space
- Test components incrementally with mock inputs
- Install whatever tools needed - full system access provided"""
# Global state for VM lifecycle management
# These persist across tool calls to enable session continuity
# Changed to dictionaries keyed by task_id to prevent leakage between concurrent tasks
_active_instances: Dict[str, Any] = {}
_active_contexts: Dict[str, Any] = {}
_last_activity: Dict[str, float] = {} # Track last activity time for each VM
_instance_lock = threading.Lock()
_cleanup_thread = None
_cleanup_running = False
def _cleanup_inactive_vms(vm_lifetime_seconds: int = 300):
"""
Clean up VMs that have been inactive for longer than vm_lifetime_seconds.
This function should be called periodically by a background thread.
Args:
vm_lifetime_seconds: Maximum lifetime in seconds for inactive VMs (default: 300)
"""
global _active_instances, _active_contexts, _last_activity
current_time = time.time()
tasks_to_cleanup = []
with _instance_lock:
# Find all VMs that have been inactive for too long
for task_id, last_time in list(_last_activity.items()):
if current_time - last_time > vm_lifetime_seconds:
tasks_to_cleanup.append(task_id)
# Clean up the inactive VMs
for task_id in tasks_to_cleanup:
try:
if task_id in _active_instances:
instance = _active_instances[task_id]
# Terminate the VM instance
if hasattr(instance, 'terminate'):
instance.terminate()
elif hasattr(instance, 'stop'):
instance.stop()
elif hasattr(instance, 'delete'):
instance.delete()
# Remove from tracking dictionaries
del _active_instances[task_id]
print(f"[VM Cleanup] Terminated inactive VM for task: {task_id}")
if task_id in _active_contexts:
del _active_contexts[task_id]
if task_id in _last_activity:
del _last_activity[task_id]
except Exception as e:
print(f"[VM Cleanup] Error cleaning up VM for task {task_id}: {e}")
def _cleanup_thread_worker():
"""
Background thread worker that periodically cleans up inactive VMs.
Runs every 60 seconds.
"""
global _cleanup_running
while _cleanup_running:
try:
vm_lifetime = int(os.getenv("HECATE_VM_LIFETIME_SECONDS", "300"))
_cleanup_inactive_vms(vm_lifetime)
except Exception as e:
print(f"[VM Cleanup] Error in cleanup thread: {e}")
# Sleep for 60 seconds, but check every second if we should stop
for _ in range(60):
if not _cleanup_running:
break
time.sleep(1)
def _start_cleanup_thread():
"""
Start the background cleanup thread if it's not already running.
"""
global _cleanup_thread, _cleanup_running
with _instance_lock:
if _cleanup_thread is None or not _cleanup_thread.is_alive():
_cleanup_running = True
_cleanup_thread = threading.Thread(target=_cleanup_thread_worker, daemon=True)
_cleanup_thread.start()
def _stop_cleanup_thread():
"""
Stop the background cleanup thread.
"""
global _cleanup_running
_cleanup_running = False
if _cleanup_thread is not None:
_cleanup_thread.join(timeout=5)
def cleanup_vm(task_id: str):
"""
Manually clean up a specific VM by task_id.
This should be called when a task is completed.
Args:
task_id: The task ID of the VM to clean up
"""
global _active_instances, _active_contexts, _last_activity
with _instance_lock:
try:
if task_id in _active_instances:
instance = _active_instances[task_id]
# Terminate the VM instance
if hasattr(instance, 'terminate'):
instance.terminate()
elif hasattr(instance, 'stop'):
instance.stop()
elif hasattr(instance, 'delete'):
instance.delete()
# Remove from tracking dictionaries
del _active_instances[task_id]
print(f"[VM Cleanup] Manually terminated VM for task: {task_id}")
if task_id in _active_contexts:
del _active_contexts[task_id]
if task_id in _last_activity:
del _last_activity[task_id]
except Exception as e:
print(f"[VM Cleanup] Error manually cleaning up VM for task {task_id}: {e}")
# Register cleanup on program exit
atexit.register(_stop_cleanup_thread)
def terminal_tool(
command: Optional[str] = None,
input_keys: Optional[str] = None,
session_id: Optional[str] = None,
background: bool = False,
idle_threshold: float = 5.0,
timeout: Optional[int] = None,
task_id: Optional[str] = None
) -> str:
"""
Execute a command on a Morph VM with optional interactive session support.
This tool uses Hecate's VM lifecycle management to automatically create
and manage VMs. VMs are reused within the configured lifetime window
and automatically cleaned up after inactivity.
Args:
command: The command to execute (optional if continuing existing session)
input_keys: Keystrokes to send to interactive session (e.g., "hello\\n")
session_id: ID of existing session to continue (optional)
background: Whether to run the command in the background (default: False)
idle_threshold: Seconds to wait for output before considering session idle (default: 5.0)
timeout: Command timeout in seconds (optional)
task_id: Unique identifier for this task to isolate VMs between concurrent tasks (optional)
Returns:
str: JSON string containing command output, session info, exit code, and any errors
Examples:
# Execute a simple command
>>> result = terminal_tool(command="ls -la /tmp")
# Start an interactive Python session
>>> result = terminal_tool(command="python3")
>>> session_data = json.loads(result)
>>> session_id = session_data["session_id"]
# Send input to the session
>>> result = terminal_tool(input_keys="print('Hello')\\n", session_id=session_id)
# Run a background task
>>> result = terminal_tool(command="sleep 60", background=True)
"""
global _active_instances, _active_contexts
try:
# Import required modules lazily so this module can be imported
# even when hecate is not installed
try:
from morphcloud._llm import ToolCall
from morphcloud.api import MorphCloudClient
from hecate.cli import run_tool, ExecutionContext
from rich.console import Console
import io
except ImportError as import_error:
return json.dumps({
"output": "",
"screen": "",
"exit_code": -1,
"error": f"Terminal tool is disabled due to import error: {import_error}",
"status": "disabled"
}, ensure_ascii=False)
# Get configuration from environment
vm_lifetime_seconds = int(os.getenv("HECATE_VM_LIFETIME_SECONDS", "300"))
vm_ttl_seconds = int(os.getenv("HECATE_VM_TTL_SECONDS", "1200")) # 20 minutes default
snapshot_id = os.getenv("HECATE_DEFAULT_SNAPSHOT_ID", "snapshot_defv9tjg")
# Check API key
morph_api_key = os.getenv("MORPH_API_KEY")
if not morph_api_key:
return json.dumps({
"output": "",
"screen": "",
"exit_code": -1,
"error": "MORPH_API_KEY environment variable not set",
"status": "disabled"
}, ensure_ascii=False)
# Use task_id to isolate VMs between concurrent tasks
# If no task_id provided, use "default" for backward compatibility
effective_task_id = task_id or "default"
# Start the cleanup thread if not already running
_start_cleanup_thread()
# Get or create VM instance and execution context per task
# This is critical for interactive session support - the context must persist!
with _instance_lock:
if effective_task_id not in _active_instances:
morph_client = MorphCloudClient(api_key=morph_api_key)
_active_instances[effective_task_id] = morph_client.instances.start(
snapshot_id=snapshot_id,
ttl_seconds=vm_ttl_seconds,
ttl_action="stop"
)
# Get or create persistent execution context per task
if effective_task_id not in _active_contexts:
_active_contexts[effective_task_id] = ExecutionContext()
# Update last activity time for this VM (resets the inactivity timer)
_last_activity[effective_task_id] = time.time()
instance = _active_instances[effective_task_id]
ctx = _active_contexts[effective_task_id]
# Build tool input based on provided parameters
tool_input = {}
if command:
tool_input["command"] = command
if input_keys:
tool_input["input_keys"] = input_keys
if session_id:
tool_input["session_id"] = session_id
if background:
tool_input["background"] = background
if idle_threshold != 5.0:
tool_input["idle_threshold"] = idle_threshold
if timeout is not None:
tool_input["timeout"] = timeout
tool_call = ToolCall(
name="run_command",
input=tool_input
)
# Create a console for output (redirect to string buffer to avoid printing)
console_output = io.StringIO()
console = Console(file=console_output, force_terminal=False, legacy_windows=False)
# Generate unique tool block ID
tool_block_id = f"tool_{uuid.uuid4().hex[:8]}"
# Execute the tool with hecate
result = run_tool(
tool_call=tool_call,
instance=instance,
console=console,
tool_block_id=tool_block_id,
ctx=ctx
)
# Format the result with only essential fields for the LLM
# Map hecate's "stdout" to "output" for compatibility
formatted_result = {
"output": result.get("stdout", result.get("output", "")),
"screen": result.get("screen", ""),
"exit_code": result.get("returncode", result.get("exit_code", -1)),
"error": result.get("error")
}
return json.dumps(formatted_result, ensure_ascii=False)
except Exception as e:
return json.dumps({
"output": "",
"screen": "",
"exit_code": -1,
"error": f"Failed to execute terminal command: {str(e)}",
"status": "error"
}, ensure_ascii=False)
def check_hecate_requirements() -> bool:
"""
Check if all requirements for terminal tools are met.
Returns:
bool: True if all requirements are met, False otherwise
"""
# Check for required environment variables
required_vars = ["MORPH_API_KEY"]
optional_vars = ["OPENAI_API_KEY"] # Needed for Hecate's LLM features
missing_required = [var for var in required_vars if not os.getenv(var)]
missing_optional = [var for var in optional_vars if not os.getenv(var)]
if missing_required:
print(f"Missing required environment variables: {', '.join(missing_required)}")
return False
if missing_optional:
print(f"Warning: Missing optional environment variables: {', '.join(missing_optional)}")
print(" (Some Hecate features may be limited)")
# Check if Hecate and required modules are importable
try:
from morphcloud._llm import ToolCall
from morphcloud.api import MorphCloudClient
from hecate.cli import run_tool, ExecutionContext
from rich.console import Console
return True
except Exception as e:
print(f"Hecate not available: {e}")
print(f"Make sure hecate is installed and MORPH_API_KEY is set.")
return False
# Module-level initialization check
_requirements_met = check_hecate_requirements()
if __name__ == "__main__":
"""
Simple test/demo when run directly
"""
print("Terminal Tool Module")
print("=" * 40)
if not _requirements_met:
print("Requirements not met. Please check the messages above.")
exit(1)
print("All requirements met!")
print("\nAvailable Tool:")
print(" - terminal_tool: Execute commands with optional interactive session support")
print("\nUsage Examples:")
print(" # Execute a command")
print(" result = terminal_tool(command='ls -la')")
print(" ")
print(" # Start an interactive session")
print(" result = terminal_tool(command='python3')")
print(" session_data = json.loads(result)")
print(" session_id = session_data['session_id']")
print(" ")
print(" # Send input to the session")
print(" result = terminal_tool(")
print(" input_keys='print(\"Hello\")\\\\n',")
print(" session_id=session_id")
print(" )")
print(" ")
print(" # Run a background task")
print(" result = terminal_tool(command='sleep 60', background=True)")
print("\nEnvironment Variables:")
print(f" MORPH_API_KEY: {'Set' if os.getenv('MORPH_API_KEY') else 'Not set'}")
print(f" OPENAI_API_KEY: {'Set' if os.getenv('OPENAI_API_KEY') else 'Not set (optional)'}")
print(f" HECATE_VM_TTL_SECONDS: {os.getenv('HECATE_VM_TTL_SECONDS', '1200')} (default: 1200 / 20 minutes)")
print(f" HECATE_VM_LIFETIME_SECONDS: {os.getenv('HECATE_VM_LIFETIME_SECONDS', '300')} (default: 300 / 5 minutes)")
print(f" HECATE_DEFAULT_SNAPSHOT_ID: {os.getenv('HECATE_DEFAULT_SNAPSHOT_ID', 'snapshot_defv9tjg')} (default: snapshot_defv9tjg)")

471
tools/vision_tools.py Normal file
View File

@@ -0,0 +1,471 @@
#!/usr/bin/env python3
"""
Vision Tools Module
This module provides vision analysis tools that work with image URLs.
Uses Gemini Flash via Nous Research API for intelligent image understanding.
Available tools:
- vision_analyze_tool: Analyze images from URLs with custom prompts
Features:
- Downloads images from URLs and converts to base64 for API compatibility
- Comprehensive image description
- Context-aware analysis based on user queries
- Automatic temporary file cleanup
- Proper error handling and validation
- Debug logging support
Usage:
from vision_tools import vision_analyze_tool
import asyncio
# Analyze an image
result = await vision_analyze_tool(
image_url="https://example.com/image.jpg",
user_prompt="What architectural style is this building?"
)
"""
import json
import os
import asyncio
import uuid
import datetime
import base64
from pathlib import Path
from typing import Dict, Any, Optional
from openai import AsyncOpenAI
import httpx # Use httpx for async HTTP requests
# Initialize Nous Research API client for vision processing
nous_client = AsyncOpenAI(
api_key=os.getenv("NOUS_API_KEY"),
base_url="https://inference-api.nousresearch.com/v1"
)
# Configuration for vision processing
DEFAULT_VISION_MODEL = "gemini-2.5-flash"
# Debug mode configuration
DEBUG_MODE = os.getenv("VISION_TOOLS_DEBUG", "false").lower() == "true"
DEBUG_SESSION_ID = str(uuid.uuid4())
DEBUG_LOG_PATH = Path("./logs")
DEBUG_DATA = {
"session_id": DEBUG_SESSION_ID,
"start_time": datetime.datetime.now().isoformat(),
"debug_enabled": DEBUG_MODE,
"tool_calls": []
} if DEBUG_MODE else None
# Create logs directory if debug mode is enabled
if DEBUG_MODE:
DEBUG_LOG_PATH.mkdir(exist_ok=True)
print(f"🐛 Vision debug mode enabled - Session ID: {DEBUG_SESSION_ID}")
def _log_debug_call(tool_name: str, call_data: Dict[str, Any]) -> None:
"""
Log a debug call entry to the global debug data structure.
Args:
tool_name (str): Name of the tool being called
call_data (Dict[str, Any]): Data about the call including parameters and results
"""
if not DEBUG_MODE or not DEBUG_DATA:
return
call_entry = {
"timestamp": datetime.datetime.now().isoformat(),
"tool_name": tool_name,
**call_data
}
DEBUG_DATA["tool_calls"].append(call_entry)
def _save_debug_log() -> None:
"""
Save the current debug data to a JSON file in the logs directory.
"""
if not DEBUG_MODE or not DEBUG_DATA:
return
try:
debug_filename = f"vision_tools_debug_{DEBUG_SESSION_ID}.json"
debug_filepath = DEBUG_LOG_PATH / debug_filename
# Update end time
DEBUG_DATA["end_time"] = datetime.datetime.now().isoformat()
DEBUG_DATA["total_calls"] = len(DEBUG_DATA["tool_calls"])
with open(debug_filepath, 'w', encoding='utf-8') as f:
json.dump(DEBUG_DATA, f, indent=2, ensure_ascii=False)
print(f"🐛 Vision debug log saved: {debug_filepath}")
except Exception as e:
print(f"❌ Error saving vision debug log: {str(e)}")
def _validate_image_url(url: str) -> bool:
"""
Basic validation of image URL format.
Args:
url (str): The URL to validate
Returns:
bool: True if URL appears to be valid, False otherwise
"""
if not url or not isinstance(url, str):
return False
# Check if it's a valid URL format
if not (url.startswith('http://') or url.startswith('https://')):
return False
# Check for common image extensions (optional, as URLs may not have extensions)
image_extensions = ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.webp', '.svg']
return True # Allow all HTTP/HTTPS URLs for flexibility
async def _download_image(image_url: str, destination: Path) -> Path:
"""
Download an image from a URL to a local destination (async).
Args:
image_url (str): The URL of the image to download
destination (Path): The path where the image should be saved
Returns:
Path: The path to the downloaded image
Raises:
Exception: If download fails or response is invalid
"""
# Create parent directories if they don't exist
destination.parent.mkdir(parents=True, exist_ok=True)
# Download the image with appropriate headers using async httpx
async with httpx.AsyncClient(timeout=30.0) as client:
response = await client.get(
image_url,
headers={"User-Agent": "hermes-agent-vision/1.0"},
)
response.raise_for_status()
# Save the image content
destination.write_bytes(response.content)
return destination
def _determine_mime_type(image_path: Path) -> str:
"""
Determine the MIME type of an image based on its file extension.
Args:
image_path (Path): Path to the image file
Returns:
str: The MIME type (defaults to image/jpeg if unknown)
"""
extension = image_path.suffix.lower()
mime_types = {
'.jpg': 'image/jpeg',
'.jpeg': 'image/jpeg',
'.png': 'image/png',
'.gif': 'image/gif',
'.bmp': 'image/bmp',
'.webp': 'image/webp',
'.svg': 'image/svg+xml'
}
return mime_types.get(extension, 'image/jpeg')
def _image_to_base64_data_url(image_path: Path, mime_type: Optional[str] = None) -> str:
"""
Convert an image file to a base64-encoded data URL.
Args:
image_path (Path): Path to the image file
mime_type (Optional[str]): MIME type of the image (auto-detected if None)
Returns:
str: Base64-encoded data URL (e.g., "data:image/jpeg;base64,...")
"""
# Read the image as bytes
data = image_path.read_bytes()
# Encode to base64
encoded = base64.b64encode(data).decode("ascii")
# Determine MIME type
mime = mime_type or _determine_mime_type(image_path)
# Create data URL
data_url = f"data:{mime};base64,{encoded}"
return data_url
async def vision_analyze_tool(
image_url: str,
user_prompt: str,
model: str = DEFAULT_VISION_MODEL
) -> str:
"""
Analyze an image from a URL using vision AI.
This tool downloads images from URLs, converts them to base64, and processes
them using Gemini Flash via Nous Research API. The image is downloaded to a
temporary location and automatically cleaned up after processing.
The user_prompt parameter is expected to be pre-formatted by the calling
function (typically model_tools.py) to include both full description
requests and specific questions.
Args:
image_url (str): The URL of the image to analyze (must be http:// or https://)
user_prompt (str): The pre-formatted prompt for the vision model
model (str): The vision model to use (default: gemini-2.5-flash)
Returns:
str: JSON string containing the analysis results with the following structure:
{
"success": bool,
"analysis": str (defaults to error message if None)
}
Raises:
Exception: If download fails, analysis fails, or API key is not set
Note:
- Temporary images are stored in ./temp_vision_images/
- Images are automatically deleted after processing
- Supports common image formats (JPEG, PNG, GIF, WebP, etc.)
"""
debug_call_data = {
"parameters": {
"image_url": image_url,
"user_prompt": user_prompt[:200] + "..." if len(user_prompt) > 200 else user_prompt,
"model": model
},
"error": None,
"success": False,
"analysis_length": 0,
"model_used": model,
"image_size_bytes": 0
}
temp_image_path = None
try:
print(f"🔍 Analyzing image from URL: {image_url[:60]}{'...' if len(image_url) > 60 else ''}", flush=True)
print(f"📝 User prompt: {user_prompt[:100]}{'...' if len(user_prompt) > 100 else ''}", flush=True)
# Validate image URL
if not _validate_image_url(image_url):
raise ValueError("Invalid image URL format. Must start with http:// or https://")
# Check API key availability
if not os.getenv("NOUS_API_KEY"):
raise ValueError("NOUS_API_KEY environment variable not set")
# Download the image to a temporary location
print(f"⬇️ Downloading image from URL...", flush=True)
temp_dir = Path("./temp_vision_images")
temp_image_path = temp_dir / f"temp_image_{uuid.uuid4()}.jpg"
await _download_image(image_url, temp_image_path)
# Get image file size for logging
image_size_bytes = temp_image_path.stat().st_size
image_size_kb = image_size_bytes / 1024
print(f"✅ Image downloaded successfully ({image_size_kb:.1f} KB)", flush=True)
# Convert image to base64 data URL
print(f"🔄 Converting image to base64...", flush=True)
image_data_url = _image_to_base64_data_url(temp_image_path)
# Calculate size in KB for better readability
data_size_kb = len(image_data_url) / 1024
print(f"✅ Image converted to base64 ({data_size_kb:.1f} KB)", flush=True)
debug_call_data["image_size_bytes"] = image_size_bytes
# Use the prompt as provided (model_tools.py now handles full description formatting)
comprehensive_prompt = user_prompt
# Prepare the message with base64-encoded image
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": comprehensive_prompt
},
{
"type": "image_url",
"image_url": {
"url": image_data_url
}
}
]
}
]
print(f"🧠 Processing image with {model}...", flush=True)
# Call the vision API
response = await nous_client.chat.completions.create(
model=model,
messages=messages,
temperature=0.1, # Low temperature for consistent analysis
max_tokens=2000 # Generous limit for detailed analysis
)
# Extract the analysis
analysis = response.choices[0].message.content.strip()
analysis_length = len(analysis)
print(f"✅ Image analysis completed ({analysis_length} characters)", flush=True)
# Prepare successful response
result = {
"success": True,
"analysis": analysis or "There was a problem with the request and the image could not be analyzed."
}
debug_call_data["success"] = True
debug_call_data["analysis_length"] = analysis_length
# Log debug information
_log_debug_call("vision_analyze_tool", debug_call_data)
_save_debug_log()
return json.dumps(result, indent=2, ensure_ascii=False)
except Exception as e:
error_msg = f"Error analyzing image: {str(e)}"
print(f"{error_msg}", flush=True)
# Prepare error response
result = {
"success": False,
"analysis": "There was a problem with the request and the image could not be analyzed."
}
debug_call_data["error"] = error_msg
_log_debug_call("vision_analyze_tool", debug_call_data)
_save_debug_log()
return json.dumps(result, indent=2, ensure_ascii=False)
finally:
# Clean up temporary image file
if temp_image_path and temp_image_path.exists():
try:
temp_image_path.unlink()
print(f"🧹 Cleaned up temporary image file", flush=True)
except Exception as cleanup_error:
print(f"⚠️ Warning: Could not delete temporary file: {cleanup_error}", flush=True)
def check_nous_api_key() -> bool:
"""
Check if the Nous Research API key is available in environment variables.
Returns:
bool: True if API key is set, False otherwise
"""
return bool(os.getenv("NOUS_API_KEY"))
def check_vision_requirements() -> bool:
"""
Check if all requirements for vision tools are met.
Returns:
bool: True if requirements are met, False otherwise
"""
return check_nous_api_key()
def get_debug_session_info() -> Dict[str, Any]:
"""
Get information about the current debug session.
Returns:
Dict[str, Any]: Dictionary containing debug session information
"""
if not DEBUG_MODE or not DEBUG_DATA:
return {
"enabled": False,
"session_id": None,
"log_path": None,
"total_calls": 0
}
return {
"enabled": True,
"session_id": DEBUG_SESSION_ID,
"log_path": str(DEBUG_LOG_PATH / f"vision_tools_debug_{DEBUG_SESSION_ID}.json"),
"total_calls": len(DEBUG_DATA["tool_calls"])
}
if __name__ == "__main__":
"""
Simple test/demo when run directly
"""
print("👁️ Vision Tools Module")
print("=" * 40)
# Check if API key is available
api_available = check_nous_api_key()
if not api_available:
print("❌ NOUS_API_KEY environment variable not set")
print("Please set your API key: export NOUS_API_KEY='your-key-here'")
print("Get API key at: https://inference-api.nousresearch.com/")
exit(1)
else:
print("✅ Nous Research API key found")
print("🛠️ Vision tools ready for use!")
print(f"🧠 Using model: {DEFAULT_VISION_MODEL}")
# Show debug mode status
if DEBUG_MODE:
print(f"🐛 Debug mode ENABLED - Session ID: {DEBUG_SESSION_ID}")
print(f" Debug logs will be saved to: ./logs/vision_tools_debug_{DEBUG_SESSION_ID}.json")
else:
print("🐛 Debug mode disabled (set VISION_TOOLS_DEBUG=true to enable)")
print("\nBasic usage:")
print(" from vision_tools import vision_analyze_tool")
print(" import asyncio")
print("")
print(" async def main():")
print(" result = await vision_analyze_tool(")
print(" image_url='https://example.com/image.jpg',")
print(" user_prompt='What do you see in this image?'")
print(" )")
print(" print(result)")
print(" asyncio.run(main())")
print("\nExample prompts:")
print(" - 'What architectural style is this building?'")
print(" - 'Describe the emotions and mood in this image'")
print(" - 'What text can you read in this image?'")
print(" - 'Identify any safety hazards visible'")
print(" - 'What products or brands are shown?'")
print("\nDebug mode:")
print(" # Enable debug logging")
print(" export VISION_TOOLS_DEBUG=true")
print(" # Debug logs capture all vision analysis calls and results")
print(" # Logs saved to: ./logs/vision_tools_debug_UUID.json")

1032
tools/web_tools.py Normal file

File diff suppressed because it is too large Load Diff

282
toolset_distributions.py Normal file
View File

@@ -0,0 +1,282 @@
#!/usr/bin/env python3
"""
Toolset Distributions Module
This module defines distributions of toolsets for data generation runs.
Each distribution specifies which toolsets should be used and their probability
of being selected for any given prompt during the batch processing.
A distribution is a dictionary mapping toolset names to their selection probability (%).
Probabilities should sum to 100, but the system will normalize if they don't.
Usage:
from toolset_distributions import get_distribution, list_distributions
# Get a specific distribution
dist = get_distribution("image_gen")
# List all available distributions
all_dists = list_distributions()
"""
from typing import Dict, List, Optional
import random
from toolsets import validate_toolset
# Distribution definitions
# Each key is a distribution name, and the value is a dict of toolset_name: probability_percentage
DISTRIBUTIONS = {
# Default: All tools available 100% of the time
"default": {
"description": "All available tools, all the time",
"toolsets": {
"web": 100,
"vision": 100,
"image_gen": 100,
"terminal": 100,
"moa": 100
}
},
# Image generation focused distribution
"image_gen": {
"description": "Heavy focus on image generation with vision and web support",
"toolsets": {
"image_gen": 90, # 80% chance of image generation tools
"vision": 90, # 60% chance of vision tools
"web": 55, # 40% chance of web tools
"terminal": 45,
"moa": 10 # 20% chance of reasoning tools
}
},
# Research-focused distribution
"research": {
"description": "Web research with vision analysis and reasoning",
"toolsets": {
"web": 90, # 90% chance of web tools
"vision": 50, # 50% chance of vision tools
"moa": 40, # 40% chance of reasoning tools
"terminal": 10 # 10% chance of terminal tools
}
},
# Scientific problem solving focused distribution
"science": {
"description": "Web research with vision analysis and reasoning",
"toolsets": {
"web": 94, # 90% chance of web tools
"vision": 65, # 50% chance of vision tools
"moa": 10, # 40% chance of reasoning tools
"terminal": 94, # 10% chance of terminal tools
"image_gen": 15 # 80% chance of image generation tools
}
},
# Development-focused distribution
"development": {
"description": "Terminal and reasoning with occasional web lookup",
"toolsets": {
"terminal": 80, # 80% chance of terminal tools
"moa": 60, # 60% chance of reasoning tools
"web": 30, # 30% chance of web tools
"vision": 10 # 10% chance of vision tools
}
},
# Safe mode (no terminal)
"safe": {
"description": "All tools except terminal for safety",
"toolsets": {
"web": 80,
"vision": 60,
"image_gen": 60,
"moa": 50
}
},
# Balanced distribution
"balanced": {
"description": "Equal probability of all toolsets",
"toolsets": {
"web": 50,
"vision": 50,
"image_gen": 50,
"terminal": 50,
"moa": 50
}
},
# Minimal (web only)
"minimal": {
"description": "Only web tools for basic research",
"toolsets": {
"web": 100
}
},
# Creative (vision + image generation)
"creative": {
"description": "Image generation and vision analysis focus",
"toolsets": {
"image_gen": 90,
"vision": 90,
"web": 30
}
},
# Reasoning heavy
"reasoning": {
"description": "Heavy mixture of agents usage with minimal other tools",
"toolsets": {
"moa": 90,
"web": 30,
"terminal": 20
}
}
}
def get_distribution(name: str) -> Optional[Dict[str, any]]:
"""
Get a toolset distribution by name.
Args:
name (str): Name of the distribution
Returns:
Dict: Distribution definition with description and toolsets
None: If distribution not found
"""
return DISTRIBUTIONS.get(name)
def list_distributions() -> Dict[str, Dict]:
"""
List all available distributions.
Returns:
Dict: All distribution definitions
"""
return DISTRIBUTIONS.copy()
def sample_toolsets_from_distribution(distribution_name: str) -> List[str]:
"""
Sample toolsets based on a distribution's probabilities.
Each toolset in the distribution has a % chance of being included.
This allows multiple toolsets to be active simultaneously.
Args:
distribution_name (str): Name of the distribution to sample from
Returns:
List[str]: List of sampled toolset names
Raises:
ValueError: If distribution name is not found
"""
dist = get_distribution(distribution_name)
if not dist:
raise ValueError(f"Unknown distribution: {distribution_name}")
# Sample each toolset independently based on its probability
selected_toolsets = []
for toolset_name, probability in dist["toolsets"].items():
# Validate toolset exists
if not validate_toolset(toolset_name):
print(f"⚠️ Warning: Toolset '{toolset_name}' in distribution '{distribution_name}' is not valid")
continue
# Roll the dice - if random value is less than probability, include this toolset
if random.random() * 100 < probability:
selected_toolsets.append(toolset_name)
# If no toolsets were selected (can happen with low probabilities),
# ensure at least one toolset is selected by picking the highest probability one
if not selected_toolsets and dist["toolsets"]:
# Find toolset with highest probability
highest_prob_toolset = max(dist["toolsets"].items(), key=lambda x: x[1])[0]
if validate_toolset(highest_prob_toolset):
selected_toolsets.append(highest_prob_toolset)
return selected_toolsets
def validate_distribution(distribution_name: str) -> bool:
"""
Check if a distribution name is valid.
Args:
distribution_name (str): Distribution name to validate
Returns:
bool: True if valid, False otherwise
"""
return distribution_name in DISTRIBUTIONS
def print_distribution_info(distribution_name: str) -> None:
"""
Print detailed information about a distribution.
Args:
distribution_name (str): Distribution name
"""
dist = get_distribution(distribution_name)
if not dist:
print(f"❌ Unknown distribution: {distribution_name}")
return
print(f"\n📊 Distribution: {distribution_name}")
print(f" Description: {dist['description']}")
print(f" Toolsets:")
for toolset, prob in sorted(dist["toolsets"].items(), key=lambda x: x[1], reverse=True):
print(f"{toolset:15} : {prob:3}% chance")
if __name__ == "__main__":
"""
Demo and testing of the distributions system
"""
print("📊 Toolset Distributions Demo")
print("=" * 60)
# List all distributions
print("\n📋 Available Distributions:")
print("-" * 40)
for name, dist in list_distributions().items():
print(f"\n {name}:")
print(f" {dist['description']}")
toolset_list = ", ".join([f"{ts}({p}%)" for ts, p in dist["toolsets"].items()])
print(f" Toolsets: {toolset_list}")
# Demo sampling
print("\n\n🎲 Sampling Examples:")
print("-" * 40)
test_distributions = ["image_gen", "research", "balanced", "default"]
for dist_name in test_distributions:
print(f"\n{dist_name}:")
# Sample 5 times to show variability
samples = []
for _ in range(5):
sampled = sample_toolsets_from_distribution(dist_name)
samples.append(sorted(sampled))
print(f" Sample 1: {samples[0]}")
print(f" Sample 2: {samples[1]}")
print(f" Sample 3: {samples[2]}")
print(f" Sample 4: {samples[3]}")
print(f" Sample 5: {samples[4]}")
# Show detailed info
print("\n\n📊 Detailed Distribution Info:")
print("-" * 40)
print_distribution_info("image_gen")
print_distribution_info("research")

339
toolsets.py Normal file
View File

@@ -0,0 +1,339 @@
#!/usr/bin/env python3
"""
Toolsets Module
This module provides a flexible system for defining and managing tool aliases/toolsets.
Toolsets allow you to group tools together for specific scenarios and can be composed
from individual tools or other toolsets.
Features:
- Define custom toolsets with specific tools
- Compose toolsets from other toolsets
- Built-in common toolsets for typical use cases
- Easy extension for new toolsets
- Support for dynamic toolset resolution
Usage:
from toolsets import get_toolset, resolve_toolset, get_all_toolsets
# Get tools for a specific toolset
tools = get_toolset("research")
# Resolve a toolset to get all tool names (including from composed toolsets)
all_tools = resolve_toolset("full_stack")
"""
from typing import List, Dict, Any, Set, Optional
import json
# Core toolset definitions
# These can include individual tools or reference other toolsets
TOOLSETS = {
# Basic toolsets - individual tool categories
"web": {
"description": "Web research and content extraction tools",
"tools": ["web_search", "web_extract", "web_crawl"],
"includes": [] # No other toolsets included
},
"vision": {
"description": "Image analysis and vision tools",
"tools": ["vision_analyze"],
"includes": []
},
"image_gen": {
"description": "Creative generation tools (images)",
"tools": ["image_generate"],
"includes": []
},
"terminal": {
"description": "Terminal/command execution tools",
"tools": ["terminal"],
"includes": []
},
"moa": {
"description": "Advanced reasoning and problem-solving tools",
"tools": ["mixture_of_agents"],
"includes": []
},
# Scenario-specific toolsets
"debugging": {
"description": "Debugging and troubleshooting toolkit",
"tools": ["terminal"],
"includes": ["web"] # For searching error messages and solutions
},
"safe": {
"description": "Safe toolkit without terminal access",
"tools": ["mixture_of_agents"],
"includes": ["web", "vision", "creative"]
}
}
def get_toolset(name: str) -> Optional[Dict[str, Any]]:
"""
Get a toolset definition by name.
Args:
name (str): Name of the toolset
Returns:
Dict: Toolset definition with description, tools, and includes
None: If toolset not found
"""
# Return toolset definition
return TOOLSETS.get(name)
def resolve_toolset(name: str, visited: Set[str] = None) -> List[str]:
"""
Recursively resolve a toolset to get all tool names.
This function handles toolset composition by recursively resolving
included toolsets and combining all tools.
Args:
name (str): Name of the toolset to resolve
visited (Set[str]): Set of already visited toolsets (for cycle detection)
Returns:
List[str]: List of all tool names in the toolset
"""
if visited is None:
visited = set()
# Special aliases that represent all tools across every toolset
# This ensures future toolsets are automatically included without changes.
if name in {"all", "*"}:
all_tools: Set[str] = set()
for toolset_name in get_toolset_names():
# Use a fresh visited set per branch to avoid cross-branch contamination
resolved = resolve_toolset(toolset_name, visited.copy())
all_tools.update(resolved)
return list(all_tools)
# Check for cycles
if name in visited:
print(f"⚠️ Circular dependency detected in toolset '{name}'")
return []
visited.add(name)
# Get toolset definition
toolset = TOOLSETS.get(name)
if not toolset:
return []
# Collect direct tools
tools = set(toolset.get("tools", []))
# Recursively resolve included toolsets
for included_name in toolset.get("includes", []):
included_tools = resolve_toolset(included_name, visited.copy())
tools.update(included_tools)
return list(tools)
def resolve_multiple_toolsets(toolset_names: List[str]) -> List[str]:
"""
Resolve multiple toolsets and combine their tools.
Args:
toolset_names (List[str]): List of toolset names to resolve
Returns:
List[str]: Combined list of all tool names (deduplicated)
"""
all_tools = set()
for name in toolset_names:
tools = resolve_toolset(name)
all_tools.update(tools)
return list(all_tools)
def get_all_toolsets() -> Dict[str, Dict[str, Any]]:
"""
Get all available toolsets with their definitions.
Returns:
Dict: All toolset definitions
"""
return TOOLSETS.copy()
def get_toolset_names() -> List[str]:
"""
Get names of all available toolsets (excluding aliases).
Returns:
List[str]: List of toolset names
"""
return list(TOOLSETS.keys())
def validate_toolset(name: str) -> bool:
"""
Check if a toolset name is valid.
Args:
name (str): Toolset name to validate
Returns:
bool: True if valid, False otherwise
"""
# Accept special alias names for convenience
if name in {"all", "*"}:
return True
return name in TOOLSETS
def create_custom_toolset(
name: str,
description: str,
tools: List[str] = None,
includes: List[str] = None
) -> None:
"""
Create a custom toolset at runtime.
Args:
name (str): Name for the new toolset
description (str): Description of the toolset
tools (List[str]): Direct tools to include
includes (List[str]): Other toolsets to include
"""
TOOLSETS[name] = {
"description": description,
"tools": tools or [],
"includes": includes or []
}
def get_toolset_info(name: str) -> Dict[str, Any]:
"""
Get detailed information about a toolset including resolved tools.
Args:
name (str): Toolset name
Returns:
Dict: Detailed toolset information
"""
toolset = get_toolset(name)
if not toolset:
return None
resolved_tools = resolve_toolset(name)
return {
"name": name,
"description": toolset["description"],
"direct_tools": toolset["tools"],
"includes": toolset["includes"],
"resolved_tools": resolved_tools,
"tool_count": len(resolved_tools),
"is_composite": len(toolset["includes"]) > 0
}
def print_toolset_tree(name: str, indent: int = 0) -> None:
"""
Print a tree view of a toolset and its composition.
Args:
name (str): Toolset name
indent (int): Current indentation level
"""
prefix = " " * indent
toolset = get_toolset(name)
if not toolset:
print(f"{prefix}❌ Unknown toolset: {name}")
return
# Print toolset name and description
print(f"{prefix}📦 {name}: {toolset['description']}")
# Print direct tools
if toolset["tools"]:
print(f"{prefix} 🔧 Tools: {', '.join(toolset['tools'])}")
# Print included toolsets
if toolset["includes"]:
print(f"{prefix} 📂 Includes:")
for included in toolset["includes"]:
print_toolset_tree(included, indent + 2)
if __name__ == "__main__":
"""
Demo and testing of the toolsets system
"""
print("🎯 Toolsets System Demo")
print("=" * 60)
# Show all available toolsets
print("\n📦 Available Toolsets:")
print("-" * 40)
for name, toolset in get_all_toolsets().items():
info = get_toolset_info(name)
composite = "📂" if info["is_composite"] else "🔧"
print(f"{composite} {name:20} - {toolset['description']}")
print(f" Tools: {len(info['resolved_tools'])} total")
# Demo toolset resolution
print("\n🔍 Toolset Resolution Examples:")
print("-" * 40)
examples = ["research", "development", "full_stack", "minimal", "safe"]
for name in examples:
tools = resolve_toolset(name)
print(f"\n{name}:")
print(f" Resolved to {len(tools)} tools: {', '.join(sorted(tools))}")
# Show toolset composition tree
print("\n🌳 Toolset Composition Tree:")
print("-" * 40)
print("\nExample: 'content_creation' toolset:")
print_toolset_tree("content_creation")
print("\nExample: 'full_stack' toolset:")
print_toolset_tree("full_stack")
# Demo multiple toolset resolution
print("\n🔗 Multiple Toolset Resolution:")
print("-" * 40)
combined = resolve_multiple_toolsets(["minimal", "vision", "reasoning"])
print(f"Combining ['minimal', 'vision', 'reasoning']:")
print(f" Result: {', '.join(sorted(combined))}")
# Demo custom toolset creation
print("\n Custom Toolset Creation:")
print("-" * 40)
create_custom_toolset(
name="my_custom",
description="My custom toolset for specific tasks",
tools=["web_search"],
includes=["terminal", "vision"]
)
custom_info = get_toolset_info("my_custom")
print(f"Created 'my_custom' toolset:")
print(f" Description: {custom_info['description']}")
print(f" Resolved tools: {', '.join(custom_info['resolved_tools'])}")

View File

@@ -1,265 +0,0 @@
#!/usr/bin/env python3
"""
Standalone Web Tools Module
This module provides generic web tools that work with multiple backend providers.
Currently uses Tavily as the backend, but the interface makes it easy to swap
to other providers like Firecrawl without changing the function signatures.
Available tools:
- web_search_tool: Search the web for information
- web_extract_tool: Extract content from specific web pages
- web_crawl_tool: Crawl websites with specific instructions
Backend compatibility:
- Tavily: https://docs.tavily.com/
- Firecrawl: https://docs.firecrawl.dev/features/search
Usage:
from web_tools import web_search_tool, web_extract_tool, web_crawl_tool
# Search the web
results = web_search_tool("Python machine learning libraries", limit=3)
# Extract content from URLs
content = web_extract_tool(["https://example.com"], format="markdown")
# Crawl a website
crawl_data = web_crawl_tool("example.com", "Find contact information")
"""
#TODO: Search Capabilities over the scraped pages
#TODO: Store the pages in something
#TODO: Tool to see what pages are available/saved to search over
import json
import os
import re
from typing import List
from tavily import TavilyClient
# Initialize Tavily client once at module level
tavily_client = TavilyClient(api_key=os.getenv("TAVILY_API_KEY"))
def clean_base64_images(text: str) -> str:
"""
Remove base64 encoded images from text to reduce token count and clutter.
This function finds and removes base64 encoded images in various formats:
- (data:image/png;base64,...)
- (data:image/jpeg;base64,...)
- (data:image/svg+xml;base64,...)
- data:image/[type];base64,... (without parentheses)
Args:
text: The text content to clean
Returns:
Cleaned text with base64 images replaced with placeholders
"""
# Pattern to match base64 encoded images wrapped in parentheses
# Matches: (data:image/[type];base64,[base64-string])
base64_with_parens_pattern = r'\(data:image/[^;]+;base64,[A-Za-z0-9+/=]+\)'
# Pattern to match base64 encoded images without parentheses
# Matches: data:image/[type];base64,[base64-string]
base64_pattern = r'data:image/[^;]+;base64,[A-Za-z0-9+/=]+'
# Replace parentheses-wrapped images first
cleaned_text = re.sub(base64_with_parens_pattern, '[BASE64_IMAGE_REMOVED]', text)
# Then replace any remaining non-parentheses images
cleaned_text = re.sub(base64_pattern, '[BASE64_IMAGE_REMOVED]', cleaned_text)
return cleaned_text
def web_search_tool(query: str, limit: int = 5) -> str:
"""
Search the web for information using available search API backend.
This function provides a generic interface for web search that can work
with multiple backends. Currently uses Tavily but can be easily swapped.
Args:
query (str): The search query to look up
limit (int): Maximum number of results to return (default: 5)
Returns:
str: JSON string containing search results with the following structure:
{
"query": str,
"results": [
{
"title": str,
"url": str,
"content": str,
"score": float
},
...
]
}
Raises:
Exception: If search fails or API key is not set
"""
try:
print(f"🔍 Searching the web for: '{query}' (limit: {limit})")
# Use Tavily's search functionality
response = tavily_client.search(query=query, max_results=limit, search_depth="advanced")
print(f"✅ Found {len(response.get('results', []))} results")
result_json = json.dumps(response, indent=2)
# Clean base64 images from search results
return clean_base64_images(result_json)
except Exception as e:
error_msg = f"Error searching web: {str(e)}"
print(f"{error_msg}")
return json.dumps({"error": error_msg})
def web_extract_tool(urls: List[str], format: str = None) -> str:
"""
Extract content from specific web pages using available extraction API backend.
This function provides a generic interface for web content extraction that
can work with multiple backends. Currently uses Tavily but can be easily swapped.
Args:
urls (List[str]): List of URLs to extract content from
format (str): Desired output format ("markdown" or "html", optional)
Returns:
str: JSON string containing extracted content with the following structure:
{
"results": [
{
"url": str,
"title": str,
"raw_content": str,
"content": str
},
...
]
}
Raises:
Exception: If extraction fails or API key is not set
"""
try:
print(f"📄 Extracting content from {len(urls)} URL(s)")
# Use Tavily's extract functionality
response = tavily_client.extract(urls=urls, format=format)
print(f"✅ Extracted content from {len(response.get('results', []))} pages")
# Print summary of extracted pages for debugging
for result in response.get('results', []):
url = result.get('url', 'Unknown URL')
content_length = len(result.get('raw_content', ''))
print(f" 📝 {url} ({content_length} characters)")
result_json = json.dumps(response, indent=2)
# Clean base64 images from extracted content
return clean_base64_images(result_json)
except Exception as e:
error_msg = f"Error extracting content: {str(e)}"
print(f"{error_msg}")
return json.dumps({"error": error_msg})
def web_crawl_tool(url: str, instructions: str = None, depth: str = "basic") -> str:
"""
Crawl a website with specific instructions using available crawling API backend.
This function provides a generic interface for web crawling that can work
with multiple backends. Currently uses Tavily but can be easily swapped.
Args:
url (str): The base URL to crawl (can include or exclude https://)
instructions (str): Instructions for what to crawl/extract using LLM intelligence (optional)
depth (str): Depth of extraction ("basic" or "advanced", default: "basic")
Returns:
str: JSON string containing crawled content with the following structure:
{
"results": [
{
"url": str,
"title": str,
"content": str
},
...
]
}
Raises:
Exception: If crawling fails or API key is not set
"""
try:
instructions_text = f" with instructions: '{instructions}'" if instructions else ""
print(f"🕷️ Crawling {url}{instructions_text}")
# Use Tavily's crawl functionality
response = tavily_client.crawl(
url=url,
limit=20, # Reasonable limit for most use cases
instructions=instructions or "Get all available content",
extract_depth=depth
)
print(f"✅ Crawled {len(response.get('results', []))} pages")
# Print summary of crawled pages for debugging
for result in response.get('results', []):
page_url = result.get('url', 'Unknown URL')
content_length = len(result.get('content', ''))
print(f" 🌐 {page_url} ({content_length} characters)")
result_json = json.dumps(response, indent=2)
# Clean base64 images from crawled content
return clean_base64_images(result_json)
except Exception as e:
error_msg = f"Error crawling website: {str(e)}"
print(f"{error_msg}")
return json.dumps({"error": error_msg})
# Convenience function to check if API key is available
def check_tavily_api_key() -> bool:
"""
Check if the Tavily API key is available in environment variables.
Returns:
bool: True if API key is set, False otherwise
"""
return bool(os.getenv("TAVILY_API_KEY"))
if __name__ == "__main__":
"""
Simple test/demo when run directly
"""
print("🌐 Standalone Web Tools Module")
print("=" * 40)
# Check if API key is available
if not check_tavily_api_key():
print("❌ TAVILY_API_KEY environment variable not set")
print("Please set your API key: export TAVILY_API_KEY='your-key-here'")
print("Get API key at: https://tavily.com/")
exit(1)
print("✅ Tavily API key found")
print("🛠️ Web tools ready for use!")
print("\nExample usage:")
print(" from web_tools import web_search_tool, web_extract_tool, web_crawl_tool")
print(" results = web_search_tool('Python tutorials')")
print(" content = web_extract_tool(['https://example.com'])")
print(" crawl_data = web_crawl_tool('example.com', 'Find documentation')")