mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-03 01:07:31 +08:00
Salvaged from PR #3816 by 0xbyt4. Stripped unrelated changes (telegram thread retry, cache logging in quiet_mode), preserved existing beta headers (interleaved-thinking, fine-grained-tool-streaming), and rebased onto current main. New computer_use toolset: - Screenshot capture via macOS native screencapture + sips - Mouse: click, double/triple/right/middle click, drag, move - Keyboard: type text (clipboard paste for Unicode), key combos - Zoom for inspecting small screen regions at full resolution - Auto-screenshot after destructive actions (saves API round-trips) Architecture: - Dual-schema: stub (OpenAI format) for dispatch + native (computer_20251124) injected into Anthropic API calls - Provider gating: stripped from non-Anthropic providers at init - Beta API routing: messages.create → beta.messages.create when native tools present (both streaming and non-streaming) - Multimodal results: _anthropic_content_blocks on tool messages, content stays string for session DB / trajectory compatibility Token optimization: - Server-side context editing (context-management-2025-06-27 beta) - Client-side screenshot-aware pruning in context compressor - Image eviction: keeps only 3 most recent screenshots - Image-aware token estimation (flat 1500 tokens per image) Safety: - Hard-blocked key combos (empty trash, force delete, lock screen) - Blocked type patterns (curl|bash, sudo -S -p '' rm -rf, privilege escalation) - Anti-injection system prompt guidance - Approval callback wired (disabled during beta) Includes: 102 tests, 657-line macOS workflow skill (auto-loaded), feature docs page, reference catalog updates.
202 lines
7.1 KiB
Markdown
202 lines
7.1 KiB
Markdown
---
|
||
title: Computer Use
|
||
description: Control the macOS desktop via screenshots, mouse clicks, keyboard input, and scrolling using Anthropic's Computer Use API.
|
||
sidebar_label: Computer Use
|
||
sidebar_position: 6
|
||
---
|
||
|
||
# Computer Use
|
||
|
||
Hermes Agent can control your macOS desktop through Anthropic's Computer Use API — taking screenshots, clicking UI elements, typing text, scrolling, and using keyboard shortcuts. This enables the agent to interact with **any** application on your computer, not just the terminal or browser.
|
||
|
||
:::caution Beta Feature
|
||
Computer Use is in beta. It requires macOS, the Anthropic provider (`anthropic_messages` API mode), and `pyautogui` for mouse/keyboard control.
|
||
:::
|
||
|
||
## Setup
|
||
|
||
### 1. Install dependencies
|
||
|
||
```bash
|
||
uv pip install -e '.[computer-use]'
|
||
# or
|
||
pip install -e '.[computer-use]'
|
||
```
|
||
|
||
This installs `pyautogui` and its macOS dependencies (`pyobjc-framework-Quartz`).
|
||
|
||
### 2. Grant macOS permissions
|
||
|
||
The tool needs two macOS permissions:
|
||
|
||
- **Screen Recording**: System Settings → Privacy & Security → Screen Recording → add your Terminal app
|
||
- **Accessibility**: System Settings → Privacy & Security → Accessibility → add your Terminal app
|
||
|
||
After granting permissions, **fully restart Terminal** (not just new tab).
|
||
|
||
### 3. Enable the toolset
|
||
|
||
**Option A — Interactive setup (recommended):**
|
||
|
||
```bash
|
||
hermes setup tools
|
||
# or
|
||
hermes tools
|
||
```
|
||
|
||
Select `computer_use` from the checklist and choose which platforms to enable it for (CLI, Telegram, Discord, Slack, WhatsApp, Signal, Email, DingTalk).
|
||
|
||
**Option B — CLI command:**
|
||
|
||
```bash
|
||
# Enable for CLI
|
||
hermes tools enable computer_use --platform cli
|
||
|
||
# Enable for Telegram
|
||
hermes tools enable computer_use --platform telegram
|
||
|
||
# Enable for Discord
|
||
hermes tools enable computer_use --platform discord
|
||
```
|
||
|
||
**Option C — Edit `~/.hermes/config.yaml` manually:**
|
||
|
||
```yaml
|
||
platform_toolsets:
|
||
cli:
|
||
- computer_use
|
||
- terminal
|
||
- file
|
||
# ... other toolsets
|
||
telegram:
|
||
- computer_use
|
||
# ... other toolsets
|
||
```
|
||
|
||
**Option D — Enable temporarily for one session:**
|
||
|
||
```bash
|
||
hermes -t computer_use
|
||
```
|
||
|
||
## How It Works
|
||
|
||
1. **Screenshot**: Agent captures the screen and sees it via Claude's vision
|
||
2. **Decide**: Claude identifies UI elements and coordinates from the screenshot
|
||
3. **Act**: Agent performs mouse/keyboard actions at the identified coordinates
|
||
4. **Verify**: Agent takes another screenshot to confirm the action worked
|
||
|
||
The coordinate system matches your logical screen resolution (e.g., 1470×956 on a Retina MacBook). Screenshots are automatically resized to this resolution so coordinates map 1:1 to `pyautogui` — no manual scaling needed.
|
||
|
||
## Available Actions
|
||
|
||
| Action | Description | Parameters |
|
||
|--------|-------------|------------|
|
||
| `screenshot` | Capture current screen | — |
|
||
| `left_click` | Click at position | `coordinate: [x, y]` |
|
||
| `right_click` | Right-click at position | `coordinate: [x, y]` |
|
||
| `double_click` | Double-click at position | `coordinate: [x, y]` |
|
||
| `triple_click` | Triple-click (select line) | `coordinate: [x, y]` |
|
||
| `middle_click` | Middle-click at position | `coordinate: [x, y]` |
|
||
| `mouse_move` | Move cursor (drag-aware when button held) | `coordinate: [x, y]` |
|
||
| `left_click_drag` | Atomic drag from A to B | `start_coordinate`, `coordinate` |
|
||
| `left_mouse_down` | Press and hold left button | `coordinate: [x, y]` |
|
||
| `left_mouse_up` | Release left button | — |
|
||
| `type` | Type text (via clipboard paste) | `text: "hello"` |
|
||
| `key` | Press key or shortcut | `key: "command+l"` |
|
||
| `hold_key` | Press and hold a key for duration | `key: "shift"`, `duration: 2` |
|
||
| `scroll` | Scroll at position | `coordinate`, `scroll_direction`, `scroll_amount` |
|
||
| `zoom` | Inspect a screen region at full resolution | `region: [x1, y1, x2, y2]` |
|
||
| `wait` | Pause for N seconds (max 10) | `duration: 2` |
|
||
|
||
## Usage Examples
|
||
|
||
### Take a screenshot and describe it
|
||
|
||
```
|
||
You: What's on my screen?
|
||
Agent: [takes screenshot] I see Chrome open with GitHub, Terminal in the background...
|
||
```
|
||
|
||
### Open a website
|
||
|
||
```
|
||
You: Open x.com in Chrome
|
||
Agent: [activates Chrome via osascript, Cmd+L, types URL, presses Enter]
|
||
```
|
||
|
||
### Fill a form
|
||
|
||
```
|
||
You: Fill in the search box on this page
|
||
Agent: [clicks on search field, types text, presses Enter]
|
||
```
|
||
|
||
## CLI vs Gateway Mode
|
||
|
||
### CLI Mode
|
||
|
||
The terminal running Hermes has focus. After using `osascript` or `open` via the terminal tool, Terminal regains focus. The agent must re-activate the target app before typing.
|
||
|
||
### Gateway Mode (Recommended)
|
||
|
||
When running via Telegram/Discord gateway, the agent runs in the background with no terminal window. Focus issues don't occur, making this the most reliable mode for desktop automation.
|
||
|
||
Screenshots are sent as images to the chat. Each screenshot generates a unique file path (e.g., `MEDIA:/tmp/hermes_screenshot_a1b2c3d4.png`). The agent extracts this path from the tool result's `text_summary` and includes it in the response, and the gateway delivers it as a native image.
|
||
|
||
## Skills
|
||
|
||
When the `computer_use` toolset is enabled, the **macOS Computer Use** skill is automatically available. This skill teaches the agent:
|
||
|
||
- Reliable app switching patterns (osascript > Cmd+Tab > click)
|
||
- macOS keyboard shortcuts for system, browser, and text editing
|
||
- Typing via clipboard paste (keyboard layout independent)
|
||
- Scrolling alternatives when the scroll action fails
|
||
- Click accuracy strategies
|
||
- Error recovery patterns
|
||
- Safety rules (what NOT to do)
|
||
|
||
The agent loads this skill automatically when handling computer use tasks.
|
||
|
||
## Configuration
|
||
|
||
Computer Use is configured via the `computer_use` toolset. No additional environment variables are needed.
|
||
|
||
```yaml
|
||
platform_toolsets:
|
||
cli:
|
||
- computer_use # Enable for CLI
|
||
telegram:
|
||
- computer_use # Enable for Telegram gateway
|
||
discord:
|
||
- computer_use # Enable for Discord gateway
|
||
```
|
||
|
||
The tool is gated behind a requirements check — it only loads on macOS when `pyautogui` is installed.
|
||
|
||
## Limitations
|
||
|
||
- **macOS only** — not available on Linux or Windows
|
||
- **Anthropic provider only** — requires `anthropic_messages` API mode (uses beta API)
|
||
- **Primary display only** — multi-monitor setups: secondary displays are not visible
|
||
- **Coordinate accuracy**: ~1-2px after scaling — precise for most UI targets
|
||
- **Type overwrites clipboard** — the `type` action uses `pbcopy` + `Cmd+V`
|
||
- **Scroll unreliable** — use keyboard shortcuts (`space`, `Page_Down`) as fallback
|
||
- **Wait capped at 10s** — chain multiple waits for longer pauses
|
||
- **No Touch Bar** — Touch Bar interactions not supported
|
||
- **No Spaces/Mission Control** — full-screen spaces not navigable
|
||
|
||
## Troubleshooting
|
||
|
||
### "No such file or directory: '['"
|
||
Coordinate formatting issue — fixed in latest version. Update your Hermes installation.
|
||
|
||
### Screenshots return empty
|
||
Missing Screen Recording permission. Grant it in System Settings → Privacy & Security → Screen Recording and restart Terminal.
|
||
|
||
### Clicks/typing don't work
|
||
Missing Accessibility permission. Grant it in System Settings → Privacy & Security → Accessibility and restart Terminal.
|
||
|
||
### Tool not loading
|
||
Ensure `pyautogui` is installed (`pip install pyautogui`) and you're on macOS. Check `hermes doctor` for tool availability.
|