Files
hermes-agent/website/docs/user-guide/features/computer-use.md
Teknium 8e3803f3ce feat: Computer Use Tool — macOS desktop control via Anthropic native API
Salvaged from PR #3816 by 0xbyt4. Stripped unrelated changes (telegram
thread retry, cache logging in quiet_mode), preserved existing beta
headers (interleaved-thinking, fine-grained-tool-streaming), and
rebased onto current main.

New computer_use toolset:
- Screenshot capture via macOS native screencapture + sips
- Mouse: click, double/triple/right/middle click, drag, move
- Keyboard: type text (clipboard paste for Unicode), key combos
- Zoom for inspecting small screen regions at full resolution
- Auto-screenshot after destructive actions (saves API round-trips)

Architecture:
- Dual-schema: stub (OpenAI format) for dispatch + native
  (computer_20251124) injected into Anthropic API calls
- Provider gating: stripped from non-Anthropic providers at init
- Beta API routing: messages.create → beta.messages.create when
  native tools present (both streaming and non-streaming)
- Multimodal results: _anthropic_content_blocks on tool messages,
  content stays string for session DB / trajectory compatibility

Token optimization:
- Server-side context editing (context-management-2025-06-27 beta)
- Client-side screenshot-aware pruning in context compressor
- Image eviction: keeps only 3 most recent screenshots
- Image-aware token estimation (flat 1500 tokens per image)

Safety:
- Hard-blocked key combos (empty trash, force delete, lock screen)
- Blocked type patterns (curl|bash, sudo -S -p '' rm -rf, privilege escalation)
- Anti-injection system prompt guidance
- Approval callback wired (disabled during beta)

Includes: 102 tests, 657-line macOS workflow skill (auto-loaded),
feature docs page, reference catalog updates.
2026-04-02 01:59:32 -07:00

202 lines
7.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: Computer Use
description: Control the macOS desktop via screenshots, mouse clicks, keyboard input, and scrolling using Anthropic's Computer Use API.
sidebar_label: Computer Use
sidebar_position: 6
---
# Computer Use
Hermes Agent can control your macOS desktop through Anthropic's Computer Use API — taking screenshots, clicking UI elements, typing text, scrolling, and using keyboard shortcuts. This enables the agent to interact with **any** application on your computer, not just the terminal or browser.
:::caution Beta Feature
Computer Use is in beta. It requires macOS, the Anthropic provider (`anthropic_messages` API mode), and `pyautogui` for mouse/keyboard control.
:::
## Setup
### 1. Install dependencies
```bash
uv pip install -e '.[computer-use]'
# or
pip install -e '.[computer-use]'
```
This installs `pyautogui` and its macOS dependencies (`pyobjc-framework-Quartz`).
### 2. Grant macOS permissions
The tool needs two macOS permissions:
- **Screen Recording**: System Settings → Privacy & Security → Screen Recording → add your Terminal app
- **Accessibility**: System Settings → Privacy & Security → Accessibility → add your Terminal app
After granting permissions, **fully restart Terminal** (not just new tab).
### 3. Enable the toolset
**Option A — Interactive setup (recommended):**
```bash
hermes setup tools
# or
hermes tools
```
Select `computer_use` from the checklist and choose which platforms to enable it for (CLI, Telegram, Discord, Slack, WhatsApp, Signal, Email, DingTalk).
**Option B — CLI command:**
```bash
# Enable for CLI
hermes tools enable computer_use --platform cli
# Enable for Telegram
hermes tools enable computer_use --platform telegram
# Enable for Discord
hermes tools enable computer_use --platform discord
```
**Option C — Edit `~/.hermes/config.yaml` manually:**
```yaml
platform_toolsets:
cli:
- computer_use
- terminal
- file
# ... other toolsets
telegram:
- computer_use
# ... other toolsets
```
**Option D — Enable temporarily for one session:**
```bash
hermes -t computer_use
```
## How It Works
1. **Screenshot**: Agent captures the screen and sees it via Claude's vision
2. **Decide**: Claude identifies UI elements and coordinates from the screenshot
3. **Act**: Agent performs mouse/keyboard actions at the identified coordinates
4. **Verify**: Agent takes another screenshot to confirm the action worked
The coordinate system matches your logical screen resolution (e.g., 1470×956 on a Retina MacBook). Screenshots are automatically resized to this resolution so coordinates map 1:1 to `pyautogui` — no manual scaling needed.
## Available Actions
| Action | Description | Parameters |
|--------|-------------|------------|
| `screenshot` | Capture current screen | — |
| `left_click` | Click at position | `coordinate: [x, y]` |
| `right_click` | Right-click at position | `coordinate: [x, y]` |
| `double_click` | Double-click at position | `coordinate: [x, y]` |
| `triple_click` | Triple-click (select line) | `coordinate: [x, y]` |
| `middle_click` | Middle-click at position | `coordinate: [x, y]` |
| `mouse_move` | Move cursor (drag-aware when button held) | `coordinate: [x, y]` |
| `left_click_drag` | Atomic drag from A to B | `start_coordinate`, `coordinate` |
| `left_mouse_down` | Press and hold left button | `coordinate: [x, y]` |
| `left_mouse_up` | Release left button | — |
| `type` | Type text (via clipboard paste) | `text: "hello"` |
| `key` | Press key or shortcut | `key: "command+l"` |
| `hold_key` | Press and hold a key for duration | `key: "shift"`, `duration: 2` |
| `scroll` | Scroll at position | `coordinate`, `scroll_direction`, `scroll_amount` |
| `zoom` | Inspect a screen region at full resolution | `region: [x1, y1, x2, y2]` |
| `wait` | Pause for N seconds (max 10) | `duration: 2` |
## Usage Examples
### Take a screenshot and describe it
```
You: What's on my screen?
Agent: [takes screenshot] I see Chrome open with GitHub, Terminal in the background...
```
### Open a website
```
You: Open x.com in Chrome
Agent: [activates Chrome via osascript, Cmd+L, types URL, presses Enter]
```
### Fill a form
```
You: Fill in the search box on this page
Agent: [clicks on search field, types text, presses Enter]
```
## CLI vs Gateway Mode
### CLI Mode
The terminal running Hermes has focus. After using `osascript` or `open` via the terminal tool, Terminal regains focus. The agent must re-activate the target app before typing.
### Gateway Mode (Recommended)
When running via Telegram/Discord gateway, the agent runs in the background with no terminal window. Focus issues don't occur, making this the most reliable mode for desktop automation.
Screenshots are sent as images to the chat. Each screenshot generates a unique file path (e.g., `MEDIA:/tmp/hermes_screenshot_a1b2c3d4.png`). The agent extracts this path from the tool result's `text_summary` and includes it in the response, and the gateway delivers it as a native image.
## Skills
When the `computer_use` toolset is enabled, the **macOS Computer Use** skill is automatically available. This skill teaches the agent:
- Reliable app switching patterns (osascript > Cmd+Tab > click)
- macOS keyboard shortcuts for system, browser, and text editing
- Typing via clipboard paste (keyboard layout independent)
- Scrolling alternatives when the scroll action fails
- Click accuracy strategies
- Error recovery patterns
- Safety rules (what NOT to do)
The agent loads this skill automatically when handling computer use tasks.
## Configuration
Computer Use is configured via the `computer_use` toolset. No additional environment variables are needed.
```yaml
platform_toolsets:
cli:
- computer_use # Enable for CLI
telegram:
- computer_use # Enable for Telegram gateway
discord:
- computer_use # Enable for Discord gateway
```
The tool is gated behind a requirements check — it only loads on macOS when `pyautogui` is installed.
## Limitations
- **macOS only** — not available on Linux or Windows
- **Anthropic provider only** — requires `anthropic_messages` API mode (uses beta API)
- **Primary display only** — multi-monitor setups: secondary displays are not visible
- **Coordinate accuracy**: ~1-2px after scaling — precise for most UI targets
- **Type overwrites clipboard** — the `type` action uses `pbcopy` + `Cmd+V`
- **Scroll unreliable** — use keyboard shortcuts (`space`, `Page_Down`) as fallback
- **Wait capped at 10s** — chain multiple waits for longer pauses
- **No Touch Bar** — Touch Bar interactions not supported
- **No Spaces/Mission Control** — full-screen spaces not navigable
## Troubleshooting
### "No such file or directory: '['"
Coordinate formatting issue — fixed in latest version. Update your Hermes installation.
### Screenshots return empty
Missing Screen Recording permission. Grant it in System Settings → Privacy & Security → Screen Recording and restart Terminal.
### Clicks/typing don't work
Missing Accessibility permission. Grant it in System Settings → Privacy & Security → Accessibility and restart Terminal.
### Tool not loading
Ensure `pyautogui` is installed (`pip install pyautogui`) and you're on macOS. Check `hermes doctor` for tool availability.