Salvaged from PR #3816 by 0xbyt4. Stripped unrelated changes (telegram thread retry, cache logging in quiet_mode), preserved existing beta headers (interleaved-thinking, fine-grained-tool-streaming), and rebased onto current main. New computer_use toolset: - Screenshot capture via macOS native screencapture + sips - Mouse: click, double/triple/right/middle click, drag, move - Keyboard: type text (clipboard paste for Unicode), key combos - Zoom for inspecting small screen regions at full resolution - Auto-screenshot after destructive actions (saves API round-trips) Architecture: - Dual-schema: stub (OpenAI format) for dispatch + native (computer_20251124) injected into Anthropic API calls - Provider gating: stripped from non-Anthropic providers at init - Beta API routing: messages.create → beta.messages.create when native tools present (both streaming and non-streaming) - Multimodal results: _anthropic_content_blocks on tool messages, content stays string for session DB / trajectory compatibility Token optimization: - Server-side context editing (context-management-2025-06-27 beta) - Client-side screenshot-aware pruning in context compressor - Image eviction: keeps only 3 most recent screenshots - Image-aware token estimation (flat 1500 tokens per image) Safety: - Hard-blocked key combos (empty trash, force delete, lock screen) - Blocked type patterns (curl|bash, sudo -S -p '' rm -rf, privilege escalation) - Anti-injection system prompt guidance - Approval callback wired (disabled during beta) Includes: 102 tests, 657-line macOS workflow skill (auto-loaded), feature docs page, reference catalog updates.
7.1 KiB
title, description, sidebar_label, sidebar_position
| title | description | sidebar_label | sidebar_position |
|---|---|---|---|
| Computer Use | Control the macOS desktop via screenshots, mouse clicks, keyboard input, and scrolling using Anthropic's Computer Use API. | Computer Use | 6 |
Computer Use
Hermes Agent can control your macOS desktop through Anthropic's Computer Use API — taking screenshots, clicking UI elements, typing text, scrolling, and using keyboard shortcuts. This enables the agent to interact with any application on your computer, not just the terminal or browser.
:::caution Beta Feature
Computer Use is in beta. It requires macOS, the Anthropic provider (anthropic_messages API mode), and pyautogui for mouse/keyboard control.
:::
Setup
1. Install dependencies
uv pip install -e '.[computer-use]'
# or
pip install -e '.[computer-use]'
This installs pyautogui and its macOS dependencies (pyobjc-framework-Quartz).
2. Grant macOS permissions
The tool needs two macOS permissions:
- Screen Recording: System Settings → Privacy & Security → Screen Recording → add your Terminal app
- Accessibility: System Settings → Privacy & Security → Accessibility → add your Terminal app
After granting permissions, fully restart Terminal (not just new tab).
3. Enable the toolset
Option A — Interactive setup (recommended):
hermes setup tools
# or
hermes tools
Select computer_use from the checklist and choose which platforms to enable it for (CLI, Telegram, Discord, Slack, WhatsApp, Signal, Email, DingTalk).
Option B — CLI command:
# Enable for CLI
hermes tools enable computer_use --platform cli
# Enable for Telegram
hermes tools enable computer_use --platform telegram
# Enable for Discord
hermes tools enable computer_use --platform discord
Option C — Edit ~/.hermes/config.yaml manually:
platform_toolsets:
cli:
- computer_use
- terminal
- file
# ... other toolsets
telegram:
- computer_use
# ... other toolsets
Option D — Enable temporarily for one session:
hermes -t computer_use
How It Works
- Screenshot: Agent captures the screen and sees it via Claude's vision
- Decide: Claude identifies UI elements and coordinates from the screenshot
- Act: Agent performs mouse/keyboard actions at the identified coordinates
- Verify: Agent takes another screenshot to confirm the action worked
The coordinate system matches your logical screen resolution (e.g., 1470×956 on a Retina MacBook). Screenshots are automatically resized to this resolution so coordinates map 1:1 to pyautogui — no manual scaling needed.
Available Actions
| Action | Description | Parameters |
|---|---|---|
screenshot |
Capture current screen | — |
left_click |
Click at position | coordinate: [x, y] |
right_click |
Right-click at position | coordinate: [x, y] |
double_click |
Double-click at position | coordinate: [x, y] |
triple_click |
Triple-click (select line) | coordinate: [x, y] |
middle_click |
Middle-click at position | coordinate: [x, y] |
mouse_move |
Move cursor (drag-aware when button held) | coordinate: [x, y] |
left_click_drag |
Atomic drag from A to B | start_coordinate, coordinate |
left_mouse_down |
Press and hold left button | coordinate: [x, y] |
left_mouse_up |
Release left button | — |
type |
Type text (via clipboard paste) | text: "hello" |
key |
Press key or shortcut | key: "command+l" |
hold_key |
Press and hold a key for duration | key: "shift", duration: 2 |
scroll |
Scroll at position | coordinate, scroll_direction, scroll_amount |
zoom |
Inspect a screen region at full resolution | region: [x1, y1, x2, y2] |
wait |
Pause for N seconds (max 10) | duration: 2 |
Usage Examples
Take a screenshot and describe it
You: What's on my screen?
Agent: [takes screenshot] I see Chrome open with GitHub, Terminal in the background...
Open a website
You: Open x.com in Chrome
Agent: [activates Chrome via osascript, Cmd+L, types URL, presses Enter]
Fill a form
You: Fill in the search box on this page
Agent: [clicks on search field, types text, presses Enter]
CLI vs Gateway Mode
CLI Mode
The terminal running Hermes has focus. After using osascript or open via the terminal tool, Terminal regains focus. The agent must re-activate the target app before typing.
Gateway Mode (Recommended)
When running via Telegram/Discord gateway, the agent runs in the background with no terminal window. Focus issues don't occur, making this the most reliable mode for desktop automation.
Screenshots are sent as images to the chat. Each screenshot generates a unique file path (e.g., MEDIA:/tmp/hermes_screenshot_a1b2c3d4.png). The agent extracts this path from the tool result's text_summary and includes it in the response, and the gateway delivers it as a native image.
Skills
When the computer_use toolset is enabled, the macOS Computer Use skill is automatically available. This skill teaches the agent:
- Reliable app switching patterns (osascript > Cmd+Tab > click)
- macOS keyboard shortcuts for system, browser, and text editing
- Typing via clipboard paste (keyboard layout independent)
- Scrolling alternatives when the scroll action fails
- Click accuracy strategies
- Error recovery patterns
- Safety rules (what NOT to do)
The agent loads this skill automatically when handling computer use tasks.
Configuration
Computer Use is configured via the computer_use toolset. No additional environment variables are needed.
platform_toolsets:
cli:
- computer_use # Enable for CLI
telegram:
- computer_use # Enable for Telegram gateway
discord:
- computer_use # Enable for Discord gateway
The tool is gated behind a requirements check — it only loads on macOS when pyautogui is installed.
Limitations
- macOS only — not available on Linux or Windows
- Anthropic provider only — requires
anthropic_messagesAPI mode (uses beta API) - Primary display only — multi-monitor setups: secondary displays are not visible
- Coordinate accuracy: ~1-2px after scaling — precise for most UI targets
- Type overwrites clipboard — the
typeaction usespbcopy+Cmd+V - Scroll unreliable — use keyboard shortcuts (
space,Page_Down) as fallback - Wait capped at 10s — chain multiple waits for longer pauses
- No Touch Bar — Touch Bar interactions not supported
- No Spaces/Mission Control — full-screen spaces not navigable
Troubleshooting
"No such file or directory: '['"
Coordinate formatting issue — fixed in latest version. Update your Hermes installation.
Screenshots return empty
Missing Screen Recording permission. Grant it in System Settings → Privacy & Security → Screen Recording and restart Terminal.
Clicks/typing don't work
Missing Accessibility permission. Grant it in System Settings → Privacy & Security → Accessibility and restart Terminal.
Tool not loading
Ensure pyautogui is installed (pip install pyautogui) and you're on macOS. Check hermes doctor for tool availability.