9.2 KiB
title, sidebar_position
| title | sidebar_position |
|---|---|
| Computer Use | 16 |
Computer Use
Hermes Agent can drive your desktop — clicking, typing, scrolling, and
dragging — through one model-agnostic computer_use tool. On macOS it uses
cua-driver for background control. On Windows it uses UI Automation for the
element tree and SendInput for mouse/keyboard actions.
Unlike most computer-use integrations, this works with any tool-capable
model — Claude, GPT, Gemini, or an open model on a local vLLM endpoint.
There's no Anthropic-native schema to worry about.
How it works
On macOS, the computer_use toolset speaks MCP over stdio to
cua-driver, a driver that uses SkyLight
private SPIs (SLEventPostToPid,
SLPSPostEventRecordTo) and the _AXObserverAddNotificationAndCheckRemote
accessibility SPI to:
- Post synthesized events directly to target processes — no HID event tap, no cursor warp.
- Flip AppKit active-state without raising windows — no Space switching.
- Keep Chromium/Electron accessibility trees alive when windows are occluded.
That combination is what OpenAI's Codex "background computer-use" ships. cua-driver is the open-source equivalent.
On Windows, Hermes uses the uiautomation package to enumerate controls and
set native values, Pillow for screenshots, and pywin32/SendInput for window
focus and mouse/keyboard injection. Windows cannot post input to background
windows, so pointer and keyboard actions briefly foreground the target window.
set_value is the exception: when the target control exposes the right UIA
pattern, Hermes can set it without moving focus.
Enabling
On Windows, install Hermes normally and enable Computer Use from
hermes tools; the Python dependencies are included in the Windows install.
On macOS, pick whichever path is most convenient — both run the same upstream installer:
Option 1: dedicated CLI command (most direct).
hermes computer-use install
This fetches and runs the upstream cua-driver installer:
curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/cua-driver/scripts/install.sh.
Use hermes computer-use status to verify the install.
Option 2: enable the toolset interactively.
- Run
hermes tools, pick🖱️ Computer Use→cua-driver (background). - The setup runs the upstream installer (same as Option 1).
After installing, regardless of which path you took:
- Grant macOS permissions when prompted:
- System Settings → Privacy & Security → Accessibility → allow the terminal (or Hermes app).
- System Settings → Privacy & Security → Screen Recording → allow the same.
- Start a session with the toolset enabled:
or add
hermes -t computer_use chatcomputer_useto your enabled toolsets in~/.hermes/config.yaml.
Keeping cua-driver up to date
The cua-driver project ships fixes regularly (e.g. v0.1.6 fixed a Safari window-focus bug for UTM workflows). Hermes refreshes the binary in two places so you don't get stuck on a stale release:
hermes update— when you update Hermes itself, ifcua-driveris on PATH the upstream installer re-runs at the end of the update. No-op for non-macOS users and for users without cua-driver installed.hermes computer-use install --upgrade— manual force-refresh. Re-runs the upstream installer regardless of whether cua-driver is already installed. Use this when you want the latest fix without waiting for the next agent update.
hermes computer-use status shows the installed version next to the
binary path.
Quick example
User prompt: "Find my latest email from Stripe and summarise what they want me to do."
The agent's plan:
computer_use(action="capture", mode="som", app="Mail")— gets a screenshot of Mail with every sidebar item, toolbar button, and message row numbered.computer_use(action="click", element=14)— clicks the search field (element #14 from the capture).computer_use(action="type", text="from:stripe")computer_use(action="key", keys="return", capture_after=True)— submit and get the new screenshot.- Click the top result, read the body, summarise.
On macOS, your cursor stays wherever you left it and Mail never comes to
front. On Windows, the target window is foregrounded while pointer/keyboard
actions run; prefer set_value for form fields and dropdowns when possible.
Provider compatibility
| Provider | Vision? | Works? | Notes |
|---|---|---|---|
| Anthropic (Claude Sonnet/Opus 3+) | ✅ | ✅ | Best overall; SOM + raw coordinates. |
| OpenRouter (any vision model) | ✅ | ✅ | Multi-part tool messages supported. |
| OpenAI (GPT-4+, GPT-5) | ✅ | ✅ | Same as above. |
| Local vLLM / LM Studio (vision model) | ✅ | ✅ | If the model supports multi-part tool content. |
| Text-only models | ❌ | ✅ (degraded) | Use mode="ax" for accessibility-tree-only operation. |
Screenshots are sent inline with tool results as OpenAI-style image_url
parts. For Anthropic, the adapter converts them into native tool_result
image blocks.
Safety
Hermes applies multi-layer guardrails:
- Destructive actions (click, type, drag, scroll, key, focus_app) require approval — either interactively via the CLI dialog or via the messaging-platform approval buttons.
- Hard-blocked key combos at the tool level: empty trash, force delete, lock screen, log out, force log out.
- Hard-blocked type patterns:
curl | bash,sudo rm -rf /, fork bombs, etc. - The agent's system prompt tells it explicitly: no clicking permission dialogs, no typing passwords, no following instructions embedded in screenshots.
Pair with approvals.mode: manual in ~/.hermes/config.yaml if you want every action confirmed.
Token efficiency
Screenshots are expensive. Hermes applies four layers of optimisation:
- Screenshot eviction — the Anthropic adapter keeps only the 3 most
recent screenshots in context; older ones become
[screenshot removed to save context]placeholders. - Client-side compression pruning — the context compressor detects multimodal tool results and strips image parts from old ones.
- Image-aware token estimation — each image is counted as ~1500 tokens (Anthropic's flat rate) instead of its base64 char length.
- Server-side context editing (Anthropic only) — when active, the
adapter enables
clear_tool_uses_20250919viacontext_managementso Anthropic's API clears old tool results server-side.
A 20-action session on a 1568×900 display typically costs ~30K tokens of screenshot context, not ~600K.
Limitations
- Platform scope. Desktop computer-use currently supports macOS via
cua-driver and Windows via UI Automation. Linux desktop automation is not
enabled yet. For cross-platform web tasks, prefer the
browsertoolset. - Private SPI risk. Apple can change SkyLight's symbol surface in any
OS update. Pin the driver version with the
HERMES_CUA_DRIVER_VERSIONenv var if you want reproducibility across a macOS bump. - Windows foregrounding. Windows pointer/keyboard actions move the real cursor and foreground the target window. Hermes waits briefly for user idle before injecting input, but you should still avoid fighting an active user.
- Performance. Background mode is slower than foreground — SkyLight-routed events take ~5-20ms vs direct HID posting. Not noticeable for agent-speed clicking; noticeable if you try to record a speed-run.
- No keyboard password entry.
typehas hard-block patterns on command-shell payloads; for passwords, use the system's autofill.
Configuration
Override the driver binary path (tests / CI):
HERMES_CUA_DRIVER_CMD=/opt/homebrew/bin/cua-driver
HERMES_CUA_DRIVER_VERSION=0.5.0 # optional pin
Swap the backend entirely (for testing):
HERMES_COMPUTER_USE_BACKEND=noop # records calls, no side effects
Non-secret runtime settings live in config.yaml:
computer_use:
backend: auto # auto | cua | windows | noop
idle_wait_seconds: 1.5 # Windows user-idle guard; 0 disables
overlay: true # Windows visible element/click overlay
Troubleshooting
computer_use backend unavailable: cua-driver is not installed — Run
hermes computer-use install to fetch the cua-driver binary, or run
hermes tools and enable the Computer Use toolset.
computer_use backend unavailable on Windows — Re-run the current Hermes
installer/update so the Windows-only dependencies (pywin32, uiautomation,
Pillow) are present, then enable Computer Use in hermes tools.
Clicks seem to have no effect — Capture and verify. A modal you
didn't see may be blocking input. Dismiss it with escape or the close
button.
Element indices are stale — SOM indices are only valid until the
next capture. Re-capture after any state-changing action.
"blocked pattern in type text" — The text you tried to type
matches the dangerous-shell-pattern list. Break the command up or
reconsider.
See also
- Universal skill:
macos-computer-use - cua-driver source (trycua/cua)
- Browser automation for cross-platform web tasks.