The old defaults (StartLimitIntervalSec=600, StartLimitBurst=5,
RestartSec=30) meant any network outage over ~5 minutes would
permanently kill the gateway until manual intervention.
Changes:
- StartLimitIntervalSec=0 (never give up)
- Restart=always (not just on-failure)
- RestartSec=60 with RestartMaxDelaySec=300, RestartSteps=5
(exponential backoff: 60 → 120 → 180 → 240 → 300s cap)
- After=network-online.target + Wants= (both units now wait for
actual connectivity, not just network.target)
Power outage → internet down → internet back = auto-recovery.
When the dashboard is bound to 0.0.0.0 with --insecure (e.g. behind
Tailscale Serve), WebSocket endpoints (/api/pty, /api/ws, /api/pub,
/api/events) rejected connections from non-loopback client IPs with
code 4403 — causing 'events feed disconnected' in the UI.
Extract the repeated loopback check into _ws_client_is_allowed() which
respects the public bind flag. Session token auth still guards all
endpoints regardless of bind mode.
FixesNousResearch/hermes-agent#11768
Root cause: target.strip().lower() was lowercasing the entire target string,
corrupting case-sensitive chat IDs like Slack C123ABC and Matrix !RoomABC.
Fix: Only lowercase the platform prefix for case-insensitive matching;
preserve the original case for chat_id and thread_id values.
YAML loads a bare numeric value such as
discord:
free_response_channels: 1491973769726791812
as an int. _discord_free_response_channels() / _slack_free_response_channels()
checked `isinstance(raw, list)` and `isinstance(raw, str)` in that order and
then fell through to `return set()`, so a single-channel config that happened
to be unquoted was silently dropped with no log line — the bot kept demanding
@mentions even though the channel was configured to free-response.
A multi-channel value like `1234567890,9876543210` does not trip this because
the comma forces YAML to parse it as a string. Single-channel configs are
the only case that breaks, which is exactly the footgun that's hardest to
diagnose (the config "looks right" and the feature just doesn't activate).
Note that the old-schema env-var bridge at gateway/config.py:614+ already
runs `str(frc)` when forwarding to SLACK_/DISCORD_FREE_RESPONSE_CHANNELS,
so the env-var fallback worked. The bug only surfaces on the
`config.extra["free_response_channels"]` path populated by the `platforms:`
bridge at gateway/config.py:576, which passes the raw YAML value through
unchanged.
Fix at the reader: treat any non-list value as a scalar, coerce with str(),
then apply the same CSV split semantics. This keeps the public contract
stable (list or str-like continues to work identically) while accepting
the ints that the YAML loader is free to hand us.
Added tests for both Discord and Slack covering:
- bare int value in config.extra
- list of ints in config.extra
Slack has built-in slash commands (e.g. /status, /me, /join) that apps
cannot register. When running `hermes slack manifest --write`, the
generated manifest included /status, causing Slack to reject the entire
manifest with a reserved-command error.
Add _SLACK_RESERVED_COMMANDS frozenset of all known Slack built-ins and
skip them in slack_native_slashes(). Affected commands remain reachable
via /hermes <command>.
Tests updated:
- New test_excludes_slack_reserved_commands validates no leaks
- test_includes_canonical_commands no longer asserts /status
- test_telegram_parity accounts for expected Slack-only exclusions
Self-review fixes for the slash ephemeral ack:
- Only stash response_url when text starts with '/' (gateway command).
Free-form questions via '/hermes <question>' must produce public agent
replies visible to the whole channel, not ephemeral.
- Use a ContextVar (_slash_user_id) to thread the invoking user's ID
from _handle_slash_command through to send(). _pop_slash_context now
matches the exact (channel_id, user_id) key when the ContextVar is
set, preventing concurrent users on the same channel from stealing
each other's ephemeral context. ContextVars propagate to child
asyncio.Tasks, so the value survives through handle_message →
_process_message_background → _send_with_retry → send().
- Add truncate_message() in _send_slash_ephemeral to prevent silent
failures on long responses (response_url has the same ~40k limit).
- Log send_private_notice failures at debug level instead of bare
except/pass — aids diagnostics without spamming.
- Document app_mention dedup dependency on shared event ts.
- Add tests: free-form question must NOT stash context, concurrent
users on the same channel get isolated contexts, non-slash send()
path fallback behavior.
Adds platform-level private notice delivery abstraction so operational
messages (e.g. sethome prompt) can be sent ephemerally on Slack when
configured with `slack.notice_delivery: private`.
Changes:
- gateway/config.py: _normalize_notice_delivery() + GatewayConfig.get_notice_delivery()
with per-platform config bridging
- gateway/platforms/base.py: send_private_notice() default implementation
(falls through to send())
- gateway/platforms/slack.py: send_private_notice() via chat_postEphemeral
- gateway/run.py: _deliver_platform_notice() helper replaces direct
adapter.send() for the sethome notice, with private→public fallback
- gateway/platforms/slack.py: app_mention handler now forwards to
_handle_slack_message (safe due to ts-based dedup) instead of no-op pass,
fixing edge-case Slack configs where mentions arrive only as app_mention
- gateway/platforms/slack.py format_message: negative lookbehind prevents
markdown images (![]()) from becoming broken Slack links; italic regex
now requires non-whitespace boundaries so 'a * b * c' stays literal
Based on PR #9340 by @probepark.
Slack slash commands (/q, /btw, /stop, /model, etc.) previously showed
no user-visible acknowledgement and posted command replies as public
channel messages. This diverged from Discord, which uses ephemeral
deferred responses for slash commands.
Changes:
- handle_hermes_command now passes response_type='ephemeral' and a
'Running /cmd…' text to ack(), giving the user immediate 'Only visible
to you' feedback when they invoke any native slash command.
- _handle_slash_command stashes the Slack response_url from the command
payload in a per-channel context dict before dispatching to
handle_message.
- send() checks for a pending slash context and, when found, POSTs to
the response_url with replace_original=true to swap the initial ack
with the real command reply (e.g. 'Queued for the next turn.'),
keeping it ephemeral.
- Stale slash contexts are garbage-collected on lookup (120s TTL).
- The response_url POST is non-fatal: if it fails, the user already saw
the initial ack, and send() returns success=True.
Fixes#18182
Introduce the Electron desktop app with a split app/chat/settings structure and shared nanostore state so UI areas own their state instead of routing it through the root.
Long-running gateway processes that survive 'hermes update' keep
pre-update modules cached in sys.modules. When new tool files on
disk then try to 'from hermes_cli.config import cfg_get' (added in
PR #17304), the import resolves against the stale module object
and raises ImportError — hitting users on Matrix, Telegram, Feishu,
and other platforms.
Two defenses:
1. Gateway self-check (gateway/run.py). On __init__, snapshot the
newest mtime across sentinel source files (hermes_cli/config.py,
run_agent.py, gateway/run.py, etc.). On every inbound message,
re-read those mtimes; if any is newer than boot time + 2s slack,
request a graceful restart via the normal drain path and return
a one-line ack to the user. Idempotent, works regardless of how
the update happened (hermes update, manual git pull, installer).
2. Post-restart survivor sweep ('hermes update'). After the existing
restart loop, sleep 3s, rescan for gateway PIDs we already tried
to kill, and SIGKILL any survivors. The detached profile watchers
and systemd then relaunch with fresh code instead of waiting out
the 120s watcher timeout.
Closes#17648.
* fix(curator): defer first run and add --dry-run preview (#18373)
Curator was meant to run 7 days after install, not on the very first
gateway tick. On a fresh install (no .curator_state), should_run_now()
returned True immediately because last_run_at was None — so the gateway
cron ticker fired Curator against a fresh skill library moments after
'hermes update'. Combined with the binary 'agent-created' provenance
model (anything not bundled and not hub-installed), this consolidated
hand-authored user workflow skills without consent.
Changes:
- should_run_now(): first observation seeds last_run_at='now' and returns
False. The next real pass fires one full interval_hours later (7 days
by default), matching the original design intent.
- hermes curator run --dry-run: produces the same review report without
applying automatic transitions OR permitting the LLM to call
skill_manage / terminal mv. A DRY-RUN banner is prepended to the
prompt and the caller skips apply_automatic_transitions. State is
NOT advanced so a preview doesn't defer the next scheduled real pass.
- hermes update: prints a one-liner on fresh installs pointing at
--dry-run, pause, and the docs. Silent on steady state.
- Docs: curator.md and cli-commands.md explain the deferred first-run
behavior and warn that hand-written SKILL.md files share the
'agent-created' bucket, with guidance to pin or preview before the
first pass.
Tests:
- test_first_run_defers replaces the old 'first run always eligible'
assertion — same fixture, inverted expectation.
- test_maybe_run_curator_defers_on_fresh_install covers the gateway tick
path end-to-end.
- Three new dry-run tests cover state-advance suppression, prompt
banner injection, and apply_automatic_transitions skipping.
Fixes#18373.
* feat(curator): pre-run backup + rollback (#18373)
Every real curator pass now snapshots ~/.hermes/skills/ into
~/.hermes/skills/.curator_backups/<utc-iso>/skills.tar.gz before calling
apply_automatic_transitions or the LLM review. If a run consolidates or
archives something the user didn't want touched, 'hermes curator
rollback' restores the tree in one command. Dry-run is skipped — no
mutation means no snapshot needed.
Changes:
- agent/curator_backup.py (new): tar.gz snapshot + safe rollback. The
snapshot excludes .curator_backups/ (would recurse) and .hub/ (managed
by the skills hub). Extract refuses absolute paths and .. components,
and uses tarfile's filter='data' on Python 3.12+. Rollback takes a
pre-rollback safety snapshot FIRST, stages the current tree into
.rollback-staging-<ts>/ so the extract lands in an empty dir, and
cleans the staging dir on success. A failed extract restores the
staged contents.
- agent/curator.py: run_curator_review() calls curator_backup.
snapshot_skills(reason='pre-curator-run') before apply_automatic_
transitions. Best-effort — a failed snapshot logs at debug and the
run continues (a transient disk issue shouldn't silently disable
curator forever).
- hermes_cli/curator.py: new 'hermes curator backup' and 'hermes curator
rollback' subcommands. rollback supports --list, --id <ts>, -y.
- hermes_cli/config.py: curator.backup.{enabled, keep} config block
with sane defaults (enabled=true, keep=5).
- Docs: curator.md gets a 'Backups and rollback' section; cli-commands
.md table gets the new rows.
Tests (new file tests/agent/test_curator_backup.py, 16 cases):
- snapshot creates tarball + manifest with correct counts
- snapshot excludes .curator_backups/ (recursion guard) and .hub/
- snapshot disabled via config returns None without creating anything
- snapshot uniquifies ids within the same second (-01 suffix)
- prune honors keep count, newest-first
- list_backups + _resolve_backup cover newest-default and unknown-id
- rollback restores a deleted skill with content intact
- rollback is itself undoable — safety snapshot shows up in list_backups
- rollback with no snapshots returns an error
- rollback refuses tarballs with absolute paths or .. components
- real curator runs take a 'pre-curator-run' snapshot; dry-runs do not
All curator tests: 210 passing locally.
Prevents ghost sessions from accumulating in state.db when the TUI/web
dashboard is opened and closed without sending a message.
Changes:
- run_agent.py: Add _ensure_db_session() gate method, called at
run_conversation() entry. Remove eager create_session() from __init__.
Handle compression rotation flag correctly.
- tui_gateway/server.py: Remove eager db.create_session() in
_start_agent_build(). Add post-first-message pending_title re-apply.
- hermes_state.py: Extract _insert_session_row() shared helper (DRY).
Add prune_empty_ghost_sessions() for one-time migration.
- cli.py: One-time ghost session prune on startup. Fix _pending_title
to call _ensure_db_session() before set_session_title().
- hermes_cli/main.py: Guard TUI exit summary on message_count > 0.
- tests: Update test_860_dedup to call _ensure_db_session() before
direct _flush_messages_to_session_db() calls.
Closes: ghost session clutter in hermes sessions list and web dashboard.
Telegram's client does not display empty forum topics in the chat's
topic list. After createForumTopic succeeds, send a short pin message
into the new topic so it becomes immediately visible to the user.
Only fires for newly created topics (no thread_id in config yet).
Failure to send the seed is non-fatal (debug-logged, topic still works).
The bot-owner identity check inside OwnerCommandMiddleware was commented
out and replaced with a hardcoded `is_owner = True`, so any group member
could trigger allowlisted privileged commands (/approve, /deny, /stop,
/reset, /retry, /undo, /new, /background, /bg, /btw, /queue, /q) by
sending the slash command without @-mentioning the bot. The most severe
case is /approve: a non-owner could approve a dangerous tool call the
bot was waiting on the owner to confirm.
Re-enable the documented identity check (push.from_account ==
push.bot_owner_id) so only the configured owner can issue these
commands.
Adds a new top-of-sidebar docs page at /docs/user-stories that is a
masonry-style collage of 99 real user stories sourced from X/Twitter,
GitHub issues/PRs, Reddit, Hacker News, YouTube, blogs (Medium, Substack,
dev.to), podcasts, LinkedIn, GitHub Gists, and Product Hunt.
Every tile links to the original post/issue/video/gist where someone
described a specific use case: personal assistants, dev workflows,
trading bots, research briefs, family WhatsApp agents, Kubernetes
deployments, legal-domain self-hosted setups, and more.
- docs/user-stories.mdx: MDX entry mounting the collage component
- src/components/UserStoriesCollage: React component with category +
source filters, CSS-columns masonry layout, per-category accent colors
- src/data/userStories.json: source-of-truth dataset (force-added; the
root .gitignore's unanchored 'data/' rule would otherwise swallow it,
same reason skills.json is explicitly listed in website/.gitignore)
- sidebars.ts: link added at the top of the docs sidebar
Four callsites hardcoded Path.home() / '.hermes' with no HERMES_HOME
check, breaking Docker deployments and profile isolation (hermes -p):
- plugins/hermes-achievements/dashboard/plugin_api.py:
state_path(), snapshot_path(), checkpoint_path() bare-literal paths
- scripts/profile-tui.py:
DEFAULT_STATE_DB and DEFAULT_LOG defaults ignored HERMES_HOME
- hermes_cli/slack_cli.py:
except-Exception fallback for slack-manifest.json dump
- optional-skills/migration/openclaw-migration/scripts/openclaw_to_hermes.py:
--target argparse default
Use get_hermes_home() (with an ImportError shim for the standalone
scripts) or 'os.environ.get("HERMES_HOME") or str(Path.home()/".hermes")'
where importing hermes_constants is impractical.
E2E-verified: with HERMES_HOME=/tmp/x all three achievements paths and
both profile-tui defaults route under /tmp/x.
Salvaged from #18068 (original scope was broader mechanical cleanup
claiming 23 callsites were buggy; most were already respecting
HERMES_HOME via os.environ.get(key, default) — only these 4 had no env
check at all). Credit: @web-dev0521.
Two machine-readable entry points to the Hermes Agent docs:
/llms.txt curated index of every doc page, one link per page
with short descriptions. ~17 KB, safe to load into
an LLM context window.
/llms-full.txt every page under website/docs/ concatenated as markdown.
~1.8 MB. For one-shot ingestion by coding agents and
RAG pipelines.
Both files are also served from /docs/llms.txt and /docs/llms-full.txt
(Docusaurus serves website/static/ under baseUrl=/docs/). Some agents and
IDE plugins probe the classic site-root path; the deploy workflow now copies
both files to _site root so either URL works.
Conforms to the emerging llmstxt.org spec: H1 project name, blockquote
summary, short install command, GitHub link, then curated sections
mirroring the docs-site navigation (Getting Started, Using Hermes,
Features, Messaging, Integrations, Guides, Developer Guide, Reference).
Generated by website/scripts/generate-llms-txt.py. Wired into prebuild.mjs
so every 'npm run build' and 'npm run start' refreshes the files alongside
the existing skills.json extraction. Both outputs are gitignored (same
precedent as src/data/skills.json).
Descriptions in llms.txt are pulled from each page's frontmatter, so they
stay current automatically. All ~80 section slugs are validated against
the filesystem at generation time; an invalid slug would fail the prebuild.
Adds a proper feature page at user-guide/features/goals.md covering
the /goal slash command — Hermes' take on the Ralph loop shipped in
PR #18262. The slash-commands reference table had two table rows but
no narrative doc walking through the judge model, fail-open semantics,
turn budget, persistence, user-message preemption, or the aux-model
config override.
Adds a walkthrough example showing a multi-turn goal running to
completion, covers the two judge failure modes with how to recover,
and credits Codex CLI 0.128.0 / Eric Traut as prior art.
Also cross-links both slash-commands.md rows to the new page so
readers discovering /goal from the command reference can dive in.
The anyOf collapse in _repair_schema returned early, skipping the
nullable-strip and enum-cleanup steps. When a schema had anyOf
[{enum: [..., null, '']}, {type: null}] alongside a parent-level
'nullable: true', collapsing to the single non-null branch produced a
merged node that still had both 'nullable' and the bad enum values —
Moonshot would still 400 on it.
Fix: fall through to Rules 1/3 when the collapse produces a single
merged node; only return early for the multi-branch case (pure
anyOf preservation) or when there was no null branch to remove.
Adds a test that locks in the combined-case expectation.
When a schema node inside anyOf has enum values but no explicit 'type',
Rule 3 (enum cleanup) ran before _fill_missing_type, so node_type was
None and the enum was never cleaned. Moonshot then rejected the schema
with 'enum value (<nil>) does not match any type in [string]'.
Fix: reorder operations — fill missing type first, strip nullable,
then clean enum. This ensures enum cleanup always has a type to check.
Also fixes test expectation: empty string in enum is now correctly
stripped (Moonshot rejects it too).
Closes#16875
Add a standing-goal slash command that keeps Hermes working toward a
user-stated objective across turns until it is achieved, paused, or
the turn budget runs out. Our take on the Ralph loop — cf. Codex CLI
0.128.0's /goal.
After each turn, a lightweight auxiliary-model judge call asks 'is
this goal satisfied by the assistant's last response?'. If not, and
we're under the turn budget (default 20), Hermes feeds a continuation
prompt back into the same session as a normal user message. Any real
user message preempts the continuation loop automatically.
Judge failures fail OPEN (continue) so a flaky judge never wedges
progress — the turn budget is the real backstop.
### Commands
- `/goal <text>` — set a standing goal (kicks off the first turn)
- `/goal` or `/goal status` — show current state
- `/goal pause` — pause the continuation loop
- `/goal resume` — resume (resets turn counter)
- `/goal clear` — drop the goal
Works on both CLI and gateway platforms via the central CommandDef
registry.
### Design invariants preserved
- **Prompt cache**: continuation prompts are regular user-role
messages appended to history. No system-prompt mutation, no toolset
swap.
- **Role alternation**: continuation is a user turn, never injected
mid-tool-loop.
- **Session persistence**: goal state lives in SessionDB.state_meta
keyed by `goal:<session_id>`, so `/resume` picks it up.
- **Mid-run safety**: on the gateway, `/goal status|pause|clear` are
allowed mid-run (control-plane only); setting a new goal requires
`/stop` first so we don't race a second continuation prompt against
the current turn.
### Files
- `hermes_cli/goals.py` (new, 380 lines) — GoalManager + judge + state
- `hermes_cli/commands.py` — CommandDef entry
- `hermes_cli/config.py` — `goals.max_turns` default
- `hermes_cli/web_server.py` — dashboard category merge
- `cli.py` — /goal handler + post-turn continuation hook in
process_loop
- `gateway/run.py` — /goal handler + post-turn continuation hook
wrapping _handle_message_with_agent
- `tests/hermes_cli/test_goals.py` (new, 26 tests) — judge parsing,
fail-open semantics, lifecycle, persistence, budget exhaustion
- `website/docs/reference/slash-commands.md` — docs entry
* docs(sidebar): collapse exploding skills tree to a single Skills node
The Skills sub-tree in the left sidebar expanded to 200+ entries
(22 bundled categories + 15 optional categories, every skill a page).
That's most of the nav on a first visit — docs for the actual product
get drowned in it.
Collapse the sidebar to:
Skills
godmode (hand-written spotlight)
google-workspace (hand-written spotlight)
Bundled catalog (reference/skills-catalog — table of all bundled)
Optional catalog (reference/optional-skills-catalog — table of all optional)
Per-skill pages still generate and are still reachable at their URLs;
they're linked from the two catalog tables and from the Skills overview
page. They just don't appear in the left nav anymore.
sidebars.ts goes from 649 lines to 247. generate-skill-docs.py loses
the bundled/optional sidebar render helpers.
Also picks up incidental generator output drift on current main
(comfyui skill content refresh; 4 new skill pages for
devops-kanban-orchestrator, devops-kanban-worker,
productivity-here-now, productivity-shopify; two catalog refreshes).
These are what the generator produces on main today — keeping them
committed avoids the next docs build showing 'working tree dirty'.
* docs(sidebar): drop godmode and google-workspace spotlight pages
Keep the Skills sidebar node strictly principled: two catalog links,
nothing else. There was no rule for which skills got spotlight pages
and which got auto-generated pages — just that these two happened to
be hand-written first.
Both pages still build and are still reachable at
/docs/user-guide/skills/godmode and
/docs/user-guide/skills/google-workspace. They're linked from the
catalog tables and the Skills overview page.
Sidebar Skills node now:
Skills
├── Bundled catalog
└── Optional catalog