hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-05 02:07:34 +08:00

Author	SHA1	Message	Date
GodsBoy	f5bd77b3e1	fix(kanban): anchor board, workspaces, and worker logs at the shared Hermes root The Kanban board is documented as shared across all Hermes profiles, but `kanban_db_path()` and `workspaces_root()` resolved through `get_hermes_home()`, which returns the active profile's HERMES_HOME. When the dispatcher spawned a worker with `hermes -p <profile> --skills kanban-worker chat -q "work kanban task <id>"`, the worker rewrote HERMES_HOME to the profile subdirectory before kanban_db.py imported, opening a profile-local `kanban.db` that did not contain the dispatcher's task. `kanban_show` and `kanban_complete` failed; the dispatcher's row stayed `running` and was retried/crashed. The same defect applied to `_default_spawn`'s log directory and `worker_log_path`, so `hermes kanban tail` did not see the worker's output. Add `kanban_home()` in `hermes_cli/kanban_db.py` that resolves through `HERMES_KANBAN_HOME` (explicit override) then `get_default_hermes_root()`, which already understands the `<root>/profiles/<name>` and Docker / custom HERMES_HOME shapes. Reroute `kanban_db_path`, `workspaces_root`, the `_default_spawn` log directory, `gc_worker_logs`, and `worker_log_path` through it. Profile-specific config, `.env`, memory, and sessions stay isolated as before; only the kanban surface is shared. Add a `TestSharedBoardPaths` regression class to `tests/hermes_cli/test_kanban_db.py` covering: default install, profile-worker convergence, Docker custom HERMES_HOME, Docker profile layout, explicit `HERMES_KANBAN_HOME` override, and a real SQLite round-trip across dispatcher and worker HERMES_HOME perspectives. The dispatcher/worker convergence tests fail on origin/main and pass after the fix. Update the `kanban.md` user-guide page and the misleading docstrings in `kanban_db.py` to describe the shared-root behavior. Fixes #19348	2026-05-03 15:13:39 -07:00
Siddharth Balyan	167b5648ea	Revert "fix(cli): CLI/TUI on local backend always uses launch directory, ignores terminal.cwd (#19242 )" (#19329 ) This reverts commit `9eaddfafa3`.	2026-05-04 00:43:58 +05:30
Siddharth Balyan	9eaddfafa3	fix(cli): CLI/TUI on local backend always uses launch directory, ignores terminal.cwd (#19242 ) CLI/TUI sessions on the local backend now unconditionally use os.getcwd() as the working directory. The terminal.cwd config value is only consumed by gateway/cron/delegation modes (where there's no shell to cd from). Previously, 'hermes setup' would write an absolute path (e.g. $HOME) into terminal.cwd which then pinned the CLI to that directory regardless of where the user launched hermes from. This was a silent foot-gun — the user's 'cd' was being ignored. Changes: 1. cli.py: Restructured CWD resolution — if TERMINAL_CWD is not already set by the gateway, and the backend is local, always use os.getcwd(). Config terminal.cwd is irrelevant for interactive CLI/TUI sessions. 2. setup.py: Moved the cwd prompt from setup_terminal_backend() to setup_gateway(). It now only appears when configuring messaging platforms and is labeled 'Gateway working directory'. 3. Tests: Rewrote test_cwd_env_respect.py to validate the new behavior: explicit config paths are ignored for CLI, gateway pre-set values are preserved, non-local backends keep their config paths. 4. Docs: Updated configuration.md, profiles.md, and environment-variables.md to clarify that terminal.cwd only affects gateway/cron mode on local backend. Closes #19214	2026-05-04 00:14:36 +05:30
GodsBoy	b8ae8cc801	fix(debug): redact log content at upload time in hermes debug share Apply agent.redact.redact_sensitive_text with force=True to log content captured by _capture_log_snapshot before it reaches upload_to_pastebin. On-disk logs are untouched. Compatible with the off-by-default local redaction policy from #16794: this is upload-time-only and applies regardless of security.redact_secrets because the public paste service is the leak surface. A visible banner is prepended to each uploaded log paste so reviewers know redaction was applied. --no-redact preserves deliberate unredacted sharing for maintainer-coordinated cases. The bug-report, setup-help, and feature-request issue templates direct users to run hermes debug share and paste the resulting public URLs. With redaction off by default per #16794, those uploads have been carrying credentials onto paste.rs and dpaste.com. force=True is non-negotiable: without it, redact_sensitive_text short-circuits at agent/redact.py:322 when the env var is unset, so the fix would silently be a no-op for its target audience. A regression test pins this down. Fixes #19316	2026-05-03 11:42:20 -07:00
Siddharth Balyan	c9a3f36f56	feat: add video_analyze tool for native video understanding (#19301 ) * feat: add video_analyze tool for native video understanding Adds a video_analyze tool that sends video files to multimodal LLMs (e.g. Gemini) for analysis via the OpenRouter-compatible video_url content type. Mirrors vision_analyze in structure, error handling, and registration pattern. Key design: - Base64 encodes entire video (no frame extraction, no ffmpeg dep) - Uses 'video_url' content block type (OpenRouter standard) - Supports mp4, webm, mov, avi, mkv, mpeg formats - 50 MB hard cap, 20 MB warning threshold - 180s minimum timeout (videos take longer than images) - AUXILIARY_VIDEO_MODEL env override, falls back to AUXILIARY_VISION_MODEL - Same SSRF protection, retry logic, and cleanup as vision_analyze Default disabled: registered in 'video' toolset (not in _HERMES_CORE_TOOLS). Users opt in via: hermes tools enable video, or enabled_toolsets=['video']. * feat(video): add models.dev capability pre-check + CONFIGURABLE_TOOLSETS entry - Pre-checks model video capability via models.dev modalities.input before expensive base64 encoding. Fails early with helpful message suggesting video-capable alternatives (gemini, mimo-v2.5-pro). - Passes optimistically if model unknown or lookup fails. - Adds ModelInfo.supports_video_input() helper. - Adds 'video' to CONFIGURABLE_TOOLSETS and _DEFAULT_OFF_TOOLSETS so 'hermes tools enable video' works from CLI. - 8 new tests for the capability check (37 total). * refactor(video): remove models.dev capability pre-check Removes _check_video_model_capability and ModelInfo.supports_video_input. The vision_analyze tool doesn't pre-check image capability either — both tools rely on the same pattern: send request, handle API errors gracefully with categorized user-facing messages. The pre-check was inconsistent (only worked for some providers/models) so drop it for parity. * cleanup: compress comments, fix fragile timeout coupling - Replace _VISION_DOWNLOAD_TIMEOUT * 2 with hardcoded 60s (no silent breakage if vision timeout changes independently) - Strip verbose comments and redundant log lines throughout - No behavioral changes	2026-05-04 00:04:36 +05:30
Bartok9	e527240b27	fix(tools): write_file handler now rejects missing 'content'/'path' args instead of silently writing zero-byte files (#19096 ) Under context pressure, frontier models sometimes emit tool calls with required fields dropped. Previously _handle_write_file() used args.get('content', '') which substituted an empty string for the missing key, returned success with bytes_written=0, and created a zero-byte file on disk. The model had no way to detect the failure. Changes: - Reject calls where 'path' is absent or not a non-empty string - Reject calls where 'content' key is entirely absent (key-presence check, not truthiness) — distinguishing a legitimately empty file from a dropped arg - Reject calls where 'content' is a non-string type - All error messages include guidance to re-emit the tool call or switch to execute_code with hermes_tools.write_file() for large payloads - Explicit empty string content (file truncation) continues to work Regression tests added for all four cases: missing path, missing content, explicit-empty content, and wrong content type. Fixes #19096	2026-05-03 08:52:41 -07:00
Tranquil-Flow	6b4fb9f878	fix(cron): treat non-dict origin as missing instead of crashing tick ``_resolve_origin`` called ``origin.get('platform')`` on whatever ``job.get('origin')`` returned. The leading ``if not origin: return None`` short-circuited the falsy cases (None, empty dict, "") but a non-empty string passed that guard and then crashed with ``AttributeError: 'str' object has no attribute 'get'`` on every fire attempt. Observed in the wild after a migration script tagged jobs with free-form provenance strings (e.g. ``"combined-digest-replaces-x-and-y-20260503"``). ``mark_job_run`` did record ``last_status: error, last_error: "'str' object has no attribute 'get'"`` once, but the next tick re-loaded the same poisoned origin and crashed identically. The job stayed enabled, fired every tick, and accumulated cascading errors in the log until ``origin`` was patched manually. Replace the falsy guard with ``isinstance(origin, dict)``. Non-dict origins (string, int, list, tuple, float — anything that survived a hand-edit, JSON-script write, or migration) are now treated the same as a missing origin: the job continues with ``deliver`` falling back through its normal home-channel path instead of crashing the scheduler loop. Test parametrises the non-dict shapes that can appear in jobs.json through external writers and asserts ``_resolve_origin`` returns None for each. Note: this fix scope is the non-dict-``origin`` crash only. The ``next_run_at: null`` recurring-job recovery (the second sub-bug in #18722) is independently addressed by the in-flight #18825, which extends the never-silently-disable defense from #16265 to ``get_due_jobs()`` — that approach is well-aligned with the existing recovery pattern and ships fine without a competing change here. Fixes #18722 (non-dict origin crash; recurring-job recovery covered by #18825)	2026-05-03 08:51:50 -07:00
leprincep35700	b59bb4e351	fix(gateway): preserve home-channel thread targets across restart notifications	2026-05-03 08:47:49 -07:00
Teknium	d87fd9f039	fix(goals): make /goal work in TUI and fix gateway verdict delivery (#19209 ) /goal was silently broken outside the classic CLI. TUI: /goal was routed through the HermesCLI slash-worker subprocess, which set the goal row in SessionDB but then called _pending_input.put(state.goal) — the subprocess has no reader for that queue, so the kickoff message was discarded. No post-turn judge was wired into prompt.submit either, so even a manual kickoff would not continue the goal loop. Intercept /goal in command.dispatch instead, drive GoalManager directly, and return {type: send, notice, message} so the TUI client renders the Goal-set notice and fires the kickoff. Run the judge in _run_prompt_submit after message.complete, surface the verdict via status.update {kind: goal}, and chain the continuation turn after the running guard is released. Gateway: _post_turn_goal_continuation was gated on hasattr(adapter, 'send_message'), but adapters only expose send(). That branch was dead on every platform — users never saw '✓ Goal achieved', 'Continuing toward goal', or budget-exhausted messages. Replace the dead call with adapter.send(chat_id, content, metadata) and drop a broken reference to self._loop. Tests: - tests/tui_gateway/test_goal_command.py — full /goal dispatch matrix (set / status / pause / resume / clear / stop / done / whitespace) plus regressions for slash.exec → 4018 and 'goal' staying in _PENDING_INPUT_COMMANDS. - tests/gateway/test_goal_verdict_send.py — locks in the adapter.send path for done / continue / budget-exhausted and verifies the hook no-ops when no goal is set or the adapter lacks send().	2026-05-03 05:49:12 -07:00
kshitijk4poor	6f2dab248a	fix: update tests for resume_pending semantics + add AUTHOR_MAP entries Tests updated to reflect suspend_recently_active now setting resume_pending=True (preserves session) instead of suspended=True (wipes session history). AUTHOR_MAP entries: millerc79 (#19033), shellybotmoyer (#18915)	2026-05-03 03:54:03 -07:00
nftpoetrist	6c1322b997	fix(slack): close previous handler in connect() to prevent zombie Socket Mode connections SlackAdapter.connect() overwrote self._handler, self._app, and self._socket_mode_task without closing the prior AsyncSocketModeHandler first. If connect() was called a second time on the same adapter (e.g. during a gateway restart or in-process reconnect attempt), the old Socket Mode websocket stayed alive. Both the old and new connections received every Slack event and dispatched it twice — producing double responses with different wording, the same bug that affected DiscordAdapter (#18187, fixed in #18758). Fix: add a close-before-reassign guard at the start of the connection setup path, mirroring the guard DiscordAdapter.connect() already has. When self._handler is None (fresh adapter, first connect()) the block is a harmless no-op. Scoped to the handler/app fields only — no behavior change for any path that does not call connect() twice. Fixes #18980	2026-05-03 03:47:49 -07:00
0xyg3n	19ba9e43b6	fix(gateway/discord): require allowlist auth on slash commands Slash commands (_run_simple_slash, _handle_thread_create_slash) bypassed every DISCORD_ALLOWED_* gate enforced by on_message. Any guild member could invoke /background (RCE via terminal), /restart, /model, /skill, etc. CVSS 9.8 Critical. - _evaluate_slash_authorization mirrors on_message gates (user, role, channel, ignored channel) with fail-closed semantics - _check_slash_authorization sends ephemeral reject + logs + admin alert - Auth gate runs before defer() so rejections are ephemeral - /skill autocomplete returns [] for unauthorized users (no catalog leak) - Component views (ExecApproval, SlashConfirm, UpdatePrompt, ModelPicker) now honor role allowlists via shared _component_check_auth helper - Optional DISCORD_HIDE_SLASH_COMMANDS defense-in-depth - Cross-platform admin alert (Telegram/Slack fallback) on unauthorized attempts Based on PR #18125 by @0xyg3n.	2026-05-03 03:44:55 -07:00
kshitijk4poor	5d5b8912be	test: add tests for cmd_key preservation through name clamping - TestClampCommandNamesTriples: unit tests for 3-tuple support in _clamp_command_names (short names, long names, collisions, multiple entries, backward compat with 2-tuples) - TestDiscordSkillCmdKeyDispatch: integration test through the full discord_skill_commands pipeline verifying long skill names retain their original cmd_key after clamping - Add contributor CharlieKerfoot to AUTHOR_MAP	2026-05-03 03:25:45 -07:00
kshitij	457c7b76cd	feat(openrouter): add response caching support (#19132 ) Enable OpenRouter's response caching feature (beta) via X-OpenRouter-Cache headers. When enabled, identical API requests return cached responses for free (zero billing), reducing both latency and cost. Configuration via config.yaml: openrouter: response_cache: true # default: on response_cache_ttl: 300 # 1-86400 seconds Changes: - Add openrouter config section to DEFAULT_CONFIG (response_cache + TTL) - Add build_or_headers() in auxiliary_client.py that builds attribution headers plus optional cache headers based on config - Replace inline _OR_HEADERS dicts with build_or_headers() at all 5 sites: run_agent.py __init__, _apply_client_headers_for_base_url(), and auxiliary_client.py _try_openrouter() + _to_async_client() - Add _check_openrouter_cache_status() method to AIAgent that reads X-OpenRouter-Cache-Status from streaming response headers and logs HIT/MISS status - Document in cli-config.yaml.example - Add 28 tests (22 unit + 6 integration) Ref: https://openrouter.ai/docs/guides/features/response-caching	2026-05-03 01:54:24 -07:00
Henkey	9987f3d824	fix(acp): compact Zed tool replay rendering	2026-05-03 01:44:23 -07:00
Henkey	19854c7cd2	Schedule ACP history replay and fence file output	2026-05-03 01:44:23 -07:00
Henkey	eb612f5574	fix(acp): keep web extract rendering compact	2026-05-03 01:44:23 -07:00
Henkey	b294d1d022	fix(acp): keep read-file starts compact	2026-05-03 01:44:23 -07:00
Henkey	72c8037a24	fix(acp): polish common tool rendering	2026-05-03 01:44:23 -07:00
Henkey	ef9a08a872	fix(acp): polish Zed context and tool rendering	2026-05-03 01:44:23 -07:00
Henkey	e26f9b2070	fix(acp): route Zed thoughts to reasoning callbacks	2026-05-03 01:44:23 -07:00
helix4u	4f37669170	fix(tools): reconfigure enabled unconfigured toolsets	2026-05-03 00:33:02 -07:00
helix4u	d409a4409c	fix(model): avoid bedrock credential probe in provider picker	2026-05-03 00:32:55 -07:00
liuhao1024	af98122793	fix(auxiliary): propagate explicit_api_key to _try_openrouter() When resolve_provider_client() passes explicit_api_key for OpenRouter auxiliary tasks, _try_openrouter() now accepts and honors this parameter instead of silently ignoring it and falling back to OPENROUTER_API_KEY env var. Root cause: _try_openrouter() had no explicit_api_key parameter, so even when callers wanted to pass a runtime credential pool key, it could not be used. Fix: - Add explicit_api_key: str = None parameter to _try_openrouter() - Prioritize explicit_api_key over pool key and env var - Update resolve_provider_client() call site to pass explicit_api_key Regression coverage: - Test that explicit_api_key is passed to OpenAI client when provided - Test that fallback to OPENROUTER_API_KEY still works when explicit_api_key is None Closes #18338	2026-05-02 02:27:49 -07:00
teknium1	762eb79f1e	fix(gateway): tighten httpx keepalive and close whatsapp typing-response leak (#18451 ) Two mitigations for the CLOSE_WAIT accumulation reported against QQ Bot + Feishu on macOS behind Cloudflare Warp. 1. Shared httpx.Limits helper (gateway/platforms/_http_client_limits.py). Every long-lived platform adapter now constructs httpx.AsyncClient with max_keepalive_connections=10 and keepalive_expiry=2.0, vs httpx's default of unbounded keepalive pool and 5.0s expiry. On macOS/Warp the default 5s window let idle keepalive sockets sit in CLOSE_WAIT long enough for seven persistent adapters (QQ Bot, WeCom, DingTalk, Signal, BlueBubbles, WeCom-callback, plus the transient Feishu helper) to compound to the 256-fd ulimit. Tunable via HERMES_GATEWAY_HTTPX_KEEPALIVE_EXPIRY and HERMES_GATEWAY_HTTPX_MAX_KEEPALIVE env vars. 2. whatsapp.send_typing aiohttp leak. The call was 'await self._http_session.post(...)' with no 'async with' and no variable capture — the ClientResponse went out of scope unclosed, holding its TCP socket in CLOSE_WAIT until GC. Fixed by wrapping in 'async with'. This was the only bare-await aiohttp leak in the gateway/tools/plugins tree per audit; all other aiohttp sites use the context-manager pattern correctly. The underlying reporter also saw Feishu SDK (lark-oapi) connections in CLOSE_WAIT — those are inside the SDK and out of our direct control, but tightening httpx keepalive across adapters reduces the aggregate pool pressure regardless of which individual adapter leaks.	2026-05-02 02:23:37 -07:00
beibi9966	38dd057e91	fix(feishu): finalize remote document downloads inside httpx.AsyncClient context (#18502 ) Snapshot Content-Type and body while the client context is still active so pooled connections fully release on exit. Previously the read happened after `async with httpx.AsyncClient(...)` returned — which works today only because httpx eagerly buffers non-streaming responses; a future refactor to `.stream()` would silently read- after-close. Part of the #18451 connection-hygiene audit. Salvage of #18502.	2026-05-02 02:23:37 -07:00
luyao618	13f344c5ce	fix(agent): try fallback providers at init when primary credential pool is exhausted (#17929 ) When a provider's credential pool has a single entry in 429-cooldown, resolve_provider_client returns None and AIAgent.__init__ raises a misleading RuntimeError suggesting the API key is missing — even when valid fallback_providers are configured. This patch makes __init__ iterate the fallback chain before raising, mirroring the existing in-flight fallback logic in the request loop. If a fallback resolves, the agent initializes against it and sets _fallback_activated=True so _restore_primary_runtime can pick the primary back up after cooldown. Closes #17929	2026-05-02 02:09:46 -07:00
Teknium	1dce908930	fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log) (#18761 ) * fix(gateway): config.yaml wins over .env for agent/display/timezone settings Regression from the silent config→env bridge. The bridge at module import time is correct for max_turns (unconditional overwrite), but every other agent., display., timezone, and security bridge key was guarded by 'if X not in os.environ' — so a stale .env entry from an old 'hermes setup' run would shadow the user's current config.yaml indefinitely. Symptom: agent.max_turns: 500 in config.yaml, HERMES_MAX_ITERATIONS=60 in .env from an old setup, and the gateway silently capped at 60 iterations per turn. Gateway logs confirmed api_calls never exceeded 60. Three changes: 1. gateway/run.py: drop the 'not in os.environ' guards for all agent., display., timezone, and security.* bridge keys. config.yaml is now authoritative for these settings — same semantics already in place for max_turns, terminal., and auxiliary.. Also surface the bridge failure (previously 'except Exception: pass') to stderr so operators see bridge errors instead of silently falling back to .env. 2. gateway/run.py: INFO-log the resolved max_iterations at gateway start so operators can verify the config→env bridge did the right thing instead of chasing a phantom budget ceiling. 3. hermes_cli/setup.py: stop writing HERMES_MAX_ITERATIONS to .env in the setup wizard. config.yaml is the single source of truth. Also clean up any stale .env entry left behind by pre-fix setups. Regression tests in tests/gateway/test_config_env_bridge_authority.py guard each config→env key against the 'stale .env shadows config' bug. * fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log) Three issues observed in production gateway.log during a rapid restart chain on 2026-05-02, all fixed here. 1. _send_restart_notification logged unconditional success adapter.send() catches provider errors (e.g. Telegram 'Chat not found') and returns SendResult(success=False); it never raises. The caller ignored the return value and always logged 'Sent restart notification to <chat>' at INFO, producing a misleading success line directly below the 'Failed to send Telegram message' traceback on every boot. Now inspects result.success and logs WARNING with the error otherwise. 2. WhatsApp bridge SIGTERM on shutdown classified as fatal error _check_managed_bridge_exit() saw the bridge's returncode -15 (our own SIGTERM from disconnect()) and fired the full fatal-error path, producing 'ERROR ... WhatsApp bridge process exited unexpectedly' plus 'Fatal whatsapp adapter error (whatsapp_bridge_exited)' on every planned shutdown, immediately before the normal '✓ whatsapp disconnected'. Adds a _shutting_down flag that disconnect() sets before the terminate, and _check_managed_bridge_exit() returns None for returncode in {0, -2, -15} while shutting down. OOM-kill (137) and other non-signal exits still hit the fatal path. 3. restart_drain_timeout default 60s → 180s On 2026-05-02 01:43:27 a user /restart fired while three agents were mid-API-call (82s, 112s, 154s into their turns). The 60s drain budget expired and all three were force-interrupted. 180s covers realistic in-flight agent turns; users on very-long-reasoning models can still raise it further via agent.restart_drain_timeout in config.yaml. Existing explicit user values are preserved by deep-merge. Tests - tests/gateway/test_restart_notification.py: two new tests assert INFO is only logged on SendResult(success=True) and WARNING with the error string is logged on SendResult(success=False). - tests/gateway/test_whatsapp_connect.py: parametrized test for returncode in {0, -2, -15} proves shutdown-time exits are suppressed; separate test proves returncode 137 (SIGKILL/OOM) still surfaces as fatal even when _shutting_down is set. - _check_managed_bridge_exit() reads _shutting_down via getattr-with- default so existing _make_adapter() test helpers that bypass __init__ (pitfall #17 in AGENTS.md) keep working unmodified.	2026-05-02 02:08:06 -07:00
Teknium	5eac6084bc	fix(discord): warn on 32-char clamp collisions in the /skill collector (#18759 ) Discord's per-command name limit is 32 chars. When two skill slugs share the same first 32 chars (or a skill slug clamps onto a reserved gateway command name), only the first seen wins — the second is dropped from the /skill autocomplete. The old behavior incremented a ``hidden`` counter silently, so skill authors had no way to discover the drop short of noticing their skill was missing from the picker. Not an actively-biting bug today (no collisions on the default catalog as of 2026-05), but a landmine the moment someone ships a skill with a long name. The earlier series in #18745 / #18753 / #18754 dropped the other silent data-loss paths in the Discord /skill collector; this one lights up the last remaining one. Fix: promote ``_names_used`` from a set to a dict keyed by the clamped name, mapping to the source cmd_key (or a ``"<reserved>"`` sentinel for names inherited via ``reserved_names``). On collision, log a WARNING naming both sides — the winner, the loser, the clamped name, and what to rename. Two phrasings: * skill-vs-skill — "both clamp to X on Discord's 32-char command-name limit; only the winner appears in /skill. Rename one skill's frontmatter ``name:`` to differ in its first 32 chars." * skill-vs-reserved — "collides with a reserved gateway command name; the skill will not appear in /skill. Rename the skill's frontmatter ``name:``." Tests: three cases in ``tests/hermes_cli/test_discord_skill_clamp_warning.py`` — skill-vs-skill collision (warning names both cmd_keys + clamped prefix), skill-vs-reserved collision (warning uses the distinct phrasing), and a no-collision negative (zero warnings emitted).	2026-05-02 02:05:01 -07:00
teknium1	e363ced3c3	test(discord): regression coverage for zombie-websocket guard in connect() Covers PR #18224 fix for issue #18187 — when DiscordAdapter.connect() is called a second time without an intervening disconnect(), the previous commands.Bot must be closed before a new one is created. Otherwise both websockets stay connected to Discord's gateway and both fire on_message, producing double responses with different wording.	2026-05-02 02:04:14 -07:00
teknium1	0a6865b328	test(credential_pool): regression coverage for .env vs os.environ precedence Covers PR #18256 fix for issue #18254 — when OPENROUTER_API_KEY is set in BOTH os.environ (stale from parent shell) and ~/.hermes/.env (fresh), _seed_from_env must prefer the .env value. Also guards the fallback case where .env omits the key entirely (Docker/K8s/systemd deployments that only inject via runtime env).	2026-05-02 02:00:32 -07:00
Teknium	10297fa23c	fix(discord): `/reload-skills` now refreshes the `/skill` autocomplete live (#18754 ) `_register_skill_group` captured the skill catalog in closure variables (`entries` and `skill_lookup`) so the single `tree.add_command` call at startup owned the only live copy. The closure is never re-entered after startup, so `/reload-skills` — which rescans the on-disk skills dir and refreshes the in-process `_skill_commands` registry — had no way to propagate results into the `/skill` autocomplete on Discord. New skills stayed invisible in the dropdown, and deleted skills returned "Unknown skill" when the stale autocomplete entry was clicked. The fix is purely a dataflow change: promote `entries` and `skill_lookup` to instance attributes (`_skill_entries`, `_skill_lookup`), split the collector-driven rebuild into a helper (`_refresh_skill_catalog_state`), and add a public `refresh_skill_group()` method that re-runs the helper and is safe to call at any point after the initial registration. The gateway's `_handle_reload_skills_command` then iterates `self.adapters` and calls `refresh_skill_group()` on any adapter that exposes it (currently only Discord). Both sync and async implementations are supported; adapters that don't override the method (Telegram's BotCommand menu, Slack subcommand map, etc.) are silently skipped — the in-process `reload_skills()` call covers them. No `tree.sync()` is required because Discord fetches autocomplete options dynamically on every keystroke — mutating the instance state the callbacks already read from is sufficient. That sidesteps the per-app command-bucket rate limit (~5 writes / 20 s) that made the previous bulk-sync-on-reload approach unusable (#16713 context). Tests: tests/gateway/test_reload_skills_discord_resync.py — five cases covering (1) refresh replaces entries, (2) entries stay sorted after refresh, (3) collector exception leaves cached state intact, (4) `_refresh_skill_catalog_state` populates the instance attrs, (5) orchestrator calls `refresh_skill_group()` on sync + async adapters and skips adapters that don't expose it.	2026-05-02 02:00:11 -07:00
Teknium	6ec74aec07	fix(gateway): match disabled/optional skills by frontmatter slug, not dir name (#18753 ) _check_unavailable_skill is meant to turn a typed "/foo" command that doesn't resolve into a specific hint — "disabled, enable with hermes skills config" or "available but not installed, install with hermes skills install …" — instead of the generic "unknown command" reply. It was doing the match with `skill_md.parent.name.lower().replace("_", "-")`, comparing that to the typed command. For every skill whose directory name drifted from its declared frontmatter `name:`, that comparison failed and the user got the unhelpful generic path. On a standard install today 19 skills have this drift, e.g.: dir: mlops/stable-diffusion frontmatter: name: Stable Diffusion Image Generation registered slug (what the user types): /stable-diffusion-image-generation dir: mlops/qdrant frontmatter: name: Qdrant Vector Search registered slug: /qdrant-vector-search dir: mlops/flash-attention frontmatter: name: Optimizing Attention Flash registered slug: /optimizing-attention-flash In every case, _check_unavailable_skill would fall through because "stable-diffusion" != "stable-diffusion-image-generation", even with the skill sitting right there on disk. Fix: extract a small `_skill_slug_from_frontmatter` helper that reads the SKILL.md frontmatter and normalizes exactly like scan_skill_commands (lower, spaces/underscores → hyphens, strip non-[a-z0-9-], collapse runs of hyphens, strip edges). Use it in both the disabled-skills branch and the optional-skills branch. The disabled-set membership check now uses the declared frontmatter name (which is what `hermes skills config` writes into skills.disabled / platform_disabled), not the slug. Tests: five cases in tests/gateway/test_unavailable_skill_hint.py — the drift case for the disabled branch, unknown-command negative, matched-but-not-disabled negative, non-alnum stripping, and the drift case for the optional-skills branch. All five fail against main and pass with the fix.	2026-05-02 02:00:09 -07:00
Teknium	8825e9044c	fix(discord): complete #18741 for /skill autocomplete and drop legacy 25x25 caps (#18745 ) ``discord_skill_commands_by_category`` was lagging the flat ``discord_skill_commands`` collector on two counts. Both were actively dropping skills from Discord's ``/skill`` autocomplete dropdown. 1. External-dir skills were filtered out. #18741 widened the flat collector to accept ``SKILLS_DIR + skills.external_dirs`` but left this sibling collector — the one ``_register_skill_group`` actually uses on Discord — still matching ``SKILLS_DIR`` only. External skills were visible in ``hermes skills list`` and the agent's ``/skill-name`` dispatch but silently absent from Discord's ``/skill`` picker. Widen the accepted roots to match, and derive categories from whichever root the skill lives under so ``<ext>/mlops/foo/SKILL.md`` still lands in the ``mlops`` group. 2. 25-group × 25-subcommand caps were still applied. PR #11580 refactored ``/skill`` to a flat autocomplete (whose options Discord fetches dynamically — no per-command payload concern) and its docstring promises "no hidden skills." The collector kept the old nested-layout caps anyway, silently dropping anything past the 25th alphabetical category. On installs with 29 category dirs today (real example: tail categories ``social-media``, ``software-development``, ``yuanbao`` going missing) this was biting immediately. Remove the caps; ``hidden`` now reports only 32-char name-clamp collisions against reserved names. Tests: guard both behaviors. ``test_no_legacy_25x25_cap`` builds 30 categories × 30 skills each and asserts all 900 are returned. ``test_external_dirs_skills_included`` monkeypatches ``get_external_skills_dirs`` and asserts an external-dir skill makes it into the result grouped under its own top-level directory.	2026-05-02 02:00:06 -07:00
Jacob Lizarraga	2470434d60	fix(telegram): probe polling liveness after reconnect to detect wedged Updater After a transient Telegram 502, _handle_polling_network_error's stop()+start_polling() cycle can leave PTB's Updater with `running=True` but a wedged consumer task that never makes progress. No error_callback fires in that state, so the reconnect ladder never advances past attempt 1, the MAX_NETWORK_RETRIES fatal-error path is never reached, and the gateway sits silent indefinitely. Schedule a heartbeat probe (60s after a successful reconnect) that verifies Updater.running is still True and bot.get_me() responds within a tight asyncio.wait_for timeout. Either failure feeds back into the reconnect ladder so the existing escalation path fires. No PTB-internal coupling, no Application rebuild — minimal additive defense inside the existing reconnect abstraction. Tests cover healthy / Updater non-running / probe timeout / probe network error / already-fatal cases, plus an integration check that the probe is actually scheduled after a successful start_polling(). Closes the silent-wedge case observed in the wild after a transient Telegram 502; existing reconnect tests updated to mock bot.get_me() now that the success path schedules a heartbeat probe.	2026-05-02 01:55:04 -07:00
liuhao1024	9bf260472b	fix(tools): deduplicate tool names at API boundary for Vertex/Azure/Bedrock Providers like Google Vertex, Azure, and Amazon Bedrock reject API requests with duplicate tool names (HTTP 400: 'Tool names must be unique'). The upstream injection paths in run_agent.py already dedup after PR #17335, but two API-boundary functions pass tools through without checking: - agent/auxiliary_client.py: _build_call_kwargs() (all non-Anthropic providers in chat_completions mode) - agent/anthropic_adapter.py: convert_tools_to_anthropic() (Anthropic Messages API path) Add defensive dedup guards at both sites. Duplicates are dropped with a warning log, converting a hard 400 failure into a recoverable condition. This is intentionally conservative — the root-cause dedup in run_agent.py is the primary defense; these guards add resilience against future injection-path regressions. Includes 8 new tests covering unique passthrough, duplicate removal, empty/None edge cases. Closes #18478	2026-05-02 01:51:51 -07:00
Teknium	699b3679bc	fix(constants): warn once when get_hermes_home() falls back under an active profile (#18746 ) When HERMES_HOME is unset but ~/.hermes/active_profile names a non-default profile, any data this process writes lands in the default profile — not the one the operator expects. Before this change the fallback was silent, so cross-profile contamination (#18594) was invisible until a user noticed their memory/state ended up in the wrong place. Now we emit a one-shot warning to stderr the first time this happens in a process. No raise — there are 30+ module-level callers of get_hermes_home() and raising from any of them would brick import. Behavior is otherwise unchanged; subprocess spawners (systemd template, kanban dispatcher, docker entrypoint) already propagate HERMES_HOME correctly. Bypasses logging.getLogger() because this runs before logging is configured in a significant fraction of callers (module import time). Refs #18594. Credit to @liuhao1024 for surfacing the silent-fallback case in PR #18600; we kept the diagnostic signal without the import-time raise.	2026-05-02 01:49:55 -07:00
CoreyNoDream	c5e3a6fb5b	fix(cli): decode .env as UTF-8 to avoid GBK crash on Windows Path.read_text() uses the system locale by default. On Windows CN/JP/KR locales (GBK/CP932/CP949), reading a UTF-8 .env raises UnicodeDecodeError as soon as it contains any non-ASCII byte (e.g. an em dash). Pin encoding="utf-8" on every .env read in hermes_cli to match how the rest of the codebase (load_dotenv at doctor.py:26) already decodes it. Adds a regression test that monkeypatches Path.read_text to simulate a GBK locale and asserts 'hermes doctor' no longer raises. Refs #18637	2026-05-02 01:40:31 -07:00
Teknium	e2cea6eeba	fix(gateway): include external_dirs skills in Telegram/Discord slash commands (#18741 ) Skills configured through `skills.external_dirs` in config.yaml were visible via `hermes skills list`, `get_skill_commands()`, and the agent's `/skill-name` dispatch, but silently excluded from the Telegram and Discord slash-command menus. The filter in `_collect_gateway_skill_entries` only accepted skills whose `skill_md_path` started with `SKILLS_DIR`, so anything under an external directory fell through. Widen the accepted-prefix set to include all configured external dirs alongside the local skills dir. Every prefix is now slash-terminated so `/my-skills` cannot also admit `/my-skills-extra`. Also guard against empty `skill_md_path` values so they can't accidentally match. Fixes #8110 Salvages #8790 by luyao618. Co-authored-by: Yao <34041715+luyao618@users.noreply.github.com>	2026-05-02 01:36:57 -07:00
Teknium	c73594fe41	fix(skills): rescan skill_commands cache when platform scope changes (#18739 ) The process-global `_skill_commands` dict in agent/skill_commands.py was seeded by whichever platform scanned first, and `get_skill_commands()` only rescanned when the cache was empty. In a long-lived gateway process serving multiple platforms (Telegram + Discord + Slack), the first platform's `skills.platform_disabled` view was silently inherited by the others — so a skill disabled for Telegram would also disappear from Discord's slash menu, and vice versa. Track the platform scope the cache was populated for (`_skill_commands_platform`) and rescan in `get_skill_commands()` when the currently-active platform no longer matches. Platform resolution uses the same precedence as `_is_skill_disabled`: `HERMES_PLATFORM` env var then `HERMES_SESSION_PLATFORM` from the gateway session context. Fixes #14536 Salvages #14570 by LeonSGP43. Co-authored-by: LeonSGP <leon@sgp43.com>	2026-05-02 01:36:53 -07:00
Teknium	97acd66b4c	fix(curator): authoritative absorbed_into on delete + restore cron skill links on rollback (#18671 ) (#18731 ) * fix(curator): authoritative absorbed_into declarations on skill delete Closes #18671. The classification pipeline that feeds cron-ref rewriting used to infer consolidation vs pruning from two brittle signals: the curator model's post-hoc YAML summary block, and a substring heuristic scanning other tool calls for the removed skill's name. Both miss in real consolidations — the model forgets the YAML under reasoning pressure, and the heuristic misses when the umbrella's patch content describes the absorbed behavior abstractly instead of naming the old slug. When both miss, the skill falls through to 'no-evidence fallback' pruned, and #18253's cron rewriter drops the cron ref entirely instead of mapping it to the umbrella. Same observable symptom as pre-#18253: 'Skill(s) not found and skipped' at the next cron run. The fix makes the model declare intent at the moment of deletion. skill_manage(action='delete') now accepts absorbed_into: - absorbed_into='<umbrella>' -> consolidated, target must exist on disk - absorbed_into='' -> explicit prune, no forwarding target - missing -> legacy path, falls through to heuristic/YAML The curator reconciler reads these declarations off llm_meta.tool_calls BEFORE either the YAML block or the substring heuristic. Declaration wins. Fallback logic stays intact for backward compat with any caller (human or older curator conversation) that doesn't populate the arg. Changes - tools/skill_manager_tool.py: add absorbed_into param to skill_manage + _delete_skill. Validate target exists when non-empty. Reject absorbed_into=<self>. Wire through dispatcher + registry + schema. - agent/curator.py: new _extract_absorbed_into_declarations() walks tool calls for skill_manage(delete) with the arg. _reconcile_classification accepts absorbed_declarations= and treats them as authoritative. Curator prompt updated to require the arg on every delete. - Tests: 7 new skill_manager tests covering the tool contract (valid target, empty string, nonexistent target, self-reference, whitespace, backward compat, dispatcher plumbing). 11 new curator tests covering the extractor + authoritative reconciler path + mixed-legacy-and- declared runs. Validation - 307/307 targeted tests pass (curator + cron + skill_manager suites). - E2E #18671 repro: 3 narrow skills, 1 umbrella, cron job referencing all 3. Model emits NO YAML block. Heuristic misses (patch prose doesn't name old slugs). Delete calls carry absorbed_into. Result: both PR skills correctly classified 'consolidated' + cron rewritten ['pr-review-format', 'pr-review-checklist', 'stale-junk'] -> ['hermes-agent-dev']; stale-junk pruned via absorbed_into=''. - E2E backward-compat: delete without absorbed_into, model emits YAML -> routed via existing 'model' source, cron still rewritten correctly. * feat(curator): capture + restore cron skill links across snapshot/rollback Before this, rolling back a curator run restored the skills tree but cron jobs still pointed at the umbrella skills the curator had rewritten them to. The user would see their old narrow skills back on disk but their cron jobs still configured with the merged umbrella — not actually 'back to how it was'. Snapshot side: snapshot_skills() now captures ~/.hermes/cron/jobs.json alongside the skills tarball, as cron-jobs.json. The manifest gets a new 'cron_jobs' block with {backed_up, jobs_count} so rollback (and the CLI confirm dialog) can surface what's in the snapshot. If jobs.json is missing/unreadable/malformed, snapshot proceeds without cron data — the skills backup is the core guarantee; cron is additive. Rollback side: after the skills extract succeeds, the new _restore_cron_skill_links() reconciles the backed-up jobs into the live jobs.json SURGICALLY. Only 'skills' and 'skill' fields are restored, and only on jobs matched by id. Everything else about a cron job — schedule, last_run_at, next_run_at, enabled, prompt, workdir, hooks — is live state the user or scheduler has modified since the snapshot; overwriting it would regress unrelated activity. Reconciliation rules: - Job in backup AND live, skills differ → skills restored. - Job in backup AND live, skills match → no-op. - Job in backup, NOT in live → skipped (user deleted it after snapshot; their choice is later than the snapshot). - Job in live, NOT in backup → untouched (user created it after snapshot). - Snapshot missing cron-jobs.json at all → rollback still succeeds, reports 'not captured' (older pre-feature snapshots keep working). Writes go through cron.jobs.save_jobs under the same _jobs_file_lock the scheduler uses, so rollback doesn't race tick(). Also: - hermes_cli/curator.py: rollback confirm dialog now shows 'cron jobs: N (will be restored for skill-link fields only)' when the snapshot has cron data, or 'not in snapshot (<reason>)' otherwise. - rollback()'s message string includes a 'cron links: ...' clause summarizing the reconciliation outcome. Tests - 9 new cases: snapshot-with-cron, snapshot-without-cron, malformed-json captured-as-raw, full rollback-restores-skills-and-cron, rollback touches only skill fields, rollback skips user-deleted jobs, rollback leaves user-created jobs untouched, rollback still works with pre-feature snapshot that has no cron-jobs.json, standalone unit test on _restore_cron_skill_links exercising the full report shape. Validation - 484/484 targeted tests pass (curator + cron + skill_manager suites). - E2E: real snapshot_skills, real cron rewrite, real rollback. Before: ['pr-review-format', 'pr-review-checklist', 'pr-triage-salvage']. After curator: ['hermes-agent-dev']. After rollback: ['pr-review-format', 'pr-review-checklist', 'pr-triage-salvage']. Non-skill fields (id, name, prompt) preserved across the round trip.	2026-05-02 01:29:57 -07:00
Amr Essam	d05a87e686	fix(gateway): clear slack assistant thread status	2026-05-01 14:01:26 -07:00
hinotoi-agent	a147164d3c	fix(slack): preserve per-user slash-command session isolation	2026-05-01 14:01:26 -07:00
nightq	5cdc39e29a	fix(gateway): preserve case-sensitive chat IDs in DeliveryTarget.parse Fixes NousResearch/hermes-agent#11768 Root cause: target.strip().lower() was lowercasing the entire target string, corrupting case-sensitive chat IDs like Slack C123ABC and Matrix !RoomABC. Fix: Only lowercase the platform prefix for case-insensitive matching; preserve the original case for chat_id and thread_id values.	2026-05-01 14:01:26 -07:00
YAMAGUCHI Seiji	2b3923ff13	fix(gateway): coerce scalar free_response_channels to str before split YAML loads a bare numeric value such as discord: free_response_channels: 1491973769726791812 as an int. _discord_free_response_channels() / _slack_free_response_channels() checked `isinstance(raw, list)` and `isinstance(raw, str)` in that order and then fell through to `return set()`, so a single-channel config that happened to be unquoted was silently dropped with no log line — the bot kept demanding @mentions even though the channel was configured to free-response. A multi-channel value like `1234567890,9876543210` does not trip this because the comma forces YAML to parse it as a string. Single-channel configs are the only case that breaks, which is exactly the footgun that's hardest to diagnose (the config "looks right" and the feature just doesn't activate). Note that the old-schema env-var bridge at gateway/config.py:614+ already runs `str(frc)` when forwarding to SLACK_/DISCORD_FREE_RESPONSE_CHANNELS, so the env-var fallback worked. The bug only surfaces on the `config.extra["free_response_channels"]` path populated by the `platforms:` bridge at gateway/config.py:576, which passes the raw YAML value through unchanged. Fix at the reader: treat any non-list value as a scalar, coerce with str(), then apply the same CSV split semantics. This keeps the public contract stable (list or str-like continues to work identically) while accepting the ints that the YAML loader is free to hand us. Added tests for both Discord and Slack covering: - bare int value in config.extra - list of ints in config.extra	2026-05-01 14:01:26 -07:00
Prive FE Coder	a717199bbf	fix(slack): exclude reserved Slack commands from native slash manifest Slack has built-in slash commands (e.g. /status, /me, /join) that apps cannot register. When running `hermes slack manifest --write`, the generated manifest included /status, causing Slack to reject the entire manifest with a reserved-command error. Add _SLACK_RESERVED_COMMANDS frozenset of all known Slack built-ins and skip them in slack_native_slashes(). Affected commands remain reachable via /hermes <command>. Tests updated: - New test_excludes_slack_reserved_commands validates no leaks - test_includes_canonical_commands no longer asserts /status - test_telegram_parity accounts for expected Slack-only exclusions	2026-05-01 14:01:26 -07:00
kshitijk4poor	8fcc160f6b	fix(gateway/slack): review fixes — scope ephemeral to commands, user isolation Self-review fixes for the slash ephemeral ack: - Only stash response_url when text starts with '/' (gateway command). Free-form questions via '/hermes <question>' must produce public agent replies visible to the whole channel, not ephemeral. - Use a ContextVar (_slash_user_id) to thread the invoking user's ID from _handle_slash_command through to send(). _pop_slash_context now matches the exact (channel_id, user_id) key when the ContextVar is set, preventing concurrent users on the same channel from stealing each other's ephemeral context. ContextVars propagate to child asyncio.Tasks, so the value survives through handle_message → _process_message_background → _send_with_retry → send(). - Add truncate_message() in _send_slash_ephemeral to prevent silent failures on long responses (response_url has the same ~40k limit). - Log send_private_notice failures at debug level instead of bare except/pass — aids diagnostics without spamming. - Document app_mention dedup dependency on shared event ts. - Add tests: free-form question must NOT stash context, concurrent users on the same channel get isolated contexts, non-slash send() path fallback behavior.	2026-05-01 13:33:06 -07:00
probepark	0ab2d752ff	feat(gateway): private notice delivery and Slack format_message fixes Adds platform-level private notice delivery abstraction so operational messages (e.g. sethome prompt) can be sent ephemerally on Slack when configured with `slack.notice_delivery: private`. Changes: - gateway/config.py: _normalize_notice_delivery() + GatewayConfig.get_notice_delivery() with per-platform config bridging - gateway/platforms/base.py: send_private_notice() default implementation (falls through to send()) - gateway/platforms/slack.py: send_private_notice() via chat_postEphemeral - gateway/run.py: _deliver_platform_notice() helper replaces direct adapter.send() for the sethome notice, with private→public fallback - gateway/platforms/slack.py: app_mention handler now forwards to _handle_slack_message (safe due to ts-based dedup) instead of no-op pass, fixing edge-case Slack configs where mentions arrive only as app_mention - gateway/platforms/slack.py format_message: negative lookbehind prevents markdown images (![]()) from becoming broken Slack links; italic regex now requires non-whitespace boundaries so 'a * b * c' stays literal Based on PR #9340 by @probepark.	2026-05-01 13:33:06 -07:00
kshitijk4poor	7cda0e5224	fix(gateway/slack): ephemeral ack and routing for slash commands Slack slash commands (/q, /btw, /stop, /model, etc.) previously showed no user-visible acknowledgement and posted command replies as public channel messages. This diverged from Discord, which uses ephemeral deferred responses for slash commands. Changes: - handle_hermes_command now passes response_type='ephemeral' and a 'Running /cmd…' text to ack(), giving the user immediate 'Only visible to you' feedback when they invoke any native slash command. - _handle_slash_command stashes the Slack response_url from the command payload in a per-channel context dict before dispatching to handle_message. - send() checks for a pending slash context and, when found, POSTs to the response_url with replace_original=true to swap the initial ack with the real command reply (e.g. 'Queued for the next turn.'), keeping it ephemeral. - Stale slash contexts are garbage-collected on lookup (120s TTL). - The response_url POST is non-fatal: if it fails, the user already saw the initial ack, and send() returns success=True. Fixes #18182	2026-05-01 13:33:06 -07:00
Teknium	f99676e315	fix(gateway): auto-restart when source files change out from under us (#17648 ) (#18409 ) Long-running gateway processes that survive 'hermes update' keep pre-update modules cached in sys.modules. When new tool files on disk then try to 'from hermes_cli.config import cfg_get' (added in PR #17304), the import resolves against the stale module object and raises ImportError — hitting users on Matrix, Telegram, Feishu, and other platforms. Two defenses: 1. Gateway self-check (gateway/run.py). On __init__, snapshot the newest mtime across sentinel source files (hermes_cli/config.py, run_agent.py, gateway/run.py, etc.). On every inbound message, re-read those mtimes; if any is newer than boot time + 2s slack, request a graceful restart via the normal drain path and return a one-line ack to the user. Idempotent, works regardless of how the update happened (hermes update, manual git pull, installer). 2. Post-restart survivor sweep ('hermes update'). After the existing restart loop, sleep 3s, rescan for gateway PIDs we already tried to kill, and SIGKILL any survivors. The detached profile watchers and systemd then relaunch with fresh code instead of waiting out the 120s watcher timeout. Closes #17648.	2026-05-01 09:50:08 -07:00

1 2 3 4 5 ...

3054 Commits