Files
hermes-agent/hermes_cli/_subprocess_compat.py

235 lines
9.3 KiB
Python
Raw Normal View History

feat(windows): close remaining POSIX-only landmines — TUI crash, kanban waitpid, AF_UNIX sandbox, /bin/bash, npm .cmd shims, cwd tracking, detach flags Second pass on native Windows support, driven by a systematic audit across five areas: POSIX-only primitives (signal.SIGKILL/SIGHUP/SIGPIPE, os.WNOHANG, os.setsid), path translation bugs (/c/Users → C:\Users), subprocess patterns (npm.cmd batch shims, start_new_session no-op on Windows), subsystem health (cron, gateway daemon, update flow), and module-level import guards. Every change is platform-gated — POSIX (Linux/macOS) behaviour is preserved bit-identical. Explicit "do no harm" test: test_posix_path_preserved_on_linux, test_posix_noop, test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix. ## New module - hermes_cli/_subprocess_compat.py — shared helpers (resolve_node_command, windows_detach_flags, windows_hide_flags, windows_detach_popen_kwargs). All no-ops on non-Windows. ## CRITICAL fixes (would crash or silently break on Windows) - tui_gateway/entry.py: SIGPIPE/SIGHUP referenced at module top level would AttributeError on import on Windows, breaking `hermes --tui` entirely (it spawns this module as a subprocess). Guard each signal.signal() call with hasattr() and add SIGBREAK as Windows' SIGHUP equivalent. - hermes_cli/kanban_db.py: os.waitpid(-1, os.WNOHANG) in dispatcher tick was unguarded. os.WNOHANG doesn't exist on Windows. Gate the whole reap loop behind `os.name != "nt"` — Windows has no zombies anyway. - tools/code_execution_tool.py: AF_UNIX socket for execute_code RPC fails on most Windows builds. Fall back to loopback TCP (AF_INET on 127.0.0.1:0 ephemeral port) when _IS_WINDOWS. HERMES_RPC_SOCKET env var now accepts either a filesystem path (POSIX) or `tcp://127.0.0.1:<port>` (Windows). Generated sandbox client parses both. - cron/scheduler.py: `argv = ["/bin/bash", str(path)]` hardcoded. Use shutil.which("bash") so Windows (Git Bash via MinGit) works, with a readable error when bash is genuinely absent. - 6 bare npm/npx spawn sites: tools_config.py x2, doctor.py, whatsapp.py (npm install + node version probe), browser_tool.py x2. On Windows npm is npm.cmd / npx is npx.cmd (batch shims); subprocess.Popen(["npm", ...]) fails with WinError 193. shutil.which(...) returns the absolute .cmd path which CreateProcessW accepts because the extension routes through cmd.exe /c. POSIX behaviour unchanged (shutil.which still returns the same path subprocess would resolve itself). ## HIGH fixes (silent misbehaviour on Windows) - tools/environments/local.py get_temp_dir: hardcoded /tmp returned on Windows meant `_cwd_file = "/tmp/hermes-cwd-*.txt"`, which bash wrote via MSYS2's virtual /tmp but native Python couldn't open. Result: cwd tracking silently broken — `cd` in terminal tool did nothing. Windows branch now returns `%HERMES_HOME%/cache/terminal` with forward slashes (works in both bash and Python, guaranteed no spaces). - tools/environments/local.py _make_run_env PATH injection: `/usr/bin not in split(":")` heuristic mangles Windows PATH (";" separator). Gate the injection behind `not _IS_WINDOWS`. - hermes_cli/gateway.py launch_detached_profile_gateway_restart: outer Popen + watcher-script Popen both used start_new_session=True, which Windows silently ignores. Watcher stayed attached to CLI's console, died when user closed terminal after `hermes update`, left gateway stale. Now branches through windows_detach_popen_kwargs() helper (CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW on Windows, start_new_session=True on POSIX — identical to main). ## MEDIUM fixes - gateway/run.py /restart and /update handlers: hardcoded bash/setsid chain crashes on Windows when user triggers /update in-gateway. Now has sys.platform=="win32" branch using sys.executable + a tiny Python watcher with proper detach flags. POSIX path is unchanged. - cli.py _git_repo_root: Git on Windows sometimes returns /c/Users/... style paths that break subprocess.Popen(cwd=...) and Path().resolve(). Added _normalize_git_bash_path() helper that translates /c/Users, /cygdrive/c, /mnt/c variants to native C:\Users form. POSIX no-op. _git_repo_root() now routes every result through it. - cli.py worktree .worktreeinclude: os.symlink on directories failed hard on Windows (requires admin or Developer Mode). Falls back to shutil.copytree with a warning log. ## Tests - 29 new tests in tests/tools/test_windows_native_support.py covering: subprocess_compat helpers, TUI entry signal guards, kanban waitpid guard, code_execution TCP fallback source-level invariants, cron bash resolution, npm/npx bare-spawn lint per-file, local env Windows temp dir, PATH injection gating, git bash path normalization, symlink fallback, gateway detached watcher flags. - One existing test assertion adjusted in test_browser_homebrew_paths: it compared captured Popen argv to the BARE `"npx"` literal; after the shutil.which() change argv[0] is the absolute path. New assertion checks the shape (two items, second is `agent-browser`) rather than the exact first-item string. Behaviour unchanged; test was too strict. All 56 tests pass on Linux (30 from previous commits + 26 new). 267 tests from the affected files/dirs (browser, code_exec, local_env, process_registry, kanban_db, windows_compat) all pass — zero regressions. tests/hermes_cli/ (3909 pass) and tests/gateway/ (5021 pass) unchanged; all pre-existing test failures confirmed unrelated via `git stash` re-run. ## What's still deferred (LOW priority) - Visible cmd-window flashes on short-lived console apps (~14 sites) — cosmetic, needs a follow-up pass once we have user reports. - agent/file_safety.py POSIX-only security deny patterns — separate hardening task. - tools/process_registry.py returning "/tmp" as fallback — theoretical; reachable only when all env-var candidates fail.
2026-05-07 17:29:31 -07:00
"""Windows subprocess compatibility helpers.
Hermes is developed on Linux / macOS and tested natively on Windows too.
Several common subprocess patterns break silently-or-loudly on Windows:
* ``["npm", "install", ...]`` on Windows ``npm`` is ``npm.cmd``, a batch
shim. ``subprocess.Popen(["npm", ...])`` fails with WinError 193
("not a valid Win32 application") because CreateProcessW can't run a
``.cmd`` file without ``shell=True`` or PATHEXT resolution.
* ``start_new_session=True`` on POSIX, this maps to ``os.setsid()`` and
actually detaches the child. On Windows it's silently ignored; the
Windows equivalent is ``CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS``
creationflags, which Python only applies when you pass them explicitly.
* Console-window flashes every ``subprocess.Popen`` of a ``.exe`` on
Windows spawns a cmd window briefly unless ``CREATE_NO_WINDOW`` is
passed. Cosmetic but jarring for background daemons.
This module centralizes the platform-branching logic so the rest of the
codebase doesn't sprinkle ``if sys.platform == "win32":`` everywhere.
**All helpers are no-ops on non-Windows** calling them in Linux/macOS
code paths is safe by design. That's the "do no damage on POSIX"
guarantee.
"""
from __future__ import annotations
import shutil
import sys
from typing import Sequence
feat(windows): close remaining POSIX-only landmines — TUI crash, kanban waitpid, AF_UNIX sandbox, /bin/bash, npm .cmd shims, cwd tracking, detach flags Second pass on native Windows support, driven by a systematic audit across five areas: POSIX-only primitives (signal.SIGKILL/SIGHUP/SIGPIPE, os.WNOHANG, os.setsid), path translation bugs (/c/Users → C:\Users), subprocess patterns (npm.cmd batch shims, start_new_session no-op on Windows), subsystem health (cron, gateway daemon, update flow), and module-level import guards. Every change is platform-gated — POSIX (Linux/macOS) behaviour is preserved bit-identical. Explicit "do no harm" test: test_posix_path_preserved_on_linux, test_posix_noop, test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix. ## New module - hermes_cli/_subprocess_compat.py — shared helpers (resolve_node_command, windows_detach_flags, windows_hide_flags, windows_detach_popen_kwargs). All no-ops on non-Windows. ## CRITICAL fixes (would crash or silently break on Windows) - tui_gateway/entry.py: SIGPIPE/SIGHUP referenced at module top level would AttributeError on import on Windows, breaking `hermes --tui` entirely (it spawns this module as a subprocess). Guard each signal.signal() call with hasattr() and add SIGBREAK as Windows' SIGHUP equivalent. - hermes_cli/kanban_db.py: os.waitpid(-1, os.WNOHANG) in dispatcher tick was unguarded. os.WNOHANG doesn't exist on Windows. Gate the whole reap loop behind `os.name != "nt"` — Windows has no zombies anyway. - tools/code_execution_tool.py: AF_UNIX socket for execute_code RPC fails on most Windows builds. Fall back to loopback TCP (AF_INET on 127.0.0.1:0 ephemeral port) when _IS_WINDOWS. HERMES_RPC_SOCKET env var now accepts either a filesystem path (POSIX) or `tcp://127.0.0.1:<port>` (Windows). Generated sandbox client parses both. - cron/scheduler.py: `argv = ["/bin/bash", str(path)]` hardcoded. Use shutil.which("bash") so Windows (Git Bash via MinGit) works, with a readable error when bash is genuinely absent. - 6 bare npm/npx spawn sites: tools_config.py x2, doctor.py, whatsapp.py (npm install + node version probe), browser_tool.py x2. On Windows npm is npm.cmd / npx is npx.cmd (batch shims); subprocess.Popen(["npm", ...]) fails with WinError 193. shutil.which(...) returns the absolute .cmd path which CreateProcessW accepts because the extension routes through cmd.exe /c. POSIX behaviour unchanged (shutil.which still returns the same path subprocess would resolve itself). ## HIGH fixes (silent misbehaviour on Windows) - tools/environments/local.py get_temp_dir: hardcoded /tmp returned on Windows meant `_cwd_file = "/tmp/hermes-cwd-*.txt"`, which bash wrote via MSYS2's virtual /tmp but native Python couldn't open. Result: cwd tracking silently broken — `cd` in terminal tool did nothing. Windows branch now returns `%HERMES_HOME%/cache/terminal` with forward slashes (works in both bash and Python, guaranteed no spaces). - tools/environments/local.py _make_run_env PATH injection: `/usr/bin not in split(":")` heuristic mangles Windows PATH (";" separator). Gate the injection behind `not _IS_WINDOWS`. - hermes_cli/gateway.py launch_detached_profile_gateway_restart: outer Popen + watcher-script Popen both used start_new_session=True, which Windows silently ignores. Watcher stayed attached to CLI's console, died when user closed terminal after `hermes update`, left gateway stale. Now branches through windows_detach_popen_kwargs() helper (CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW on Windows, start_new_session=True on POSIX — identical to main). ## MEDIUM fixes - gateway/run.py /restart and /update handlers: hardcoded bash/setsid chain crashes on Windows when user triggers /update in-gateway. Now has sys.platform=="win32" branch using sys.executable + a tiny Python watcher with proper detach flags. POSIX path is unchanged. - cli.py _git_repo_root: Git on Windows sometimes returns /c/Users/... style paths that break subprocess.Popen(cwd=...) and Path().resolve(). Added _normalize_git_bash_path() helper that translates /c/Users, /cygdrive/c, /mnt/c variants to native C:\Users form. POSIX no-op. _git_repo_root() now routes every result through it. - cli.py worktree .worktreeinclude: os.symlink on directories failed hard on Windows (requires admin or Developer Mode). Falls back to shutil.copytree with a warning log. ## Tests - 29 new tests in tests/tools/test_windows_native_support.py covering: subprocess_compat helpers, TUI entry signal guards, kanban waitpid guard, code_execution TCP fallback source-level invariants, cron bash resolution, npm/npx bare-spawn lint per-file, local env Windows temp dir, PATH injection gating, git bash path normalization, symlink fallback, gateway detached watcher flags. - One existing test assertion adjusted in test_browser_homebrew_paths: it compared captured Popen argv to the BARE `"npx"` literal; after the shutil.which() change argv[0] is the absolute path. New assertion checks the shape (two items, second is `agent-browser`) rather than the exact first-item string. Behaviour unchanged; test was too strict. All 56 tests pass on Linux (30 from previous commits + 26 new). 267 tests from the affected files/dirs (browser, code_exec, local_env, process_registry, kanban_db, windows_compat) all pass — zero regressions. tests/hermes_cli/ (3909 pass) and tests/gateway/ (5021 pass) unchanged; all pre-existing test failures confirmed unrelated via `git stash` re-run. ## What's still deferred (LOW priority) - Visible cmd-window flashes on short-lived console apps (~14 sites) — cosmetic, needs a follow-up pass once we have user reports. - agent/file_safety.py POSIX-only security deny patterns — separate hardening task. - tools/process_registry.py returning "/tmp" as fallback — theoretical; reachable only when all env-var candidates fail.
2026-05-07 17:29:31 -07:00
__all__ = [
"IS_WINDOWS",
"resolve_node_command",
"windows_detach_flags",
fix(windows): retry watcher Popen without breakaway when parent job denies it, plus regression tests for the breakaway bit (#40956) #40909 added `CREATE_BREAKAWAY_FROM_JOB` to `windows_detach_flags()`, which fixed the headline bug (gateway dies after Desktop GUI update and never comes back). The flag's own docstring acknowledges that restrictive parent job objects can still refuse breakaway with `ERROR_ACCESS_DENIED`, surfacing as `OSError` on the `subprocess.Popen` call: "Callers in this codebase already wrap detached spawns in try/except OSError and fall back to a cmd.exe wrapper, so the breakaway-denied case degrades gracefully rather than crashing." That's true for `_spawn_detached` in `gateway_windows.py` (the `hermes gateway start` path), which has both the breakaway bit AND a retry-without-breakaway fallback. It's NOT true for the post-update watcher path in `launch_detached_profile_gateway_restart` (`hermes_cli/gateway.py`), which only has `except OSError: return False` and gives up entirely. If a user's shell/terminal/container wraps Hermes in a breakaway-denying job, the gateway-respawn watcher silently fails to launch instead of trying again without breakaway. This PR closes that gap and adds the regression tests that were missing from the original fix. ## Changes ### `hermes_cli/_subprocess_compat.py` Adds a sibling helper `windows_detach_flags_without_breakaway()` so callers can express the fallback symbolically (via the helper) rather than coding the magic `& ~0x01000000` mask at every site. Documented on `windows_detach_flags` and `windows_detach_flags_without_breakaway` with the recommended try/except pattern. ### `hermes_cli/gateway.py::launch_detached_profile_gateway_restart` Two changes, both aligned with the canonical pattern in `gateway_windows._spawn_detached`: 1. The outer watcher Popen now wraps in `try/except OSError`, and on failure retries with `windows_detach_flags_without_breakaway()` (POSIX never reaches this branch — `start_new_session=True` can't raise OSError). 2. The inlined respawn payload (the `python -c` watcher) also wraps its CreateProcess in try/except OSError and retries with `_flags & ~_CREATE_BREAKAWAY_FROM_JOB` on failure. This matters because the watcher's job-object inheritance is independent of the outer process's — even if the outer Popen succeeds with breakaway, the respawned gateway might inherit a job that doesn't. ### Regression tests in `tests/tools/test_windows_native_support.py` #40909 shipped the fix without any test that the breakaway bit is present (the existing `test_windows_detach_flags_has_expected_win32_bits` asserted only the three legacy bits). Four new tests close that: - `test_windows_detach_flags_includes_breakaway_from_job` — explicit assertion that the breakaway bit is in the default bundle, with the rationale spelled out in the docstring so a future maintainer staring at this test understands why removing it would resurrect the gateway-dies-after-GUI-update bug. - `test_windows_detach_flags_without_breakaway_drops_only_that_bit` — fallback payload keeps the other three detach bits intact. - `test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway` — static-text check on the stringified watcher payload. The inlined Python program isn't reachable via normal import-time inspection because it lives in a `textwrap.dedent("""...""")` literal that gets passed to a separate `python -c` interpreter. Asserting that both `_CREATE_BREAKAWAY_FROM_JOB` (symbolic) and `0x01000000` (hex literal) appear inside the dedent block is a sufficient regression guard against accidental refactors. - `test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback` — static check that this PR's fallback retry is wired up symbolically. Without standing up a real Windows job object that refuses breakaway, we can't trigger the OSError in a unit test; the text guard catches the case where a future refactor removes the helper import or the `& ~_CREATE_BREAKAWAY_FROM_JOB` retry. Also extends `test_windows_detach_flags_has_expected_win32_bits` to include the breakaway bit assertion and updates `test_windows_flags_zero_on_posix` to cover the new helper. ## Tests Locally on Windows: 8/8 in the `-k "detach or breakaway or popen_kwargs or launch_detached or gateway_run_update or hermes_cli_gateway"` slice pass. Broader `tests/hermes_cli/test_gateway*.py + test_windows_native_support.py`: 172 passed, 10 failed. All 10 failures are pre-existing POSIX-only tests running on a Windows host (os.geteuid, SIGKILL fallback, is_linux fixture mismatches). Stashing this PR and re-running on bare post-#40909 main reproduces all 10 identically — none are regressions. POSIX paths unchanged: `windows_detach_flags()` and `windows_detach_flags_without_breakaway()` both return 0 off Windows, `windows_detach_popen_kwargs()` still yields `{"start_new_session": True}`. ## Out of scope - The other detached-spawn site in `hermes_cli/gateway.py` (around line 3068) also uses `windows_detach_popen_kwargs()` + `except OSError`. It deserves the same fallback treatment but the codepath is different enough (not the update-flow watcher) that it warrants a separate PR with its own scrutiny. - `gateway/run.py` has Windows branches with `windows_detach_popen_kwargs` too — same reasoning. ## Context Follow-up to #40909 (merged). I had a parallel PR (#40934, closed) that duplicated the core breakaway fix; the bits unique to that PR that #40909 didn't cover are the contents of this one. Closing #40934 and opening this slimmed-down version as the focused follow-up.
2026-06-07 01:21:58 -07:00
"windows_detach_flags_without_breakaway",
feat(windows): close remaining POSIX-only landmines — TUI crash, kanban waitpid, AF_UNIX sandbox, /bin/bash, npm .cmd shims, cwd tracking, detach flags Second pass on native Windows support, driven by a systematic audit across five areas: POSIX-only primitives (signal.SIGKILL/SIGHUP/SIGPIPE, os.WNOHANG, os.setsid), path translation bugs (/c/Users → C:\Users), subprocess patterns (npm.cmd batch shims, start_new_session no-op on Windows), subsystem health (cron, gateway daemon, update flow), and module-level import guards. Every change is platform-gated — POSIX (Linux/macOS) behaviour is preserved bit-identical. Explicit "do no harm" test: test_posix_path_preserved_on_linux, test_posix_noop, test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix. ## New module - hermes_cli/_subprocess_compat.py — shared helpers (resolve_node_command, windows_detach_flags, windows_hide_flags, windows_detach_popen_kwargs). All no-ops on non-Windows. ## CRITICAL fixes (would crash or silently break on Windows) - tui_gateway/entry.py: SIGPIPE/SIGHUP referenced at module top level would AttributeError on import on Windows, breaking `hermes --tui` entirely (it spawns this module as a subprocess). Guard each signal.signal() call with hasattr() and add SIGBREAK as Windows' SIGHUP equivalent. - hermes_cli/kanban_db.py: os.waitpid(-1, os.WNOHANG) in dispatcher tick was unguarded. os.WNOHANG doesn't exist on Windows. Gate the whole reap loop behind `os.name != "nt"` — Windows has no zombies anyway. - tools/code_execution_tool.py: AF_UNIX socket for execute_code RPC fails on most Windows builds. Fall back to loopback TCP (AF_INET on 127.0.0.1:0 ephemeral port) when _IS_WINDOWS. HERMES_RPC_SOCKET env var now accepts either a filesystem path (POSIX) or `tcp://127.0.0.1:<port>` (Windows). Generated sandbox client parses both. - cron/scheduler.py: `argv = ["/bin/bash", str(path)]` hardcoded. Use shutil.which("bash") so Windows (Git Bash via MinGit) works, with a readable error when bash is genuinely absent. - 6 bare npm/npx spawn sites: tools_config.py x2, doctor.py, whatsapp.py (npm install + node version probe), browser_tool.py x2. On Windows npm is npm.cmd / npx is npx.cmd (batch shims); subprocess.Popen(["npm", ...]) fails with WinError 193. shutil.which(...) returns the absolute .cmd path which CreateProcessW accepts because the extension routes through cmd.exe /c. POSIX behaviour unchanged (shutil.which still returns the same path subprocess would resolve itself). ## HIGH fixes (silent misbehaviour on Windows) - tools/environments/local.py get_temp_dir: hardcoded /tmp returned on Windows meant `_cwd_file = "/tmp/hermes-cwd-*.txt"`, which bash wrote via MSYS2's virtual /tmp but native Python couldn't open. Result: cwd tracking silently broken — `cd` in terminal tool did nothing. Windows branch now returns `%HERMES_HOME%/cache/terminal` with forward slashes (works in both bash and Python, guaranteed no spaces). - tools/environments/local.py _make_run_env PATH injection: `/usr/bin not in split(":")` heuristic mangles Windows PATH (";" separator). Gate the injection behind `not _IS_WINDOWS`. - hermes_cli/gateway.py launch_detached_profile_gateway_restart: outer Popen + watcher-script Popen both used start_new_session=True, which Windows silently ignores. Watcher stayed attached to CLI's console, died when user closed terminal after `hermes update`, left gateway stale. Now branches through windows_detach_popen_kwargs() helper (CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW on Windows, start_new_session=True on POSIX — identical to main). ## MEDIUM fixes - gateway/run.py /restart and /update handlers: hardcoded bash/setsid chain crashes on Windows when user triggers /update in-gateway. Now has sys.platform=="win32" branch using sys.executable + a tiny Python watcher with proper detach flags. POSIX path is unchanged. - cli.py _git_repo_root: Git on Windows sometimes returns /c/Users/... style paths that break subprocess.Popen(cwd=...) and Path().resolve(). Added _normalize_git_bash_path() helper that translates /c/Users, /cygdrive/c, /mnt/c variants to native C:\Users form. POSIX no-op. _git_repo_root() now routes every result through it. - cli.py worktree .worktreeinclude: os.symlink on directories failed hard on Windows (requires admin or Developer Mode). Falls back to shutil.copytree with a warning log. ## Tests - 29 new tests in tests/tools/test_windows_native_support.py covering: subprocess_compat helpers, TUI entry signal guards, kanban waitpid guard, code_execution TCP fallback source-level invariants, cron bash resolution, npm/npx bare-spawn lint per-file, local env Windows temp dir, PATH injection gating, git bash path normalization, symlink fallback, gateway detached watcher flags. - One existing test assertion adjusted in test_browser_homebrew_paths: it compared captured Popen argv to the BARE `"npx"` literal; after the shutil.which() change argv[0] is the absolute path. New assertion checks the shape (two items, second is `agent-browser`) rather than the exact first-item string. Behaviour unchanged; test was too strict. All 56 tests pass on Linux (30 from previous commits + 26 new). 267 tests from the affected files/dirs (browser, code_exec, local_env, process_registry, kanban_db, windows_compat) all pass — zero regressions. tests/hermes_cli/ (3909 pass) and tests/gateway/ (5021 pass) unchanged; all pre-existing test failures confirmed unrelated via `git stash` re-run. ## What's still deferred (LOW priority) - Visible cmd-window flashes on short-lived console apps (~14 sites) — cosmetic, needs a follow-up pass once we have user reports. - agent/file_safety.py POSIX-only security deny patterns — separate hardening task. - tools/process_registry.py returning "/tmp" as fallback — theoretical; reachable only when all env-var candidates fail.
2026-05-07 17:29:31 -07:00
"windows_hide_flags",
"windows_detach_popen_kwargs",
]
IS_WINDOWS = sys.platform == "win32"
# -----------------------------------------------------------------------------
# Node ecosystem launcher resolution
# -----------------------------------------------------------------------------
def resolve_node_command(name: str, argv: Sequence[str]) -> list[str]:
"""Resolve a Node-ecosystem command name to an absolute-path argv.
On Windows, commands like ``npm``, ``npx``, ``yarn``, ``pnpm``,
``playwright``, ``prettier`` ship as ``.cmd`` files (batch shims).
``subprocess.Popen(["npm", "install"])`` fails with WinError 193
because CreateProcessW doesn't execute batch files directly.
``shutil.which(name)`` *does* resolve ``.cmd`` via PATHEXT and returns
the fully-qualified path which CreateProcessW accepts because the
extension tells Windows to route through ``cmd.exe /c``.
On POSIX ``shutil.which`` also returns a fully-qualified path when
found. That's a small change from bare-name resolution (the OS does
its own PATH search) but functionally identical and has the side
benefit of making the argv reproducible in logs.
Behavior when the command is not on PATH:
- On Windows: return the bare name caller can still try with
``shell=True`` as a last resort, OR the subsequent Popen will
raise FileNotFoundError with a readable error we want to surface.
- On POSIX: same. Bare ``npm`` on a Linux box without npm installed
fails the same way it did before this function existed.
Args:
name: The command name to resolve (``npm``, ``npx``, ``node`` ).
argv: The remaining arguments. Must NOT include ``name`` itself
this function builds the full argv list.
Returns:
A list suitable for passing to subprocess.Popen/run/call.
"""
resolved = shutil.which(name)
if resolved:
return [resolved, *argv]
return [name, *argv]
# -----------------------------------------------------------------------------
# Detached / hidden process creation
# -----------------------------------------------------------------------------
# Win32 CreationFlags — defined here rather than imported from subprocess
# because CREATE_NO_WINDOW and DETACHED_PROCESS aren't guaranteed to be
# present on stdlib subprocess on older Pythons or non-Windows builds.
_CREATE_NEW_PROCESS_GROUP = 0x00000200
_DETACHED_PROCESS = 0x00000008
_CREATE_NO_WINDOW = 0x08000000
fix(gateway,windows): reliability — JOB breakaway + status --deep probes + test-leak fix (#40909) * fix(gateway,windows): reliability — supervisor task, JOB breakaway, status --deep Three coordinated fixes for the Windows gateway reliability story: 1. CREATE_BREAKAWAY_FROM_JOB on every detached spawn The 'hermes update' triggered from the Electron Desktop GUI ran inside Electron's job object. Without breakaway, the post-update gateway watcher spawned by update — already DETACHED_PROCESS — was still reaped when Electron's job tore down, so the gateway never came back after a GUI-initiated update. Adds CREATE_BREAKAWAY_FROM_JOB (0x01000000) to: - hermes_cli/_subprocess_compat.py::windows_detach_flags() — used by every helper that calls windows_detach_popen_kwargs(), including launch_detached_profile_gateway_restart() - The watcher subprocess's own respawn snippet in hermes_cli/gateway.py (inlined flags so the watcher's child respawn also breaks away) _spawn_detached() in gateway_windows.py already had the flag; this change brings the rest of the codebase to parity. 2. Per-minute supervisor Scheduled Task — Windows equivalent of systemd Restart=always Introduces hermes_cli/gateway_supervisor.py and registers it as a second Scheduled Task ('Hermes_Gateway_Supervisor', SC MINUTE /MO 1, LIMITED rights) alongside the existing ONLOGON task. Every minute, the supervisor uses the same gateway.status.get_running_pid() probe as 'hermes gateway status' and, if no gateway is alive, calls gateway_windows._spawn_detached() (which now includes BREAKAWAY) to bring one back. Covers every crash mode, not just 'machine rebooted': taskkill, OOM, GUI update SIGTERM, parent job teardown. Cheap — one pythonw startup per minute when down, one PID-existence check per minute when up. Wired into both the schtasks-success and Startup-folder-fallback install paths via _install_supervisor_best_effort(), and removed in uninstall(). Best-effort: a failing supervisor install logs a warning but doesn't roll back the primary install. 3. 'hermes gateway status --deep' shows per-probe PASS/FAIL Replaces the existing terse '--deep' output (which only printed paths) with an actual diagnostic table: [1] PID file present [2] Lock file held by a live process [3] get_running_pid() result [4] _pid_exists(pid) — OS-level liveness [5] gateway_state.json (state + age) [6] Last lifecycle event from gateway-exit-diag.log When the high-level summary disagrees with reality, the user can see exactly which signal is lying. Test-leak fix ------------- tests/hermes_cli/test_gateway_wsl.py::TestGatewayCommandWSLMessages monkey-patched is_linux/is_wsl/supports_systemd_services to simulate WSL but did NOT stub is_windows(). On a Windows host, the dispatcher in _gateway_command_inner takes the is_windows() branch BEFORE the WSL guidance branch, so the test invoked gateway_windows.install() for real. install() writes to %APPDATA%\...\Startup\Hermes_Gateway.cmd — the REAL user Startup folder, never sandboxed by tmp_path — pointing at the test's pytest-of-<user>/pytest-<N>/.../gateway-service/ wrapper. When pytest tore down the tmp_path, every subsequent Windows login flashed a cmd.exe window that failed to find the missing target. Stubs is_windows=False on all four affected tests: test_install_wsl_no_systemd test_start_wsl_no_systemd test_status_wsl_running_manual test_status_wsl_not_running Defense-in-depth: _build_startup_launcher() now prefixes the launcher with 'if not exist <target> exit /b 0', so any future stale Startup entry silently no-ops instead of flashing a console window. Status enhancements ------------------- - status() now reports supervisor task presence alongside the existing schtasks/Startup info, and nudges the user to reinstall if the supervisor isn't registered. - Deep mode dumps both the supervisor task name + script path. * fix(gateway,windows): drop the per-minute supervisor task — keep breakaway + deep probes Earlier in this branch we added a per-minute schtasks-based supervisor to respawn the gateway after crashes / GUI-update SIGTERMs. The implementation flashed a brief console window on every firing, which stole window focus. We tried several variants: - cmd.exe wrapper invoking pythonw -> flashes (cmd.exe is console-subsystem) - schtasks /TR pointing at pythonw -> flashes (uv venv launcher pythonw is actually subsystem=Console, not GUI; it respawns the real pythonw) - schtasks /TR pointing at base uv -> still flashes (Task Scheduler-side conhost preallocation; documented Windows quirk) - XML registration with <Hidden>true> -> still flashes (<Hidden> only hides the task in the Task Scheduler UI, not the spawned window) Researched what leading projects do: - Ollama: GUI-subsystem tray exe + Startup-folder shortcut. No supervisor. - Tailscale: real Windows Service via SCM. Session 0, no console possible. - Syncthing: --no-console flag inside the binary + Startup folder. - openclaw: VBS Run(..., 0, False) wrapper. Suppresses the *window* but Super User Q971162 confirms focus-steal still occurs in some cases. None of these use a per-minute polling scheduled task. The 'auto-restart on crash' responsibility belongs INSIDE the daemon (Tailscale's in-process recovery / Ollama's monitor+worker pair) OR is delegated to the Windows Service Control Manager — not Task Scheduler. So this commit drops the supervisor entirely. The CREATE_BREAKAWAY_FROM_JOB fix in _subprocess_compat.py (from commit c1e5fa433) survives — that is the *real* fix for problem #2 (GUI-update kills gateway): the post-update watcher in launch_detached_profile_gateway_restart() now breaks out of Electron's job object, so the gateway respawn watcher survives the GUI quit and successfully respawns the gateway. Surviving from c1e5fa433: * CREATE_BREAKAWAY_FROM_JOB in hermes_cli/_subprocess_compat.py (fixes #2) * Inlined breakaway flag in the watcher respawn snippet in gateway.py * hermes gateway status --deep PASS/FAIL probes (fixes #1 — visibility) * 'if not exist <target> exit /b 0' guard in _build_startup_launcher (fixes #3 — silent no-op for stale Startup entries) * tests/hermes_cli/test_gateway_wsl.py is_windows=False stubs (root cause of #3 — pytest WSL tests no longer leak Startup entries on Win hosts) Removed in this commit: * hermes_cli/gateway_supervisor.py (entire file) * Supervisor section in hermes_cli/gateway_windows.py (~180 lines): get_supervisor_task_name, get_supervisor_script_path, _build_supervisor_cmd_script, _write_supervisor_script, _install_supervisor_task, is_supervisor_task_registered, _install_supervisor_best_effort * _install_supervisor_best_effort() calls in install() (3 spots) * supervisor cleanup block in uninstall() * supervisor display lines in status() / status(deep=True) Future direction (out of scope for this PR): the right place for Windows 'Restart=always' semantics is a real Windows Service installed via pywin32's win32serviceutil.ServiceFramework — session-0 isolation, SCM auto-restart, no console window possible. That's a meaningful next-PR project, not a band-aid. Tests: 51 pass / 2 pre-existing failures in tests/hermes_cli/test_gateway_{windows,wsl}.py (the 2 failures are TestSupportsSystemdServicesWSL cases that fail on origin/main too — unrelated to this PR).
2026-06-06 19:53:58 -07:00
# Escape any Win32 job object the parent process belongs to. Without this,
# a detached child still inherits its parent's job object membership, and
# when that parent (Electron, Tauri, Windows Terminal, the Desktop GUI's
# bootstrap-installer) dies, the OS tears down the whole job — taking the
# "detached" child with it. Critical for the post-update gateway watcher:
# Electron spawns the Tauri updater inside its own job, the updater spawns
# the watcher subprocess; without BREAKAWAY the watcher dies the instant
# Electron exits, so the gateway never gets respawned after a `hermes
# update` triggered from the GUI. See fix/windows-gateway-reliability.
_CREATE_BREAKAWAY_FROM_JOB = 0x01000000
feat(windows): close remaining POSIX-only landmines — TUI crash, kanban waitpid, AF_UNIX sandbox, /bin/bash, npm .cmd shims, cwd tracking, detach flags Second pass on native Windows support, driven by a systematic audit across five areas: POSIX-only primitives (signal.SIGKILL/SIGHUP/SIGPIPE, os.WNOHANG, os.setsid), path translation bugs (/c/Users → C:\Users), subprocess patterns (npm.cmd batch shims, start_new_session no-op on Windows), subsystem health (cron, gateway daemon, update flow), and module-level import guards. Every change is platform-gated — POSIX (Linux/macOS) behaviour is preserved bit-identical. Explicit "do no harm" test: test_posix_path_preserved_on_linux, test_posix_noop, test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix. ## New module - hermes_cli/_subprocess_compat.py — shared helpers (resolve_node_command, windows_detach_flags, windows_hide_flags, windows_detach_popen_kwargs). All no-ops on non-Windows. ## CRITICAL fixes (would crash or silently break on Windows) - tui_gateway/entry.py: SIGPIPE/SIGHUP referenced at module top level would AttributeError on import on Windows, breaking `hermes --tui` entirely (it spawns this module as a subprocess). Guard each signal.signal() call with hasattr() and add SIGBREAK as Windows' SIGHUP equivalent. - hermes_cli/kanban_db.py: os.waitpid(-1, os.WNOHANG) in dispatcher tick was unguarded. os.WNOHANG doesn't exist on Windows. Gate the whole reap loop behind `os.name != "nt"` — Windows has no zombies anyway. - tools/code_execution_tool.py: AF_UNIX socket for execute_code RPC fails on most Windows builds. Fall back to loopback TCP (AF_INET on 127.0.0.1:0 ephemeral port) when _IS_WINDOWS. HERMES_RPC_SOCKET env var now accepts either a filesystem path (POSIX) or `tcp://127.0.0.1:<port>` (Windows). Generated sandbox client parses both. - cron/scheduler.py: `argv = ["/bin/bash", str(path)]` hardcoded. Use shutil.which("bash") so Windows (Git Bash via MinGit) works, with a readable error when bash is genuinely absent. - 6 bare npm/npx spawn sites: tools_config.py x2, doctor.py, whatsapp.py (npm install + node version probe), browser_tool.py x2. On Windows npm is npm.cmd / npx is npx.cmd (batch shims); subprocess.Popen(["npm", ...]) fails with WinError 193. shutil.which(...) returns the absolute .cmd path which CreateProcessW accepts because the extension routes through cmd.exe /c. POSIX behaviour unchanged (shutil.which still returns the same path subprocess would resolve itself). ## HIGH fixes (silent misbehaviour on Windows) - tools/environments/local.py get_temp_dir: hardcoded /tmp returned on Windows meant `_cwd_file = "/tmp/hermes-cwd-*.txt"`, which bash wrote via MSYS2's virtual /tmp but native Python couldn't open. Result: cwd tracking silently broken — `cd` in terminal tool did nothing. Windows branch now returns `%HERMES_HOME%/cache/terminal` with forward slashes (works in both bash and Python, guaranteed no spaces). - tools/environments/local.py _make_run_env PATH injection: `/usr/bin not in split(":")` heuristic mangles Windows PATH (";" separator). Gate the injection behind `not _IS_WINDOWS`. - hermes_cli/gateway.py launch_detached_profile_gateway_restart: outer Popen + watcher-script Popen both used start_new_session=True, which Windows silently ignores. Watcher stayed attached to CLI's console, died when user closed terminal after `hermes update`, left gateway stale. Now branches through windows_detach_popen_kwargs() helper (CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW on Windows, start_new_session=True on POSIX — identical to main). ## MEDIUM fixes - gateway/run.py /restart and /update handlers: hardcoded bash/setsid chain crashes on Windows when user triggers /update in-gateway. Now has sys.platform=="win32" branch using sys.executable + a tiny Python watcher with proper detach flags. POSIX path is unchanged. - cli.py _git_repo_root: Git on Windows sometimes returns /c/Users/... style paths that break subprocess.Popen(cwd=...) and Path().resolve(). Added _normalize_git_bash_path() helper that translates /c/Users, /cygdrive/c, /mnt/c variants to native C:\Users form. POSIX no-op. _git_repo_root() now routes every result through it. - cli.py worktree .worktreeinclude: os.symlink on directories failed hard on Windows (requires admin or Developer Mode). Falls back to shutil.copytree with a warning log. ## Tests - 29 new tests in tests/tools/test_windows_native_support.py covering: subprocess_compat helpers, TUI entry signal guards, kanban waitpid guard, code_execution TCP fallback source-level invariants, cron bash resolution, npm/npx bare-spawn lint per-file, local env Windows temp dir, PATH injection gating, git bash path normalization, symlink fallback, gateway detached watcher flags. - One existing test assertion adjusted in test_browser_homebrew_paths: it compared captured Popen argv to the BARE `"npx"` literal; after the shutil.which() change argv[0] is the absolute path. New assertion checks the shape (two items, second is `agent-browser`) rather than the exact first-item string. Behaviour unchanged; test was too strict. All 56 tests pass on Linux (30 from previous commits + 26 new). 267 tests from the affected files/dirs (browser, code_exec, local_env, process_registry, kanban_db, windows_compat) all pass — zero regressions. tests/hermes_cli/ (3909 pass) and tests/gateway/ (5021 pass) unchanged; all pre-existing test failures confirmed unrelated via `git stash` re-run. ## What's still deferred (LOW priority) - Visible cmd-window flashes on short-lived console apps (~14 sites) — cosmetic, needs a follow-up pass once we have user reports. - agent/file_safety.py POSIX-only security deny patterns — separate hardening task. - tools/process_registry.py returning "/tmp" as fallback — theoretical; reachable only when all env-var candidates fail.
2026-05-07 17:29:31 -07:00
def windows_detach_flags() -> int:
"""Return Win32 creationflags that detach a child from the parent
console and process group. 0 on non-Windows.
Pair with ``start_new_session=False`` (default) when calling
subprocess.Popen on POSIX use ``start_new_session=True`` instead,
which maps to ``os.setsid()`` in the child.
Rationale:
- ``CREATE_NEW_PROCESS_GROUP`` child has its own process group so
Ctrl+C in the parent console doesn't propagate.
- ``DETACHED_PROCESS`` child has no console at all. Necessary for
background daemons (gateway watchers, update respawners) because
without it, closing the console kills the child.
- ``CREATE_NO_WINDOW`` suppress the brief cmd flash that would
otherwise appear when launching a console app. Redundant with
DETACHED_PROCESS but explicit for clarity.
fix(gateway,windows): reliability — JOB breakaway + status --deep probes + test-leak fix (#40909) * fix(gateway,windows): reliability — supervisor task, JOB breakaway, status --deep Three coordinated fixes for the Windows gateway reliability story: 1. CREATE_BREAKAWAY_FROM_JOB on every detached spawn The 'hermes update' triggered from the Electron Desktop GUI ran inside Electron's job object. Without breakaway, the post-update gateway watcher spawned by update — already DETACHED_PROCESS — was still reaped when Electron's job tore down, so the gateway never came back after a GUI-initiated update. Adds CREATE_BREAKAWAY_FROM_JOB (0x01000000) to: - hermes_cli/_subprocess_compat.py::windows_detach_flags() — used by every helper that calls windows_detach_popen_kwargs(), including launch_detached_profile_gateway_restart() - The watcher subprocess's own respawn snippet in hermes_cli/gateway.py (inlined flags so the watcher's child respawn also breaks away) _spawn_detached() in gateway_windows.py already had the flag; this change brings the rest of the codebase to parity. 2. Per-minute supervisor Scheduled Task — Windows equivalent of systemd Restart=always Introduces hermes_cli/gateway_supervisor.py and registers it as a second Scheduled Task ('Hermes_Gateway_Supervisor', SC MINUTE /MO 1, LIMITED rights) alongside the existing ONLOGON task. Every minute, the supervisor uses the same gateway.status.get_running_pid() probe as 'hermes gateway status' and, if no gateway is alive, calls gateway_windows._spawn_detached() (which now includes BREAKAWAY) to bring one back. Covers every crash mode, not just 'machine rebooted': taskkill, OOM, GUI update SIGTERM, parent job teardown. Cheap — one pythonw startup per minute when down, one PID-existence check per minute when up. Wired into both the schtasks-success and Startup-folder-fallback install paths via _install_supervisor_best_effort(), and removed in uninstall(). Best-effort: a failing supervisor install logs a warning but doesn't roll back the primary install. 3. 'hermes gateway status --deep' shows per-probe PASS/FAIL Replaces the existing terse '--deep' output (which only printed paths) with an actual diagnostic table: [1] PID file present [2] Lock file held by a live process [3] get_running_pid() result [4] _pid_exists(pid) — OS-level liveness [5] gateway_state.json (state + age) [6] Last lifecycle event from gateway-exit-diag.log When the high-level summary disagrees with reality, the user can see exactly which signal is lying. Test-leak fix ------------- tests/hermes_cli/test_gateway_wsl.py::TestGatewayCommandWSLMessages monkey-patched is_linux/is_wsl/supports_systemd_services to simulate WSL but did NOT stub is_windows(). On a Windows host, the dispatcher in _gateway_command_inner takes the is_windows() branch BEFORE the WSL guidance branch, so the test invoked gateway_windows.install() for real. install() writes to %APPDATA%\...\Startup\Hermes_Gateway.cmd — the REAL user Startup folder, never sandboxed by tmp_path — pointing at the test's pytest-of-<user>/pytest-<N>/.../gateway-service/ wrapper. When pytest tore down the tmp_path, every subsequent Windows login flashed a cmd.exe window that failed to find the missing target. Stubs is_windows=False on all four affected tests: test_install_wsl_no_systemd test_start_wsl_no_systemd test_status_wsl_running_manual test_status_wsl_not_running Defense-in-depth: _build_startup_launcher() now prefixes the launcher with 'if not exist <target> exit /b 0', so any future stale Startup entry silently no-ops instead of flashing a console window. Status enhancements ------------------- - status() now reports supervisor task presence alongside the existing schtasks/Startup info, and nudges the user to reinstall if the supervisor isn't registered. - Deep mode dumps both the supervisor task name + script path. * fix(gateway,windows): drop the per-minute supervisor task — keep breakaway + deep probes Earlier in this branch we added a per-minute schtasks-based supervisor to respawn the gateway after crashes / GUI-update SIGTERMs. The implementation flashed a brief console window on every firing, which stole window focus. We tried several variants: - cmd.exe wrapper invoking pythonw -> flashes (cmd.exe is console-subsystem) - schtasks /TR pointing at pythonw -> flashes (uv venv launcher pythonw is actually subsystem=Console, not GUI; it respawns the real pythonw) - schtasks /TR pointing at base uv -> still flashes (Task Scheduler-side conhost preallocation; documented Windows quirk) - XML registration with <Hidden>true> -> still flashes (<Hidden> only hides the task in the Task Scheduler UI, not the spawned window) Researched what leading projects do: - Ollama: GUI-subsystem tray exe + Startup-folder shortcut. No supervisor. - Tailscale: real Windows Service via SCM. Session 0, no console possible. - Syncthing: --no-console flag inside the binary + Startup folder. - openclaw: VBS Run(..., 0, False) wrapper. Suppresses the *window* but Super User Q971162 confirms focus-steal still occurs in some cases. None of these use a per-minute polling scheduled task. The 'auto-restart on crash' responsibility belongs INSIDE the daemon (Tailscale's in-process recovery / Ollama's monitor+worker pair) OR is delegated to the Windows Service Control Manager — not Task Scheduler. So this commit drops the supervisor entirely. The CREATE_BREAKAWAY_FROM_JOB fix in _subprocess_compat.py (from commit c1e5fa433) survives — that is the *real* fix for problem #2 (GUI-update kills gateway): the post-update watcher in launch_detached_profile_gateway_restart() now breaks out of Electron's job object, so the gateway respawn watcher survives the GUI quit and successfully respawns the gateway. Surviving from c1e5fa433: * CREATE_BREAKAWAY_FROM_JOB in hermes_cli/_subprocess_compat.py (fixes #2) * Inlined breakaway flag in the watcher respawn snippet in gateway.py * hermes gateway status --deep PASS/FAIL probes (fixes #1 — visibility) * 'if not exist <target> exit /b 0' guard in _build_startup_launcher (fixes #3 — silent no-op for stale Startup entries) * tests/hermes_cli/test_gateway_wsl.py is_windows=False stubs (root cause of #3 — pytest WSL tests no longer leak Startup entries on Win hosts) Removed in this commit: * hermes_cli/gateway_supervisor.py (entire file) * Supervisor section in hermes_cli/gateway_windows.py (~180 lines): get_supervisor_task_name, get_supervisor_script_path, _build_supervisor_cmd_script, _write_supervisor_script, _install_supervisor_task, is_supervisor_task_registered, _install_supervisor_best_effort * _install_supervisor_best_effort() calls in install() (3 spots) * supervisor cleanup block in uninstall() * supervisor display lines in status() / status(deep=True) Future direction (out of scope for this PR): the right place for Windows 'Restart=always' semantics is a real Windows Service installed via pywin32's win32serviceutil.ServiceFramework — session-0 isolation, SCM auto-restart, no console window possible. That's a meaningful next-PR project, not a band-aid. Tests: 51 pass / 2 pre-existing failures in tests/hermes_cli/test_gateway_{windows,wsl}.py (the 2 failures are TestSupportsSystemdServicesWSL cases that fail on origin/main too — unrelated to this PR).
2026-06-06 19:53:58 -07:00
- ``CREATE_BREAKAWAY_FROM_JOB`` escape any job object the parent is
in. Electron (Desktop app) and Tauri (bootstrap installer) wrap
their children in job objects; without breakaway, those children
die when the parent process exits even if they were spawned with
DETACHED_PROCESS. This was the missing flag that made the
post-update gateway respawn watcher silently die alongside the
Tauri updater after the Electron Desktop's update flow finished.
If a process is in a job that disallows breakaway (rare
JOB_OBJECT_LIMIT_BREAKAWAY_OK isn't set), CreateProcess returns
ERROR_ACCESS_DENIED. Python surfaces that as ``PermissionError``
on the ``subprocess.Popen`` call. Callers in this codebase already
wrap detached spawns in ``try/except OSError`` and fall back to a
cmd.exe wrapper, so the breakaway-denied case degrades gracefully
rather than crashing.
feat(windows): close remaining POSIX-only landmines — TUI crash, kanban waitpid, AF_UNIX sandbox, /bin/bash, npm .cmd shims, cwd tracking, detach flags Second pass on native Windows support, driven by a systematic audit across five areas: POSIX-only primitives (signal.SIGKILL/SIGHUP/SIGPIPE, os.WNOHANG, os.setsid), path translation bugs (/c/Users → C:\Users), subprocess patterns (npm.cmd batch shims, start_new_session no-op on Windows), subsystem health (cron, gateway daemon, update flow), and module-level import guards. Every change is platform-gated — POSIX (Linux/macOS) behaviour is preserved bit-identical. Explicit "do no harm" test: test_posix_path_preserved_on_linux, test_posix_noop, test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix. ## New module - hermes_cli/_subprocess_compat.py — shared helpers (resolve_node_command, windows_detach_flags, windows_hide_flags, windows_detach_popen_kwargs). All no-ops on non-Windows. ## CRITICAL fixes (would crash or silently break on Windows) - tui_gateway/entry.py: SIGPIPE/SIGHUP referenced at module top level would AttributeError on import on Windows, breaking `hermes --tui` entirely (it spawns this module as a subprocess). Guard each signal.signal() call with hasattr() and add SIGBREAK as Windows' SIGHUP equivalent. - hermes_cli/kanban_db.py: os.waitpid(-1, os.WNOHANG) in dispatcher tick was unguarded. os.WNOHANG doesn't exist on Windows. Gate the whole reap loop behind `os.name != "nt"` — Windows has no zombies anyway. - tools/code_execution_tool.py: AF_UNIX socket for execute_code RPC fails on most Windows builds. Fall back to loopback TCP (AF_INET on 127.0.0.1:0 ephemeral port) when _IS_WINDOWS. HERMES_RPC_SOCKET env var now accepts either a filesystem path (POSIX) or `tcp://127.0.0.1:<port>` (Windows). Generated sandbox client parses both. - cron/scheduler.py: `argv = ["/bin/bash", str(path)]` hardcoded. Use shutil.which("bash") so Windows (Git Bash via MinGit) works, with a readable error when bash is genuinely absent. - 6 bare npm/npx spawn sites: tools_config.py x2, doctor.py, whatsapp.py (npm install + node version probe), browser_tool.py x2. On Windows npm is npm.cmd / npx is npx.cmd (batch shims); subprocess.Popen(["npm", ...]) fails with WinError 193. shutil.which(...) returns the absolute .cmd path which CreateProcessW accepts because the extension routes through cmd.exe /c. POSIX behaviour unchanged (shutil.which still returns the same path subprocess would resolve itself). ## HIGH fixes (silent misbehaviour on Windows) - tools/environments/local.py get_temp_dir: hardcoded /tmp returned on Windows meant `_cwd_file = "/tmp/hermes-cwd-*.txt"`, which bash wrote via MSYS2's virtual /tmp but native Python couldn't open. Result: cwd tracking silently broken — `cd` in terminal tool did nothing. Windows branch now returns `%HERMES_HOME%/cache/terminal` with forward slashes (works in both bash and Python, guaranteed no spaces). - tools/environments/local.py _make_run_env PATH injection: `/usr/bin not in split(":")` heuristic mangles Windows PATH (";" separator). Gate the injection behind `not _IS_WINDOWS`. - hermes_cli/gateway.py launch_detached_profile_gateway_restart: outer Popen + watcher-script Popen both used start_new_session=True, which Windows silently ignores. Watcher stayed attached to CLI's console, died when user closed terminal after `hermes update`, left gateway stale. Now branches through windows_detach_popen_kwargs() helper (CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW on Windows, start_new_session=True on POSIX — identical to main). ## MEDIUM fixes - gateway/run.py /restart and /update handlers: hardcoded bash/setsid chain crashes on Windows when user triggers /update in-gateway. Now has sys.platform=="win32" branch using sys.executable + a tiny Python watcher with proper detach flags. POSIX path is unchanged. - cli.py _git_repo_root: Git on Windows sometimes returns /c/Users/... style paths that break subprocess.Popen(cwd=...) and Path().resolve(). Added _normalize_git_bash_path() helper that translates /c/Users, /cygdrive/c, /mnt/c variants to native C:\Users form. POSIX no-op. _git_repo_root() now routes every result through it. - cli.py worktree .worktreeinclude: os.symlink on directories failed hard on Windows (requires admin or Developer Mode). Falls back to shutil.copytree with a warning log. ## Tests - 29 new tests in tests/tools/test_windows_native_support.py covering: subprocess_compat helpers, TUI entry signal guards, kanban waitpid guard, code_execution TCP fallback source-level invariants, cron bash resolution, npm/npx bare-spawn lint per-file, local env Windows temp dir, PATH injection gating, git bash path normalization, symlink fallback, gateway detached watcher flags. - One existing test assertion adjusted in test_browser_homebrew_paths: it compared captured Popen argv to the BARE `"npx"` literal; after the shutil.which() change argv[0] is the absolute path. New assertion checks the shape (two items, second is `agent-browser`) rather than the exact first-item string. Behaviour unchanged; test was too strict. All 56 tests pass on Linux (30 from previous commits + 26 new). 267 tests from the affected files/dirs (browser, code_exec, local_env, process_registry, kanban_db, windows_compat) all pass — zero regressions. tests/hermes_cli/ (3909 pass) and tests/gateway/ (5021 pass) unchanged; all pre-existing test failures confirmed unrelated via `git stash` re-run. ## What's still deferred (LOW priority) - Visible cmd-window flashes on short-lived console apps (~14 sites) — cosmetic, needs a follow-up pass once we have user reports. - agent/file_safety.py POSIX-only security deny patterns — separate hardening task. - tools/process_registry.py returning "/tmp" as fallback — theoretical; reachable only when all env-var candidates fail.
2026-05-07 17:29:31 -07:00
"""
if not IS_WINDOWS:
return 0
fix(gateway,windows): reliability — JOB breakaway + status --deep probes + test-leak fix (#40909) * fix(gateway,windows): reliability — supervisor task, JOB breakaway, status --deep Three coordinated fixes for the Windows gateway reliability story: 1. CREATE_BREAKAWAY_FROM_JOB on every detached spawn The 'hermes update' triggered from the Electron Desktop GUI ran inside Electron's job object. Without breakaway, the post-update gateway watcher spawned by update — already DETACHED_PROCESS — was still reaped when Electron's job tore down, so the gateway never came back after a GUI-initiated update. Adds CREATE_BREAKAWAY_FROM_JOB (0x01000000) to: - hermes_cli/_subprocess_compat.py::windows_detach_flags() — used by every helper that calls windows_detach_popen_kwargs(), including launch_detached_profile_gateway_restart() - The watcher subprocess's own respawn snippet in hermes_cli/gateway.py (inlined flags so the watcher's child respawn also breaks away) _spawn_detached() in gateway_windows.py already had the flag; this change brings the rest of the codebase to parity. 2. Per-minute supervisor Scheduled Task — Windows equivalent of systemd Restart=always Introduces hermes_cli/gateway_supervisor.py and registers it as a second Scheduled Task ('Hermes_Gateway_Supervisor', SC MINUTE /MO 1, LIMITED rights) alongside the existing ONLOGON task. Every minute, the supervisor uses the same gateway.status.get_running_pid() probe as 'hermes gateway status' and, if no gateway is alive, calls gateway_windows._spawn_detached() (which now includes BREAKAWAY) to bring one back. Covers every crash mode, not just 'machine rebooted': taskkill, OOM, GUI update SIGTERM, parent job teardown. Cheap — one pythonw startup per minute when down, one PID-existence check per minute when up. Wired into both the schtasks-success and Startup-folder-fallback install paths via _install_supervisor_best_effort(), and removed in uninstall(). Best-effort: a failing supervisor install logs a warning but doesn't roll back the primary install. 3. 'hermes gateway status --deep' shows per-probe PASS/FAIL Replaces the existing terse '--deep' output (which only printed paths) with an actual diagnostic table: [1] PID file present [2] Lock file held by a live process [3] get_running_pid() result [4] _pid_exists(pid) — OS-level liveness [5] gateway_state.json (state + age) [6] Last lifecycle event from gateway-exit-diag.log When the high-level summary disagrees with reality, the user can see exactly which signal is lying. Test-leak fix ------------- tests/hermes_cli/test_gateway_wsl.py::TestGatewayCommandWSLMessages monkey-patched is_linux/is_wsl/supports_systemd_services to simulate WSL but did NOT stub is_windows(). On a Windows host, the dispatcher in _gateway_command_inner takes the is_windows() branch BEFORE the WSL guidance branch, so the test invoked gateway_windows.install() for real. install() writes to %APPDATA%\...\Startup\Hermes_Gateway.cmd — the REAL user Startup folder, never sandboxed by tmp_path — pointing at the test's pytest-of-<user>/pytest-<N>/.../gateway-service/ wrapper. When pytest tore down the tmp_path, every subsequent Windows login flashed a cmd.exe window that failed to find the missing target. Stubs is_windows=False on all four affected tests: test_install_wsl_no_systemd test_start_wsl_no_systemd test_status_wsl_running_manual test_status_wsl_not_running Defense-in-depth: _build_startup_launcher() now prefixes the launcher with 'if not exist <target> exit /b 0', so any future stale Startup entry silently no-ops instead of flashing a console window. Status enhancements ------------------- - status() now reports supervisor task presence alongside the existing schtasks/Startup info, and nudges the user to reinstall if the supervisor isn't registered. - Deep mode dumps both the supervisor task name + script path. * fix(gateway,windows): drop the per-minute supervisor task — keep breakaway + deep probes Earlier in this branch we added a per-minute schtasks-based supervisor to respawn the gateway after crashes / GUI-update SIGTERMs. The implementation flashed a brief console window on every firing, which stole window focus. We tried several variants: - cmd.exe wrapper invoking pythonw -> flashes (cmd.exe is console-subsystem) - schtasks /TR pointing at pythonw -> flashes (uv venv launcher pythonw is actually subsystem=Console, not GUI; it respawns the real pythonw) - schtasks /TR pointing at base uv -> still flashes (Task Scheduler-side conhost preallocation; documented Windows quirk) - XML registration with <Hidden>true> -> still flashes (<Hidden> only hides the task in the Task Scheduler UI, not the spawned window) Researched what leading projects do: - Ollama: GUI-subsystem tray exe + Startup-folder shortcut. No supervisor. - Tailscale: real Windows Service via SCM. Session 0, no console possible. - Syncthing: --no-console flag inside the binary + Startup folder. - openclaw: VBS Run(..., 0, False) wrapper. Suppresses the *window* but Super User Q971162 confirms focus-steal still occurs in some cases. None of these use a per-minute polling scheduled task. The 'auto-restart on crash' responsibility belongs INSIDE the daemon (Tailscale's in-process recovery / Ollama's monitor+worker pair) OR is delegated to the Windows Service Control Manager — not Task Scheduler. So this commit drops the supervisor entirely. The CREATE_BREAKAWAY_FROM_JOB fix in _subprocess_compat.py (from commit c1e5fa433) survives — that is the *real* fix for problem #2 (GUI-update kills gateway): the post-update watcher in launch_detached_profile_gateway_restart() now breaks out of Electron's job object, so the gateway respawn watcher survives the GUI quit and successfully respawns the gateway. Surviving from c1e5fa433: * CREATE_BREAKAWAY_FROM_JOB in hermes_cli/_subprocess_compat.py (fixes #2) * Inlined breakaway flag in the watcher respawn snippet in gateway.py * hermes gateway status --deep PASS/FAIL probes (fixes #1 — visibility) * 'if not exist <target> exit /b 0' guard in _build_startup_launcher (fixes #3 — silent no-op for stale Startup entries) * tests/hermes_cli/test_gateway_wsl.py is_windows=False stubs (root cause of #3 — pytest WSL tests no longer leak Startup entries on Win hosts) Removed in this commit: * hermes_cli/gateway_supervisor.py (entire file) * Supervisor section in hermes_cli/gateway_windows.py (~180 lines): get_supervisor_task_name, get_supervisor_script_path, _build_supervisor_cmd_script, _write_supervisor_script, _install_supervisor_task, is_supervisor_task_registered, _install_supervisor_best_effort * _install_supervisor_best_effort() calls in install() (3 spots) * supervisor cleanup block in uninstall() * supervisor display lines in status() / status(deep=True) Future direction (out of scope for this PR): the right place for Windows 'Restart=always' semantics is a real Windows Service installed via pywin32's win32serviceutil.ServiceFramework — session-0 isolation, SCM auto-restart, no console window possible. That's a meaningful next-PR project, not a band-aid. Tests: 51 pass / 2 pre-existing failures in tests/hermes_cli/test_gateway_{windows,wsl}.py (the 2 failures are TestSupportsSystemdServicesWSL cases that fail on origin/main too — unrelated to this PR).
2026-06-06 19:53:58 -07:00
return (
_CREATE_NEW_PROCESS_GROUP
| _DETACHED_PROCESS
| _CREATE_NO_WINDOW
| _CREATE_BREAKAWAY_FROM_JOB
)
feat(windows): close remaining POSIX-only landmines — TUI crash, kanban waitpid, AF_UNIX sandbox, /bin/bash, npm .cmd shims, cwd tracking, detach flags Second pass on native Windows support, driven by a systematic audit across five areas: POSIX-only primitives (signal.SIGKILL/SIGHUP/SIGPIPE, os.WNOHANG, os.setsid), path translation bugs (/c/Users → C:\Users), subprocess patterns (npm.cmd batch shims, start_new_session no-op on Windows), subsystem health (cron, gateway daemon, update flow), and module-level import guards. Every change is platform-gated — POSIX (Linux/macOS) behaviour is preserved bit-identical. Explicit "do no harm" test: test_posix_path_preserved_on_linux, test_posix_noop, test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix. ## New module - hermes_cli/_subprocess_compat.py — shared helpers (resolve_node_command, windows_detach_flags, windows_hide_flags, windows_detach_popen_kwargs). All no-ops on non-Windows. ## CRITICAL fixes (would crash or silently break on Windows) - tui_gateway/entry.py: SIGPIPE/SIGHUP referenced at module top level would AttributeError on import on Windows, breaking `hermes --tui` entirely (it spawns this module as a subprocess). Guard each signal.signal() call with hasattr() and add SIGBREAK as Windows' SIGHUP equivalent. - hermes_cli/kanban_db.py: os.waitpid(-1, os.WNOHANG) in dispatcher tick was unguarded. os.WNOHANG doesn't exist on Windows. Gate the whole reap loop behind `os.name != "nt"` — Windows has no zombies anyway. - tools/code_execution_tool.py: AF_UNIX socket for execute_code RPC fails on most Windows builds. Fall back to loopback TCP (AF_INET on 127.0.0.1:0 ephemeral port) when _IS_WINDOWS. HERMES_RPC_SOCKET env var now accepts either a filesystem path (POSIX) or `tcp://127.0.0.1:<port>` (Windows). Generated sandbox client parses both. - cron/scheduler.py: `argv = ["/bin/bash", str(path)]` hardcoded. Use shutil.which("bash") so Windows (Git Bash via MinGit) works, with a readable error when bash is genuinely absent. - 6 bare npm/npx spawn sites: tools_config.py x2, doctor.py, whatsapp.py (npm install + node version probe), browser_tool.py x2. On Windows npm is npm.cmd / npx is npx.cmd (batch shims); subprocess.Popen(["npm", ...]) fails with WinError 193. shutil.which(...) returns the absolute .cmd path which CreateProcessW accepts because the extension routes through cmd.exe /c. POSIX behaviour unchanged (shutil.which still returns the same path subprocess would resolve itself). ## HIGH fixes (silent misbehaviour on Windows) - tools/environments/local.py get_temp_dir: hardcoded /tmp returned on Windows meant `_cwd_file = "/tmp/hermes-cwd-*.txt"`, which bash wrote via MSYS2's virtual /tmp but native Python couldn't open. Result: cwd tracking silently broken — `cd` in terminal tool did nothing. Windows branch now returns `%HERMES_HOME%/cache/terminal` with forward slashes (works in both bash and Python, guaranteed no spaces). - tools/environments/local.py _make_run_env PATH injection: `/usr/bin not in split(":")` heuristic mangles Windows PATH (";" separator). Gate the injection behind `not _IS_WINDOWS`. - hermes_cli/gateway.py launch_detached_profile_gateway_restart: outer Popen + watcher-script Popen both used start_new_session=True, which Windows silently ignores. Watcher stayed attached to CLI's console, died when user closed terminal after `hermes update`, left gateway stale. Now branches through windows_detach_popen_kwargs() helper (CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW on Windows, start_new_session=True on POSIX — identical to main). ## MEDIUM fixes - gateway/run.py /restart and /update handlers: hardcoded bash/setsid chain crashes on Windows when user triggers /update in-gateway. Now has sys.platform=="win32" branch using sys.executable + a tiny Python watcher with proper detach flags. POSIX path is unchanged. - cli.py _git_repo_root: Git on Windows sometimes returns /c/Users/... style paths that break subprocess.Popen(cwd=...) and Path().resolve(). Added _normalize_git_bash_path() helper that translates /c/Users, /cygdrive/c, /mnt/c variants to native C:\Users form. POSIX no-op. _git_repo_root() now routes every result through it. - cli.py worktree .worktreeinclude: os.symlink on directories failed hard on Windows (requires admin or Developer Mode). Falls back to shutil.copytree with a warning log. ## Tests - 29 new tests in tests/tools/test_windows_native_support.py covering: subprocess_compat helpers, TUI entry signal guards, kanban waitpid guard, code_execution TCP fallback source-level invariants, cron bash resolution, npm/npx bare-spawn lint per-file, local env Windows temp dir, PATH injection gating, git bash path normalization, symlink fallback, gateway detached watcher flags. - One existing test assertion adjusted in test_browser_homebrew_paths: it compared captured Popen argv to the BARE `"npx"` literal; after the shutil.which() change argv[0] is the absolute path. New assertion checks the shape (two items, second is `agent-browser`) rather than the exact first-item string. Behaviour unchanged; test was too strict. All 56 tests pass on Linux (30 from previous commits + 26 new). 267 tests from the affected files/dirs (browser, code_exec, local_env, process_registry, kanban_db, windows_compat) all pass — zero regressions. tests/hermes_cli/ (3909 pass) and tests/gateway/ (5021 pass) unchanged; all pre-existing test failures confirmed unrelated via `git stash` re-run. ## What's still deferred (LOW priority) - Visible cmd-window flashes on short-lived console apps (~14 sites) — cosmetic, needs a follow-up pass once we have user reports. - agent/file_safety.py POSIX-only security deny patterns — separate hardening task. - tools/process_registry.py returning "/tmp" as fallback — theoretical; reachable only when all env-var candidates fail.
2026-05-07 17:29:31 -07:00
fix(windows): retry watcher Popen without breakaway when parent job denies it, plus regression tests for the breakaway bit (#40956) #40909 added `CREATE_BREAKAWAY_FROM_JOB` to `windows_detach_flags()`, which fixed the headline bug (gateway dies after Desktop GUI update and never comes back). The flag's own docstring acknowledges that restrictive parent job objects can still refuse breakaway with `ERROR_ACCESS_DENIED`, surfacing as `OSError` on the `subprocess.Popen` call: "Callers in this codebase already wrap detached spawns in try/except OSError and fall back to a cmd.exe wrapper, so the breakaway-denied case degrades gracefully rather than crashing." That's true for `_spawn_detached` in `gateway_windows.py` (the `hermes gateway start` path), which has both the breakaway bit AND a retry-without-breakaway fallback. It's NOT true for the post-update watcher path in `launch_detached_profile_gateway_restart` (`hermes_cli/gateway.py`), which only has `except OSError: return False` and gives up entirely. If a user's shell/terminal/container wraps Hermes in a breakaway-denying job, the gateway-respawn watcher silently fails to launch instead of trying again without breakaway. This PR closes that gap and adds the regression tests that were missing from the original fix. ## Changes ### `hermes_cli/_subprocess_compat.py` Adds a sibling helper `windows_detach_flags_without_breakaway()` so callers can express the fallback symbolically (via the helper) rather than coding the magic `& ~0x01000000` mask at every site. Documented on `windows_detach_flags` and `windows_detach_flags_without_breakaway` with the recommended try/except pattern. ### `hermes_cli/gateway.py::launch_detached_profile_gateway_restart` Two changes, both aligned with the canonical pattern in `gateway_windows._spawn_detached`: 1. The outer watcher Popen now wraps in `try/except OSError`, and on failure retries with `windows_detach_flags_without_breakaway()` (POSIX never reaches this branch — `start_new_session=True` can't raise OSError). 2. The inlined respawn payload (the `python -c` watcher) also wraps its CreateProcess in try/except OSError and retries with `_flags & ~_CREATE_BREAKAWAY_FROM_JOB` on failure. This matters because the watcher's job-object inheritance is independent of the outer process's — even if the outer Popen succeeds with breakaway, the respawned gateway might inherit a job that doesn't. ### Regression tests in `tests/tools/test_windows_native_support.py` #40909 shipped the fix without any test that the breakaway bit is present (the existing `test_windows_detach_flags_has_expected_win32_bits` asserted only the three legacy bits). Four new tests close that: - `test_windows_detach_flags_includes_breakaway_from_job` — explicit assertion that the breakaway bit is in the default bundle, with the rationale spelled out in the docstring so a future maintainer staring at this test understands why removing it would resurrect the gateway-dies-after-GUI-update bug. - `test_windows_detach_flags_without_breakaway_drops_only_that_bit` — fallback payload keeps the other three detach bits intact. - `test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway` — static-text check on the stringified watcher payload. The inlined Python program isn't reachable via normal import-time inspection because it lives in a `textwrap.dedent("""...""")` literal that gets passed to a separate `python -c` interpreter. Asserting that both `_CREATE_BREAKAWAY_FROM_JOB` (symbolic) and `0x01000000` (hex literal) appear inside the dedent block is a sufficient regression guard against accidental refactors. - `test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback` — static check that this PR's fallback retry is wired up symbolically. Without standing up a real Windows job object that refuses breakaway, we can't trigger the OSError in a unit test; the text guard catches the case where a future refactor removes the helper import or the `& ~_CREATE_BREAKAWAY_FROM_JOB` retry. Also extends `test_windows_detach_flags_has_expected_win32_bits` to include the breakaway bit assertion and updates `test_windows_flags_zero_on_posix` to cover the new helper. ## Tests Locally on Windows: 8/8 in the `-k "detach or breakaway or popen_kwargs or launch_detached or gateway_run_update or hermes_cli_gateway"` slice pass. Broader `tests/hermes_cli/test_gateway*.py + test_windows_native_support.py`: 172 passed, 10 failed. All 10 failures are pre-existing POSIX-only tests running on a Windows host (os.geteuid, SIGKILL fallback, is_linux fixture mismatches). Stashing this PR and re-running on bare post-#40909 main reproduces all 10 identically — none are regressions. POSIX paths unchanged: `windows_detach_flags()` and `windows_detach_flags_without_breakaway()` both return 0 off Windows, `windows_detach_popen_kwargs()` still yields `{"start_new_session": True}`. ## Out of scope - The other detached-spawn site in `hermes_cli/gateway.py` (around line 3068) also uses `windows_detach_popen_kwargs()` + `except OSError`. It deserves the same fallback treatment but the codepath is different enough (not the update-flow watcher) that it warrants a separate PR with its own scrutiny. - `gateway/run.py` has Windows branches with `windows_detach_popen_kwargs` too — same reasoning. ## Context Follow-up to #40909 (merged). I had a parallel PR (#40934, closed) that duplicated the core breakaway fix; the bits unique to that PR that #40909 didn't cover are the contents of this one. Closing #40934 and opening this slimmed-down version as the focused follow-up.
2026-06-07 01:21:58 -07:00
def windows_detach_flags_without_breakaway() -> int:
"""Same as :func:`windows_detach_flags` minus ``CREATE_BREAKAWAY_FROM_JOB``.
The docstring on :func:`windows_detach_flags` notes that a process in
a job which disallows breakaway (no ``JOB_OBJECT_LIMIT_BREAKAWAY_OK``)
will see ``ERROR_ACCESS_DENIED`` from CreateProcess, surfacing as
``OSError`` (``PermissionError``) on the ``subprocess.Popen`` call.
Callers that want to recover by retrying without the breakaway
bit can pair the two helpers symbolically rather than coding the
``& ~0x01000000`` magic at every site:
.. code-block:: python
try:
subprocess.Popen(argv, creationflags=windows_detach_flags(), )
except OSError:
subprocess.Popen(
argv,
creationflags=windows_detach_flags_without_breakaway(),
,
)
See ``gateway_windows.py::_spawn_detached`` for the canonical
implementation of this pattern. Returns 0 on non-Windows.
"""
if not IS_WINDOWS:
return 0
return _CREATE_NEW_PROCESS_GROUP | _DETACHED_PROCESS | _CREATE_NO_WINDOW
feat(windows): close remaining POSIX-only landmines — TUI crash, kanban waitpid, AF_UNIX sandbox, /bin/bash, npm .cmd shims, cwd tracking, detach flags Second pass on native Windows support, driven by a systematic audit across five areas: POSIX-only primitives (signal.SIGKILL/SIGHUP/SIGPIPE, os.WNOHANG, os.setsid), path translation bugs (/c/Users → C:\Users), subprocess patterns (npm.cmd batch shims, start_new_session no-op on Windows), subsystem health (cron, gateway daemon, update flow), and module-level import guards. Every change is platform-gated — POSIX (Linux/macOS) behaviour is preserved bit-identical. Explicit "do no harm" test: test_posix_path_preserved_on_linux, test_posix_noop, test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix. ## New module - hermes_cli/_subprocess_compat.py — shared helpers (resolve_node_command, windows_detach_flags, windows_hide_flags, windows_detach_popen_kwargs). All no-ops on non-Windows. ## CRITICAL fixes (would crash or silently break on Windows) - tui_gateway/entry.py: SIGPIPE/SIGHUP referenced at module top level would AttributeError on import on Windows, breaking `hermes --tui` entirely (it spawns this module as a subprocess). Guard each signal.signal() call with hasattr() and add SIGBREAK as Windows' SIGHUP equivalent. - hermes_cli/kanban_db.py: os.waitpid(-1, os.WNOHANG) in dispatcher tick was unguarded. os.WNOHANG doesn't exist on Windows. Gate the whole reap loop behind `os.name != "nt"` — Windows has no zombies anyway. - tools/code_execution_tool.py: AF_UNIX socket for execute_code RPC fails on most Windows builds. Fall back to loopback TCP (AF_INET on 127.0.0.1:0 ephemeral port) when _IS_WINDOWS. HERMES_RPC_SOCKET env var now accepts either a filesystem path (POSIX) or `tcp://127.0.0.1:<port>` (Windows). Generated sandbox client parses both. - cron/scheduler.py: `argv = ["/bin/bash", str(path)]` hardcoded. Use shutil.which("bash") so Windows (Git Bash via MinGit) works, with a readable error when bash is genuinely absent. - 6 bare npm/npx spawn sites: tools_config.py x2, doctor.py, whatsapp.py (npm install + node version probe), browser_tool.py x2. On Windows npm is npm.cmd / npx is npx.cmd (batch shims); subprocess.Popen(["npm", ...]) fails with WinError 193. shutil.which(...) returns the absolute .cmd path which CreateProcessW accepts because the extension routes through cmd.exe /c. POSIX behaviour unchanged (shutil.which still returns the same path subprocess would resolve itself). ## HIGH fixes (silent misbehaviour on Windows) - tools/environments/local.py get_temp_dir: hardcoded /tmp returned on Windows meant `_cwd_file = "/tmp/hermes-cwd-*.txt"`, which bash wrote via MSYS2's virtual /tmp but native Python couldn't open. Result: cwd tracking silently broken — `cd` in terminal tool did nothing. Windows branch now returns `%HERMES_HOME%/cache/terminal` with forward slashes (works in both bash and Python, guaranteed no spaces). - tools/environments/local.py _make_run_env PATH injection: `/usr/bin not in split(":")` heuristic mangles Windows PATH (";" separator). Gate the injection behind `not _IS_WINDOWS`. - hermes_cli/gateway.py launch_detached_profile_gateway_restart: outer Popen + watcher-script Popen both used start_new_session=True, which Windows silently ignores. Watcher stayed attached to CLI's console, died when user closed terminal after `hermes update`, left gateway stale. Now branches through windows_detach_popen_kwargs() helper (CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW on Windows, start_new_session=True on POSIX — identical to main). ## MEDIUM fixes - gateway/run.py /restart and /update handlers: hardcoded bash/setsid chain crashes on Windows when user triggers /update in-gateway. Now has sys.platform=="win32" branch using sys.executable + a tiny Python watcher with proper detach flags. POSIX path is unchanged. - cli.py _git_repo_root: Git on Windows sometimes returns /c/Users/... style paths that break subprocess.Popen(cwd=...) and Path().resolve(). Added _normalize_git_bash_path() helper that translates /c/Users, /cygdrive/c, /mnt/c variants to native C:\Users form. POSIX no-op. _git_repo_root() now routes every result through it. - cli.py worktree .worktreeinclude: os.symlink on directories failed hard on Windows (requires admin or Developer Mode). Falls back to shutil.copytree with a warning log. ## Tests - 29 new tests in tests/tools/test_windows_native_support.py covering: subprocess_compat helpers, TUI entry signal guards, kanban waitpid guard, code_execution TCP fallback source-level invariants, cron bash resolution, npm/npx bare-spawn lint per-file, local env Windows temp dir, PATH injection gating, git bash path normalization, symlink fallback, gateway detached watcher flags. - One existing test assertion adjusted in test_browser_homebrew_paths: it compared captured Popen argv to the BARE `"npx"` literal; after the shutil.which() change argv[0] is the absolute path. New assertion checks the shape (two items, second is `agent-browser`) rather than the exact first-item string. Behaviour unchanged; test was too strict. All 56 tests pass on Linux (30 from previous commits + 26 new). 267 tests from the affected files/dirs (browser, code_exec, local_env, process_registry, kanban_db, windows_compat) all pass — zero regressions. tests/hermes_cli/ (3909 pass) and tests/gateway/ (5021 pass) unchanged; all pre-existing test failures confirmed unrelated via `git stash` re-run. ## What's still deferred (LOW priority) - Visible cmd-window flashes on short-lived console apps (~14 sites) — cosmetic, needs a follow-up pass once we have user reports. - agent/file_safety.py POSIX-only security deny patterns — separate hardening task. - tools/process_registry.py returning "/tmp" as fallback — theoretical; reachable only when all env-var candidates fail.
2026-05-07 17:29:31 -07:00
def windows_hide_flags() -> int:
"""Return Win32 creationflags that merely hide the child's console
window without detaching the child. 0 on non-Windows.
Use for short-lived console apps spawned as part of a larger
operation (``taskkill``, ``where``, version probes) where we want no
flash but also want to collect stdout/exit code synchronously.
The key difference from :func:`windows_detach_flags`: NO
``DETACHED_PROCESS`` the child still inherits stdio handles so
``capture_output=True`` works. ``DETACHED_PROCESS`` would sever
stdio and break stdout capture.
"""
if not IS_WINDOWS:
return 0
return _CREATE_NO_WINDOW
def windows_detach_popen_kwargs() -> dict:
"""Return a dict of Popen kwargs that detach a child on Windows and
fall back to the POSIX equivalent (``start_new_session=True``) on
Linux/macOS.
Usage pattern:
.. code-block:: python
subprocess.Popen(
argv,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
stdin=subprocess.DEVNULL,
close_fds=True,
**windows_detach_popen_kwargs(),
)
This replaces the unsafe-on-Windows pattern:
.. code-block:: python
subprocess.Popen(..., start_new_session=True)
which silently fails to detach on Windows (the flag is accepted but
has no effect the child stays attached to the parent's console
and dies when the console closes).
"""
if IS_WINDOWS:
return {"creationflags": windows_detach_flags()}
return {"start_new_session": True}