feat(windows): close remaining POSIX-only landmines — TUI crash, kanban waitpid, AF_UNIX sandbox, /bin/bash, npm .cmd shims, cwd tracking, detach flags
Second pass on native Windows support, driven by a systematic audit across
five areas: POSIX-only primitives (signal.SIGKILL/SIGHUP/SIGPIPE, os.WNOHANG,
os.setsid), path translation bugs (/c/Users → C:\Users), subprocess patterns
(npm.cmd batch shims, start_new_session no-op on Windows), subsystem health
(cron, gateway daemon, update flow), and module-level import guards.
Every change is platform-gated — POSIX (Linux/macOS) behaviour is preserved
bit-identical. Explicit "do no harm" test: test_posix_path_preserved_on_linux,
test_posix_noop, test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix.
## New module
- hermes_cli/_subprocess_compat.py — shared helpers (resolve_node_command,
windows_detach_flags, windows_hide_flags, windows_detach_popen_kwargs).
All no-ops on non-Windows.
## CRITICAL fixes (would crash or silently break on Windows)
- tui_gateway/entry.py: SIGPIPE/SIGHUP referenced at module top level would
AttributeError on import on Windows, breaking `hermes --tui` entirely (it
spawns this module as a subprocess). Guard each signal.signal() call with
hasattr() and add SIGBREAK as Windows' SIGHUP equivalent.
- hermes_cli/kanban_db.py: os.waitpid(-1, os.WNOHANG) in dispatcher tick was
unguarded. os.WNOHANG doesn't exist on Windows. Gate the whole reap loop
behind `os.name != "nt"` — Windows has no zombies anyway.
- tools/code_execution_tool.py: AF_UNIX socket for execute_code RPC fails on
most Windows builds. Fall back to loopback TCP (AF_INET on 127.0.0.1:0
ephemeral port) when _IS_WINDOWS. HERMES_RPC_SOCKET env var now accepts
either a filesystem path (POSIX) or `tcp://127.0.0.1:<port>` (Windows).
Generated sandbox client parses both.
- cron/scheduler.py: `argv = ["/bin/bash", str(path)]` hardcoded. Use
shutil.which("bash") so Windows (Git Bash via MinGit) works, with a
readable error when bash is genuinely absent.
- 6 bare npm/npx spawn sites: tools_config.py x2, doctor.py, whatsapp.py
(npm install + node version probe), browser_tool.py x2. On Windows npm
is npm.cmd / npx is npx.cmd (batch shims); subprocess.Popen(["npm", ...])
fails with WinError 193. shutil.which(...) returns the absolute .cmd
path which CreateProcessW accepts because the extension routes through
cmd.exe /c. POSIX behaviour unchanged (shutil.which still returns the
same path subprocess would resolve itself).
## HIGH fixes (silent misbehaviour on Windows)
- tools/environments/local.py get_temp_dir: hardcoded /tmp returned on
Windows meant `_cwd_file = "/tmp/hermes-cwd-*.txt"`, which bash wrote
via MSYS2's virtual /tmp but native Python couldn't open. Result: cwd
tracking silently broken — `cd` in terminal tool did nothing. Windows
branch now returns `%HERMES_HOME%/cache/terminal` with forward slashes
(works in both bash and Python, guaranteed no spaces).
- tools/environments/local.py _make_run_env PATH injection: `/usr/bin not
in split(":")` heuristic mangles Windows PATH (";" separator). Gate
the injection behind `not _IS_WINDOWS`.
- hermes_cli/gateway.py launch_detached_profile_gateway_restart: outer
Popen + watcher-script Popen both used start_new_session=True, which
Windows silently ignores. Watcher stayed attached to CLI's console,
died when user closed terminal after `hermes update`, left gateway
stale. Now branches through windows_detach_popen_kwargs() helper
(CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW on
Windows, start_new_session=True on POSIX — identical to main).
## MEDIUM fixes
- gateway/run.py /restart and /update handlers: hardcoded bash/setsid
chain crashes on Windows when user triggers /update in-gateway. Now
has sys.platform=="win32" branch using sys.executable + a tiny
Python watcher with proper detach flags. POSIX path is unchanged.
- cli.py _git_repo_root: Git on Windows sometimes returns /c/Users/...
style paths that break subprocess.Popen(cwd=...) and Path().resolve().
Added _normalize_git_bash_path() helper that translates /c/Users,
/cygdrive/c, /mnt/c variants to native C:\Users form. POSIX no-op.
_git_repo_root() now routes every result through it.
- cli.py worktree .worktreeinclude: os.symlink on directories failed
hard on Windows (requires admin or Developer Mode). Falls back to
shutil.copytree with a warning log.
## Tests
- 29 new tests in tests/tools/test_windows_native_support.py covering:
subprocess_compat helpers, TUI entry signal guards, kanban waitpid
guard, code_execution TCP fallback source-level invariants, cron bash
resolution, npm/npx bare-spawn lint per-file, local env Windows temp
dir, PATH injection gating, git bash path normalization, symlink
fallback, gateway detached watcher flags.
- One existing test assertion adjusted in test_browser_homebrew_paths:
it compared captured Popen argv to the BARE `"npx"` literal; after the
shutil.which() change argv[0] is the absolute path. New assertion
checks the shape (two items, second is `agent-browser`) rather than
the exact first-item string. Behaviour unchanged; test was too strict.
All 56 tests pass on Linux (30 from previous commits + 26 new).
267 tests from the affected files/dirs (browser, code_exec, local_env,
process_registry, kanban_db, windows_compat) all pass — zero regressions.
tests/hermes_cli/ (3909 pass) and tests/gateway/ (5021 pass) unchanged;
all pre-existing test failures confirmed unrelated via `git stash` re-run.
## What's still deferred (LOW priority)
- Visible cmd-window flashes on short-lived console apps (~14 sites) —
cosmetic, needs a follow-up pass once we have user reports.
- agent/file_safety.py POSIX-only security deny patterns — separate
hardening task.
- tools/process_registry.py returning "/tmp" as fallback — theoretical;
reachable only when all env-var candidates fail.
2026-05-07 17:29:31 -07:00
|
|
|
"""Windows subprocess compatibility helpers.
|
|
|
|
|
|
|
|
|
|
Hermes is developed on Linux / macOS and tested natively on Windows too.
|
|
|
|
|
Several common subprocess patterns break silently-or-loudly on Windows:
|
|
|
|
|
|
|
|
|
|
* ``["npm", "install", ...]`` — on Windows ``npm`` is ``npm.cmd``, a batch
|
|
|
|
|
shim. ``subprocess.Popen(["npm", ...])`` fails with WinError 193
|
|
|
|
|
("not a valid Win32 application") because CreateProcessW can't run a
|
|
|
|
|
``.cmd`` file without ``shell=True`` or PATHEXT resolution.
|
|
|
|
|
|
|
|
|
|
* ``start_new_session=True`` — on POSIX, this maps to ``os.setsid()`` and
|
|
|
|
|
actually detaches the child. On Windows it's silently ignored; the
|
|
|
|
|
Windows equivalent is ``CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS``
|
|
|
|
|
creationflags, which Python only applies when you pass them explicitly.
|
|
|
|
|
|
|
|
|
|
* Console-window flashes — every ``subprocess.Popen`` of a ``.exe`` on
|
|
|
|
|
Windows spawns a cmd window briefly unless ``CREATE_NO_WINDOW`` is
|
|
|
|
|
passed. Cosmetic but jarring for background daemons.
|
|
|
|
|
|
|
|
|
|
This module centralizes the platform-branching logic so the rest of the
|
|
|
|
|
codebase doesn't sprinkle ``if sys.platform == "win32":`` everywhere.
|
|
|
|
|
|
|
|
|
|
**All helpers are no-ops on non-Windows** — calling them in Linux/macOS
|
|
|
|
|
code paths is safe by design. That's the "do no damage on POSIX"
|
|
|
|
|
guarantee.
|
|
|
|
|
"""
|
|
|
|
|
|
|
|
|
|
from __future__ import annotations
|
|
|
|
|
|
|
|
|
|
import shutil
|
|
|
|
|
import sys
|
chore: prune unused imports and duplicate import redefinitions
Remove unused imports (F401) and duplicate/shadowed import
redefinitions (F811) across the codebase using ruff's safe
autofixes. No behavioral changes -- imports only.
- ~1400 safe autofixes applied across 644 files (net -1072 lines)
- __init__.py re-exports preserved (excluded from F401 removal so
public re-export surfaces stay intact)
- Re-exports that are imported or monkeypatched by tests but look
unused in their defining module are kept with explicit # noqa:
F401 (gateway/run.py load_dotenv; run_agent re-exports from
agent.message_sanitization, agent.context_compressor,
agent.retry_utils, agent.prompt_builder, agent.process_bootstrap,
agent.codex_responses_adapter)
- Unsafe F841 (unused-variable) fixes deliberately skipped -- those
can change behavior when the RHS has side effects
- ruff lints remain disabled in pyproject.toml (only PLW1514 is
selected); this is a one-time cleanup, not a config change
Verification:
- python -m compileall: clean
- pytest --collect-only: all 27161 tests collect (zero import errors)
- core entry points import clean (run_agent, model_tools, cli,
toolsets, hermes_state, batch_runner, gateway)
- static scan: every name any test imports directly from an edited
module still resolves
2026-05-29 02:04:58 +05:30
|
|
|
from typing import Sequence
|
feat(windows): close remaining POSIX-only landmines — TUI crash, kanban waitpid, AF_UNIX sandbox, /bin/bash, npm .cmd shims, cwd tracking, detach flags
Second pass on native Windows support, driven by a systematic audit across
five areas: POSIX-only primitives (signal.SIGKILL/SIGHUP/SIGPIPE, os.WNOHANG,
os.setsid), path translation bugs (/c/Users → C:\Users), subprocess patterns
(npm.cmd batch shims, start_new_session no-op on Windows), subsystem health
(cron, gateway daemon, update flow), and module-level import guards.
Every change is platform-gated — POSIX (Linux/macOS) behaviour is preserved
bit-identical. Explicit "do no harm" test: test_posix_path_preserved_on_linux,
test_posix_noop, test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix.
## New module
- hermes_cli/_subprocess_compat.py — shared helpers (resolve_node_command,
windows_detach_flags, windows_hide_flags, windows_detach_popen_kwargs).
All no-ops on non-Windows.
## CRITICAL fixes (would crash or silently break on Windows)
- tui_gateway/entry.py: SIGPIPE/SIGHUP referenced at module top level would
AttributeError on import on Windows, breaking `hermes --tui` entirely (it
spawns this module as a subprocess). Guard each signal.signal() call with
hasattr() and add SIGBREAK as Windows' SIGHUP equivalent.
- hermes_cli/kanban_db.py: os.waitpid(-1, os.WNOHANG) in dispatcher tick was
unguarded. os.WNOHANG doesn't exist on Windows. Gate the whole reap loop
behind `os.name != "nt"` — Windows has no zombies anyway.
- tools/code_execution_tool.py: AF_UNIX socket for execute_code RPC fails on
most Windows builds. Fall back to loopback TCP (AF_INET on 127.0.0.1:0
ephemeral port) when _IS_WINDOWS. HERMES_RPC_SOCKET env var now accepts
either a filesystem path (POSIX) or `tcp://127.0.0.1:<port>` (Windows).
Generated sandbox client parses both.
- cron/scheduler.py: `argv = ["/bin/bash", str(path)]` hardcoded. Use
shutil.which("bash") so Windows (Git Bash via MinGit) works, with a
readable error when bash is genuinely absent.
- 6 bare npm/npx spawn sites: tools_config.py x2, doctor.py, whatsapp.py
(npm install + node version probe), browser_tool.py x2. On Windows npm
is npm.cmd / npx is npx.cmd (batch shims); subprocess.Popen(["npm", ...])
fails with WinError 193. shutil.which(...) returns the absolute .cmd
path which CreateProcessW accepts because the extension routes through
cmd.exe /c. POSIX behaviour unchanged (shutil.which still returns the
same path subprocess would resolve itself).
## HIGH fixes (silent misbehaviour on Windows)
- tools/environments/local.py get_temp_dir: hardcoded /tmp returned on
Windows meant `_cwd_file = "/tmp/hermes-cwd-*.txt"`, which bash wrote
via MSYS2's virtual /tmp but native Python couldn't open. Result: cwd
tracking silently broken — `cd` in terminal tool did nothing. Windows
branch now returns `%HERMES_HOME%/cache/terminal` with forward slashes
(works in both bash and Python, guaranteed no spaces).
- tools/environments/local.py _make_run_env PATH injection: `/usr/bin not
in split(":")` heuristic mangles Windows PATH (";" separator). Gate
the injection behind `not _IS_WINDOWS`.
- hermes_cli/gateway.py launch_detached_profile_gateway_restart: outer
Popen + watcher-script Popen both used start_new_session=True, which
Windows silently ignores. Watcher stayed attached to CLI's console,
died when user closed terminal after `hermes update`, left gateway
stale. Now branches through windows_detach_popen_kwargs() helper
(CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW on
Windows, start_new_session=True on POSIX — identical to main).
## MEDIUM fixes
- gateway/run.py /restart and /update handlers: hardcoded bash/setsid
chain crashes on Windows when user triggers /update in-gateway. Now
has sys.platform=="win32" branch using sys.executable + a tiny
Python watcher with proper detach flags. POSIX path is unchanged.
- cli.py _git_repo_root: Git on Windows sometimes returns /c/Users/...
style paths that break subprocess.Popen(cwd=...) and Path().resolve().
Added _normalize_git_bash_path() helper that translates /c/Users,
/cygdrive/c, /mnt/c variants to native C:\Users form. POSIX no-op.
_git_repo_root() now routes every result through it.
- cli.py worktree .worktreeinclude: os.symlink on directories failed
hard on Windows (requires admin or Developer Mode). Falls back to
shutil.copytree with a warning log.
## Tests
- 29 new tests in tests/tools/test_windows_native_support.py covering:
subprocess_compat helpers, TUI entry signal guards, kanban waitpid
guard, code_execution TCP fallback source-level invariants, cron bash
resolution, npm/npx bare-spawn lint per-file, local env Windows temp
dir, PATH injection gating, git bash path normalization, symlink
fallback, gateway detached watcher flags.
- One existing test assertion adjusted in test_browser_homebrew_paths:
it compared captured Popen argv to the BARE `"npx"` literal; after the
shutil.which() change argv[0] is the absolute path. New assertion
checks the shape (two items, second is `agent-browser`) rather than
the exact first-item string. Behaviour unchanged; test was too strict.
All 56 tests pass on Linux (30 from previous commits + 26 new).
267 tests from the affected files/dirs (browser, code_exec, local_env,
process_registry, kanban_db, windows_compat) all pass — zero regressions.
tests/hermes_cli/ (3909 pass) and tests/gateway/ (5021 pass) unchanged;
all pre-existing test failures confirmed unrelated via `git stash` re-run.
## What's still deferred (LOW priority)
- Visible cmd-window flashes on short-lived console apps (~14 sites) —
cosmetic, needs a follow-up pass once we have user reports.
- agent/file_safety.py POSIX-only security deny patterns — separate
hardening task.
- tools/process_registry.py returning "/tmp" as fallback — theoretical;
reachable only when all env-var candidates fail.
2026-05-07 17:29:31 -07:00
|
|
|
|
|
|
|
|
__all__ = [
|
|
|
|
|
"IS_WINDOWS",
|
|
|
|
|
"resolve_node_command",
|
|
|
|
|
"windows_detach_flags",
|
fix(windows): retry watcher Popen without breakaway when parent job denies it, plus regression tests for the breakaway bit (#40956)
#40909 added `CREATE_BREAKAWAY_FROM_JOB` to `windows_detach_flags()`,
which fixed the headline bug (gateway dies after Desktop GUI update
and never comes back). The flag's own docstring acknowledges that
restrictive parent job objects can still refuse breakaway with
`ERROR_ACCESS_DENIED`, surfacing as `OSError` on the `subprocess.Popen`
call:
"Callers in this codebase already wrap detached spawns in
try/except OSError and fall back to a cmd.exe wrapper, so the
breakaway-denied case degrades gracefully rather than crashing."
That's true for `_spawn_detached` in `gateway_windows.py` (the
`hermes gateway start` path), which has both the breakaway bit AND a
retry-without-breakaway fallback. It's NOT true for the post-update
watcher path in `launch_detached_profile_gateway_restart`
(`hermes_cli/gateway.py`), which only has `except OSError: return
False` and gives up entirely. If a user's shell/terminal/container
wraps Hermes in a breakaway-denying job, the gateway-respawn watcher
silently fails to launch instead of trying again without breakaway.
This PR closes that gap and adds the regression tests that were
missing from the original fix.
## Changes
### `hermes_cli/_subprocess_compat.py`
Adds a sibling helper `windows_detach_flags_without_breakaway()` so
callers can express the fallback symbolically (via the helper) rather
than coding the magic `& ~0x01000000` mask at every site. Documented
on `windows_detach_flags` and `windows_detach_flags_without_breakaway`
with the recommended try/except pattern.
### `hermes_cli/gateway.py::launch_detached_profile_gateway_restart`
Two changes, both aligned with the canonical pattern in
`gateway_windows._spawn_detached`:
1. The outer watcher Popen now wraps in `try/except OSError`, and on
failure retries with `windows_detach_flags_without_breakaway()`
(POSIX never reaches this branch — `start_new_session=True` can't
raise OSError).
2. The inlined respawn payload (the `python -c` watcher) also
wraps its CreateProcess in try/except OSError and retries with
`_flags & ~_CREATE_BREAKAWAY_FROM_JOB` on failure. This matters
because the watcher's job-object inheritance is independent of the
outer process's — even if the outer Popen succeeds with breakaway,
the respawned gateway might inherit a job that doesn't.
### Regression tests in `tests/tools/test_windows_native_support.py`
#40909 shipped the fix without any test that the breakaway bit is
present (the existing `test_windows_detach_flags_has_expected_win32_bits`
asserted only the three legacy bits). Four new tests close that:
- `test_windows_detach_flags_includes_breakaway_from_job` — explicit
assertion that the breakaway bit is in the default bundle, with the
rationale spelled out in the docstring so a future maintainer
staring at this test understands why removing it would resurrect
the gateway-dies-after-GUI-update bug.
- `test_windows_detach_flags_without_breakaway_drops_only_that_bit`
— fallback payload keeps the other three detach bits intact.
- `test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway`
— static-text check on the stringified watcher payload. The inlined
Python program isn't reachable via normal import-time inspection
because it lives in a `textwrap.dedent("""...""")` literal that
gets passed to a separate `python -c` interpreter. Asserting that
both `_CREATE_BREAKAWAY_FROM_JOB` (symbolic) and `0x01000000` (hex
literal) appear inside the dedent block is a sufficient regression
guard against accidental refactors.
- `test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback`
— static check that this PR's fallback retry is wired up
symbolically. Without standing up a real Windows job object that
refuses breakaway, we can't trigger the OSError in a unit test;
the text guard catches the case where a future refactor removes
the helper import or the `& ~_CREATE_BREAKAWAY_FROM_JOB` retry.
Also extends `test_windows_detach_flags_has_expected_win32_bits` to
include the breakaway bit assertion and updates
`test_windows_flags_zero_on_posix` to cover the new helper.
## Tests
Locally on Windows: 8/8 in the `-k "detach or breakaway or
popen_kwargs or launch_detached or gateway_run_update or
hermes_cli_gateway"` slice pass.
Broader `tests/hermes_cli/test_gateway*.py + test_windows_native_support.py`:
172 passed, 10 failed. All 10 failures are pre-existing POSIX-only
tests running on a Windows host (os.geteuid, SIGKILL fallback,
is_linux fixture mismatches). Stashing this PR and re-running on bare
post-#40909 main reproduces all 10 identically — none are regressions.
POSIX paths unchanged: `windows_detach_flags()` and
`windows_detach_flags_without_breakaway()` both return 0 off Windows,
`windows_detach_popen_kwargs()` still yields `{"start_new_session": True}`.
## Out of scope
- The other detached-spawn site in `hermes_cli/gateway.py` (around
line 3068) also uses `windows_detach_popen_kwargs()` + `except
OSError`. It deserves the same fallback treatment but the codepath
is different enough (not the update-flow watcher) that it warrants
a separate PR with its own scrutiny.
- `gateway/run.py` has Windows branches with `windows_detach_popen_kwargs`
too — same reasoning.
## Context
Follow-up to #40909 (merged). I had a parallel PR (#40934, closed)
that duplicated the core breakaway fix; the bits unique to that PR
that #40909 didn't cover are the contents of this one. Closing #40934
and opening this slimmed-down version as the focused follow-up.
2026-06-07 01:21:58 -07:00
|
|
|
"windows_detach_flags_without_breakaway",
|
feat(windows): close remaining POSIX-only landmines — TUI crash, kanban waitpid, AF_UNIX sandbox, /bin/bash, npm .cmd shims, cwd tracking, detach flags
Second pass on native Windows support, driven by a systematic audit across
five areas: POSIX-only primitives (signal.SIGKILL/SIGHUP/SIGPIPE, os.WNOHANG,
os.setsid), path translation bugs (/c/Users → C:\Users), subprocess patterns
(npm.cmd batch shims, start_new_session no-op on Windows), subsystem health
(cron, gateway daemon, update flow), and module-level import guards.
Every change is platform-gated — POSIX (Linux/macOS) behaviour is preserved
bit-identical. Explicit "do no harm" test: test_posix_path_preserved_on_linux,
test_posix_noop, test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix.
## New module
- hermes_cli/_subprocess_compat.py — shared helpers (resolve_node_command,
windows_detach_flags, windows_hide_flags, windows_detach_popen_kwargs).
All no-ops on non-Windows.
## CRITICAL fixes (would crash or silently break on Windows)
- tui_gateway/entry.py: SIGPIPE/SIGHUP referenced at module top level would
AttributeError on import on Windows, breaking `hermes --tui` entirely (it
spawns this module as a subprocess). Guard each signal.signal() call with
hasattr() and add SIGBREAK as Windows' SIGHUP equivalent.
- hermes_cli/kanban_db.py: os.waitpid(-1, os.WNOHANG) in dispatcher tick was
unguarded. os.WNOHANG doesn't exist on Windows. Gate the whole reap loop
behind `os.name != "nt"` — Windows has no zombies anyway.
- tools/code_execution_tool.py: AF_UNIX socket for execute_code RPC fails on
most Windows builds. Fall back to loopback TCP (AF_INET on 127.0.0.1:0
ephemeral port) when _IS_WINDOWS. HERMES_RPC_SOCKET env var now accepts
either a filesystem path (POSIX) or `tcp://127.0.0.1:<port>` (Windows).
Generated sandbox client parses both.
- cron/scheduler.py: `argv = ["/bin/bash", str(path)]` hardcoded. Use
shutil.which("bash") so Windows (Git Bash via MinGit) works, with a
readable error when bash is genuinely absent.
- 6 bare npm/npx spawn sites: tools_config.py x2, doctor.py, whatsapp.py
(npm install + node version probe), browser_tool.py x2. On Windows npm
is npm.cmd / npx is npx.cmd (batch shims); subprocess.Popen(["npm", ...])
fails with WinError 193. shutil.which(...) returns the absolute .cmd
path which CreateProcessW accepts because the extension routes through
cmd.exe /c. POSIX behaviour unchanged (shutil.which still returns the
same path subprocess would resolve itself).
## HIGH fixes (silent misbehaviour on Windows)
- tools/environments/local.py get_temp_dir: hardcoded /tmp returned on
Windows meant `_cwd_file = "/tmp/hermes-cwd-*.txt"`, which bash wrote
via MSYS2's virtual /tmp but native Python couldn't open. Result: cwd
tracking silently broken — `cd` in terminal tool did nothing. Windows
branch now returns `%HERMES_HOME%/cache/terminal` with forward slashes
(works in both bash and Python, guaranteed no spaces).
- tools/environments/local.py _make_run_env PATH injection: `/usr/bin not
in split(":")` heuristic mangles Windows PATH (";" separator). Gate
the injection behind `not _IS_WINDOWS`.
- hermes_cli/gateway.py launch_detached_profile_gateway_restart: outer
Popen + watcher-script Popen both used start_new_session=True, which
Windows silently ignores. Watcher stayed attached to CLI's console,
died when user closed terminal after `hermes update`, left gateway
stale. Now branches through windows_detach_popen_kwargs() helper
(CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW on
Windows, start_new_session=True on POSIX — identical to main).
## MEDIUM fixes
- gateway/run.py /restart and /update handlers: hardcoded bash/setsid
chain crashes on Windows when user triggers /update in-gateway. Now
has sys.platform=="win32" branch using sys.executable + a tiny
Python watcher with proper detach flags. POSIX path is unchanged.
- cli.py _git_repo_root: Git on Windows sometimes returns /c/Users/...
style paths that break subprocess.Popen(cwd=...) and Path().resolve().
Added _normalize_git_bash_path() helper that translates /c/Users,
/cygdrive/c, /mnt/c variants to native C:\Users form. POSIX no-op.
_git_repo_root() now routes every result through it.
- cli.py worktree .worktreeinclude: os.symlink on directories failed
hard on Windows (requires admin or Developer Mode). Falls back to
shutil.copytree with a warning log.
## Tests
- 29 new tests in tests/tools/test_windows_native_support.py covering:
subprocess_compat helpers, TUI entry signal guards, kanban waitpid
guard, code_execution TCP fallback source-level invariants, cron bash
resolution, npm/npx bare-spawn lint per-file, local env Windows temp
dir, PATH injection gating, git bash path normalization, symlink
fallback, gateway detached watcher flags.
- One existing test assertion adjusted in test_browser_homebrew_paths:
it compared captured Popen argv to the BARE `"npx"` literal; after the
shutil.which() change argv[0] is the absolute path. New assertion
checks the shape (two items, second is `agent-browser`) rather than
the exact first-item string. Behaviour unchanged; test was too strict.
All 56 tests pass on Linux (30 from previous commits + 26 new).
267 tests from the affected files/dirs (browser, code_exec, local_env,
process_registry, kanban_db, windows_compat) all pass — zero regressions.
tests/hermes_cli/ (3909 pass) and tests/gateway/ (5021 pass) unchanged;
all pre-existing test failures confirmed unrelated via `git stash` re-run.
## What's still deferred (LOW priority)
- Visible cmd-window flashes on short-lived console apps (~14 sites) —
cosmetic, needs a follow-up pass once we have user reports.
- agent/file_safety.py POSIX-only security deny patterns — separate
hardening task.
- tools/process_registry.py returning "/tmp" as fallback — theoretical;
reachable only when all env-var candidates fail.
2026-05-07 17:29:31 -07:00
|
|
|
"windows_hide_flags",
|
|
|
|
|
"windows_detach_popen_kwargs",
|
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
IS_WINDOWS = sys.platform == "win32"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# -----------------------------------------------------------------------------
|
|
|
|
|
# Node ecosystem launcher resolution
|
|
|
|
|
# -----------------------------------------------------------------------------
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
def resolve_node_command(name: str, argv: Sequence[str]) -> list[str]:
|
|
|
|
|
"""Resolve a Node-ecosystem command name to an absolute-path argv.
|
|
|
|
|
|
|
|
|
|
On Windows, commands like ``npm``, ``npx``, ``yarn``, ``pnpm``,
|
|
|
|
|
``playwright``, ``prettier`` ship as ``.cmd`` files (batch shims).
|
|
|
|
|
``subprocess.Popen(["npm", "install"])`` fails with WinError 193
|
|
|
|
|
because CreateProcessW doesn't execute batch files directly.
|
|
|
|
|
|
|
|
|
|
``shutil.which(name)`` *does* resolve ``.cmd`` via PATHEXT and returns
|
|
|
|
|
the fully-qualified path — which CreateProcessW accepts because the
|
|
|
|
|
extension tells Windows to route through ``cmd.exe /c``.
|
|
|
|
|
|
|
|
|
|
On POSIX ``shutil.which`` also returns a fully-qualified path when
|
|
|
|
|
found. That's a small change from bare-name resolution (the OS does
|
|
|
|
|
its own PATH search) but functionally identical and has the side
|
|
|
|
|
benefit of making the argv reproducible in logs.
|
|
|
|
|
|
|
|
|
|
Behavior when the command is not on PATH:
|
|
|
|
|
- On Windows: return the bare name — caller can still try with
|
|
|
|
|
``shell=True`` as a last resort, OR the subsequent Popen will
|
|
|
|
|
raise FileNotFoundError with a readable error we want to surface.
|
|
|
|
|
- On POSIX: same. Bare ``npm`` on a Linux box without npm installed
|
|
|
|
|
fails the same way it did before this function existed.
|
|
|
|
|
|
|
|
|
|
Args:
|
|
|
|
|
name: The command name to resolve (``npm``, ``npx``, ``node`` …).
|
|
|
|
|
argv: The remaining arguments. Must NOT include ``name`` itself —
|
|
|
|
|
this function builds the full argv list.
|
|
|
|
|
|
|
|
|
|
Returns:
|
|
|
|
|
A list suitable for passing to subprocess.Popen/run/call.
|
|
|
|
|
"""
|
|
|
|
|
resolved = shutil.which(name)
|
|
|
|
|
if resolved:
|
|
|
|
|
return [resolved, *argv]
|
|
|
|
|
return [name, *argv]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# -----------------------------------------------------------------------------
|
|
|
|
|
# Detached / hidden process creation
|
|
|
|
|
# -----------------------------------------------------------------------------
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Win32 CreationFlags — defined here rather than imported from subprocess
|
|
|
|
|
# because CREATE_NO_WINDOW and DETACHED_PROCESS aren't guaranteed to be
|
|
|
|
|
# present on stdlib subprocess on older Pythons or non-Windows builds.
|
|
|
|
|
_CREATE_NEW_PROCESS_GROUP = 0x00000200
|
|
|
|
|
_DETACHED_PROCESS = 0x00000008
|
|
|
|
|
_CREATE_NO_WINDOW = 0x08000000
|
fix(gateway,windows): reliability — JOB breakaway + status --deep probes + test-leak fix (#40909)
* fix(gateway,windows): reliability — supervisor task, JOB breakaway, status --deep
Three coordinated fixes for the Windows gateway reliability story:
1. CREATE_BREAKAWAY_FROM_JOB on every detached spawn
The 'hermes update' triggered from the Electron Desktop GUI ran inside
Electron's job object. Without breakaway, the post-update gateway
watcher spawned by update — already DETACHED_PROCESS — was still
reaped when Electron's job tore down, so the gateway never came back
after a GUI-initiated update. Adds CREATE_BREAKAWAY_FROM_JOB (0x01000000)
to:
- hermes_cli/_subprocess_compat.py::windows_detach_flags() — used by
every helper that calls windows_detach_popen_kwargs(), including
launch_detached_profile_gateway_restart()
- The watcher subprocess's own respawn snippet in
hermes_cli/gateway.py (inlined flags so the watcher's child
respawn also breaks away)
_spawn_detached() in gateway_windows.py already had the flag; this
change brings the rest of the codebase to parity.
2. Per-minute supervisor Scheduled Task — Windows equivalent of
systemd Restart=always
Introduces hermes_cli/gateway_supervisor.py and registers it as a
second Scheduled Task ('Hermes_Gateway_Supervisor', SC MINUTE /MO 1,
LIMITED rights) alongside the existing ONLOGON task. Every minute,
the supervisor uses the same gateway.status.get_running_pid() probe
as 'hermes gateway status' and, if no gateway is alive, calls
gateway_windows._spawn_detached() (which now includes BREAKAWAY) to
bring one back.
Covers every crash mode, not just 'machine rebooted': taskkill,
OOM, GUI update SIGTERM, parent job teardown. Cheap — one pythonw
startup per minute when down, one PID-existence check per minute
when up.
Wired into both the schtasks-success and Startup-folder-fallback
install paths via _install_supervisor_best_effort(), and removed in
uninstall(). Best-effort: a failing supervisor install logs a
warning but doesn't roll back the primary install.
3. 'hermes gateway status --deep' shows per-probe PASS/FAIL
Replaces the existing terse '--deep' output (which only printed
paths) with an actual diagnostic table:
[1] PID file present
[2] Lock file held by a live process
[3] get_running_pid() result
[4] _pid_exists(pid) — OS-level liveness
[5] gateway_state.json (state + age)
[6] Last lifecycle event from gateway-exit-diag.log
When the high-level summary disagrees with reality, the user can
see exactly which signal is lying.
Test-leak fix
-------------
tests/hermes_cli/test_gateway_wsl.py::TestGatewayCommandWSLMessages
monkey-patched is_linux/is_wsl/supports_systemd_services to simulate
WSL but did NOT stub is_windows(). On a Windows host, the dispatcher
in _gateway_command_inner takes the is_windows() branch BEFORE the
WSL guidance branch, so the test invoked gateway_windows.install()
for real. install() writes to %APPDATA%\...\Startup\Hermes_Gateway.cmd
— the REAL user Startup folder, never sandboxed by tmp_path — pointing
at the test's pytest-of-<user>/pytest-<N>/.../gateway-service/ wrapper.
When pytest tore down the tmp_path, every subsequent Windows login
flashed a cmd.exe window that failed to find the missing target.
Stubs is_windows=False on all four affected tests:
test_install_wsl_no_systemd
test_start_wsl_no_systemd
test_status_wsl_running_manual
test_status_wsl_not_running
Defense-in-depth: _build_startup_launcher() now prefixes the launcher
with 'if not exist <target> exit /b 0', so any future stale Startup
entry silently no-ops instead of flashing a console window.
Status enhancements
-------------------
- status() now reports supervisor task presence alongside the existing
schtasks/Startup info, and nudges the user to reinstall if the
supervisor isn't registered.
- Deep mode dumps both the supervisor task name + script path.
* fix(gateway,windows): drop the per-minute supervisor task — keep breakaway + deep probes
Earlier in this branch we added a per-minute schtasks-based supervisor to
respawn the gateway after crashes / GUI-update SIGTERMs. The implementation
flashed a brief console window on every firing, which stole window focus.
We tried several variants:
- cmd.exe wrapper invoking pythonw -> flashes (cmd.exe is console-subsystem)
- schtasks /TR pointing at pythonw -> flashes (uv venv launcher pythonw is
actually subsystem=Console, not GUI; it respawns the real pythonw)
- schtasks /TR pointing at base uv -> still flashes (Task Scheduler-side
conhost preallocation; documented Windows quirk)
- XML registration with <Hidden>true> -> still flashes (<Hidden> only hides
the task in the Task Scheduler UI, not the spawned window)
Researched what leading projects do:
- Ollama: GUI-subsystem tray exe + Startup-folder shortcut. No supervisor.
- Tailscale: real Windows Service via SCM. Session 0, no console possible.
- Syncthing: --no-console flag inside the binary + Startup folder.
- openclaw: VBS Run(..., 0, False) wrapper. Suppresses the *window* but
Super User Q971162 confirms focus-steal still occurs in some cases.
None of these use a per-minute polling scheduled task. The 'auto-restart on
crash' responsibility belongs INSIDE the daemon (Tailscale's in-process
recovery / Ollama's monitor+worker pair) OR is delegated to the Windows
Service Control Manager — not Task Scheduler.
So this commit drops the supervisor entirely. The CREATE_BREAKAWAY_FROM_JOB
fix in _subprocess_compat.py (from commit c1e5fa433) survives — that is the
*real* fix for problem #2 (GUI-update kills gateway): the post-update
watcher in launch_detached_profile_gateway_restart() now breaks out of
Electron's job object, so the gateway respawn watcher survives the GUI
quit and successfully respawns the gateway.
Surviving from c1e5fa433:
* CREATE_BREAKAWAY_FROM_JOB in hermes_cli/_subprocess_compat.py (fixes #2)
* Inlined breakaway flag in the watcher respawn snippet in gateway.py
* hermes gateway status --deep PASS/FAIL probes (fixes #1 — visibility)
* 'if not exist <target> exit /b 0' guard in _build_startup_launcher
(fixes #3 — silent no-op for stale Startup entries)
* tests/hermes_cli/test_gateway_wsl.py is_windows=False stubs (root cause
of #3 — pytest WSL tests no longer leak Startup entries on Win hosts)
Removed in this commit:
* hermes_cli/gateway_supervisor.py (entire file)
* Supervisor section in hermes_cli/gateway_windows.py (~180 lines):
get_supervisor_task_name, get_supervisor_script_path,
_build_supervisor_cmd_script, _write_supervisor_script,
_install_supervisor_task, is_supervisor_task_registered,
_install_supervisor_best_effort
* _install_supervisor_best_effort() calls in install() (3 spots)
* supervisor cleanup block in uninstall()
* supervisor display lines in status() / status(deep=True)
Future direction (out of scope for this PR): the right place for Windows
'Restart=always' semantics is a real Windows Service installed via
pywin32's win32serviceutil.ServiceFramework — session-0 isolation, SCM
auto-restart, no console window possible. That's a meaningful next-PR
project, not a band-aid.
Tests: 51 pass / 2 pre-existing failures in
tests/hermes_cli/test_gateway_{windows,wsl}.py (the 2 failures are
TestSupportsSystemdServicesWSL cases that fail on origin/main too —
unrelated to this PR).
2026-06-06 19:53:58 -07:00
|
|
|
# Escape any Win32 job object the parent process belongs to. Without this,
|
|
|
|
|
# a detached child still inherits its parent's job object membership, and
|
|
|
|
|
# when that parent (Electron, Tauri, Windows Terminal, the Desktop GUI's
|
|
|
|
|
# bootstrap-installer) dies, the OS tears down the whole job — taking the
|
|
|
|
|
# "detached" child with it. Critical for the post-update gateway watcher:
|
|
|
|
|
# Electron spawns the Tauri updater inside its own job, the updater spawns
|
|
|
|
|
# the watcher subprocess; without BREAKAWAY the watcher dies the instant
|
|
|
|
|
# Electron exits, so the gateway never gets respawned after a `hermes
|
|
|
|
|
# update` triggered from the GUI. See fix/windows-gateway-reliability.
|
|
|
|
|
_CREATE_BREAKAWAY_FROM_JOB = 0x01000000
|
feat(windows): close remaining POSIX-only landmines — TUI crash, kanban waitpid, AF_UNIX sandbox, /bin/bash, npm .cmd shims, cwd tracking, detach flags
Second pass on native Windows support, driven by a systematic audit across
five areas: POSIX-only primitives (signal.SIGKILL/SIGHUP/SIGPIPE, os.WNOHANG,
os.setsid), path translation bugs (/c/Users → C:\Users), subprocess patterns
(npm.cmd batch shims, start_new_session no-op on Windows), subsystem health
(cron, gateway daemon, update flow), and module-level import guards.
Every change is platform-gated — POSIX (Linux/macOS) behaviour is preserved
bit-identical. Explicit "do no harm" test: test_posix_path_preserved_on_linux,
test_posix_noop, test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix.
## New module
- hermes_cli/_subprocess_compat.py — shared helpers (resolve_node_command,
windows_detach_flags, windows_hide_flags, windows_detach_popen_kwargs).
All no-ops on non-Windows.
## CRITICAL fixes (would crash or silently break on Windows)
- tui_gateway/entry.py: SIGPIPE/SIGHUP referenced at module top level would
AttributeError on import on Windows, breaking `hermes --tui` entirely (it
spawns this module as a subprocess). Guard each signal.signal() call with
hasattr() and add SIGBREAK as Windows' SIGHUP equivalent.
- hermes_cli/kanban_db.py: os.waitpid(-1, os.WNOHANG) in dispatcher tick was
unguarded. os.WNOHANG doesn't exist on Windows. Gate the whole reap loop
behind `os.name != "nt"` — Windows has no zombies anyway.
- tools/code_execution_tool.py: AF_UNIX socket for execute_code RPC fails on
most Windows builds. Fall back to loopback TCP (AF_INET on 127.0.0.1:0
ephemeral port) when _IS_WINDOWS. HERMES_RPC_SOCKET env var now accepts
either a filesystem path (POSIX) or `tcp://127.0.0.1:<port>` (Windows).
Generated sandbox client parses both.
- cron/scheduler.py: `argv = ["/bin/bash", str(path)]` hardcoded. Use
shutil.which("bash") so Windows (Git Bash via MinGit) works, with a
readable error when bash is genuinely absent.
- 6 bare npm/npx spawn sites: tools_config.py x2, doctor.py, whatsapp.py
(npm install + node version probe), browser_tool.py x2. On Windows npm
is npm.cmd / npx is npx.cmd (batch shims); subprocess.Popen(["npm", ...])
fails with WinError 193. shutil.which(...) returns the absolute .cmd
path which CreateProcessW accepts because the extension routes through
cmd.exe /c. POSIX behaviour unchanged (shutil.which still returns the
same path subprocess would resolve itself).
## HIGH fixes (silent misbehaviour on Windows)
- tools/environments/local.py get_temp_dir: hardcoded /tmp returned on
Windows meant `_cwd_file = "/tmp/hermes-cwd-*.txt"`, which bash wrote
via MSYS2's virtual /tmp but native Python couldn't open. Result: cwd
tracking silently broken — `cd` in terminal tool did nothing. Windows
branch now returns `%HERMES_HOME%/cache/terminal` with forward slashes
(works in both bash and Python, guaranteed no spaces).
- tools/environments/local.py _make_run_env PATH injection: `/usr/bin not
in split(":")` heuristic mangles Windows PATH (";" separator). Gate
the injection behind `not _IS_WINDOWS`.
- hermes_cli/gateway.py launch_detached_profile_gateway_restart: outer
Popen + watcher-script Popen both used start_new_session=True, which
Windows silently ignores. Watcher stayed attached to CLI's console,
died when user closed terminal after `hermes update`, left gateway
stale. Now branches through windows_detach_popen_kwargs() helper
(CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW on
Windows, start_new_session=True on POSIX — identical to main).
## MEDIUM fixes
- gateway/run.py /restart and /update handlers: hardcoded bash/setsid
chain crashes on Windows when user triggers /update in-gateway. Now
has sys.platform=="win32" branch using sys.executable + a tiny
Python watcher with proper detach flags. POSIX path is unchanged.
- cli.py _git_repo_root: Git on Windows sometimes returns /c/Users/...
style paths that break subprocess.Popen(cwd=...) and Path().resolve().
Added _normalize_git_bash_path() helper that translates /c/Users,
/cygdrive/c, /mnt/c variants to native C:\Users form. POSIX no-op.
_git_repo_root() now routes every result through it.
- cli.py worktree .worktreeinclude: os.symlink on directories failed
hard on Windows (requires admin or Developer Mode). Falls back to
shutil.copytree with a warning log.
## Tests
- 29 new tests in tests/tools/test_windows_native_support.py covering:
subprocess_compat helpers, TUI entry signal guards, kanban waitpid
guard, code_execution TCP fallback source-level invariants, cron bash
resolution, npm/npx bare-spawn lint per-file, local env Windows temp
dir, PATH injection gating, git bash path normalization, symlink
fallback, gateway detached watcher flags.
- One existing test assertion adjusted in test_browser_homebrew_paths:
it compared captured Popen argv to the BARE `"npx"` literal; after the
shutil.which() change argv[0] is the absolute path. New assertion
checks the shape (two items, second is `agent-browser`) rather than
the exact first-item string. Behaviour unchanged; test was too strict.
All 56 tests pass on Linux (30 from previous commits + 26 new).
267 tests from the affected files/dirs (browser, code_exec, local_env,
process_registry, kanban_db, windows_compat) all pass — zero regressions.
tests/hermes_cli/ (3909 pass) and tests/gateway/ (5021 pass) unchanged;
all pre-existing test failures confirmed unrelated via `git stash` re-run.
## What's still deferred (LOW priority)
- Visible cmd-window flashes on short-lived console apps (~14 sites) —
cosmetic, needs a follow-up pass once we have user reports.
- agent/file_safety.py POSIX-only security deny patterns — separate
hardening task.
- tools/process_registry.py returning "/tmp" as fallback — theoretical;
reachable only when all env-var candidates fail.
2026-05-07 17:29:31 -07:00
|
|
|
|
|
|
|
|
|
|
|
|
|
def windows_detach_flags() -> int:
|
|
|
|
|
"""Return Win32 creationflags that detach a child from the parent
|
|
|
|
|
console and process group. 0 on non-Windows.
|
|
|
|
|
|
|
|
|
|
Pair with ``start_new_session=False`` (default) when calling
|
|
|
|
|
subprocess.Popen — on POSIX use ``start_new_session=True`` instead,
|
|
|
|
|
which maps to ``os.setsid()`` in the child.
|
|
|
|
|
|
|
|
|
|
Rationale:
|
|
|
|
|
- ``CREATE_NEW_PROCESS_GROUP`` — child has its own process group so
|
|
|
|
|
Ctrl+C in the parent console doesn't propagate.
|
|
|
|
|
- ``DETACHED_PROCESS`` — child has no console at all. Necessary for
|
|
|
|
|
background daemons (gateway watchers, update respawners) because
|
|
|
|
|
without it, closing the console kills the child.
|
|
|
|
|
- ``CREATE_NO_WINDOW`` — suppress the brief cmd flash that would
|
|
|
|
|
otherwise appear when launching a console app. Redundant with
|
|
|
|
|
DETACHED_PROCESS but explicit for clarity.
|
fix(gateway,windows): reliability — JOB breakaway + status --deep probes + test-leak fix (#40909)
* fix(gateway,windows): reliability — supervisor task, JOB breakaway, status --deep
Three coordinated fixes for the Windows gateway reliability story:
1. CREATE_BREAKAWAY_FROM_JOB on every detached spawn
The 'hermes update' triggered from the Electron Desktop GUI ran inside
Electron's job object. Without breakaway, the post-update gateway
watcher spawned by update — already DETACHED_PROCESS — was still
reaped when Electron's job tore down, so the gateway never came back
after a GUI-initiated update. Adds CREATE_BREAKAWAY_FROM_JOB (0x01000000)
to:
- hermes_cli/_subprocess_compat.py::windows_detach_flags() — used by
every helper that calls windows_detach_popen_kwargs(), including
launch_detached_profile_gateway_restart()
- The watcher subprocess's own respawn snippet in
hermes_cli/gateway.py (inlined flags so the watcher's child
respawn also breaks away)
_spawn_detached() in gateway_windows.py already had the flag; this
change brings the rest of the codebase to parity.
2. Per-minute supervisor Scheduled Task — Windows equivalent of
systemd Restart=always
Introduces hermes_cli/gateway_supervisor.py and registers it as a
second Scheduled Task ('Hermes_Gateway_Supervisor', SC MINUTE /MO 1,
LIMITED rights) alongside the existing ONLOGON task. Every minute,
the supervisor uses the same gateway.status.get_running_pid() probe
as 'hermes gateway status' and, if no gateway is alive, calls
gateway_windows._spawn_detached() (which now includes BREAKAWAY) to
bring one back.
Covers every crash mode, not just 'machine rebooted': taskkill,
OOM, GUI update SIGTERM, parent job teardown. Cheap — one pythonw
startup per minute when down, one PID-existence check per minute
when up.
Wired into both the schtasks-success and Startup-folder-fallback
install paths via _install_supervisor_best_effort(), and removed in
uninstall(). Best-effort: a failing supervisor install logs a
warning but doesn't roll back the primary install.
3. 'hermes gateway status --deep' shows per-probe PASS/FAIL
Replaces the existing terse '--deep' output (which only printed
paths) with an actual diagnostic table:
[1] PID file present
[2] Lock file held by a live process
[3] get_running_pid() result
[4] _pid_exists(pid) — OS-level liveness
[5] gateway_state.json (state + age)
[6] Last lifecycle event from gateway-exit-diag.log
When the high-level summary disagrees with reality, the user can
see exactly which signal is lying.
Test-leak fix
-------------
tests/hermes_cli/test_gateway_wsl.py::TestGatewayCommandWSLMessages
monkey-patched is_linux/is_wsl/supports_systemd_services to simulate
WSL but did NOT stub is_windows(). On a Windows host, the dispatcher
in _gateway_command_inner takes the is_windows() branch BEFORE the
WSL guidance branch, so the test invoked gateway_windows.install()
for real. install() writes to %APPDATA%\...\Startup\Hermes_Gateway.cmd
— the REAL user Startup folder, never sandboxed by tmp_path — pointing
at the test's pytest-of-<user>/pytest-<N>/.../gateway-service/ wrapper.
When pytest tore down the tmp_path, every subsequent Windows login
flashed a cmd.exe window that failed to find the missing target.
Stubs is_windows=False on all four affected tests:
test_install_wsl_no_systemd
test_start_wsl_no_systemd
test_status_wsl_running_manual
test_status_wsl_not_running
Defense-in-depth: _build_startup_launcher() now prefixes the launcher
with 'if not exist <target> exit /b 0', so any future stale Startup
entry silently no-ops instead of flashing a console window.
Status enhancements
-------------------
- status() now reports supervisor task presence alongside the existing
schtasks/Startup info, and nudges the user to reinstall if the
supervisor isn't registered.
- Deep mode dumps both the supervisor task name + script path.
* fix(gateway,windows): drop the per-minute supervisor task — keep breakaway + deep probes
Earlier in this branch we added a per-minute schtasks-based supervisor to
respawn the gateway after crashes / GUI-update SIGTERMs. The implementation
flashed a brief console window on every firing, which stole window focus.
We tried several variants:
- cmd.exe wrapper invoking pythonw -> flashes (cmd.exe is console-subsystem)
- schtasks /TR pointing at pythonw -> flashes (uv venv launcher pythonw is
actually subsystem=Console, not GUI; it respawns the real pythonw)
- schtasks /TR pointing at base uv -> still flashes (Task Scheduler-side
conhost preallocation; documented Windows quirk)
- XML registration with <Hidden>true> -> still flashes (<Hidden> only hides
the task in the Task Scheduler UI, not the spawned window)
Researched what leading projects do:
- Ollama: GUI-subsystem tray exe + Startup-folder shortcut. No supervisor.
- Tailscale: real Windows Service via SCM. Session 0, no console possible.
- Syncthing: --no-console flag inside the binary + Startup folder.
- openclaw: VBS Run(..., 0, False) wrapper. Suppresses the *window* but
Super User Q971162 confirms focus-steal still occurs in some cases.
None of these use a per-minute polling scheduled task. The 'auto-restart on
crash' responsibility belongs INSIDE the daemon (Tailscale's in-process
recovery / Ollama's monitor+worker pair) OR is delegated to the Windows
Service Control Manager — not Task Scheduler.
So this commit drops the supervisor entirely. The CREATE_BREAKAWAY_FROM_JOB
fix in _subprocess_compat.py (from commit c1e5fa433) survives — that is the
*real* fix for problem #2 (GUI-update kills gateway): the post-update
watcher in launch_detached_profile_gateway_restart() now breaks out of
Electron's job object, so the gateway respawn watcher survives the GUI
quit and successfully respawns the gateway.
Surviving from c1e5fa433:
* CREATE_BREAKAWAY_FROM_JOB in hermes_cli/_subprocess_compat.py (fixes #2)
* Inlined breakaway flag in the watcher respawn snippet in gateway.py
* hermes gateway status --deep PASS/FAIL probes (fixes #1 — visibility)
* 'if not exist <target> exit /b 0' guard in _build_startup_launcher
(fixes #3 — silent no-op for stale Startup entries)
* tests/hermes_cli/test_gateway_wsl.py is_windows=False stubs (root cause
of #3 — pytest WSL tests no longer leak Startup entries on Win hosts)
Removed in this commit:
* hermes_cli/gateway_supervisor.py (entire file)
* Supervisor section in hermes_cli/gateway_windows.py (~180 lines):
get_supervisor_task_name, get_supervisor_script_path,
_build_supervisor_cmd_script, _write_supervisor_script,
_install_supervisor_task, is_supervisor_task_registered,
_install_supervisor_best_effort
* _install_supervisor_best_effort() calls in install() (3 spots)
* supervisor cleanup block in uninstall()
* supervisor display lines in status() / status(deep=True)
Future direction (out of scope for this PR): the right place for Windows
'Restart=always' semantics is a real Windows Service installed via
pywin32's win32serviceutil.ServiceFramework — session-0 isolation, SCM
auto-restart, no console window possible. That's a meaningful next-PR
project, not a band-aid.
Tests: 51 pass / 2 pre-existing failures in
tests/hermes_cli/test_gateway_{windows,wsl}.py (the 2 failures are
TestSupportsSystemdServicesWSL cases that fail on origin/main too —
unrelated to this PR).
2026-06-06 19:53:58 -07:00
|
|
|
- ``CREATE_BREAKAWAY_FROM_JOB`` — escape any job object the parent is
|
|
|
|
|
in. Electron (Desktop app) and Tauri (bootstrap installer) wrap
|
|
|
|
|
their children in job objects; without breakaway, those children
|
|
|
|
|
die when the parent process exits even if they were spawned with
|
|
|
|
|
DETACHED_PROCESS. This was the missing flag that made the
|
|
|
|
|
post-update gateway respawn watcher silently die alongside the
|
|
|
|
|
Tauri updater after the Electron Desktop's update flow finished.
|
|
|
|
|
|
|
|
|
|
If a process is in a job that disallows breakaway (rare —
|
|
|
|
|
JOB_OBJECT_LIMIT_BREAKAWAY_OK isn't set), CreateProcess returns
|
|
|
|
|
ERROR_ACCESS_DENIED. Python surfaces that as ``PermissionError``
|
|
|
|
|
on the ``subprocess.Popen`` call. Callers in this codebase already
|
|
|
|
|
wrap detached spawns in ``try/except OSError`` and fall back to a
|
|
|
|
|
cmd.exe wrapper, so the breakaway-denied case degrades gracefully
|
|
|
|
|
rather than crashing.
|
feat(windows): close remaining POSIX-only landmines — TUI crash, kanban waitpid, AF_UNIX sandbox, /bin/bash, npm .cmd shims, cwd tracking, detach flags
Second pass on native Windows support, driven by a systematic audit across
five areas: POSIX-only primitives (signal.SIGKILL/SIGHUP/SIGPIPE, os.WNOHANG,
os.setsid), path translation bugs (/c/Users → C:\Users), subprocess patterns
(npm.cmd batch shims, start_new_session no-op on Windows), subsystem health
(cron, gateway daemon, update flow), and module-level import guards.
Every change is platform-gated — POSIX (Linux/macOS) behaviour is preserved
bit-identical. Explicit "do no harm" test: test_posix_path_preserved_on_linux,
test_posix_noop, test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix.
## New module
- hermes_cli/_subprocess_compat.py — shared helpers (resolve_node_command,
windows_detach_flags, windows_hide_flags, windows_detach_popen_kwargs).
All no-ops on non-Windows.
## CRITICAL fixes (would crash or silently break on Windows)
- tui_gateway/entry.py: SIGPIPE/SIGHUP referenced at module top level would
AttributeError on import on Windows, breaking `hermes --tui` entirely (it
spawns this module as a subprocess). Guard each signal.signal() call with
hasattr() and add SIGBREAK as Windows' SIGHUP equivalent.
- hermes_cli/kanban_db.py: os.waitpid(-1, os.WNOHANG) in dispatcher tick was
unguarded. os.WNOHANG doesn't exist on Windows. Gate the whole reap loop
behind `os.name != "nt"` — Windows has no zombies anyway.
- tools/code_execution_tool.py: AF_UNIX socket for execute_code RPC fails on
most Windows builds. Fall back to loopback TCP (AF_INET on 127.0.0.1:0
ephemeral port) when _IS_WINDOWS. HERMES_RPC_SOCKET env var now accepts
either a filesystem path (POSIX) or `tcp://127.0.0.1:<port>` (Windows).
Generated sandbox client parses both.
- cron/scheduler.py: `argv = ["/bin/bash", str(path)]` hardcoded. Use
shutil.which("bash") so Windows (Git Bash via MinGit) works, with a
readable error when bash is genuinely absent.
- 6 bare npm/npx spawn sites: tools_config.py x2, doctor.py, whatsapp.py
(npm install + node version probe), browser_tool.py x2. On Windows npm
is npm.cmd / npx is npx.cmd (batch shims); subprocess.Popen(["npm", ...])
fails with WinError 193. shutil.which(...) returns the absolute .cmd
path which CreateProcessW accepts because the extension routes through
cmd.exe /c. POSIX behaviour unchanged (shutil.which still returns the
same path subprocess would resolve itself).
## HIGH fixes (silent misbehaviour on Windows)
- tools/environments/local.py get_temp_dir: hardcoded /tmp returned on
Windows meant `_cwd_file = "/tmp/hermes-cwd-*.txt"`, which bash wrote
via MSYS2's virtual /tmp but native Python couldn't open. Result: cwd
tracking silently broken — `cd` in terminal tool did nothing. Windows
branch now returns `%HERMES_HOME%/cache/terminal` with forward slashes
(works in both bash and Python, guaranteed no spaces).
- tools/environments/local.py _make_run_env PATH injection: `/usr/bin not
in split(":")` heuristic mangles Windows PATH (";" separator). Gate
the injection behind `not _IS_WINDOWS`.
- hermes_cli/gateway.py launch_detached_profile_gateway_restart: outer
Popen + watcher-script Popen both used start_new_session=True, which
Windows silently ignores. Watcher stayed attached to CLI's console,
died when user closed terminal after `hermes update`, left gateway
stale. Now branches through windows_detach_popen_kwargs() helper
(CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW on
Windows, start_new_session=True on POSIX — identical to main).
## MEDIUM fixes
- gateway/run.py /restart and /update handlers: hardcoded bash/setsid
chain crashes on Windows when user triggers /update in-gateway. Now
has sys.platform=="win32" branch using sys.executable + a tiny
Python watcher with proper detach flags. POSIX path is unchanged.
- cli.py _git_repo_root: Git on Windows sometimes returns /c/Users/...
style paths that break subprocess.Popen(cwd=...) and Path().resolve().
Added _normalize_git_bash_path() helper that translates /c/Users,
/cygdrive/c, /mnt/c variants to native C:\Users form. POSIX no-op.
_git_repo_root() now routes every result through it.
- cli.py worktree .worktreeinclude: os.symlink on directories failed
hard on Windows (requires admin or Developer Mode). Falls back to
shutil.copytree with a warning log.
## Tests
- 29 new tests in tests/tools/test_windows_native_support.py covering:
subprocess_compat helpers, TUI entry signal guards, kanban waitpid
guard, code_execution TCP fallback source-level invariants, cron bash
resolution, npm/npx bare-spawn lint per-file, local env Windows temp
dir, PATH injection gating, git bash path normalization, symlink
fallback, gateway detached watcher flags.
- One existing test assertion adjusted in test_browser_homebrew_paths:
it compared captured Popen argv to the BARE `"npx"` literal; after the
shutil.which() change argv[0] is the absolute path. New assertion
checks the shape (two items, second is `agent-browser`) rather than
the exact first-item string. Behaviour unchanged; test was too strict.
All 56 tests pass on Linux (30 from previous commits + 26 new).
267 tests from the affected files/dirs (browser, code_exec, local_env,
process_registry, kanban_db, windows_compat) all pass — zero regressions.
tests/hermes_cli/ (3909 pass) and tests/gateway/ (5021 pass) unchanged;
all pre-existing test failures confirmed unrelated via `git stash` re-run.
## What's still deferred (LOW priority)
- Visible cmd-window flashes on short-lived console apps (~14 sites) —
cosmetic, needs a follow-up pass once we have user reports.
- agent/file_safety.py POSIX-only security deny patterns — separate
hardening task.
- tools/process_registry.py returning "/tmp" as fallback — theoretical;
reachable only when all env-var candidates fail.
2026-05-07 17:29:31 -07:00
|
|
|
"""
|
|
|
|
|
if not IS_WINDOWS:
|
|
|
|
|
return 0
|
fix(gateway,windows): reliability — JOB breakaway + status --deep probes + test-leak fix (#40909)
* fix(gateway,windows): reliability — supervisor task, JOB breakaway, status --deep
Three coordinated fixes for the Windows gateway reliability story:
1. CREATE_BREAKAWAY_FROM_JOB on every detached spawn
The 'hermes update' triggered from the Electron Desktop GUI ran inside
Electron's job object. Without breakaway, the post-update gateway
watcher spawned by update — already DETACHED_PROCESS — was still
reaped when Electron's job tore down, so the gateway never came back
after a GUI-initiated update. Adds CREATE_BREAKAWAY_FROM_JOB (0x01000000)
to:
- hermes_cli/_subprocess_compat.py::windows_detach_flags() — used by
every helper that calls windows_detach_popen_kwargs(), including
launch_detached_profile_gateway_restart()
- The watcher subprocess's own respawn snippet in
hermes_cli/gateway.py (inlined flags so the watcher's child
respawn also breaks away)
_spawn_detached() in gateway_windows.py already had the flag; this
change brings the rest of the codebase to parity.
2. Per-minute supervisor Scheduled Task — Windows equivalent of
systemd Restart=always
Introduces hermes_cli/gateway_supervisor.py and registers it as a
second Scheduled Task ('Hermes_Gateway_Supervisor', SC MINUTE /MO 1,
LIMITED rights) alongside the existing ONLOGON task. Every minute,
the supervisor uses the same gateway.status.get_running_pid() probe
as 'hermes gateway status' and, if no gateway is alive, calls
gateway_windows._spawn_detached() (which now includes BREAKAWAY) to
bring one back.
Covers every crash mode, not just 'machine rebooted': taskkill,
OOM, GUI update SIGTERM, parent job teardown. Cheap — one pythonw
startup per minute when down, one PID-existence check per minute
when up.
Wired into both the schtasks-success and Startup-folder-fallback
install paths via _install_supervisor_best_effort(), and removed in
uninstall(). Best-effort: a failing supervisor install logs a
warning but doesn't roll back the primary install.
3. 'hermes gateway status --deep' shows per-probe PASS/FAIL
Replaces the existing terse '--deep' output (which only printed
paths) with an actual diagnostic table:
[1] PID file present
[2] Lock file held by a live process
[3] get_running_pid() result
[4] _pid_exists(pid) — OS-level liveness
[5] gateway_state.json (state + age)
[6] Last lifecycle event from gateway-exit-diag.log
When the high-level summary disagrees with reality, the user can
see exactly which signal is lying.
Test-leak fix
-------------
tests/hermes_cli/test_gateway_wsl.py::TestGatewayCommandWSLMessages
monkey-patched is_linux/is_wsl/supports_systemd_services to simulate
WSL but did NOT stub is_windows(). On a Windows host, the dispatcher
in _gateway_command_inner takes the is_windows() branch BEFORE the
WSL guidance branch, so the test invoked gateway_windows.install()
for real. install() writes to %APPDATA%\...\Startup\Hermes_Gateway.cmd
— the REAL user Startup folder, never sandboxed by tmp_path — pointing
at the test's pytest-of-<user>/pytest-<N>/.../gateway-service/ wrapper.
When pytest tore down the tmp_path, every subsequent Windows login
flashed a cmd.exe window that failed to find the missing target.
Stubs is_windows=False on all four affected tests:
test_install_wsl_no_systemd
test_start_wsl_no_systemd
test_status_wsl_running_manual
test_status_wsl_not_running
Defense-in-depth: _build_startup_launcher() now prefixes the launcher
with 'if not exist <target> exit /b 0', so any future stale Startup
entry silently no-ops instead of flashing a console window.
Status enhancements
-------------------
- status() now reports supervisor task presence alongside the existing
schtasks/Startup info, and nudges the user to reinstall if the
supervisor isn't registered.
- Deep mode dumps both the supervisor task name + script path.
* fix(gateway,windows): drop the per-minute supervisor task — keep breakaway + deep probes
Earlier in this branch we added a per-minute schtasks-based supervisor to
respawn the gateway after crashes / GUI-update SIGTERMs. The implementation
flashed a brief console window on every firing, which stole window focus.
We tried several variants:
- cmd.exe wrapper invoking pythonw -> flashes (cmd.exe is console-subsystem)
- schtasks /TR pointing at pythonw -> flashes (uv venv launcher pythonw is
actually subsystem=Console, not GUI; it respawns the real pythonw)
- schtasks /TR pointing at base uv -> still flashes (Task Scheduler-side
conhost preallocation; documented Windows quirk)
- XML registration with <Hidden>true> -> still flashes (<Hidden> only hides
the task in the Task Scheduler UI, not the spawned window)
Researched what leading projects do:
- Ollama: GUI-subsystem tray exe + Startup-folder shortcut. No supervisor.
- Tailscale: real Windows Service via SCM. Session 0, no console possible.
- Syncthing: --no-console flag inside the binary + Startup folder.
- openclaw: VBS Run(..., 0, False) wrapper. Suppresses the *window* but
Super User Q971162 confirms focus-steal still occurs in some cases.
None of these use a per-minute polling scheduled task. The 'auto-restart on
crash' responsibility belongs INSIDE the daemon (Tailscale's in-process
recovery / Ollama's monitor+worker pair) OR is delegated to the Windows
Service Control Manager — not Task Scheduler.
So this commit drops the supervisor entirely. The CREATE_BREAKAWAY_FROM_JOB
fix in _subprocess_compat.py (from commit c1e5fa433) survives — that is the
*real* fix for problem #2 (GUI-update kills gateway): the post-update
watcher in launch_detached_profile_gateway_restart() now breaks out of
Electron's job object, so the gateway respawn watcher survives the GUI
quit and successfully respawns the gateway.
Surviving from c1e5fa433:
* CREATE_BREAKAWAY_FROM_JOB in hermes_cli/_subprocess_compat.py (fixes #2)
* Inlined breakaway flag in the watcher respawn snippet in gateway.py
* hermes gateway status --deep PASS/FAIL probes (fixes #1 — visibility)
* 'if not exist <target> exit /b 0' guard in _build_startup_launcher
(fixes #3 — silent no-op for stale Startup entries)
* tests/hermes_cli/test_gateway_wsl.py is_windows=False stubs (root cause
of #3 — pytest WSL tests no longer leak Startup entries on Win hosts)
Removed in this commit:
* hermes_cli/gateway_supervisor.py (entire file)
* Supervisor section in hermes_cli/gateway_windows.py (~180 lines):
get_supervisor_task_name, get_supervisor_script_path,
_build_supervisor_cmd_script, _write_supervisor_script,
_install_supervisor_task, is_supervisor_task_registered,
_install_supervisor_best_effort
* _install_supervisor_best_effort() calls in install() (3 spots)
* supervisor cleanup block in uninstall()
* supervisor display lines in status() / status(deep=True)
Future direction (out of scope for this PR): the right place for Windows
'Restart=always' semantics is a real Windows Service installed via
pywin32's win32serviceutil.ServiceFramework — session-0 isolation, SCM
auto-restart, no console window possible. That's a meaningful next-PR
project, not a band-aid.
Tests: 51 pass / 2 pre-existing failures in
tests/hermes_cli/test_gateway_{windows,wsl}.py (the 2 failures are
TestSupportsSystemdServicesWSL cases that fail on origin/main too —
unrelated to this PR).
2026-06-06 19:53:58 -07:00
|
|
|
return (
|
|
|
|
|
_CREATE_NEW_PROCESS_GROUP
|
|
|
|
|
| _DETACHED_PROCESS
|
|
|
|
|
| _CREATE_NO_WINDOW
|
|
|
|
|
| _CREATE_BREAKAWAY_FROM_JOB
|
|
|
|
|
)
|
feat(windows): close remaining POSIX-only landmines — TUI crash, kanban waitpid, AF_UNIX sandbox, /bin/bash, npm .cmd shims, cwd tracking, detach flags
Second pass on native Windows support, driven by a systematic audit across
five areas: POSIX-only primitives (signal.SIGKILL/SIGHUP/SIGPIPE, os.WNOHANG,
os.setsid), path translation bugs (/c/Users → C:\Users), subprocess patterns
(npm.cmd batch shims, start_new_session no-op on Windows), subsystem health
(cron, gateway daemon, update flow), and module-level import guards.
Every change is platform-gated — POSIX (Linux/macOS) behaviour is preserved
bit-identical. Explicit "do no harm" test: test_posix_path_preserved_on_linux,
test_posix_noop, test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix.
## New module
- hermes_cli/_subprocess_compat.py — shared helpers (resolve_node_command,
windows_detach_flags, windows_hide_flags, windows_detach_popen_kwargs).
All no-ops on non-Windows.
## CRITICAL fixes (would crash or silently break on Windows)
- tui_gateway/entry.py: SIGPIPE/SIGHUP referenced at module top level would
AttributeError on import on Windows, breaking `hermes --tui` entirely (it
spawns this module as a subprocess). Guard each signal.signal() call with
hasattr() and add SIGBREAK as Windows' SIGHUP equivalent.
- hermes_cli/kanban_db.py: os.waitpid(-1, os.WNOHANG) in dispatcher tick was
unguarded. os.WNOHANG doesn't exist on Windows. Gate the whole reap loop
behind `os.name != "nt"` — Windows has no zombies anyway.
- tools/code_execution_tool.py: AF_UNIX socket for execute_code RPC fails on
most Windows builds. Fall back to loopback TCP (AF_INET on 127.0.0.1:0
ephemeral port) when _IS_WINDOWS. HERMES_RPC_SOCKET env var now accepts
either a filesystem path (POSIX) or `tcp://127.0.0.1:<port>` (Windows).
Generated sandbox client parses both.
- cron/scheduler.py: `argv = ["/bin/bash", str(path)]` hardcoded. Use
shutil.which("bash") so Windows (Git Bash via MinGit) works, with a
readable error when bash is genuinely absent.
- 6 bare npm/npx spawn sites: tools_config.py x2, doctor.py, whatsapp.py
(npm install + node version probe), browser_tool.py x2. On Windows npm
is npm.cmd / npx is npx.cmd (batch shims); subprocess.Popen(["npm", ...])
fails with WinError 193. shutil.which(...) returns the absolute .cmd
path which CreateProcessW accepts because the extension routes through
cmd.exe /c. POSIX behaviour unchanged (shutil.which still returns the
same path subprocess would resolve itself).
## HIGH fixes (silent misbehaviour on Windows)
- tools/environments/local.py get_temp_dir: hardcoded /tmp returned on
Windows meant `_cwd_file = "/tmp/hermes-cwd-*.txt"`, which bash wrote
via MSYS2's virtual /tmp but native Python couldn't open. Result: cwd
tracking silently broken — `cd` in terminal tool did nothing. Windows
branch now returns `%HERMES_HOME%/cache/terminal` with forward slashes
(works in both bash and Python, guaranteed no spaces).
- tools/environments/local.py _make_run_env PATH injection: `/usr/bin not
in split(":")` heuristic mangles Windows PATH (";" separator). Gate
the injection behind `not _IS_WINDOWS`.
- hermes_cli/gateway.py launch_detached_profile_gateway_restart: outer
Popen + watcher-script Popen both used start_new_session=True, which
Windows silently ignores. Watcher stayed attached to CLI's console,
died when user closed terminal after `hermes update`, left gateway
stale. Now branches through windows_detach_popen_kwargs() helper
(CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW on
Windows, start_new_session=True on POSIX — identical to main).
## MEDIUM fixes
- gateway/run.py /restart and /update handlers: hardcoded bash/setsid
chain crashes on Windows when user triggers /update in-gateway. Now
has sys.platform=="win32" branch using sys.executable + a tiny
Python watcher with proper detach flags. POSIX path is unchanged.
- cli.py _git_repo_root: Git on Windows sometimes returns /c/Users/...
style paths that break subprocess.Popen(cwd=...) and Path().resolve().
Added _normalize_git_bash_path() helper that translates /c/Users,
/cygdrive/c, /mnt/c variants to native C:\Users form. POSIX no-op.
_git_repo_root() now routes every result through it.
- cli.py worktree .worktreeinclude: os.symlink on directories failed
hard on Windows (requires admin or Developer Mode). Falls back to
shutil.copytree with a warning log.
## Tests
- 29 new tests in tests/tools/test_windows_native_support.py covering:
subprocess_compat helpers, TUI entry signal guards, kanban waitpid
guard, code_execution TCP fallback source-level invariants, cron bash
resolution, npm/npx bare-spawn lint per-file, local env Windows temp
dir, PATH injection gating, git bash path normalization, symlink
fallback, gateway detached watcher flags.
- One existing test assertion adjusted in test_browser_homebrew_paths:
it compared captured Popen argv to the BARE `"npx"` literal; after the
shutil.which() change argv[0] is the absolute path. New assertion
checks the shape (two items, second is `agent-browser`) rather than
the exact first-item string. Behaviour unchanged; test was too strict.
All 56 tests pass on Linux (30 from previous commits + 26 new).
267 tests from the affected files/dirs (browser, code_exec, local_env,
process_registry, kanban_db, windows_compat) all pass — zero regressions.
tests/hermes_cli/ (3909 pass) and tests/gateway/ (5021 pass) unchanged;
all pre-existing test failures confirmed unrelated via `git stash` re-run.
## What's still deferred (LOW priority)
- Visible cmd-window flashes on short-lived console apps (~14 sites) —
cosmetic, needs a follow-up pass once we have user reports.
- agent/file_safety.py POSIX-only security deny patterns — separate
hardening task.
- tools/process_registry.py returning "/tmp" as fallback — theoretical;
reachable only when all env-var candidates fail.
2026-05-07 17:29:31 -07:00
|
|
|
|
|
|
|
|
|
fix(windows): retry watcher Popen without breakaway when parent job denies it, plus regression tests for the breakaway bit (#40956)
#40909 added `CREATE_BREAKAWAY_FROM_JOB` to `windows_detach_flags()`,
which fixed the headline bug (gateway dies after Desktop GUI update
and never comes back). The flag's own docstring acknowledges that
restrictive parent job objects can still refuse breakaway with
`ERROR_ACCESS_DENIED`, surfacing as `OSError` on the `subprocess.Popen`
call:
"Callers in this codebase already wrap detached spawns in
try/except OSError and fall back to a cmd.exe wrapper, so the
breakaway-denied case degrades gracefully rather than crashing."
That's true for `_spawn_detached` in `gateway_windows.py` (the
`hermes gateway start` path), which has both the breakaway bit AND a
retry-without-breakaway fallback. It's NOT true for the post-update
watcher path in `launch_detached_profile_gateway_restart`
(`hermes_cli/gateway.py`), which only has `except OSError: return
False` and gives up entirely. If a user's shell/terminal/container
wraps Hermes in a breakaway-denying job, the gateway-respawn watcher
silently fails to launch instead of trying again without breakaway.
This PR closes that gap and adds the regression tests that were
missing from the original fix.
## Changes
### `hermes_cli/_subprocess_compat.py`
Adds a sibling helper `windows_detach_flags_without_breakaway()` so
callers can express the fallback symbolically (via the helper) rather
than coding the magic `& ~0x01000000` mask at every site. Documented
on `windows_detach_flags` and `windows_detach_flags_without_breakaway`
with the recommended try/except pattern.
### `hermes_cli/gateway.py::launch_detached_profile_gateway_restart`
Two changes, both aligned with the canonical pattern in
`gateway_windows._spawn_detached`:
1. The outer watcher Popen now wraps in `try/except OSError`, and on
failure retries with `windows_detach_flags_without_breakaway()`
(POSIX never reaches this branch — `start_new_session=True` can't
raise OSError).
2. The inlined respawn payload (the `python -c` watcher) also
wraps its CreateProcess in try/except OSError and retries with
`_flags & ~_CREATE_BREAKAWAY_FROM_JOB` on failure. This matters
because the watcher's job-object inheritance is independent of the
outer process's — even if the outer Popen succeeds with breakaway,
the respawned gateway might inherit a job that doesn't.
### Regression tests in `tests/tools/test_windows_native_support.py`
#40909 shipped the fix without any test that the breakaway bit is
present (the existing `test_windows_detach_flags_has_expected_win32_bits`
asserted only the three legacy bits). Four new tests close that:
- `test_windows_detach_flags_includes_breakaway_from_job` — explicit
assertion that the breakaway bit is in the default bundle, with the
rationale spelled out in the docstring so a future maintainer
staring at this test understands why removing it would resurrect
the gateway-dies-after-GUI-update bug.
- `test_windows_detach_flags_without_breakaway_drops_only_that_bit`
— fallback payload keeps the other three detach bits intact.
- `test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway`
— static-text check on the stringified watcher payload. The inlined
Python program isn't reachable via normal import-time inspection
because it lives in a `textwrap.dedent("""...""")` literal that
gets passed to a separate `python -c` interpreter. Asserting that
both `_CREATE_BREAKAWAY_FROM_JOB` (symbolic) and `0x01000000` (hex
literal) appear inside the dedent block is a sufficient regression
guard against accidental refactors.
- `test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback`
— static check that this PR's fallback retry is wired up
symbolically. Without standing up a real Windows job object that
refuses breakaway, we can't trigger the OSError in a unit test;
the text guard catches the case where a future refactor removes
the helper import or the `& ~_CREATE_BREAKAWAY_FROM_JOB` retry.
Also extends `test_windows_detach_flags_has_expected_win32_bits` to
include the breakaway bit assertion and updates
`test_windows_flags_zero_on_posix` to cover the new helper.
## Tests
Locally on Windows: 8/8 in the `-k "detach or breakaway or
popen_kwargs or launch_detached or gateway_run_update or
hermes_cli_gateway"` slice pass.
Broader `tests/hermes_cli/test_gateway*.py + test_windows_native_support.py`:
172 passed, 10 failed. All 10 failures are pre-existing POSIX-only
tests running on a Windows host (os.geteuid, SIGKILL fallback,
is_linux fixture mismatches). Stashing this PR and re-running on bare
post-#40909 main reproduces all 10 identically — none are regressions.
POSIX paths unchanged: `windows_detach_flags()` and
`windows_detach_flags_without_breakaway()` both return 0 off Windows,
`windows_detach_popen_kwargs()` still yields `{"start_new_session": True}`.
## Out of scope
- The other detached-spawn site in `hermes_cli/gateway.py` (around
line 3068) also uses `windows_detach_popen_kwargs()` + `except
OSError`. It deserves the same fallback treatment but the codepath
is different enough (not the update-flow watcher) that it warrants
a separate PR with its own scrutiny.
- `gateway/run.py` has Windows branches with `windows_detach_popen_kwargs`
too — same reasoning.
## Context
Follow-up to #40909 (merged). I had a parallel PR (#40934, closed)
that duplicated the core breakaway fix; the bits unique to that PR
that #40909 didn't cover are the contents of this one. Closing #40934
and opening this slimmed-down version as the focused follow-up.
2026-06-07 01:21:58 -07:00
|
|
|
def windows_detach_flags_without_breakaway() -> int:
|
|
|
|
|
"""Same as :func:`windows_detach_flags` minus ``CREATE_BREAKAWAY_FROM_JOB``.
|
|
|
|
|
|
|
|
|
|
The docstring on :func:`windows_detach_flags` notes that a process in
|
|
|
|
|
a job which disallows breakaway (no ``JOB_OBJECT_LIMIT_BREAKAWAY_OK``)
|
|
|
|
|
will see ``ERROR_ACCESS_DENIED`` from CreateProcess, surfacing as
|
|
|
|
|
``OSError`` (``PermissionError``) on the ``subprocess.Popen`` call.
|
|
|
|
|
Callers that want to recover — by retrying without the breakaway
|
|
|
|
|
bit — can pair the two helpers symbolically rather than coding the
|
|
|
|
|
``& ~0x01000000`` magic at every site:
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
|
|
try:
|
|
|
|
|
subprocess.Popen(argv, creationflags=windows_detach_flags(), …)
|
|
|
|
|
except OSError:
|
|
|
|
|
subprocess.Popen(
|
|
|
|
|
argv,
|
|
|
|
|
creationflags=windows_detach_flags_without_breakaway(),
|
|
|
|
|
…,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
See ``gateway_windows.py::_spawn_detached`` for the canonical
|
|
|
|
|
implementation of this pattern. Returns 0 on non-Windows.
|
|
|
|
|
"""
|
|
|
|
|
if not IS_WINDOWS:
|
|
|
|
|
return 0
|
|
|
|
|
return _CREATE_NEW_PROCESS_GROUP | _DETACHED_PROCESS | _CREATE_NO_WINDOW
|
|
|
|
|
|
|
|
|
|
|
feat(windows): close remaining POSIX-only landmines — TUI crash, kanban waitpid, AF_UNIX sandbox, /bin/bash, npm .cmd shims, cwd tracking, detach flags
Second pass on native Windows support, driven by a systematic audit across
five areas: POSIX-only primitives (signal.SIGKILL/SIGHUP/SIGPIPE, os.WNOHANG,
os.setsid), path translation bugs (/c/Users → C:\Users), subprocess patterns
(npm.cmd batch shims, start_new_session no-op on Windows), subsystem health
(cron, gateway daemon, update flow), and module-level import guards.
Every change is platform-gated — POSIX (Linux/macOS) behaviour is preserved
bit-identical. Explicit "do no harm" test: test_posix_path_preserved_on_linux,
test_posix_noop, test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix.
## New module
- hermes_cli/_subprocess_compat.py — shared helpers (resolve_node_command,
windows_detach_flags, windows_hide_flags, windows_detach_popen_kwargs).
All no-ops on non-Windows.
## CRITICAL fixes (would crash or silently break on Windows)
- tui_gateway/entry.py: SIGPIPE/SIGHUP referenced at module top level would
AttributeError on import on Windows, breaking `hermes --tui` entirely (it
spawns this module as a subprocess). Guard each signal.signal() call with
hasattr() and add SIGBREAK as Windows' SIGHUP equivalent.
- hermes_cli/kanban_db.py: os.waitpid(-1, os.WNOHANG) in dispatcher tick was
unguarded. os.WNOHANG doesn't exist on Windows. Gate the whole reap loop
behind `os.name != "nt"` — Windows has no zombies anyway.
- tools/code_execution_tool.py: AF_UNIX socket for execute_code RPC fails on
most Windows builds. Fall back to loopback TCP (AF_INET on 127.0.0.1:0
ephemeral port) when _IS_WINDOWS. HERMES_RPC_SOCKET env var now accepts
either a filesystem path (POSIX) or `tcp://127.0.0.1:<port>` (Windows).
Generated sandbox client parses both.
- cron/scheduler.py: `argv = ["/bin/bash", str(path)]` hardcoded. Use
shutil.which("bash") so Windows (Git Bash via MinGit) works, with a
readable error when bash is genuinely absent.
- 6 bare npm/npx spawn sites: tools_config.py x2, doctor.py, whatsapp.py
(npm install + node version probe), browser_tool.py x2. On Windows npm
is npm.cmd / npx is npx.cmd (batch shims); subprocess.Popen(["npm", ...])
fails with WinError 193. shutil.which(...) returns the absolute .cmd
path which CreateProcessW accepts because the extension routes through
cmd.exe /c. POSIX behaviour unchanged (shutil.which still returns the
same path subprocess would resolve itself).
## HIGH fixes (silent misbehaviour on Windows)
- tools/environments/local.py get_temp_dir: hardcoded /tmp returned on
Windows meant `_cwd_file = "/tmp/hermes-cwd-*.txt"`, which bash wrote
via MSYS2's virtual /tmp but native Python couldn't open. Result: cwd
tracking silently broken — `cd` in terminal tool did nothing. Windows
branch now returns `%HERMES_HOME%/cache/terminal` with forward slashes
(works in both bash and Python, guaranteed no spaces).
- tools/environments/local.py _make_run_env PATH injection: `/usr/bin not
in split(":")` heuristic mangles Windows PATH (";" separator). Gate
the injection behind `not _IS_WINDOWS`.
- hermes_cli/gateway.py launch_detached_profile_gateway_restart: outer
Popen + watcher-script Popen both used start_new_session=True, which
Windows silently ignores. Watcher stayed attached to CLI's console,
died when user closed terminal after `hermes update`, left gateway
stale. Now branches through windows_detach_popen_kwargs() helper
(CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW on
Windows, start_new_session=True on POSIX — identical to main).
## MEDIUM fixes
- gateway/run.py /restart and /update handlers: hardcoded bash/setsid
chain crashes on Windows when user triggers /update in-gateway. Now
has sys.platform=="win32" branch using sys.executable + a tiny
Python watcher with proper detach flags. POSIX path is unchanged.
- cli.py _git_repo_root: Git on Windows sometimes returns /c/Users/...
style paths that break subprocess.Popen(cwd=...) and Path().resolve().
Added _normalize_git_bash_path() helper that translates /c/Users,
/cygdrive/c, /mnt/c variants to native C:\Users form. POSIX no-op.
_git_repo_root() now routes every result through it.
- cli.py worktree .worktreeinclude: os.symlink on directories failed
hard on Windows (requires admin or Developer Mode). Falls back to
shutil.copytree with a warning log.
## Tests
- 29 new tests in tests/tools/test_windows_native_support.py covering:
subprocess_compat helpers, TUI entry signal guards, kanban waitpid
guard, code_execution TCP fallback source-level invariants, cron bash
resolution, npm/npx bare-spawn lint per-file, local env Windows temp
dir, PATH injection gating, git bash path normalization, symlink
fallback, gateway detached watcher flags.
- One existing test assertion adjusted in test_browser_homebrew_paths:
it compared captured Popen argv to the BARE `"npx"` literal; after the
shutil.which() change argv[0] is the absolute path. New assertion
checks the shape (two items, second is `agent-browser`) rather than
the exact first-item string. Behaviour unchanged; test was too strict.
All 56 tests pass on Linux (30 from previous commits + 26 new).
267 tests from the affected files/dirs (browser, code_exec, local_env,
process_registry, kanban_db, windows_compat) all pass — zero regressions.
tests/hermes_cli/ (3909 pass) and tests/gateway/ (5021 pass) unchanged;
all pre-existing test failures confirmed unrelated via `git stash` re-run.
## What's still deferred (LOW priority)
- Visible cmd-window flashes on short-lived console apps (~14 sites) —
cosmetic, needs a follow-up pass once we have user reports.
- agent/file_safety.py POSIX-only security deny patterns — separate
hardening task.
- tools/process_registry.py returning "/tmp" as fallback — theoretical;
reachable only when all env-var candidates fail.
2026-05-07 17:29:31 -07:00
|
|
|
def windows_hide_flags() -> int:
|
|
|
|
|
"""Return Win32 creationflags that merely hide the child's console
|
|
|
|
|
window without detaching the child. 0 on non-Windows.
|
|
|
|
|
|
|
|
|
|
Use for short-lived console apps spawned as part of a larger
|
|
|
|
|
operation (``taskkill``, ``where``, version probes) where we want no
|
|
|
|
|
flash but also want to collect stdout/exit code synchronously.
|
|
|
|
|
|
|
|
|
|
The key difference from :func:`windows_detach_flags`: NO
|
|
|
|
|
``DETACHED_PROCESS`` — the child still inherits stdio handles so
|
|
|
|
|
``capture_output=True`` works. ``DETACHED_PROCESS`` would sever
|
|
|
|
|
stdio and break stdout capture.
|
|
|
|
|
"""
|
|
|
|
|
if not IS_WINDOWS:
|
|
|
|
|
return 0
|
|
|
|
|
return _CREATE_NO_WINDOW
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
def windows_detach_popen_kwargs() -> dict:
|
|
|
|
|
"""Return a dict of Popen kwargs that detach a child on Windows and
|
|
|
|
|
fall back to the POSIX equivalent (``start_new_session=True``) on
|
|
|
|
|
Linux/macOS.
|
|
|
|
|
|
|
|
|
|
Usage pattern:
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
|
|
subprocess.Popen(
|
|
|
|
|
argv,
|
|
|
|
|
stdout=subprocess.DEVNULL,
|
|
|
|
|
stderr=subprocess.DEVNULL,
|
|
|
|
|
stdin=subprocess.DEVNULL,
|
|
|
|
|
close_fds=True,
|
|
|
|
|
**windows_detach_popen_kwargs(),
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
This replaces the unsafe-on-Windows pattern:
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
|
|
subprocess.Popen(..., start_new_session=True)
|
|
|
|
|
|
|
|
|
|
which silently fails to detach on Windows (the flag is accepted but
|
|
|
|
|
has no effect — the child stays attached to the parent's console
|
|
|
|
|
and dies when the console closes).
|
|
|
|
|
"""
|
|
|
|
|
if IS_WINDOWS:
|
|
|
|
|
return {"creationflags": windows_detach_flags()}
|
|
|
|
|
return {"start_new_session": True}
|