feat(kanban): max-runtime timeouts, worker heartbeats, assignees picker, event vocab cleanup

Ports four items from the Multica audit (https://github.com/multica-ai/multica). Dropped their cross-host server/daemon architecture and their Postgres+pgvector skill search — both the wrong shape for our single-host SQLite kernel. 1. Per-task max-runtime (`max_runtime_seconds` column) - New kernel function `enforce_max_runtime(conn)` runs in every dispatch tick. When a running task's elapsed time exceeds the cap, we SIGTERM the worker, wait a 5 s grace (polling _pid_alive), then SIGKILL. The task goes back to 'ready' with a `timed_out` event and re-queues on the next tick (unless the spawn-failure circuit breaker has already parked it). - Host-local only: lock prefix must match this host's claimer_id so we never signal a PID on another machine. - CLI: `hermes kanban create --max-runtime 30m | 2h | 1d | <seconds>`. New `_parse_duration` helper accepts s/m/h/d suffixes or bare integers. - Dashboard POST body + the card's `max_runtime_seconds` field. 2. Worker heartbeat (`last_heartbeat_at` column, `heartbeat` event) - `heartbeat_worker(conn, task_id, note=None)` emits the event and touches last_heartbeat_at. Refused when the task isn't running. - CLI: `hermes kanban heartbeat <id> [--note "..."]`. - kanban-worker skill instructs workers to heartbeat during long loops (training runs, encodes, crawls, batch uploads). - Separate signal from PID crash detection: a worker's Python can still be alive while the actual work process is stuck. Heartbeat absence is diagnostic; future work can auto-block on stale heartbeats but v1 just surfaces the signal. 3. Assignee enumeration (`known_assignees`, `list_profiles_on_disk`) - Scans ~/.hermes/profiles/ for dirs containing config.yaml + unions with current assignees on the board. Each entry returns {name, on_disk, counts: {status: n}}. - CLI: `hermes kanban assignees [--json]`. Also hooked into `hermes kanban init` which now prints discovered profiles so new installs see 'these are the assignees you can target' immediately. - Dashboard: GET /api/plugins/kanban/assignees for the picker. 4. Event vocab cleanup (three renames + three new kinds) - `ready` → `promoted` (fires when deps clear; clearer semantic). - `priority` → `reprioritized` (past-tense verb, matches others). - `spawn_auto_blocked` → `gave_up` (short, memorable; the circuit breaker gave up on this task). - New: `spawned` (emitted with {pid} on successful spawn), `heartbeat` ({note?}), `timed_out` ({pid, elapsed_seconds, limit_seconds, sigkill}). - One-shot migration in `_migrate_add_optional_columns` renames legacy rows in-place on init_db(), so existing DBs upgrade cleanly. - Gateway notifier's TERMINAL_KINDS set updated; timed_out gets its own ⏱ message template, gave_up renamed from 'auto-blocked'. - Plugin_api.py's two 'priority' emit sites renamed to 'reprioritized'. - Documented in a new 'Event reference' section in kanban.md, grouped into three clusters (lifecycle / edits / worker telemetry) with payload shapes. Tests (+18 in tests/hermes_cli/test_kanban_core_functionality.py, 136/136 pass): - max_runtime_terminates_overrun_worker: real SIGTERM flow with _pid_alive stub, verifies event payload + state reset. - max_runtime_none_means_no_cap: unbounded tasks aren't timed out. - create_task_persists_max_runtime. - enforce_max_runtime_integrates_with_dispatch: kernel-level + dispatch_once chaining. - heartbeat_on_running_task + heartbeat_refused_when_not_running. - cli_heartbeat_verb with --note round-trip. - recompute_ready_emits_promoted_not_ready. - spawn_failure_circuit_breaker_emits_gave_up. - spawned_event_emitted_with_pid. - migration_renames_legacy_event_kinds (injects old rows, re-runs init_db, asserts rename). - list_profiles_on_disk (tmp_path + config.yaml filter). - known_assignees_merges_disk_and_board (profiles on disk + board assignees + per-status counts). - cli_assignees_json. - parse_duration_accepts_formats (s/m/h/d/float). - parse_duration_rejects_garbage. - cli_create_max_runtime_via_duration (2h → 7200). - cli_create_max_runtime_bad_format_exits_nonzero. Live smoke: POST /tasks with max_runtime_seconds round-trips; /assignees returns the union of on-disk + board-assigned names; PATCH priority produces 'reprioritized' events (not 'priority'); board cards expose max_runtime_seconds + last_heartbeat_at. Docs (website/docs/user-guide/features/kanban.md): - New 'Event reference' section with three-cluster table (lifecycle / edits / worker telemetry) + payload shapes. - CLI reference updated for --max-runtime, heartbeat, assignees. - Gateway notifications section updated for the new TERMINAL_KINDS. Not ported from Multica (deliberate, documented in the out-of-scope section already): Postgres+pgvector skill search (heavy deps conflict with SQLite kernel), server+daemon cross-host model (we're single-host on purpose), first-class agent identity with threaded comments (we keep the board profile-agnostic).
2026-05-05 02:07:34 +08:00 · 2026-04-27 06:32:17 -07:00
parent af8d43dbbb
commit da7d09c3b6
7 changed files with 848 additions and 31 deletions
--- a/website/docs/user-guide/features/kanban.md
+++ b/website/docs/user-guide/features/kanban.md
@@ -263,6 +263,7 @@ hermes kanban create "<title>" [--body ...] [--assignee <profile>]
                                [--parent <id>]... [--tenant <name>]
                                [--workspace scratch|worktree|dir:<path>]
                                [--priority N] [--triage] [--idempotency-key KEY]
+                                [--max-runtime 30m|2h|1d|<seconds>]
                                [--json]
 hermes kanban list [--mine] [--assignee P] [--status S] [--tenant T] [--archived] [--json]
 hermes kanban show <id> [--json]
@@ -281,6 +282,8 @@ hermes kanban archive <id>...
 hermes kanban tail <id>                                # follow a single task's event stream
 hermes kanban watch [--assignee P] [--tenant T]        # live stream ALL events to the terminal
        [--kinds completed,blocked,…] [--interval SECS]
+hermes kanban heartbeat <id> [--note "..."]            # worker liveness signal for long ops
+hermes kanban assignees [--json]                       # profiles on disk + per-assignee task counts
 hermes kanban dispatch [--dry-run] [--max N]           # one-shot pass
        [--failure-limit N] [--json]
 hermes kanban daemon [--interval SECS] [--max N]       # long-lived loop
@@ -332,7 +335,7 @@ Workers receive `$HERMES_TENANT` and namespace their memory writes by prefix. Th

 ## Gateway notifications

-When you run `/kanban create …` from the gateway (Telegram, Discord, Slack, etc.), the originating chat is automatically subscribed to the new task. The gateway's background notifier polls `task_events` every few seconds and delivers one message per terminal event (`completed`, `blocked`, `spawn_auto_blocked`, `crashed`) to that chat. Completed tasks also send the first line of the worker's `--result` so you see the outcome without having to `/kanban show`.
+When you run `/kanban create …` from the gateway (Telegram, Discord, Slack, etc.), the originating chat is automatically subscribed to the new task. The gateway's background notifier polls `task_events` every few seconds and delivers one message per terminal event (`completed`, `blocked`, `gave_up`, `crashed`, `timed_out`) to that chat. Completed tasks also send the first line of the worker's `--result` so you see the outcome without having to `/kanban show`.

 You can manage subscriptions explicitly from the CLI — useful when a script / cron job wants to notify a chat it didn't originate from:

@@ -346,6 +349,45 @@ hermes kanban notify-unsubscribe t_abcd \

 A subscription removes itself automatically once the task reaches `done` or `archived`; no cleanup needed.

+## Event reference
+
+Every transition appends a row to `task_events`. The kinds group into three clusters so filtering is easy (`hermes kanban watch --kinds completed,gave_up,timed_out`):
+
+**Lifecycle** (what changed about the task as a logical unit):
+
+| Kind | When |
+|---|---|
+| `created` | Task inserted. |
+| `promoted` | `todo → ready` because all parents hit `done`. |
+| `claimed` | Dispatcher atomically claimed a `ready` task for spawn. |
+| `completed` | Worker wrote `--result` and task hit `done`. |
+| `blocked` | Worker or human flipped the task to `blocked`. |
+| `unblocked` | `blocked → ready`, either manually or via `/unblock`. |
+| `archived` | Hidden from the default board. |
+
+**Edits** (human-driven changes that aren't transitions):
+
+| Kind | When |
+|---|---|
+| `assigned` | Assignee changed (including unassignment). |
+| `edited` | Title or body updated. |
+| `reprioritized` | Priority changed. |
+| `status` | Dashboard drag-drop wrote a status directly (e.g. `todo → ready`). |
+
+**Worker telemetry** (about the execution process, not the logical task):
+
+| Kind | Payload | When |
+|---|---|---|
+| `spawned` | `{pid}` | Dispatcher successfully started a worker process. |
+| `heartbeat` | `{note?}` | Worker called `hermes kanban heartbeat $TASK` to signal liveness during long operations. |
+| `reclaimed` | `{stale_lock}` | Claim TTL expired without a completion; task goes back to `ready`. |
+| `crashed` | `{pid, claimer}` | Worker PID no longer alive but TTL hadn't expired yet. |
+| `timed_out` | `{pid, elapsed_seconds, limit_seconds, sigkill}` | `max_runtime_seconds` exceeded; dispatcher SIGTERM'd (then SIGKILL'd after 5 s grace) and re-queued. |
+| `spawn_failed` | `{error, failures}` | One spawn attempt failed (missing PATH, workspace unmountable, …). Counter increments; task returns to `ready` for retry. |
+| `gave_up` | `{failures, error}` | Circuit breaker fired after N consecutive `spawn_failed`. Task auto-blocks with the last error. Default N = 5; override via `--failure-limit`. |
+
+`hermes kanban tail <id>` shows these for a single task. `hermes kanban watch` streams them board-wide.
+
 ## Out of scope

 Kanban is deliberately single-host. `~/.hermes/kanban.db` is a local SQLite file and the dispatcher spawns workers on the same machine. Running a shared board across two hosts is not supported — there's no coordination primitive for "worker X on host A, worker Y on host B," and the crash-detection path assumes PIDs are host-local. If you need multi-host, run an independent board per host and use `delegate_task` / a message queue to bridge them.