mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-05 02:07:34 +08:00
feat(kanban): max-runtime timeouts, worker heartbeats, assignees picker, event vocab cleanup
Ports four items from the Multica audit (https://github.com/multica-ai/multica). Dropped their cross-host server/daemon architecture and their Postgres+pgvector skill search — both the wrong shape for our single-host SQLite kernel. 1. Per-task max-runtime (`max_runtime_seconds` column) - New kernel function `enforce_max_runtime(conn)` runs in every dispatch tick. When a running task's elapsed time exceeds the cap, we SIGTERM the worker, wait a 5 s grace (polling _pid_alive), then SIGKILL. The task goes back to 'ready' with a `timed_out` event and re-queues on the next tick (unless the spawn-failure circuit breaker has already parked it). - Host-local only: lock prefix must match this host's claimer_id so we never signal a PID on another machine. - CLI: `hermes kanban create --max-runtime 30m | 2h | 1d | <seconds>`. New `_parse_duration` helper accepts s/m/h/d suffixes or bare integers. - Dashboard POST body + the card's `max_runtime_seconds` field. 2. Worker heartbeat (`last_heartbeat_at` column, `heartbeat` event) - `heartbeat_worker(conn, task_id, note=None)` emits the event and touches last_heartbeat_at. Refused when the task isn't running. - CLI: `hermes kanban heartbeat <id> [--note "..."]`. - kanban-worker skill instructs workers to heartbeat during long loops (training runs, encodes, crawls, batch uploads). - Separate signal from PID crash detection: a worker's Python can still be alive while the actual work process is stuck. Heartbeat absence is diagnostic; future work can auto-block on stale heartbeats but v1 just surfaces the signal. 3. Assignee enumeration (`known_assignees`, `list_profiles_on_disk`) - Scans ~/.hermes/profiles/ for dirs containing config.yaml + unions with current assignees on the board. Each entry returns {name, on_disk, counts: {status: n}}. - CLI: `hermes kanban assignees [--json]`. Also hooked into `hermes kanban init` which now prints discovered profiles so new installs see 'these are the assignees you can target' immediately. - Dashboard: GET /api/plugins/kanban/assignees for the picker. 4. Event vocab cleanup (three renames + three new kinds) - `ready` → `promoted` (fires when deps clear; clearer semantic). - `priority` → `reprioritized` (past-tense verb, matches others). - `spawn_auto_blocked` → `gave_up` (short, memorable; the circuit breaker gave up on this task). - New: `spawned` (emitted with {pid} on successful spawn), `heartbeat` ({note?}), `timed_out` ({pid, elapsed_seconds, limit_seconds, sigkill}). - One-shot migration in `_migrate_add_optional_columns` renames legacy rows in-place on init_db(), so existing DBs upgrade cleanly. - Gateway notifier's TERMINAL_KINDS set updated; timed_out gets its own ⏱ message template, gave_up renamed from 'auto-blocked'. - Plugin_api.py's two 'priority' emit sites renamed to 'reprioritized'. - Documented in a new 'Event reference' section in kanban.md, grouped into three clusters (lifecycle / edits / worker telemetry) with payload shapes. Tests (+18 in tests/hermes_cli/test_kanban_core_functionality.py, 136/136 pass): - max_runtime_terminates_overrun_worker: real SIGTERM flow with _pid_alive stub, verifies event payload + state reset. - max_runtime_none_means_no_cap: unbounded tasks aren't timed out. - create_task_persists_max_runtime. - enforce_max_runtime_integrates_with_dispatch: kernel-level + dispatch_once chaining. - heartbeat_on_running_task + heartbeat_refused_when_not_running. - cli_heartbeat_verb with --note round-trip. - recompute_ready_emits_promoted_not_ready. - spawn_failure_circuit_breaker_emits_gave_up. - spawned_event_emitted_with_pid. - migration_renames_legacy_event_kinds (injects old rows, re-runs init_db, asserts rename). - list_profiles_on_disk (tmp_path + config.yaml filter). - known_assignees_merges_disk_and_board (profiles on disk + board assignees + per-status counts). - cli_assignees_json. - parse_duration_accepts_formats (s/m/h/d/float). - parse_duration_rejects_garbage. - cli_create_max_runtime_via_duration (2h → 7200). - cli_create_max_runtime_bad_format_exits_nonzero. Live smoke: POST /tasks with max_runtime_seconds round-trips; /assignees returns the union of on-disk + board-assigned names; PATCH priority produces 'reprioritized' events (not 'priority'); board cards expose max_runtime_seconds + last_heartbeat_at. Docs (website/docs/user-guide/features/kanban.md): - New 'Event reference' section with three-cluster table (lifecycle / edits / worker telemetry) + payload shapes. - CLI reference updated for --max-runtime, heartbeat, assignees. - Gateway notifications section updated for the new TERMINAL_KINDS. Not ported from Multica (deliberate, documented in the out-of-scope section already): Postgres+pgvector skill search (heavy deps conflict with SQLite kernel), server+daemon cross-host model (we're single-host on purpose), first-class agent identity with threaded comments (we keep the board profile-agnostic).
This commit is contained in:
@@ -263,6 +263,7 @@ hermes kanban create "<title>" [--body ...] [--assignee <profile>]
|
||||
[--parent <id>]... [--tenant <name>]
|
||||
[--workspace scratch|worktree|dir:<path>]
|
||||
[--priority N] [--triage] [--idempotency-key KEY]
|
||||
[--max-runtime 30m|2h|1d|<seconds>]
|
||||
[--json]
|
||||
hermes kanban list [--mine] [--assignee P] [--status S] [--tenant T] [--archived] [--json]
|
||||
hermes kanban show <id> [--json]
|
||||
@@ -281,6 +282,8 @@ hermes kanban archive <id>...
|
||||
hermes kanban tail <id> # follow a single task's event stream
|
||||
hermes kanban watch [--assignee P] [--tenant T] # live stream ALL events to the terminal
|
||||
[--kinds completed,blocked,…] [--interval SECS]
|
||||
hermes kanban heartbeat <id> [--note "..."] # worker liveness signal for long ops
|
||||
hermes kanban assignees [--json] # profiles on disk + per-assignee task counts
|
||||
hermes kanban dispatch [--dry-run] [--max N] # one-shot pass
|
||||
[--failure-limit N] [--json]
|
||||
hermes kanban daemon [--interval SECS] [--max N] # long-lived loop
|
||||
@@ -332,7 +335,7 @@ Workers receive `$HERMES_TENANT` and namespace their memory writes by prefix. Th
|
||||
|
||||
## Gateway notifications
|
||||
|
||||
When you run `/kanban create …` from the gateway (Telegram, Discord, Slack, etc.), the originating chat is automatically subscribed to the new task. The gateway's background notifier polls `task_events` every few seconds and delivers one message per terminal event (`completed`, `blocked`, `spawn_auto_blocked`, `crashed`) to that chat. Completed tasks also send the first line of the worker's `--result` so you see the outcome without having to `/kanban show`.
|
||||
When you run `/kanban create …` from the gateway (Telegram, Discord, Slack, etc.), the originating chat is automatically subscribed to the new task. The gateway's background notifier polls `task_events` every few seconds and delivers one message per terminal event (`completed`, `blocked`, `gave_up`, `crashed`, `timed_out`) to that chat. Completed tasks also send the first line of the worker's `--result` so you see the outcome without having to `/kanban show`.
|
||||
|
||||
You can manage subscriptions explicitly from the CLI — useful when a script / cron job wants to notify a chat it didn't originate from:
|
||||
|
||||
@@ -346,6 +349,45 @@ hermes kanban notify-unsubscribe t_abcd \
|
||||
|
||||
A subscription removes itself automatically once the task reaches `done` or `archived`; no cleanup needed.
|
||||
|
||||
## Event reference
|
||||
|
||||
Every transition appends a row to `task_events`. The kinds group into three clusters so filtering is easy (`hermes kanban watch --kinds completed,gave_up,timed_out`):
|
||||
|
||||
**Lifecycle** (what changed about the task as a logical unit):
|
||||
|
||||
| Kind | When |
|
||||
|---|---|
|
||||
| `created` | Task inserted. |
|
||||
| `promoted` | `todo → ready` because all parents hit `done`. |
|
||||
| `claimed` | Dispatcher atomically claimed a `ready` task for spawn. |
|
||||
| `completed` | Worker wrote `--result` and task hit `done`. |
|
||||
| `blocked` | Worker or human flipped the task to `blocked`. |
|
||||
| `unblocked` | `blocked → ready`, either manually or via `/unblock`. |
|
||||
| `archived` | Hidden from the default board. |
|
||||
|
||||
**Edits** (human-driven changes that aren't transitions):
|
||||
|
||||
| Kind | When |
|
||||
|---|---|
|
||||
| `assigned` | Assignee changed (including unassignment). |
|
||||
| `edited` | Title or body updated. |
|
||||
| `reprioritized` | Priority changed. |
|
||||
| `status` | Dashboard drag-drop wrote a status directly (e.g. `todo → ready`). |
|
||||
|
||||
**Worker telemetry** (about the execution process, not the logical task):
|
||||
|
||||
| Kind | Payload | When |
|
||||
|---|---|---|
|
||||
| `spawned` | `{pid}` | Dispatcher successfully started a worker process. |
|
||||
| `heartbeat` | `{note?}` | Worker called `hermes kanban heartbeat $TASK` to signal liveness during long operations. |
|
||||
| `reclaimed` | `{stale_lock}` | Claim TTL expired without a completion; task goes back to `ready`. |
|
||||
| `crashed` | `{pid, claimer}` | Worker PID no longer alive but TTL hadn't expired yet. |
|
||||
| `timed_out` | `{pid, elapsed_seconds, limit_seconds, sigkill}` | `max_runtime_seconds` exceeded; dispatcher SIGTERM'd (then SIGKILL'd after 5 s grace) and re-queued. |
|
||||
| `spawn_failed` | `{error, failures}` | One spawn attempt failed (missing PATH, workspace unmountable, …). Counter increments; task returns to `ready` for retry. |
|
||||
| `gave_up` | `{failures, error}` | Circuit breaker fired after N consecutive `spawn_failed`. Task auto-blocks with the last error. Default N = 5; override via `--failure-limit`. |
|
||||
|
||||
`hermes kanban tail <id>` shows these for a single task. `hermes kanban watch` streams them board-wide.
|
||||
|
||||
## Out of scope
|
||||
|
||||
Kanban is deliberately single-host. `~/.hermes/kanban.db` is a local SQLite file and the dispatcher spawns workers on the same machine. Running a shared board across two hosts is not supported — there's no coordination primitive for "worker X on host A, worker Y on host B," and the crash-detection path assumes PIDs are host-local. If you need multi-host, run an independent board per host and use `delegate_task` / a message queue to bridge them.
|
||||
|
||||
Reference in New Issue
Block a user