Files
hermes-agent/skills/devops
Teknium da7d09c3b6 feat(kanban): max-runtime timeouts, worker heartbeats, assignees picker, event vocab cleanup
Ports four items from the Multica audit (https://github.com/multica-ai/multica).
Dropped their cross-host server/daemon architecture and their Postgres+pgvector
skill search — both the wrong shape for our single-host SQLite kernel.

1. Per-task max-runtime (`max_runtime_seconds` column)
   - New kernel function `enforce_max_runtime(conn)` runs in every dispatch
     tick. When a running task's elapsed time exceeds the cap, we SIGTERM
     the worker, wait a 5 s grace (polling _pid_alive), then SIGKILL. The
     task goes back to 'ready' with a `timed_out` event and re-queues
     on the next tick (unless the spawn-failure circuit breaker has
     already parked it).
   - Host-local only: lock prefix must match this host's claimer_id so we
     never signal a PID on another machine.
   - CLI: `hermes kanban create --max-runtime 30m | 2h | 1d | <seconds>`.
     New `_parse_duration` helper accepts s/m/h/d suffixes or bare
     integers.
   - Dashboard POST body + the card's `max_runtime_seconds` field.

2. Worker heartbeat (`last_heartbeat_at` column, `heartbeat` event)
   - `heartbeat_worker(conn, task_id, note=None)` emits the event and
     touches last_heartbeat_at. Refused when the task isn't running.
   - CLI: `hermes kanban heartbeat <id> [--note "..."]`.
   - kanban-worker skill instructs workers to heartbeat during long
     loops (training runs, encodes, crawls, batch uploads).
   - Separate signal from PID crash detection: a worker's Python can
     still be alive while the actual work process is stuck. Heartbeat
     absence is diagnostic; future work can auto-block on stale
     heartbeats but v1 just surfaces the signal.

3. Assignee enumeration (`known_assignees`, `list_profiles_on_disk`)
   - Scans ~/.hermes/profiles/ for dirs containing config.yaml + unions
     with current assignees on the board. Each entry returns
     {name, on_disk, counts: {status: n}}.
   - CLI: `hermes kanban assignees [--json]`. Also hooked into
     `hermes kanban init` which now prints discovered profiles so new
     installs see 'these are the assignees you can target' immediately.
   - Dashboard: GET /api/plugins/kanban/assignees for the picker.

4. Event vocab cleanup (three renames + three new kinds)
   - `ready` → `promoted` (fires when deps clear; clearer semantic).
   - `priority` → `reprioritized` (past-tense verb, matches others).
   - `spawn_auto_blocked` → `gave_up` (short, memorable; the circuit
     breaker gave up on this task).
   - New: `spawned` (emitted with {pid} on successful spawn),
     `heartbeat` ({note?}), `timed_out`
     ({pid, elapsed_seconds, limit_seconds, sigkill}).
   - One-shot migration in `_migrate_add_optional_columns` renames
     legacy rows in-place on init_db(), so existing DBs upgrade cleanly.
   - Gateway notifier's TERMINAL_KINDS set updated; timed_out gets its
     own ⏱ message template, gave_up renamed from 'auto-blocked'.
   - Plugin_api.py's two 'priority' emit sites renamed to
     'reprioritized'.
   - Documented in a new 'Event reference' section in kanban.md,
     grouped into three clusters (lifecycle / edits / worker
     telemetry) with payload shapes.

Tests (+18 in tests/hermes_cli/test_kanban_core_functionality.py,
136/136 pass):
  - max_runtime_terminates_overrun_worker: real SIGTERM flow with
    _pid_alive stub, verifies event payload + state reset.
  - max_runtime_none_means_no_cap: unbounded tasks aren't timed out.
  - create_task_persists_max_runtime.
  - enforce_max_runtime_integrates_with_dispatch: kernel-level +
    dispatch_once chaining.
  - heartbeat_on_running_task + heartbeat_refused_when_not_running.
  - cli_heartbeat_verb with --note round-trip.
  - recompute_ready_emits_promoted_not_ready.
  - spawn_failure_circuit_breaker_emits_gave_up.
  - spawned_event_emitted_with_pid.
  - migration_renames_legacy_event_kinds (injects old rows, re-runs
    init_db, asserts rename).
  - list_profiles_on_disk (tmp_path + config.yaml filter).
  - known_assignees_merges_disk_and_board (profiles on disk + board
    assignees + per-status counts).
  - cli_assignees_json.
  - parse_duration_accepts_formats (s/m/h/d/float).
  - parse_duration_rejects_garbage.
  - cli_create_max_runtime_via_duration (2h → 7200).
  - cli_create_max_runtime_bad_format_exits_nonzero.

Live smoke: POST /tasks with max_runtime_seconds round-trips;
/assignees returns the union of on-disk + board-assigned names;
PATCH priority produces 'reprioritized' events (not 'priority');
board cards expose max_runtime_seconds + last_heartbeat_at.

Docs (website/docs/user-guide/features/kanban.md):
  - New 'Event reference' section with three-cluster table
    (lifecycle / edits / worker telemetry) + payload shapes.
  - CLI reference updated for --max-runtime, heartbeat, assignees.
  - Gateway notifications section updated for the new TERMINAL_KINDS.

Not ported from Multica (deliberate, documented in the out-of-scope
section already): Postgres+pgvector skill search (heavy deps conflict
with SQLite kernel), server+daemon cross-host model (we're
single-host on purpose), first-class agent identity with threaded
comments (we keep the board profile-agnostic).
2026-04-27 06:32:17 -07:00
..