mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-29 15:31:38 +08:00
feat(kanban): core hardening — daemon, circuit breaker, crash detect, logs, notify, bulk, stats
Eliminates every 'known broken on day one' item in the core functionality
audit. The board is now self-driving (daemon, not cron), self-healing
(crash detection, spawn-failure circuit breaker), and self-reporting
(logs, stats, gateway notifications).
Dispatcher
- New `hermes kanban daemon` long-lived loop with --interval, --max,
--failure-limit, --pidfile, --verbose, signal-clean shutdown
(SIGINT/SIGTERM via threading.Event). A kb.run_daemon() entry point
lets tests drive it inline without subprocess.
- `hermes kanban init` now prints the dispatcher setup hint so users
don't leave the board off-by-default. Ships a systemd user unit at
plugins/kanban/systemd/hermes-kanban-dispatcher.service.
- Removed the old 'add this to cron' doc path. Cron runs agent
prompts (LLM cost per tick) — unacceptable for a per-minute
coordination loop.
Worker aliveness / safety
- Spawn returns the child's PID; dispatcher stores it on the task row
and calls detect_crashed_workers() every tick. If the PID is gone
but the claim TTL hasn't expired, the task drops back to ready with
a 'crashed' event. Host-local only — cross-host PIDs are ignored
per the single-host design.
- Spawn-failure circuit breaker: after N consecutive spawn_failed
events on the same task (default 5), the dispatcher auto-blocks
with the last error as the reason. Success resets the counter.
Workspace-resolution failures count against the same budget.
- Log rotation: _rotate_worker_log trims at 2 MiB, keeps one
generation (.log.1), bounds per-task disk usage at ~4 MiB.
Idempotency / dedup
- create_task(idempotency_key=...) returns the existing non-archived
task id for retried webhooks. --idempotency-key on the CLI, json
body field on the dashboard plugin. Archived tasks don't block a
fresh create with the same key.
CLI surface
- Bulk verbs: complete, unblock, archive accept multiple ids;
block accepts --ids for sibling blocks with the same reason.
- New verbs: daemon, watch (live event tail filtered by
assignee/tenant/kinds), stats, log, notify-subscribe,
notify-list, notify-unsubscribe.
- dispatch gains --failure-limit + crashed/auto_blocked columns in
JSON output and human-readable output.
- gc accepts --event-retention-days / --log-retention-days; prunes
task_events for terminal tasks and old log files.
Gateway integration
- New GatewayRunner._kanban_notifier_watcher: polls
kanban_notify_subs every 5s, pushes ✔/⏸/✖ messages to subscribed
chats for completed/blocked/spawn_auto_blocked/crashed events.
Cursor-advanced per-sub; auto-removed when the task reaches
done/archived. Runs alongside the session expiry and platform
reconnect watchers — SQLite work in asyncio.to_thread so the
event loop never blocks.
- /kanban create in the gateway auto-subscribes the originating
chat (platform + chat_id + thread_id). Users see
'(subscribed — you'll be notified when t_abcd completes or
blocks)' appended to the response.
Dashboard plugin
- GET /stats returns board_stats (by_status, by_assignee,
oldest_ready_age_seconds).
- GET /tasks/:id/log returns the worker log with optional ?tail=N
cap. 404 on unknown task, exists=false when the task has never
spawned.
- POST /tasks accepts idempotency_key; both Pydantic body and the
create_task kwarg now round-trip.
- /board attaches task.age (created/started/time_to_complete in
seconds) so the UI can colour stale cards without recomputing.
- Card CSS: amber border after N minutes, red border when clearly
stuck (tier per status: running 10m/60m, ready 1h/24h, todo
7d/30d, blocked 1h/24h).
- Drawer: new Worker log section, auto-loads on mount, last 100 KB
cap with on-disk path surfaced when truncated.
Kernel
- Schema additions: tasks.idempotency_key, tasks.spawn_failures,
tasks.worker_pid, tasks.last_spawn_error; new
kanban_notify_subs table. All gated by _migrate_add_optional_columns
so legacy DBs upgrade cleanly.
- release_stale_claims / complete_task / block_task now all clear
worker_pid so crash detection doesn't false-positive on reclaimed
tasks.
- read_worker_log fixed: tail-skip no longer eats one-giant-line
logs (common with child processes that don't flush newlines
before dying).
Tests (tests/hermes_cli/test_kanban_core_functionality.py, 28 new)
- Idempotency: same key returns existing, archived doesn't block,
no key never collides
- Circuit breaker: auto-blocks after limit, success resets counter,
workspace-resolution failure counts against budget
- Aliveness: _pid_alive helper, detect_crashed_workers reclaims
exited child
- Daemon: runs and stops cleanly via stop_event, survives a tick
exception
- Stats + task_age helpers
- Notify subs: CRUD, cursor advances, distinct-thread is a separate row
- GC: events-only-for-terminal-tasks, old worker logs deleted
- Log: rotation keeps one generation, read_worker_log tail
- CLI: bulk complete/archive/unblock/block, create with
--idempotency-key, stats --json, notify-subscribe+list, log
missing task, gc reports counts
- run_slash parity: smoke-tests every registered verb (23
invocations); none may raise or return empty string
Full kanban test suite: 234/234 pass under scripts/run_tests.sh
(60 original + 30 dashboard plugin + 28 new core + 116 command
registry). Live smoke covers /stats, idempotency, age, log endpoint
with and without content, log?tail= truncation signal, 404 on unknown
task.
Docs (website/docs/user-guide/features/kanban.md)
- 'Core concepts' rewritten: new statuses (triage), idempotency key,
dispatcher-as-daemon-not-cron with circuit breaker behaviour
documented.
- Quick start swapped to daemon. New systemd section covers user
service install.
- New sections: idempotent create, bulk verbs, gateway
notifications, out-of-scope single-host note (kanban.db is local;
don't expect multi-host).
- CLI reference updated for every new verb, every new flag.
This commit is contained in:
@@ -43,14 +43,14 @@ They coexist: a kanban worker may call `delegate_task` internally during its run
|
||||
|
||||
## Core concepts
|
||||
|
||||
- **Task** — a row with title, optional body, one assignee (a profile name), status (`todo | ready | running | blocked | done | archived`), optional tenant namespace.
|
||||
- **Task** — a row with title, optional body, one assignee (a profile name), status (`triage | todo | ready | running | blocked | done | archived`), optional tenant namespace, optional idempotency key (dedup for retried automation).
|
||||
- **Link** — `task_links` row recording a parent → child dependency. The dispatcher promotes `todo → ready` when all parents are `done`.
|
||||
- **Comment** — the inter-agent protocol. Agents and humans append comments; when a worker is (re-)spawned it reads the full comment thread as part of its context.
|
||||
- **Workspace** — the directory a worker operates in. Three kinds:
|
||||
- `scratch` (default) — fresh tmp dir under `~/.hermes/kanban/workspaces/<id>/`.
|
||||
- `dir:<path>` — an existing shared directory (Obsidian vault, mail ops dir, per-account folder).
|
||||
- `worktree` — a git worktree under `.worktrees/<id>/` for coding tasks.
|
||||
- **Dispatcher** — `hermes kanban dispatch` runs a one-shot pass: reclaim stale claims, promote ready tasks, atomically claim, spawn assigned profiles. Runs via cron every 60 seconds.
|
||||
- **Dispatcher** — a long-lived loop that, every N seconds (default 60): reclaims stale claims, reclaims crashed workers (PID gone but TTL not yet expired), promotes ready tasks, atomically claims, spawns assigned profiles. Runs as `hermes kanban daemon` (foreground) or as a systemd user service. After ~5 consecutive spawn failures on the same task the dispatcher auto-blocks it with the last error as the reason — prevents thrashing on tasks whose profile doesn't exist, workspace can't mount, etc.
|
||||
- **Tenant** — optional string namespace. One specialist fleet can serve multiple businesses (`--tenant business-a`) with data isolation by workspace path and memory key prefix.
|
||||
|
||||
## Quick start
|
||||
@@ -59,23 +59,58 @@ They coexist: a kanban worker may call `delegate_task` internally during its run
|
||||
# 1. Create the board
|
||||
hermes kanban init
|
||||
|
||||
# 2. Create a task
|
||||
# 2. Start the dispatcher (foreground; Ctrl-C to stop)
|
||||
hermes kanban daemon &
|
||||
|
||||
# 3. Create a task
|
||||
hermes kanban create "research AI funding landscape" --assignee researcher
|
||||
|
||||
# 3. List what's on the board
|
||||
hermes kanban list
|
||||
# 4. Watch activity live
|
||||
hermes kanban watch
|
||||
|
||||
# 4. Run a dispatcher pass (dry-run to preview, real to spawn workers)
|
||||
hermes kanban dispatch --dry-run
|
||||
hermes kanban dispatch
|
||||
# 5. See the board
|
||||
hermes kanban list
|
||||
hermes kanban stats
|
||||
```
|
||||
|
||||
To have the board run continuously, schedule the dispatcher:
|
||||
### Running the dispatcher as a service
|
||||
|
||||
For production, install the systemd user unit shipped at
|
||||
`plugins/kanban/systemd/hermes-kanban-dispatcher.service`:
|
||||
|
||||
```bash
|
||||
hermes cron add --schedule "*/1 * * * *" \
|
||||
--name kanban-dispatch \
|
||||
hermes kanban dispatch
|
||||
mkdir -p ~/.config/systemd/user
|
||||
cp plugins/kanban/systemd/hermes-kanban-dispatcher.service \
|
||||
~/.config/systemd/user/
|
||||
systemctl --user daemon-reload
|
||||
systemctl --user enable --now hermes-kanban-dispatcher.service
|
||||
systemctl --user status hermes-kanban-dispatcher
|
||||
journalctl --user -u hermes-kanban-dispatcher -f # follow logs
|
||||
```
|
||||
|
||||
Without a running dispatcher `ready` tasks stay where they are — `hermes kanban init` will remind you of this on first run.
|
||||
|
||||
### Idempotent create (for automation / webhooks)
|
||||
|
||||
```bash
|
||||
# First call creates the task. Any subsequent call with the same key
|
||||
# returns the existing task id instead of duplicating.
|
||||
hermes kanban create "nightly ops review" \
|
||||
--assignee ops \
|
||||
--idempotency-key "nightly-ops-$(date -u +%Y-%m-%d)" \
|
||||
--json
|
||||
```
|
||||
|
||||
### Bulk CLI verbs
|
||||
|
||||
All the lifecycle verbs accept multiple ids so you can clean up a batch
|
||||
in one command:
|
||||
|
||||
```bash
|
||||
hermes kanban complete t_abc t_def t_hij --result "batch wrap"
|
||||
hermes kanban archive t_abc t_def t_hij
|
||||
hermes kanban unblock t_abc t_def
|
||||
hermes kanban block t_abc "need input" --ids t_def t_hij
|
||||
```
|
||||
|
||||
## The worker skill
|
||||
@@ -223,11 +258,12 @@ The GUI is deliberately thin. Everything the plugin does is reachable from the C
|
||||
## CLI command reference
|
||||
|
||||
```
|
||||
hermes kanban init # create kanban.db
|
||||
hermes kanban init # create kanban.db + print daemon hint
|
||||
hermes kanban create "<title>" [--body ...] [--assignee <profile>]
|
||||
[--parent <id>]... [--tenant <name>]
|
||||
[--workspace scratch|worktree|dir:<path>]
|
||||
[--priority N] [--triage] [--json]
|
||||
[--priority N] [--triage] [--idempotency-key KEY]
|
||||
[--json]
|
||||
hermes kanban list [--mine] [--assignee P] [--status S] [--tenant T] [--archived] [--json]
|
||||
hermes kanban show <id> [--json]
|
||||
hermes kanban assign <id> <profile> # or 'none' to unassign
|
||||
@@ -235,14 +271,30 @@ hermes kanban link <parent_id> <child_id>
|
||||
hermes kanban unlink <parent_id> <child_id>
|
||||
hermes kanban claim <id> [--ttl SECONDS]
|
||||
hermes kanban comment <id> "<text>" [--author NAME]
|
||||
hermes kanban complete <id> [--result "..."]
|
||||
hermes kanban block <id> "<reason>"
|
||||
hermes kanban unblock <id>
|
||||
hermes kanban archive <id>
|
||||
hermes kanban tail <id> # follow event stream
|
||||
hermes kanban dispatch [--dry-run] [--max N] [--json]
|
||||
|
||||
# Bulk verbs — accept multiple ids:
|
||||
hermes kanban complete <id>... [--result "..."]
|
||||
hermes kanban block <id> "<reason>" [--ids <id>...]
|
||||
hermes kanban unblock <id>...
|
||||
hermes kanban archive <id>...
|
||||
|
||||
hermes kanban tail <id> # follow a single task's event stream
|
||||
hermes kanban watch [--assignee P] [--tenant T] # live stream ALL events to the terminal
|
||||
[--kinds completed,blocked,…] [--interval SECS]
|
||||
hermes kanban dispatch [--dry-run] [--max N] # one-shot pass
|
||||
[--failure-limit N] [--json]
|
||||
hermes kanban daemon [--interval SECS] [--max N] # long-lived loop
|
||||
[--failure-limit N] [--pidfile PATH] [-v]
|
||||
hermes kanban stats [--json] # per-status + per-assignee counts
|
||||
hermes kanban log <id> [--tail BYTES] # worker log from ~/.hermes/kanban/logs/
|
||||
hermes kanban notify-subscribe <id> # gateway bridge hook (used by /kanban in the gateway)
|
||||
--platform <name> --chat-id <id> [--thread-id <id>] [--user-id <id>]
|
||||
hermes kanban notify-list [<id>] [--json]
|
||||
hermes kanban notify-unsubscribe <id>
|
||||
--platform <name> --chat-id <id> [--thread-id <id>]
|
||||
hermes kanban context <id> # what a worker sees
|
||||
hermes kanban gc # remove scratch dirs of archived tasks
|
||||
hermes kanban gc [--event-retention-days N] # workspaces + old events + old logs
|
||||
[--log-retention-days N]
|
||||
```
|
||||
|
||||
All commands are also available as a slash command in the gateway (`/kanban list`, `/kanban comment t_abc "need docs"`, etc.). The slash command bypasses the running-agent guard, so you can `/kanban unblock` a stuck worker while the main agent is still chatting.
|
||||
@@ -278,6 +330,26 @@ hermes kanban create "monthly report" \
|
||||
|
||||
Workers receive `$HERMES_TENANT` and namespace their memory writes by prefix. The board, the dispatcher, and the profile definitions are all shared; only the data is scoped.
|
||||
|
||||
## Gateway notifications
|
||||
|
||||
When you run `/kanban create …` from the gateway (Telegram, Discord, Slack, etc.), the originating chat is automatically subscribed to the new task. The gateway's background notifier polls `task_events` every few seconds and delivers one message per terminal event (`completed`, `blocked`, `spawn_auto_blocked`, `crashed`) to that chat. Completed tasks also send the first line of the worker's `--result` so you see the outcome without having to `/kanban show`.
|
||||
|
||||
You can manage subscriptions explicitly from the CLI — useful when a script / cron job wants to notify a chat it didn't originate from:
|
||||
|
||||
```bash
|
||||
hermes kanban notify-subscribe t_abcd \
|
||||
--platform telegram --chat-id 12345678 --thread-id 7
|
||||
hermes kanban notify-list
|
||||
hermes kanban notify-unsubscribe t_abcd \
|
||||
--platform telegram --chat-id 12345678 --thread-id 7
|
||||
```
|
||||
|
||||
A subscription removes itself automatically once the task reaches `done` or `archived`; no cleanup needed.
|
||||
|
||||
## Out of scope
|
||||
|
||||
Kanban is deliberately single-host. `~/.hermes/kanban.db` is a local SQLite file and the dispatcher spawns workers on the same machine. Running a shared board across two hosts is not supported — there's no coordination primitive for "worker X on host A, worker Y on host B," and the crash-detection path assumes PIDs are host-local. If you need multi-host, run an independent board per host and use `delegate_task` / a message queue to bridge them.
|
||||
|
||||
## Design spec
|
||||
|
||||
The complete design — architecture, concurrency correctness, comparison with other systems, implementation plan, risks, open questions — lives in `docs/hermes-kanban-v1-spec.pdf`. Read that before filing any behavior-change PR.
|
||||
|
||||
Reference in New Issue
Block a user