feat: persist reasoning across gateway session turns (schema v6)

Add reasoning TEXT, reasoning_details TEXT, and codex_reasoning_items TEXT columns to the messages table (schema v5->v6). This preserves assistant reasoning chains across gateway session reloads so all provider-specific reasoning formats survive the round-trip. Three reasoning formats are now persisted: - reasoning: plain text (DeepSeek, Qwen, Moonshot, Novita, OpenRouter) - reasoning_details: structured array (OpenRouter multi-turn continuity) - codex_reasoning_items: encrypted blobs (OpenAI Codex Responses API) Previously, all three existed in-memory during a single session but were lost on gateway reload. Changes: - hermes_state.py: schema v6 migration, append_message() accepts all three fields, get_messages_as_conversation() restores them on assistant messages - run_agent.py: _flush_messages_to_session_db() passes all reasoning fields through for assistant messages - gateway/run.py: agent_history builder preserves reasoning fields on non-tool-calling assistant messages - gateway/session.py: append_to_transcript() and rewrite_transcript() pass all reasoning fields to the DB - Tests: 5 new tests for round-trip persistence Verified against: - OpenAI Codex direct (codex_reasoning_items round-trip: 868 enc chars) - OpenRouter -> Anthropic, Google, DeepSeek, Meta, Qwen, Mistral - Anthropic adapter (strips extra fields by construction) - Codex Responses API path (replays codex_reasoning_items correctly)
fix(cli): enhance real-time reasoning output by forcing flush of long partial lines
2026-06-24 19:03:33 +08:00 · 2026-03-25 09:21:51 -07:00 · 2026-03-24 19:56:30 -07:00 · 2026-03-24 19:56:30 -07:00 · 2026-03-24 19:56:30 -07:00 · 2026-03-24 19:56:30 -07:00
16 changed files with 374 additions and 65 deletions
--- a/agent/context_compressor.py
+++ b/agent/context_compressor.py
@@ -40,7 +40,7 @@ _MIN_SUMMARY_TOKENS = 2000
 # Proportion of compressed content to allocate for summary
 _SUMMARY_RATIO = 0.20
 # Absolute ceiling for summary tokens (even on very large context windows)
-_SUMMARY_TOKENS_CEILING = 32_000
+_SUMMARY_TOKENS_CEILING = 12_000

 # Placeholder used when pruning old tool results
 _PRUNED_TOOL_PLACEHOLDER = "[Old tool output cleared to save context space]"
@@ -63,10 +63,10 @@ class ContextCompressor:
    def __init__(
        self,
        model: str,
-        threshold_percent: float = 0.80,
+        threshold_percent: float = 0.50,
        protect_first_n: int = 3,
        protect_last_n: int = 20,
-        summary_target_ratio: float = 0.40,
+        summary_target_ratio: float = 0.20,
        quiet_mode: bool = False,
        summary_model_override: str = None,
        base_url: str = "",
@@ -92,8 +92,8 @@ class ContextCompressor:
        self.threshold_tokens = int(self.context_length * threshold_percent)
        self.compression_count = 0

-        # Derive token budgets from the target ratio and context length
-        target_tokens = int(self.context_length * self.summary_target_ratio)
+        # Derive token budgets: ratio is relative to the threshold, not total context
+        target_tokens = int(self.threshold_tokens * self.summary_target_ratio)
        self.tail_token_budget = target_tokens
        self.max_summary_tokens = min(
            int(self.context_length * 0.05), _SUMMARY_TOKENS_CEILING,
--- a/cli-config.yaml.example
+++ b/cli-config.yaml.example
@@ -236,23 +236,24 @@ browser:
 # 5. Summarizes middle turns using a fast/cheap model
 # 6. Inserts summary as a user message, continues conversation seamlessly
 #
-# Post-compression size scales with the model's context window via target_ratio:
-#   MiniMax 200K context → ~80K post-compression (at 0.40 ratio)
-#   GPT-5   1M   context → ~400K post-compression (at 0.40 ratio)
+# Post-compression tail budget is target_ratio × threshold × context_length:
+#   200K context, threshold 0.50, ratio 0.20 → 20K tokens of recent tail preserved
+#   1M   context, threshold 0.50, ratio 0.20 → 100K tokens of recent tail preserved
 #
 compression:
  # Enable automatic context compression (default: true)
  # Set to false if you prefer to manage context manually or want errors on overflow
  enabled: true
  
-  # Trigger compression at this % of model's context limit (default: 0.80 = 80%)
+  # Trigger compression at this % of model's context limit (default: 0.50 = 50%)
  # Lower values = more aggressive compression, higher values = compress later
-  threshold: 0.80
+  threshold: 0.50
  
-  # Target post-compression size as a fraction of context window (default: 0.40 = 40%)
-  # Controls how much context survives compression. Tail token budget and summary
-  # cap scale with this value. Range: 0.10 - 0.80
-  target_ratio: 0.40
+  # Fraction of the threshold to preserve as recent tail (default: 0.20 = 20%)
+  # e.g. 20% of 50% threshold = 10% of total context kept as recent messages.
+  # Summary output is separately capped at 12K tokens (Gemini output limit).
+  # Range: 0.10 - 0.80
+  target_ratio: 0.20

  # Number of most-recent messages to always preserve (default: 20 ≈ 10 full turns)
  # Higher values keep more recent conversation intact at the cost of more aggressive
--- a/cli.py
+++ b/cli.py
@@ -1509,10 +1509,14 @@ class HermesCLI:

        self._reasoning_buf = getattr(self, "_reasoning_buf", "") + text

-        # Emit complete lines
+        # Emit complete lines, and force-flush long partial lines so
+        # reasoning is visible in real-time even without newlines.
        while "\n" in self._reasoning_buf:
            line, self._reasoning_buf = self._reasoning_buf.split("\n", 1)
            _cprint(f"{_DIM}{line}{_RST}")
+        if len(self._reasoning_buf) > 80:
+            _cprint(f"{_DIM}{self._reasoning_buf}{_RST}")
+            self._reasoning_buf = ""

    def _close_reasoning_box(self) -> None:
        """Close the live reasoning box if it's open."""
--- a/gateway/run.py
+++ b/gateway/run.py
@@ -5288,7 +5288,18 @@ class GatewayRunner:
                        if msg.get("mirror"):
                            mirror_src = msg.get("mirror_source", "another session")
                            content = f"[Delivered from {mirror_src}] {content}"
-                        agent_history.append({"role": role, "content": content})
+                        entry = {"role": role, "content": content}
+                        # Preserve reasoning fields on assistant messages so
+                        # multi-turn reasoning context survives session reload.
+                        # The agent's _build_api_kwargs converts these to the
+                        # provider-specific format (reasoning_content, etc.).
+                        if role == "assistant":
+                            for _rkey in ("reasoning", "reasoning_details",
+                                          "codex_reasoning_items"):
+                                _rval = msg.get(_rkey)
+                                if _rval:
+                                    entry[_rkey] = _rval
+                        agent_history.append(entry)
            
            # Collect MEDIA paths already in history so we can exclude them
            # from the current turn's extraction. This is compression-safe:
--- a/gateway/session.py
+++ b/gateway/session.py
@@ -891,13 +891,17 @@ class SessionStore:
        # Write to SQLite (unless the agent already handled it)
        if self._db and not skip_db:
            try:
+                _role = message.get("role", "unknown")
                self._db.append_message(
                    session_id=session_id,
-                    role=message.get("role", "unknown"),
+                    role=_role,
                    content=message.get("content"),
                    tool_name=message.get("tool_name"),
                    tool_calls=message.get("tool_calls"),
                    tool_call_id=message.get("tool_call_id"),
+                    reasoning=message.get("reasoning") if _role == "assistant" else None,
+                    reasoning_details=message.get("reasoning_details") if _role == "assistant" else None,
+                    codex_reasoning_items=message.get("codex_reasoning_items") if _role == "assistant" else None,
                )
            except Exception as e:
                logger.debug("Session DB operation failed: %s", e)
@@ -918,13 +922,17 @@ class SessionStore:
            try:
                self._db.clear_messages(session_id)
                for msg in messages:
+                    _role = msg.get("role", "unknown")
                    self._db.append_message(
                        session_id=session_id,
-                        role=msg.get("role", "unknown"),
+                        role=_role,
                        content=msg.get("content"),
                        tool_name=msg.get("tool_name"),
                        tool_calls=msg.get("tool_calls"),
                        tool_call_id=msg.get("tool_call_id"),
+                        reasoning=msg.get("reasoning") if _role == "assistant" else None,
+                        reasoning_details=msg.get("reasoning_details") if _role == "assistant" else None,
+                        codex_reasoning_items=msg.get("codex_reasoning_items") if _role == "assistant" else None,
                    )
            except Exception as e:
                logger.debug("Failed to rewrite transcript in DB: %s", e)
--- a/hermes_cli/config.py
+++ b/hermes_cli/config.py
@@ -163,8 +163,8 @@ DEFAULT_CONFIG = {
    
    "compression": {
        "enabled": True,
-        "threshold": 0.80,            # compress when context usage exceeds this ratio
-        "target_ratio": 0.40,         # fraction of context to preserve as recent tail
+        "threshold": 0.50,            # compress when context usage exceeds this ratio
+        "target_ratio": 0.20,         # fraction of threshold to preserve as recent tail
        "protect_last_n": 20,         # minimum recent messages to keep uncompressed
        "summary_model": "",          # empty = use main configured model
        "summary_provider": "auto",
@@ -1686,8 +1686,8 @@ def show_config():
    enabled = compression.get('enabled', True)
    print(f"  Enabled:      {'yes' if enabled else 'no'}")
    if enabled:
-        print(f"  Threshold:    {compression.get('threshold', 0.80) * 100:.0f}%")
-        print(f"  Target ratio: {compression.get('target_ratio', 0.40) * 100:.0f}% of context preserved")
+        print(f"  Threshold:    {compression.get('threshold', 0.50) * 100:.0f}%")
+        print(f"  Target ratio: {compression.get('target_ratio', 0.20) * 100:.0f}% of threshold preserved")
        print(f"  Protect last: {compression.get('protect_last_n', 20)} messages")
        _sm = compression.get('summary_model', '') or '(main model)'
        print(f"  Model:        {_sm}")
--- a/hermes_state.py
+++ b/hermes_state.py
@@ -26,7 +26,7 @@ from typing import Dict, Any, List, Optional

 DEFAULT_DB_PATH = Path(os.getenv("HERMES_HOME", Path.home() / ".hermes")) / "state.db"

-SCHEMA_VERSION = 5
+SCHEMA_VERSION = 6

 SCHEMA_SQL = """
 CREATE TABLE IF NOT EXISTS schema_version (
@@ -73,7 +73,10 @@ CREATE TABLE IF NOT EXISTS messages (
    tool_name TEXT,
    timestamp REAL NOT NULL,
    token_count INTEGER,
-    finish_reason TEXT
+    finish_reason TEXT,
+    reasoning TEXT,
+    reasoning_details TEXT,
+    codex_reasoning_items TEXT
 );

 CREATE INDEX IF NOT EXISTS idx_sessions_source ON sessions(source);
@@ -189,6 +192,25 @@ class SessionDB:
                    except sqlite3.OperationalError:
                        pass
                cursor.execute("UPDATE schema_version SET version = 5")
+            if current_version < 6:
+                # v6: add reasoning columns to messages table — preserves assistant
+                # reasoning text and structured reasoning_details across gateway
+                # session turns.  Without these, reasoning chains are lost on
+                # session reload, breaking multi-turn reasoning continuity for
+                # providers that replay reasoning (OpenRouter, OpenAI, Nous).
+                for col_name, col_type in [
+                    ("reasoning", "TEXT"),
+                    ("reasoning_details", "TEXT"),
+                    ("codex_reasoning_items", "TEXT"),
+                ]:
+                    try:
+                        safe = col_name.replace('"', '""')
+                        cursor.execute(
+                            f'ALTER TABLE messages ADD COLUMN "{safe}" {col_type}'
+                        )
+                    except sqlite3.OperationalError:
+                        pass  # Column already exists
+                cursor.execute("UPDATE schema_version SET version = 6")

        # Unique title index — always ensure it exists (safe to run after migrations
        # since the title column is guaranteed to exist at this point)
@@ -587,6 +609,9 @@ class SessionDB:
        tool_call_id: str = None,
        token_count: int = None,
        finish_reason: str = None,
+        reasoning: str = None,
+        reasoning_details: Any = None,
+        codex_reasoning_items: Any = None,
    ) -> int:
        """
        Append a message to a session. Returns the message row ID.
@@ -595,10 +620,20 @@ class SessionDB:
        if role is 'tool' or tool_calls is present).
        """
        with self._lock:
+            # Serialize structured fields to JSON for storage
+            reasoning_details_json = (
+                json.dumps(reasoning_details)
+                if reasoning_details else None
+            )
+            codex_items_json = (
+                json.dumps(codex_reasoning_items)
+                if codex_reasoning_items else None
+            )
            cursor = self._conn.execute(
                """INSERT INTO messages (session_id, role, content, tool_call_id,
-                   tool_calls, tool_name, timestamp, token_count, finish_reason)
-                   VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)""",
+                   tool_calls, tool_name, timestamp, token_count, finish_reason,
+                   reasoning, reasoning_details, codex_reasoning_items)
+                   VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
                (
                    session_id,
                    role,
@@ -609,6 +644,9 @@ class SessionDB:
                    time.time(),
                    token_count,
                    finish_reason,
+                    reasoning,
+                    reasoning_details_json,
+                    codex_items_json,
                ),
            )
            msg_id = cursor.lastrowid
@@ -660,7 +698,8 @@ class SessionDB:
        """
        with self._lock:
            cursor = self._conn.execute(
-                "SELECT role, content, tool_call_id, tool_calls, tool_name "
+                "SELECT role, content, tool_call_id, tool_calls, tool_name, "
+                "reasoning, reasoning_details, codex_reasoning_items "
                "FROM messages WHERE session_id = ? ORDER BY timestamp, id",
                (session_id,),
            )
@@ -677,6 +716,22 @@ class SessionDB:
                    msg["tool_calls"] = json.loads(row["tool_calls"])
                except (json.JSONDecodeError, TypeError):
                    pass
+            # Restore reasoning fields on assistant messages so providers
+            # that replay reasoning (OpenRouter, OpenAI, Nous) receive
+            # coherent multi-turn reasoning context.
+            if row["role"] == "assistant":
+                if row["reasoning"]:
+                    msg["reasoning"] = row["reasoning"]
+                if row["reasoning_details"]:
+                    try:
+                        msg["reasoning_details"] = json.loads(row["reasoning_details"])
+                    except (json.JSONDecodeError, TypeError):
+                        pass
+                if row["codex_reasoning_items"]:
+                    try:
+                        msg["codex_reasoning_items"] = json.loads(row["codex_reasoning_items"])
+                    except (json.JSONDecodeError, TypeError):
+                        pass
            messages.append(msg)
        return messages

--- a/run_agent.py
+++ b/run_agent.py
@@ -1009,10 +1009,10 @@ class AIAgent:
        _compression_cfg = _agent_cfg.get("compression", {})
        if not isinstance(_compression_cfg, dict):
            _compression_cfg = {}
-        compression_threshold = float(_compression_cfg.get("threshold", 0.80))
+        compression_threshold = float(_compression_cfg.get("threshold", 0.50))
        compression_enabled = str(_compression_cfg.get("enabled", True)).lower() in ("true", "1", "yes")
        compression_summary_model = _compression_cfg.get("summary_model") or None
-        compression_target_ratio = float(_compression_cfg.get("target_ratio", 0.40))
+        compression_target_ratio = float(_compression_cfg.get("target_ratio", 0.20))
        compression_protect_last = int(_compression_cfg.get("protect_last_n", 20))

        # Read explicit context_length override from model config
@@ -1540,6 +1540,9 @@ class AIAgent:
                    tool_calls=tool_calls_data,
                    tool_call_id=msg.get("tool_call_id"),
                    finish_reason=msg.get("finish_reason"),
+                    reasoning=msg.get("reasoning") if role == "assistant" else None,
+                    reasoning_details=msg.get("reasoning_details") if role == "assistant" else None,
+                    codex_reasoning_items=msg.get("codex_reasoning_items") if role == "assistant" else None,
                )
            self._last_flushed_db_idx = len(messages)
        except Exception as e:
--- a/tests/agent/test_context_compressor.py
+++ b/tests/agent/test_context_compressor.py
@@ -519,24 +519,26 @@ class TestSummaryTargetRatio:
    """Verify that summary_target_ratio properly scales budgets with context window."""

    def test_tail_budget_scales_with_context(self):
-        """Tail token budget should be context_length * summary_target_ratio."""
+        """Tail token budget should be threshold_tokens * summary_target_ratio."""
        with patch("agent.context_compressor.get_model_context_length", return_value=200_000):
            c = ContextCompressor(model="test", quiet_mode=True, summary_target_ratio=0.40)
-        assert c.tail_token_budget == 80_000
+        # 200K * 0.50 threshold * 0.40 ratio = 40K
+        assert c.tail_token_budget == 40_000

        with patch("agent.context_compressor.get_model_context_length", return_value=1_000_000):
            c = ContextCompressor(model="test", quiet_mode=True, summary_target_ratio=0.40)
-        assert c.tail_token_budget == 400_000
+        # 1M * 0.50 threshold * 0.40 ratio = 200K
+        assert c.tail_token_budget == 200_000

    def test_summary_cap_scales_with_context(self):
-        """Max summary tokens should be 5% of context, capped at 32K."""
+        """Max summary tokens should be 5% of context, capped at 12K."""
        with patch("agent.context_compressor.get_model_context_length", return_value=200_000):
            c = ContextCompressor(model="test", quiet_mode=True)
        assert c.max_summary_tokens == 10_000  # 200K * 0.05

        with patch("agent.context_compressor.get_model_context_length", return_value=1_000_000):
            c = ContextCompressor(model="test", quiet_mode=True)
-        assert c.max_summary_tokens == 32_000  # capped at ceiling
+        assert c.max_summary_tokens == 12_000  # capped at 12K ceiling

    def test_ratio_clamped(self):
        """Ratio should be clamped to [0.10, 0.80]."""
@@ -548,12 +550,12 @@ class TestSummaryTargetRatio:
            c = ContextCompressor(model="test", quiet_mode=True, summary_target_ratio=0.95)
        assert c.summary_target_ratio == 0.80

-    def test_default_threshold_is_80_percent(self):
-        """Default compression threshold should be 80%."""
+    def test_default_threshold_is_50_percent(self):
+        """Default compression threshold should be 50%."""
        with patch("agent.context_compressor.get_model_context_length", return_value=100_000):
            c = ContextCompressor(model="test", quiet_mode=True)
-        assert c.threshold_percent == 0.80
-        assert c.threshold_tokens == 80_000
+        assert c.threshold_percent == 0.50
+        assert c.threshold_tokens == 50_000

    def test_default_protect_last_n_is_20(self):
        """Default protect_last_n should be 20."""
--- a/tests/test_hermes_state.py
+++ b/tests/test_hermes_state.py
@@ -177,6 +177,91 @@ class TestMessageStorage:
        messages = db.get_messages("s1")
        assert messages[0]["finish_reason"] == "stop"

+    def test_reasoning_persisted_and_restored(self, db):
+        """Reasoning text is stored for assistant messages and restored by
+        get_messages_as_conversation() so providers receive coherent multi-turn
+        reasoning context."""
+        db.create_session(session_id="s1", source="telegram")
+        db.append_message("s1", role="user", content="create a cron job")
+        db.append_message(
+            "s1",
+            role="assistant",
+            content=None,
+            tool_calls=[{"function": {"name": "cronjob", "arguments": "{}"}, "id": "c1", "type": "function"}],
+            reasoning="I should call the cronjob tool to schedule this.",
+        )
+        db.append_message("s1", role="tool", content='{"job_id": "abc"}', tool_call_id="c1")
+
+        conv = db.get_messages_as_conversation("s1")
+        assert len(conv) == 3
+        # reasoning must be present on the assistant message
+        assistant = conv[1]
+        assert assistant["role"] == "assistant"
+        assert assistant.get("reasoning") == "I should call the cronjob tool to schedule this."
+        # user and tool messages must NOT carry reasoning
+        assert "reasoning" not in conv[0]
+        assert "reasoning" not in conv[2]
+
+    def test_reasoning_details_persisted_and_restored(self, db):
+        """reasoning_details (structured array) is round-tripped through JSON
+        serialization in the DB."""
+        db.create_session(session_id="s1", source="telegram")
+        details = [
+            {"type": "reasoning.summary", "summary": "Thinking about tools"},
+            {"type": "reasoning.encrypted_content", "encrypted_content": "abc123"},
+        ]
+        db.append_message(
+            "s1",
+            role="assistant",
+            content="Hello",
+            reasoning="Thinking about what to say",
+            reasoning_details=details,
+        )
+
+        conv = db.get_messages_as_conversation("s1")
+        assert len(conv) == 1
+        msg = conv[0]
+        assert msg["reasoning"] == "Thinking about what to say"
+        assert msg["reasoning_details"] == details
+
+    def test_reasoning_not_set_for_non_assistant(self, db):
+        """reasoning is never leaked onto user or tool messages."""
+        db.create_session(session_id="s1", source="telegram")
+        db.append_message("s1", role="user", content="hi")
+        db.append_message("s1", role="assistant", content="hello", reasoning=None)
+
+        conv = db.get_messages_as_conversation("s1")
+        assert "reasoning" not in conv[0]
+        assert "reasoning" not in conv[1]
+
+    def test_reasoning_empty_string_not_restored(self, db):
+        """Empty string reasoning is treated as absent."""
+        db.create_session(session_id="s1", source="cli")
+        db.append_message("s1", role="assistant", content="hi", reasoning="")
+
+        conv = db.get_messages_as_conversation("s1")
+        assert "reasoning" not in conv[0]
+
+    def test_codex_reasoning_items_persisted_and_restored(self, db):
+        """codex_reasoning_items (encrypted blobs for Codex Responses API) are
+        round-tripped through JSON serialization in the DB."""
+        db.create_session(session_id="s1", source="cli")
+        codex_items = [
+            {"type": "reasoning", "id": "rs_abc", "encrypted_content": "enc_blob_123"},
+            {"type": "reasoning", "id": "rs_def", "encrypted_content": "enc_blob_456"},
+        ]
+        db.append_message(
+            "s1",
+            role="assistant",
+            content="Done",
+            codex_reasoning_items=codex_items,
+        )
+
+        conv = db.get_messages_as_conversation("s1")
+        assert len(conv) == 1
+        assert conv[0]["codex_reasoning_items"] == codex_items
+        assert conv[0]["codex_reasoning_items"][0]["encrypted_content"] == "enc_blob_123"
+

 # =========================================================================
 # FTS5 search
@@ -737,7 +822,7 @@ class TestSchemaInit:
    def test_schema_version(self, db):
        cursor = db._conn.execute("SELECT version FROM schema_version")
        version = cursor.fetchone()[0]
-        assert version == 5
+        assert version == 6

    def test_title_column_exists(self, db):
        """Verify the title column was created in the sessions table."""
@@ -793,12 +878,12 @@ class TestSchemaInit:
        conn.commit()
        conn.close()

-        # Open with SessionDB — should migrate to v5
+        # Open with SessionDB — should migrate to v6
        migrated_db = SessionDB(db_path=db_path)

        # Verify migration
        cursor = migrated_db._conn.execute("SELECT version FROM schema_version")
-        assert cursor.fetchone()[0] == 5
+        assert cursor.fetchone()[0] == 6

        # Verify title column exists and is NULL for existing sessions
        session = migrated_db.get_session("existing")
--- a/tools/browser_tool.py
+++ b/tools/browser_tool.py
@@ -1567,6 +1567,20 @@ def browser_vision(question: str, annotate: bool = False, task_id: Optional[str]
        vision_model = _get_vision_model()
        logger.debug("browser_vision: analysing screenshot (%d bytes)",
                     len(image_data))
+
+        # Read vision timeout from config (auxiliary.vision.timeout), default 120s.
+        # Local vision models (llama.cpp, ollama) can take well over 30s for
+        # screenshot analysis, so the default must be generous.
+        vision_timeout = 120.0
+        try:
+            from hermes_cli.config import load_config
+            _cfg = load_config()
+            _vt = _cfg.get("auxiliary", {}).get("vision", {}).get("timeout")
+            if _vt is not None:
+                vision_timeout = float(_vt)
+        except Exception:
+            pass
+
        call_kwargs = {
            "task": "vision",
            "messages": [
@@ -1580,6 +1594,7 @@ def browser_vision(question: str, annotate: bool = False, task_id: Optional[str]
            ],
            "max_tokens": 2000,
            "temperature": 0.1,
+            "timeout": vision_timeout,
        }
        if vision_model:
            call_kwargs["model"] = vision_model
--- a/tools/session_search_tool.py
+++ b/tools/session_search_tool.py
@@ -179,6 +179,58 @@ async def _summarize_session(
                return None


+def _list_recent_sessions(db, limit: int, current_session_id: str = None) -> str:
+    """Return metadata for the most recent sessions (no LLM calls)."""
+    try:
+        sessions = db.list_sessions_rich(limit=limit + 5)  # fetch extra to skip current
+
+        # Resolve current session lineage to exclude it
+        current_root = None
+        if current_session_id:
+            try:
+                sid = current_session_id
+                visited = set()
+                while sid and sid not in visited:
+                    visited.add(sid)
+                    s = db.get_session(sid)
+                    parent = s.get("parent_session_id") if s else None
+                    sid = parent if parent else None
+                current_root = max(visited, key=len) if visited else current_session_id
+            except Exception:
+                current_root = current_session_id
+
+        results = []
+        for s in sessions:
+            sid = s.get("id", "")
+            if current_root and (sid == current_root or sid == current_session_id):
+                continue
+            # Skip child/delegation sessions (they have parent_session_id)
+            if s.get("parent_session_id"):
+                continue
+            results.append({
+                "session_id": sid,
+                "title": s.get("title") or None,
+                "source": s.get("source", ""),
+                "started_at": s.get("started_at", ""),
+                "last_active": s.get("last_active", ""),
+                "message_count": s.get("message_count", 0),
+                "preview": s.get("preview", ""),
+            })
+            if len(results) >= limit:
+                break
+
+        return json.dumps({
+            "success": True,
+            "mode": "recent",
+            "results": results,
+            "count": len(results),
+            "message": f"Showing {len(results)} most recent sessions. Use a keyword query to search specific topics.",
+        }, ensure_ascii=False)
+    except Exception as e:
+        logging.error("Error listing recent sessions: %s", e, exc_info=True)
+        return json.dumps({"success": False, "error": f"Failed to list recent sessions: {e}"}, ensure_ascii=False)
+
+
 def session_search(
    query: str,
    role_filter: str = None,
@@ -195,11 +247,14 @@ def session_search(
    if db is None:
        return json.dumps({"success": False, "error": "Session database not available."}, ensure_ascii=False)

+    limit = min(limit, 5)  # Cap at 5 sessions to avoid excessive LLM calls
+
+    # Recent sessions mode: when query is empty, return metadata for recent sessions.
+    # No LLM calls — just DB queries for titles, previews, timestamps.
    if not query or not query.strip():
-        return json.dumps({"success": False, "error": "Query cannot be empty."}, ensure_ascii=False)
+        return _list_recent_sessions(db, limit, current_session_id)

    query = query.strip()
-    limit = min(limit, 5)  # Cap at 5 sessions to avoid excessive LLM calls

    try:
        # Parse role filter
@@ -364,8 +419,14 @@ def check_session_search_requirements() -> bool:
 SESSION_SEARCH_SCHEMA = {
    "name": "session_search",
    "description": (
-        "Search your long-term memory of past conversations. This is your recall -- "
+        "Search your long-term memory of past conversations, or browse recent sessions. This is your recall -- "
        "every past session is searchable, and this tool summarizes what happened.\n\n"
+        "TWO MODES:\n"
+        "1. Recent sessions (no query): Call with no arguments to see what was worked on recently. "
+        "Returns titles, previews, and timestamps. Zero LLM cost, instant. "
+        "Start here when the user asks what were we working on or what did we do recently.\n"
+        "2. Keyword search (with query): Search for specific topics across all past sessions. "
+        "Returns LLM-generated summaries of matching sessions.\n\n"
        "USE THIS PROACTIVELY when:\n"
        "- The user says 'we did this before', 'remember when', 'last time', 'as I mentioned'\n"
        "- The user asks about a topic you worked on before but don't have in current context\n"
@@ -385,7 +446,7 @@ SESSION_SEARCH_SCHEMA = {
        "properties": {
            "query": {
                "type": "string",
-                "description": "Search query — keywords, phrases, or boolean expressions to find in past sessions.",
+                "description": "Search query — keywords, phrases, or boolean expressions to find in past sessions. Omit this parameter entirely to browse recent sessions instead (returns titles, previews, timestamps with no LLM cost).",
            },
            "role_filter": {
                "type": "string",
@@ -397,7 +458,7 @@ SESSION_SEARCH_SCHEMA = {
                "default": 3,
            },
        },
-        "required": ["query"],
+        "required": [],
    },
 }

@@ -410,7 +471,7 @@ registry.register(
    toolset="session_search",
    schema=SESSION_SEARCH_SCHEMA,
    handler=lambda args, **kw: session_search(
-        query=args.get("query", ""),
+        query=args.get("query") or "",
        role_filter=args.get("role_filter"),
        limit=args.get("limit", 3),
        db=kw.get("db"),
--- a/tools/skills_guard.py
+++ b/tools/skills_guard.py
@@ -1050,6 +1050,9 @@ def _get_configured_model() -> str:

 def _resolve_trust_level(source: str) -> str:
    """Map a source identifier to a trust level."""
+    # Agent-created skills get their own permissive trust level
+    if source == "agent-created":
+        return "agent-created"
    # Official optional skills shipped with the repo
    if source.startswith("official/") or source == "official":
        return "builtin"
--- a/tools/vision_tools.py
+++ b/tools/vision_tools.py
@@ -325,8 +325,9 @@ async def vision_analyze_tool(
        logger.info("Processing image with vision model...")
        
        # Call the vision API via centralized router.
-        # Read timeout from config.yaml (auxiliary.vision.timeout), default 30s.
-        vision_timeout = 30.0
+        # Read timeout from config.yaml (auxiliary.vision.timeout), default 120s.
+        # Local vision models (llama.cpp, ollama) can take well over 30s.
+        vision_timeout = 120.0
        try:
            from hermes_cli.config import load_config
            _cfg = load_config()
--- a/website/docs/user-guide/features/hooks.md
+++ b/website/docs/user-guide/features/hooks.md
@@ -6,9 +6,20 @@ description: "Run custom code at key lifecycle points — log activity, send ale

 # Event Hooks

-The hooks system lets you run custom code at key points in the agent lifecycle — session creation, slash commands, each tool-calling step, and more. Hooks fire automatically during gateway operation without blocking the main agent pipeline.
+Hermes has two hook systems that run custom code at key lifecycle points:

-## Creating a Hook
+| System | Registered via | Runs in | Use case |
+|--------|---------------|---------|----------|
+| **[Gateway hooks](#gateway-event-hooks)** | `HOOK.yaml` + `handler.py` in `~/.hermes/hooks/` | Gateway only | Logging, alerts, webhooks |
+| **[Plugin hooks](#plugin-hooks)** | `ctx.register_hook()` in a [plugin](/docs/user-guide/features/plugins) | CLI + Gateway | Tool interception, metrics, guardrails |
+
+Both systems are non-blocking — errors in any hook are caught and logged, never crashing the agent.
+
+## Gateway Event Hooks
+
+Gateway hooks fire automatically during gateway operation (Telegram, Discord, Slack, WhatsApp) without blocking the main agent pipeline.
+
+### Creating a Hook

 Each hook is a directory under `~/.hermes/hooks/` containing two files:

@@ -19,7 +30,7 @@ Each hook is a directory under `~/.hermes/hooks/` containing two files:
    └── handler.py     # Python handler function
 ```

-### HOOK.yaml
+#### HOOK.yaml

 ```yaml
 name: my-hook
@@ -32,7 +43,7 @@ events:

 The `events` list determines which events trigger your handler. You can subscribe to any combination of events, including wildcards like `command:*`.

-### handler.py
+#### handler.py

 ```python
 import json
@@ -58,25 +69,26 @@ async def handle(event_type: str, context: dict):
 - Can be `async def` or regular `def` — both work
 - Errors are caught and logged, never crashing the agent

-## Available Events
+### Available Events

 | Event | When it fires | Context keys |
 |-------|---------------|--------------|
 | `gateway:startup` | Gateway process starts | `platforms` (list of active platform names) |
 | `session:start` | New messaging session created | `platform`, `user_id`, `session_id`, `session_key` |
+| `session:end` | Session ended (before reset) | `platform`, `user_id`, `session_key` |
 | `session:reset` | User ran `/new` or `/reset` | `platform`, `user_id`, `session_key` |
 | `agent:start` | Agent begins processing a message | `platform`, `user_id`, `session_id`, `message` |
 | `agent:step` | Each iteration of the tool-calling loop | `platform`, `user_id`, `session_id`, `iteration`, `tool_names` |
 | `agent:end` | Agent finishes processing | `platform`, `user_id`, `session_id`, `message`, `response` |
 | `command:*` | Any slash command executed | `platform`, `user_id`, `command`, `args` |

-### Wildcard Matching
+#### Wildcard Matching

 Handlers registered for `command:*` fire for any `command:` event (`command:model`, `command:reset`, etc.). Monitor all slash commands with a single subscription.

-## Examples
+### Examples

-### Telegram Alert on Long Tasks
+#### Telegram Alert on Long Tasks

 Send yourself a message when the agent takes more than 10 steps:

@@ -109,7 +121,7 @@ async def handle(event_type: str, context: dict):
            )
 ```

-### Command Usage Logger
+#### Command Usage Logger

 Track which slash commands are used:

@@ -142,7 +154,7 @@ def handle(event_type: str, context: dict):
        f.write(json.dumps(entry) + "\n")
 ```

-### Session Start Webhook
+#### Session Start Webhook

 POST to an external service on new sessions:

@@ -169,7 +181,7 @@ async def handle(event_type: str, context: dict):
        }, timeout=5)
 ```

-## How It Works
+### How It Works

 1. On gateway startup, `HookRegistry.discover_and_load()` scans `~/.hermes/hooks/`
 2. Each subdirectory with `HOOK.yaml` + `handler.py` is loaded dynamically
@@ -178,5 +190,51 @@ async def handle(event_type: str, context: dict):
 5. Errors in any handler are caught and logged — a broken hook never crashes the agent

 :::info
-Hooks only fire in the **gateway** (Telegram, Discord, Slack, WhatsApp). The CLI does not currently load hooks.
+Gateway hooks only fire in the **gateway** (Telegram, Discord, Slack, WhatsApp). The CLI does not load gateway hooks. For hooks that work everywhere, use [plugin hooks](#plugin-hooks).
 :::
+
+## Plugin Hooks
+
+[Plugins](/docs/user-guide/features/plugins) can register hooks that fire in **both CLI and gateway** sessions. These are registered programmatically via `ctx.register_hook()` in your plugin's `register()` function.
+
+```python
+def register(ctx):
+    ctx.register_hook("pre_tool_call", my_callback)
+    ctx.register_hook("post_tool_call", my_callback)
+```
+
+### Available Plugin Hooks
+
+| Hook | Fires when | Callback receives |
+|------|-----------|-------------------|
+| `pre_tool_call` | Before any tool executes | `tool_name`, `args`, `task_id` |
+| `post_tool_call` | After any tool returns | `tool_name`, `args`, `result`, `task_id` |
+| `pre_llm_call` | Before LLM API request | *(planned — not yet wired)* |
+| `post_llm_call` | After LLM API response | *(planned — not yet wired)* |
+| `on_session_start` | Session begins | *(planned — not yet wired)* |
+| `on_session_end` | Session ends | *(planned — not yet wired)* |
+
+Callbacks receive keyword arguments matching the columns above:
+
+```python
+def my_callback(**kwargs):
+    tool = kwargs["tool_name"]
+    args = kwargs["args"]
+    # ...
+```
+
+### Example: Block Dangerous Tools
+
+```python
+# ~/.hermes/plugins/tool-guard/__init__.py
+BLOCKED = {"terminal", "write_file"}
+
+def guard(**kwargs):
+    if kwargs["tool_name"] in BLOCKED:
+        print(f"⚠ Blocked tool call: {kwargs['tool_name']}")
+
+def register(ctx):
+    ctx.register_hook("pre_tool_call", guard)
+```
+
+See the **[Plugins guide](/docs/user-guide/features/plugins)** for full details on creating plugins.
--- a/website/docs/user-guide/features/plugins.md
+++ b/website/docs/user-guide/features/plugins.md
@@ -46,14 +46,16 @@ Project-local plugins under `./.hermes/plugins/` are disabled by default. Enable

 ## Available hooks

+Plugins can register callbacks for these lifecycle events. See the **[Event Hooks page](/docs/user-guide/features/hooks#plugin-hooks)** for full details, callback signatures, and examples.
+
 | Hook | Fires when |
 |------|-----------|
 | `pre_tool_call` | Before any tool executes |
 | `post_tool_call` | After any tool returns |
-| `pre_llm_call` | Before LLM API request |
-| `post_llm_call` | After LLM API response |
-| `on_session_start` | Session begins |
-| `on_session_end` | Session ends |
+| `pre_llm_call` | Before LLM API request *(planned)* |
+| `post_llm_call` | After LLM API response *(planned)* |
+| `on_session_start` | Session begins *(planned)* |
+| `on_session_end` | Session ends *(planned)* |

 ## Slash commands
Author	SHA1	Message	Date
Teknium	9a19cd6cf3	feat: persist reasoning across gateway session turns (schema v6) Add reasoning TEXT, reasoning_details TEXT, and codex_reasoning_items TEXT columns to the messages table (schema v5->v6). This preserves assistant reasoning chains across gateway session reloads so all provider-specific reasoning formats survive the round-trip. Three reasoning formats are now persisted: - reasoning: plain text (DeepSeek, Qwen, Moonshot, Novita, OpenRouter) - reasoning_details: structured array (OpenRouter multi-turn continuity) - codex_reasoning_items: encrypted blobs (OpenAI Codex Responses API) Previously, all three existed in-memory during a single session but were lost on gateway reload. Changes: - hermes_state.py: schema v6 migration, append_message() accepts all three fields, get_messages_as_conversation() restores them on assistant messages - run_agent.py: _flush_messages_to_session_db() passes all reasoning fields through for assistant messages - gateway/run.py: agent_history builder preserves reasoning fields on non-tool-calling assistant messages - gateway/session.py: append_to_transcript() and rewrite_transcript() pass all reasoning fields to the DB - Tests: 5 new tests for round-trip persistence Verified against: - OpenAI Codex direct (codex_reasoning_items round-trip: 868 enc chars) - OpenRouter -> Anthropic, Google, DeepSeek, Meta, Qwen, Mistral - Anthropic adapter (strips extra fields by construction) - Codex Responses API path (replays codex_reasoning_items correctly)	2026-03-25 09:21:51 -07:00
Teknium	0f3c191ef1	fix(cli): enhance real-time reasoning output by forcing flush of long partial lines Updated the reasoning output mechanism to emit complete lines and force-flush long partial lines, ensuring reasoning is visible in real-time even without newlines. This improves user experience during reasoning sessions.	2026-03-24 19:56:30 -07:00
Teknium	7cdf4efe05	fix(skills): agent-created skills were incorrectly treated as untrusted community content _resolve_trust_level() didn't handle 'agent-created' source, so it fell through to 'community' trust level. Community policy blocks on any caution or dangerous findings, which meant common patterns like curl with env vars, systemctl, crontab, cloudflared references etc. would block skill creation/patching. The agent-created policy row already existed in INSTALL_POLICY with permissive settings (allow caution, ask on dangerous) but was never reached. Now it is. Fixes reports of skill_manage being blocked by security scanner.	2026-03-24 19:56:30 -07:00
Teknium	adee8d1b5f	fix: browser_vision ignores auxiliary.vision.timeout config (#2901 ) * docs: unify hooks documentation — add plugin hooks to hooks page, add session:end event The hooks page only documented gateway event hooks (HOOK.yaml system). The plugins page listed plugin hooks (pre_tool_call, etc.) that weren't referenced from the hooks page, which was confusing. Changes: - hooks.md: Add overview table showing both hook systems - hooks.md: Add Plugin Hooks section with available hooks, callback signatures, and example - hooks.md: Add missing session:end gateway event (emitted but undocumented) - hooks.md: Mark pre_llm_call, post_llm_call, on_session_start, on_session_end as planned (defined in VALID_HOOKS but not yet invoked) - hooks.md: Update info box to cross-reference plugin hooks - hooks.md: Fix heading hierarchy (gateway content as subsections) - plugins.md: Add cross-reference to hooks page for full details - plugins.md: Mark planned hooks as (planned) * fix: browser_vision ignores auxiliary.vision.timeout config browser_vision called call_llm() without passing a timeout parameter, so it always used the 30-second default in auxiliary_client.py. This made vision analysis with local models (llama.cpp, ollama) impossible since they typically need more than 30s for screenshot analysis. Now browser_vision reads auxiliary.vision.timeout from config.yaml (same config key that vision_analyze already uses) and passes it through to call_llm(). Also bumped the default vision timeout from 30s to 120s in both browser_vision and vision_analyze — 30s is too aggressive for local models and the previous default silently failed for anyone running vision locally. Fixes user report from GamerGB1988.	2026-03-24 19:56:30 -07:00
Teknium	f5b84dddfd	fix(compression): restore sane defaults and cap summary at 12K tokens - threshold: 0.80 → 0.50 (compress at 50%, not 80%) - target_ratio: 0.40 → 0.20, now relative to threshold not total context (20% of 50% = 10% of context as tail budget) - summary ceiling: 32K → 12K (Gemini can't output more than ~12K) - Updated DEFAULT_CONFIG, config display, example config, and tests	2026-03-24 19:56:30 -07:00
Teknium	4549a2f51a	docs: clarify two-mode behavior in session_search schema description	2026-03-24 19:56:30 -07:00
Teknium	466720c2f3	feat(session_search): add recent sessions mode when query is omitted When session_search is called without a query (or with an empty query), it now returns metadata for the most recent sessions instead of erroring. This lets the agent quickly see what was worked on recently without needing specific keywords. Returns for each session: session_id, title, source, started_at, last_active, message_count, preview (first user message). Zero LLM cost — pure DB query. Current session lineage and child delegation sessions are excluded. The agent can then keyword-search specific sessions if it needs deeper context from any of them.	2026-03-24 19:56:30 -07:00
Teknium	fccd7a2ab4	docs: unify hooks documentation — add plugin hooks to hooks page, add session:end event The hooks page only documented gateway event hooks (HOOK.yaml system). The plugins page listed plugin hooks (pre_tool_call, etc.) that weren't referenced from the hooks page, which was confusing. Changes: - hooks.md: Add overview table showing both hook systems - hooks.md: Add Plugin Hooks section with available hooks, callback signatures, and example - hooks.md: Add missing session:end gateway event (emitted but undocumented) - hooks.md: Mark pre_llm_call, post_llm_call, on_session_start, on_session_end as planned (defined in VALID_HOOKS but not yet invoked) - hooks.md: Update info box to cross-reference plugin hooks - hooks.md: Fix heading hierarchy (gateway content as subsections) - plugins.md: Add cross-reference to hooks page for full details - plugins.md: Mark planned hooks as (planned)	2026-03-24 18:34:14 -07:00