mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-28 15:01:34 +08:00
When the streaming connection dropped AFTER user-visible text was delivered but a tool call was in flight, we stubbed the turn with a '⚠ Stream stalled mid tool-call; Ask me to retry' warning — costing an iteration and breaking the flow. Users report this happening increasingly often on long SSE streams through flaky provider routes. Fix: in the existing inner stream-retry loop, relax the deltas_were_sent short-circuit. If a tool call was in flight (partial_tool_names populated) AND the error is a transient connection error (timeout, RemoteProtocolError, SSE 'connection lost', etc.), silently retry instead of bailing out. Fire a brief 'Connection dropped mid tool-call; reconnecting…' marker so the user understands the preamble is about to be re-streamed. Researched how Claude Code (tombstone + non-streaming fallback), OpenCode (blind Effect.retry wrapping whole stream), and Clawdbot (4-way gate: stopReason==error + output==0 + !hadPotentialSideEffects) handle this. Chose the narrow Clawdbot-style gate: retry only when (a) a tool call was actually in flight (otherwise the existing stub-with-recovered-text is correct for pure-text stalls) and (b) the error is transient. Side-effect safety is automatic — no tool has been dispatched within this single API call yet. UX trade-off: user sees preamble text twice on retry (OpenCode-style). Strictly better than a lost action with a 'retry manually' message. If retries exhaust, falls through to the existing stub-with-warning path so the user isn't left with zero signal. Tests: 3 new tests in TestSilentRetryMidToolCall covering (1) silent retry recovers tool call; (2) exhausted retries fall back to stub; (3) text-only stalls don't trigger retry. 30/30 pass.