Same layering concern as the persisted-assistant scrub already removed:
_emit_interim_assistant_message and the final_response return path were
mutating model output broadly. Streaming scrubber covers real leaks
delta-by-delta; these post-stream scrubs were redundant.
Reviewer pushback on the original boundary-hardening commits — three
overreach points pulled plugin-specific policy into shared core paths:
1. gateway/run.py hardcoded a '## Honcho Context' literal split for
vision-LLM output. Plugin-format heading in framework code; could
truncate legitimate output naturally containing that header.
Drop the literal split; keep generic sanitize_context (the wrapper
strip is plugin-agnostic). Plugin-specific cleanup belongs at the
provider boundary, not the shared gateway path.
2. run_agent.run_conversation scrubbed user_message and
persist_user_message before the conversation loop. User text is
sacred — if a user types a literal <memory-context> tag we must
not silently delete it. The producer (build_memory_context_block)
is the only legitimate emitter; user input should never need the
reverse op.
3. _build_assistant_message scrubbed model output before persistence.
Same hazard: would silently mutate legitimate documentation/code
the model emits containing the literal markers. The streaming
scrubber catches real leaks delta-by-delta before content is
concatenated; persist-time scrub was redundant belt-and-suspenders.
4. _fire_stream_delta stripped leading newlines from every delta unless
a paragraph break flag was set. Mid-stream '\n' is legitimate
markdown — lists, code fences, paragraph breaks — and chunk
boundaries are arbitrary. Narrow lstrip to the very first delta
of the stream only (so stale provider preamble still gets cleaned
on turn start, but mid-stream formatting survives).
Plus: build_memory_context_block now logs a warning when its defensive
sanitize_context strips something — surfaces buggy providers returning
pre-wrapped text instead of silently double-fencing.
Net architectural change: scrub surface collapses from 8 sites to 3
(StreamingContextScrubber on output deltas, plugin→backend send,
build_memory_context_block input-validation). Plugin-specific strings
stay out of shared runtime paths. User input and persisted assistant
output are no longer mutated.
Tests: rescoped TestMemoryContextSanitization (helper-correctness only,
no source-inspection of removed call sites), updated vision tests to
drop '## Honcho Context' literal-split assertions, updated
_build_assistant_message persistence test to assert preservation.
Added: cross-turn scrubber reset, build_memory_context_block warn-on-
violation, mid-stream newline preservation (plain + code fence).
fixes#5719
The auxiliary vision LLM called by gateway._enrich_message_with_vision
can echo its injected Honcho system prompt back into the image
description. That description gets embedded verbatim into the enriched
user message, so recalled memory (personal facts, dialectic output)
surfaces into a user-visible bubble.
Strips both forms of leak before embedding:
- <memory-context>...</memory-context> fenced blocks (sanitize_context)
- trailing '## Honcho Context' sections (header + everything after)
Plus regression tests:
- tests/agent/test_streaming_context_scrubber.py — 13 tests on the
stateful scrubber (whole block, split tags, false-positive partial
tags, unterminated span, reset, case-insensitivity)
- tests/run_agent/test_run_agent_codex_responses.py — 2 new tests on
_fire_stream_delta covering the realistic 7-chunk leak scenario and
the cross-turn scrubber reset
- tests/gateway/test_vision_memory_leak.py — 4 tests covering the
vision auto-analysis boundary (clean pass-through, '## Honcho Context'
header, fenced block, both patterns together)
Thread a vision-request flag through auxiliary provider resolution so Copilot clients can include Copilot-Vision-Request only for vision tasks. This preserves normal text requests while ensuring Copilot vision payloads reach the vision-capable route.
Add regression coverage for Copilot vision routing and keep cached text and vision clients separate so a text client without the header is not reused for vision.
Co-authored-by: dhabibi <9087935+dhabibi@users.noreply.github.com>
PR #13734 fixed the concurrent-tool-executor vector (ThreadPoolExecutor
workers didn't inherit the CLI's TLS approval callback). Two vectors
remained that could still land in the deadlocking input() fallback:
1. _spawn_background_review spawns a raw threading.Thread with no
approval callback installed, so any dangerous-command guard the
review agent trips falls back to input() -> deadlock against the
parent's prompt_toolkit TUI (same class as delegate_task subagents,
fixed in 023b1bff1 / #15491). Install a _bg_review_auto_deny
callback at thread start, clear on finally.
2. prompt_dangerous_approval's fallback unconditionally spawned a
daemon thread calling input() when approval_callback was None.
That fallback can never succeed under prompt_toolkit because the
user's Enter goes to pt's raw-mode stdin capture. Detect an active
pt Application via get_app_or_none() and fail closed (deny + log)
instead, so future threads that forget to install a callback
degrade gracefully instead of hanging 60s invisibly.
Regression guards:
- tests/run_agent/test_background_review.py verifies the review
worker thread sees a callable auto-deny callback mid-run and that
the slot is cleared in the finally block.
- tests/tools/test_approval.py TestFailClosedUnderPromptToolkit
verifies prompt_dangerous_approval returns 'deny' fast under a
mocked pt Application, and that a real callback still wins over
the guard.
The background skill/memory review agent was created without toolset
restrictions, inheriting the full default tool set. This allowed it to
use terminal, send_message, delegate_task, and other tools outside its
intended scope, potentially performing unrelated side effects after
skill creation.
Restrict the review agent to only memory and skills toolsets by passing
enabled_toolsets=['memory', 'skills'] during AIAgent construction.
Fixes#15204
* feat(image-input): native multimodal routing based on model vision capability
Attach user-sent images as OpenAI-style content parts on the user turn when
the active model supports native vision, so vision-capable models see real
pixels instead of a lossy text description from vision_analyze.
Routing decision (agent/image_routing.py::decide_image_input_mode):
agent.image_input_mode = auto | native | text (default: auto)
In auto mode:
- If auxiliary.vision.provider/model is explicitly configured, keep the
text pipeline (user paid for a dedicated vision backend).
- Else if models.dev reports supports_vision=True for the active
provider/model, attach natively.
- Else fall back to text (current behaviour).
Call sites updated: gateway/run.py (all messaging platforms), tui_gateway
(dashboard/Ink), cli.py (interactive /attach + drag-drop).
run_agent.py changes:
- _prepare_anthropic_messages_for_api now passes image parts through
unchanged when the model supports vision — the Anthropic adapter
translates them to native image blocks. Previous behaviour
(vision_analyze → text) only runs for non-vision Anthropic models.
- New _prepare_messages_for_non_vision_model mirrors the same contract
for chat.completions and codex_responses paths, so non-vision models
on any provider get text-fallback instead of failing at the provider.
- New _model_supports_vision() helper reads models.dev caps.
vision_analyze description rewritten: positions it as a tool for images
NOT already visible in the conversation (URLs, tool output, deeper
inspection). Prevents the model from redundantly calling it on images
already attached natively.
Config default: agent.image_input_mode = auto.
Tests: 35 new (test_image_routing.py + test_vision_aware_preprocessing.py),
all existing tests that reference _prepare_anthropic_messages_for_api
still pass (198 targeted + new tests green).
* feat(image-input): size-cap + resize oversized images, charge image tokens in compressor
Two follow-ups that make the native image routing safer for long / heavy
sessions:
1) Oversize handling in build_native_content_parts:
- 20 MB ceiling per image (matches vision_tools._MAX_BASE64_BYTES,
the most restrictive provider — Gemini inline data).
- Delegates to vision_tools._resize_image_for_vision (Pillow-based,
already battle-tested) to downscale to 5 MB first-try.
- If Pillow is missing or resize still overshoots, the image is
dropped and reported back in skipped[]; caller falls back to text
enrichment for that image.
2) Image-token accounting in context_compressor:
- New _IMAGE_TOKEN_ESTIMATE = 1600 (matches Claude Code's constant;
within the realistic range for Anthropic/GPT-4o/Gemini billing).
- _content_length_for_budget() helper: sums text-part lengths and
charges _IMAGE_CHAR_EQUIVALENT (1600 * 4 chars) per image/image_url/
input_image part. Base64 payload inside image_url is NOT counted
as chars — dimensions don't matter, only image-presence.
- Both tail-cut sites (_prune_old_tool_results L527 and
_find_tail_cut_by_tokens L1126) now call the helper so multi-image
conversations don't slip past compression budget.
Tests: 9 new in test_image_routing.py (oversize triggers resize,
resize-fails-returns-None, oversize-skipped-reported), 11 new in
test_compressor_image_tokens.py (flat charge per image, multiple images,
Responses-API / Anthropic-native / OpenAI-chat shapes, no-inflation on
raw base64, bounds-check on the constant, integration test that an
image-heavy tail actually gets trimmed).
* fix(image-input): replace blanket 20MB ceiling with empirically-verified per-provider limits
The previous commit imposed a hardcoded 20 MB base64 ceiling on all
providers, triggering auto-resize on anything larger. This was wrong in
both directions:
* Too loose for Anthropic — actual limit is 5 MB (returns HTTP 400
'image exceeds 5 MB maximum' above that).
* Too strict for OpenAI / Codex / OpenRouter — accept 49 MB+ without
complaint (empirically verified April 2026 with progressive PNG
sizes).
New behaviour:
* _PROVIDER_BASE64_CEILING table: only anthropic and bedrock have a
ceiling (5 MB, since bedrock-on-Claude shares Anthropic's decoder).
* Providers NOT in the table get no ceiling — images attach at native
size and we trust the provider to return its own error if it
disagrees. A provider-specific 400 message is clearer than us
guessing wrong and silently degrading image quality.
* build_native_content_parts() gains a keyword-only provider arg;
gateway/CLI/TUI pass the active provider so Anthropic users get
auto-resize protection while OpenAI users don't pay it.
* Resize target dropped from 5 MB to 4 MB to slide safely under
Anthropic's boundary with header overhead.
Empirical measurements (direct API, no Hermes in the loop):
image b64 anthropic openrouter/gpt5.5 codex-oauth/gpt5.5
0.19 MB ✓ ✓ ✓
12.37 MB ✗ 400 5MB ✓ ✓
23.85 MB ✗ 400 5MB ✓ ✓
49.46 MB ✗ 413 ✓ ✓
Tests: rewrote TestOversizeHandling (5 tests): no-ceiling pass-through,
Anthropic resize fires, Anthropic skip on resize-fail, build_native_parts
routes ceiling by provider, unknown provider gets no ceiling. All 52
targeted tests pass.
* refactor(image-input): attempt native, shrink-and-retry on provider reject
Replace proactive per-provider size ceilings with a reactive shrink path
on the provider's actual rejection. All providers now attempt native
full-size attachment first; if the provider returns an image-too-large
error, the agent silently shrinks and retries once.
Why the previous design was wrong: hardcoding provider ceilings
(anthropic=5MB, others=unlimited) meant OpenAI users on a 10MB image
paid no tax, but Anthropic users lost quality on anything >5MB even
though the empirical behaviour at provider-reject time is the same
(shrink + retry). Baking the table into the routing layer also
requires updating Hermes every time a provider's limit changes.
Reactive design:
- image_routing.py: _file_to_data_url encodes native size, no ceiling.
build_native_content_parts drops its provider kwarg.
- error_classifier.py: new FailoverReason.image_too_large + pattern
match ("image exceeds", "image too large", etc.) checked BEFORE
context_overflow so Anthropic's 5MB rejection lands in the right
bucket.
- run_agent.py: new _try_shrink_image_parts_in_messages walks api
messages in-place, re-encodes oversized data: URL image parts
through vision_tools._resize_image_for_vision to fit under 4MB,
handles both chat.completions (dict image_url) and Responses
(string image_url) shapes, ignores http URLs (provider-fetched).
New image_shrink_retry_attempted flag in the retry loop fires the
shrink exactly once per turn after credential-pool recovery but
before auth retries.
E2E verified live against Anthropic claude-sonnet-4-6:
- 17.9MB PNG (23.9MB b64) attached at native size
- Anthropic returns 400 "image exceeds 5 MB maximum"
- Agent logs '📐 Image(s) exceeded provider size limit — shrank and
retrying...'
- Retry succeeds, correct response delivered in 6.8s total.
Tests: 12 new (8 shrink-helper shapes + 4 classifier signals),
replaces 5 proactive-ceiling tests with 3 simpler 'native attach works'
tests. 181 targeted tests pass. test_enum_members_exist in
test_error_classifier.py updated for the new enum value.
On provider switches mid-session (e.g. MiniMax -> DeepSeek), the source
assistant turn carries a 'reasoning' field written by the prior provider
but no 'reasoning_content' key. _copy_reasoning_content_for_api would
promote that foreign 'reasoning' to 'reasoning_content' on the outbound
DeepSeek request, leaking a cross-provider chain of thought and in
practice causing HTTP 400.
DeepSeek's own _build_assistant_message always pins reasoning_content=''
at creation time for tool-call turns, so the shape (reasoning set,
reasoning_content absent, tool_calls present) is unreachable from
same-provider DeepSeek history — it can only come from a prior provider.
Pad with '' in that case instead of promoting.
Healthy same-provider 'reasoning' promotion (no tool_calls, or on
providers that do not require the empty-string pin) is unchanged.
When _compress_context rotates session_id (compression split), fire
on_session_start(new_sid, boundary_reason="compression",
old_session_id=<old>) on the active context engine. Plugin engines
(e.g. hermes-lcm) use this to preserve DAG lineage across the rollover
instead of re-initializing fresh per-session state.
Built-in ContextCompressor.on_session_start accepts **kwargs and ignores
them — no behavior change for default users.
Closes hermes-lcm#68 symptom: after Hermes compressed and minted a new
physical session, LCM was treating the split as a fresh /new and losing
continuity (compression_count: 1, store_messages: 0, dag_nodes: 0).
Credit: @Tosko4 (PR #13370) — minimized scope to the boundary_reason
signal only; the broader session-lifecycle refactor will be taken in
separate PRs if justified by concrete plugin need.
Background review fork now inherits session_id, credential_pool, and
status_callback from the parent (added in #16099 after this PR was
written). Extend the bare-agent helper so the regression test keeps
reaching the cleanup assertions instead of failing in the runtime
resolver.
Signed-off-by: Teknium <8425893+teknium1@users.noreply.github.com>
Temporary background review agents can initialize Hindsight-backed memory clients, but close() alone skips provider teardown. Shut the memory provider down before closing so aiohttp sessions do not leak at process exit.
Made-with: Cursor
The background skill-review prompt (spawned after N user turns) now instructs
the reviewer to SURVEY existing skills first, identify the CLASS of task, and
PREFER updating/generalizing an existing skill over creating a new narrow one.
This reduces near-duplicate skill accumulation at the source. Catches the
common failure mode where repeated tasks of the same class each spawn their
own specific skill ("fix-my-tauri-error", "fix-my-electron-error") instead
of a single class-level skill ("desktop-app-build-troubleshooting").
Applied to both _SKILL_REVIEW_PROMPT and the **Skills** half of
_COMBINED_REVIEW_PROMPT. Memory-only review prompt unchanged.
Groundwork for the Curator feature (issue #7816) — the creation-side fix.
Curator handles the retirement/consolidation side in a follow-up PR.
Tests assert the behavioral instructions are present (survey, class, update-
over-create, overlap-flagging, opt-out clause) rather than snapshotting the
full prompt text.
Azure OpenAI exposes an OpenAI-compatible endpoint at
`{resource}.openai.azure.com/openai/v1` that accepts the standard
`openai` Python client. Two issues prevented gpt-5.x models from working:
1. `_max_tokens_param()` only sent `max_completion_tokens` for
`api.openai.com` URLs. Azure also requires `max_completion_tokens`
for gpt-5.x models.
2. The `codex_responses` upgrade gate unconditionally upgraded gpt-5.x
to Responses API. Azure does NOT support the Responses API — it serves
gpt-5.x on the regular `/chat/completions` path, causing a 404.
Fix: add `_is_azure_openai_url()` that matches `openai.azure.com` URLs.
- `_max_tokens_param()` now returns `max_completion_tokens` for Azure.
- The `codex_responses` upgrade gate skips Azure so gpt-5.x stays on
`chat_completions` where Azure actually serves it.
- The fallback-provider api_mode picker also recognises Azure and stays
on chat_completions.
- Tests cover max_tokens routing, api_mode behaviour, and URL detection.
gpt-4.x models on Azure are unaffected (already used chat_completions +
max_tokens, which Azure accepts for those models).
Salvage of PR #10086 — rewritten against current main where the
codex_responses upgrade gate gained copilot-acp / explicit-api_mode
exclusions.
Previously _copy_reasoning_content_for_api only padded reasoning_content
when the assistant message had tool_calls. DeepSeek V4 thinking mode
requires the field on every assistant turn, including plain text replies
without tool_calls.
- Remove the 'source_msg.get("tool_calls") and' guard
- Update test: plain assistant turns now get padded for DeepSeek/Kimi
Fixes#15213
The Codex Responses API rejects input_text inside assistant messages —
only output_text and refusal are valid content types for assistant role.
_chat_content_to_responses_parts() previously hardcoded all text content
to input_text regardless of the message role. When an assistant message
had list-format content (multimodal or structured), this produced invalid
input_text parts that the API rejected with:
Invalid value: 'input_text'. Supported values are: 'output_text' and 'refusal'.
Fix: add a role parameter to _chat_content_to_responses_parts() that
selects output_text for assistant messages and input_text for user
messages. Thread this through _chat_messages_to_responses_input() and
_preflight_codex_input_items().
Fixes#15687
When a user sends /stop during a streaming API call, the outer poll loop
detects _interrupt_requested and closes the HTTP connection. However, the
inner _call() thread catches the connection error and enters its retry
loop — opening a FRESH connection without checking the interrupt flag.
On slow providers like ollama-cloud, each retry attempt blocks for the
full stream-read timeout (120s+). With 3 retry attempts this caused
510+ second delays between /stop and actual response — the agent appeared
completely unresponsive despite the stop being acknowledged.
Fix: add an _interrupt_requested check at the top of the streaming retry
loop so the agent exits immediately instead of retrying.
Also fix log truncation: all session key logging in gateway/run.py used
[:20] or [:30] slices, which truncated 'agent:main:telegram:dm:5690190437'
(33 chars) to 'agent:main:telegram:' — losing the identifying chat type
and user ID. Replace with full keys to make logs debuggable.
Reported by user Sidharth Pulipaka via Telegram on ollama-cloud provider.
The AIAgent.flush_memories pre-compression save, the gateway
_flush_memories_for_session, and everything feeding them are
obsolete now that the background memory/skill review handles
persistent memory extraction.
Problems with flush_memories:
- Pre-dates the background review loop. It was the only memory-save
path when introduced; the background review now fires every 10 user
turns on CLI and gateway alike, which is far more frequent than
compression or session reset ever triggered flush.
- Blocking and synchronous. Pre-compression flush ran on the live agent
before compression, blocking the user-visible response.
- Cache-breaking. Flush built a temporary conversation prefix
(system prompt + memory-only tool list) that diverged from the live
conversation's cached prefix, invalidating prompt caching. The
gateway variant spawned a fresh AIAgent with its own clean prompt
for each finalized session — still cache-breaking, just in a
different process.
- Redundant. Background review runs in the live conversation's
session context, gets the same content, writes to the same memory
store, and doesn't break the cache. Everything flush_memories
claimed to preserve is already covered.
What this removes:
- AIAgent.flush_memories() method (~248 LOC in run_agent.py)
- Pre-compression flush call in _compress_context
- flush_memories call sites in cli.py (/new + exit)
- GatewayRunner._flush_memories_for_session + _async_flush_memories
(and the 3 call sites: session expiry watcher, /new, /resume)
- 'flush_memories' entry from DEFAULT_CONFIG auxiliary tasks,
hermes tools UI task list, auxiliary_client docstrings
- _memory_flush_min_turns config + init
- #15631's headroom-deduction math in
_check_compression_model_feasibility (headroom was only needed
because flush dragged the full main-agent system prompt along;
the compression summariser sends a single user-role prompt so
new_threshold = aux_context is safe again)
- The dedicated test files and assertions that exercised
flush-specific paths
What this renames (with read-time backcompat on sessions.json):
- SessionEntry.memory_flushed -> SessionEntry.expiry_finalized.
The session-expiry watcher still uses the flag to avoid re-running
finalize/eviction on the same expired session; the new name
reflects what it now actually gates. from_dict() reads
'expiry_finalized' first, falls back to the legacy 'memory_flushed'
key so existing sessions.json files upgrade seamlessly.
Supersedes #15631 and #15638.
Tested: 383 targeted tests pass across run_agent/, agent/, cli/,
and gateway/ session-boundary suites. No behavior regressions —
background memory review continues to handle persistent memory
extraction on both CLI and gateway.
_check_compression_model_feasibility calls get_model_context_length
without provider=, so Codex OAuth users get 1,050,000 (from models.dev
for 'openai') instead of the actual 272,000 limit. This happens because
_infer_provider_from_url maps chatgpt.com → 'openai' (not 'openai-codex'),
skipping the Codex-specific resolution branch entirely.
Result: compression threshold set at 85% of 1.05M = 892K — conversations
never trigger compression, the context grows unbounded, and when gateway
hygiene eventually forces compression, the Codex endpoint drops the
oversized streaming request ('peer closed connection without sending
complete message body').
Fix: forward self.provider to get_model_context_length so provider-
specific resolution branches (Codex OAuth 272K, Copilot live /models,
Nous suffix-match) fire correctly.
Reported by user on GPT 5.5 via Codex OAuth Pro (paste.rs/vsra3).
When the auxiliary compression model's context is smaller than the main
model's compression threshold, _check_compression_model_feasibility
auto-lowers the session threshold. Previously it set:
new_threshold = aux_context
This let the raw message list grow to exactly aux_context tokens. But
compression and flush_memories actually send system_prompt + tool_schemas
+ messages to the aux model. With 50+ tools that overhead is 25-30K
tokens, so the full request overflowed aux with HTTP 400.
Subtract a headroom estimate from aux_context before setting the new
threshold: the actual tool-schema token count (from
estimate_request_tokens_rough) plus a 12K allowance for the system
prompt (not yet built at __init__ time) and flush-instruction overhead.
Clamp to MINIMUM_CONTEXT_LENGTH so the session still starts even with
an unusually heavy tool schema.
This fixes the 'flush_memories overflow on busy toolsets' path that
Teknium flagged — where main and aux can be nominally the same model
but still 400 because the threshold left no room for the request
overhead. Same fix also protects the normal compression summarisation
request on the same binding aux.
Tests: two new regression tests cover the headroom reservation and the
MINIMUM_CONTEXT_LENGTH floor. Two existing tests updated for the new
(lower) threshold values now that empty-tools still produces a 12K
static headroom deduction.
The memory-flush fallback for api_mode='codex_responses' was unconditionally
adding `temperature` to codex_kwargs before calling _run_codex_stream. The
Responses API does not accept temperature on any supported backend:
- chatgpt.com/backend-api/codex rejects it outright
- api.openai.com + gpt-5/o-series reasoning models reject it
- Copilot Responses rejects it on reasoning models
The CodexAuxiliaryClient adapter and the codex_responses transport both
correctly omit temperature — the flush fallback was the only path putting
it back. On errors from the primary aux path (e.g. expired OAuth token),
users saw `⚠ Auxiliary memory flush failed: HTTP 400: Unsupported parameter:
temperature`.
Reported by Garik [NOUS] on GPT-5.5 via Codex OAuth Pro.
Extracts _needs_kimi_tool_reasoning() for symmetry with the existing
_needs_deepseek_tool_reasoning() helper, so _copy_reasoning_content_for_api
uses the same detection logic as _build_assistant_message. Future changes
to either provider's signals now only touch one function.
Adds tests/run_agent/test_deepseek_reasoning_content_echo.py covering:
- All 3 DeepSeek detection signals (provider, model, host)
- Poisoned history replay (empty string fallback)
- Plain assistant turns NOT padded
- Explicit reasoning_content preserved
- Reasoning field promoted to reasoning_content
- Existing Kimi/Moonshot detection intact
- Non-thinking providers left alone
21 tests, all pass.
``run_conversation`` was calling ``memory_manager.sync_all(
original_user_message, final_response)`` at the end of every turn
where both args were present. That gate didn't consider the
``interrupted`` local flag, so an external memory backend received
partial assistant output, aborted tool chains, or mid-stream resets as
durable conversational truth. Downstream recall then treated the
not-yet-real state as if the user had seen it complete, poisoning the
trust boundary between "what the user took away from the turn" and
"what Hermes was in the middle of producing when the interrupt hit".
Extracted the inline sync block into a new private method
``AIAgent._sync_external_memory_for_turn(original_user_message,
final_response, interrupted)`` so the interrupt guard is a single
visible check at the top of the method instead of hidden in a
boolean-and at the call site. That also gives tests a clean seam to
assert on — the pre-fix layout buried the logic inside the 3,000-line
``run_conversation`` function where no focused test could reach it.
The new method encodes three independent skip conditions:
1. ``interrupted`` → skip entirely (the #15218 fix). Applies even
when ``final_response`` and ``original_user_message`` happen to
be populated — an interrupt may have landed between a streamed
reply and the next tool call, so the strings on disk are not
actually the turn the user took away.
2. No memory manager / no final_response / no user message →
preserve existing skip behaviour (nothing new for providerless
sessions, system-initiated refreshes, tool-only turns that never
resolved, etc.).
3. Sync_all / queue_prefetch_all exceptions → swallow. External
memory providers are strictly best-effort; a misconfigured or
offline backend must never block the user from seeing their
response.
The prefetch side-effect is gated on the same interrupt flag: the
user's next message is almost certainly a retry of the same intent,
and a prefetch keyed on the interrupted turn would fire against stale
context.
### Tests (16 new, all passing on py3.11 venv)
``tests/run_agent/test_memory_sync_interrupted.py`` exercises the
helper directly on a bare ``AIAgent`` (``__new__`` pattern that the
interrupt-propagation tests already use). Coverage:
- Interrupted turn with full-looking response → no sync (the fix)
- Interrupted turn with long assistant output → no sync (the interrupt
could have landed mid-stream; strings-on-disk lie)
- Normal completed turn → sync_all + queue_prefetch_all both called
with the right args (regression guard for the positive path)
- No final_response / no user_message / no memory manager → existing
pre-fix skip paths still apply
- sync_all raises → exception swallowed, prefetch still attempted
- queue_prefetch_all raises → exception swallowed after sync succeeded
- 8-case parametrised matrix across (interrupted × final_response ×
original_user_message) asserts sync fires iff interrupted=False AND
both strings are non-empty
Closes#15218
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends _repair_tool_call_arguments() to cover the most common local-model
JSON corruption pattern: llama.cpp/Ollama backends emit literal tabs and
newlines inside JSON string values (memory save summaries, file contents,
etc.). Previously fell through to '{}' replacement, losing the call.
Adds two repair passes:
- Pass 0: json.loads(strict=False) + re-serialise to canonical wire form
- Pass 4: escape 0x00-0x1F control chars inside string values, then retry
Ports the core utility from #12068 / PR #12093 without the larger plumbing
change (that PR also replaced json.loads at 8 call sites; current main's
_repair_tool_call_arguments is already the single chokepoint, so the
upgrade happens transparently for every existing caller).
Credit: @truenorth-lj for the original utility design.
4 new regression tests covering literal newlines, tabs, re-serialisation
to strict=True-valid output, and the trailing-comma + control-char
combination case.
When the streaming path (chat completions) assembled tool call deltas and
detected malformed JSON arguments, it set has_truncated_tool_args=True but
passed the broken args through unchanged. This triggered the truncation
handler which returned a partial result and killed the session (/new required).
_many_ malformations are repairable: trailing commas, unclosed brackets,
Python None, empty strings. _repair_tool_call_arguments() already existed
for the pre-API-request path but wasn't called during streaming assembly.
Now when JSON parsing fails during streaming assembly, we attempt repair
via _repair_tool_call_arguments() before flagging as truncated. If repair
succeeds (returns valid JSON), the tool call proceeds normally. Only truly
unrepairable args fall through to the truncation handler.
This prevents the most common session-killing failure mode for models like
GLM-5.1 that produce trailing commas or unclosed brackets.
Tests: 12 new streaming assembly repair tests, all 29 existing repair
tests still passing.
When a session is split by context compression mid-tool-call, an assistant
message may end up with truncated/invalid JSON in tool_calls[*].function.arguments.
On the next turn this is replayed verbatim and providers reject the entire request
with HTTP 400 invalid_tool_call_format, bricking the conversation in a loop that
cannot recover without manual session quarantine.
This patch adds a defensive sanitizer that runs immediately before
client.chat.completions.create() in AIAgent.run_conversation():
- Validates each assistant tool_calls[*].function.arguments via json.loads
- Replaces invalid/empty arguments with '{}'
- Injects a synthetic tool response (or prepends a marker to the existing one)
so downstream messages keep valid tool_call_id pairing
- Logs each repair with session_id / message_index / preview for observability
Defense in depth: corruption can originate from compression splits, manual edits,
or plugin bugs. Sanitizing at the send chokepoint catches all sources.
Adds 7 unit tests covering: truncated JSON, empty string, None, non-string args,
existing matching tool response (no duplicate injection), non-assistant messages
ignored, multiple repairs.
Fixes#15236
gpt-5.x on the Codex Responses API sometimes degenerates and emits
Harmony-style `to=functions.<name> {json}` serialization as plain
assistant-message text instead of a structured `function_call` item.
The intent never makes it into `response.output` as a function_call,
so `tool_calls` is empty and `_normalize_codex_response()` returns
the leaked text as the final content. Downstream (e.g. delegate_task),
this surfaces as a confident-looking summary with `tool_trace: []`
because no tools actually ran — the Taiwan-embassy-email bug report.
Detect the pattern, scrub the content, and return finish_reason=
'incomplete' so the existing Codex-incomplete continuation path
(run_agent.py:11331, 3 retries) gets a chance to re-elicit a proper
function_call item. Encrypted reasoning items are preserved so the
model keeps its chain-of-thought on the retry.
Regression tests: leaked text triggers incomplete, real tool calls
alongside leak-looking text are preserved, clean responses pass
through unchanged.
Reported on Discord (gpt-5.4 / openai-codex).
Claude-style and some Anthropic-tuned models occasionally emit tool
names as class-like identifiers: TodoTool_tool, Patch_tool,
BrowserClick_tool, PatchTool. These failed strict-dict lookup in
valid_tool_names and triggered the 'Unknown tool' self-correction
loop, wasting a full turn of iteration and tokens.
_repair_tool_call already handled lowercase / separator / fuzzy
matches but couldn't bridge the CamelCase-to-snake_case gap or the
trailing '_tool' suffix that Claude sometimes tacks on. Extend it
with two bounded normalization passes:
1. CamelCase -> snake_case (via regex lookbehind).
2. Strip trailing _tool / -tool / tool suffix (case-insensitive,
applied twice so TodoTool_tool reduces all the way: strip
_tool -> TodoTool, snake -> todo_tool, strip 'tool' -> todo).
Cheap fast-paths (lowercase / separator-normalized) still run first
so the common case stays zero-cost. Fuzzy match remains the last
resort unchanged.
Tests: tests/run_agent/test_repair_tool_call_name.py covers the
three original reports (TodoTool_tool, Patch_tool, BrowserClick_tool),
plus PatchTool, WriteFileTool, ReadFile_tool, write-file_Tool,
patch-tool, and edge cases (empty, None, '_tool' alone, genuinely
unknown names).
18 new tests + 17 existing arg-repair tests = 35/35 pass.
Closes#14784
Extracts pool-rotation-room logic into `_pool_may_recover_from_rate_limit`
so single-credential pools no longer block the eager-fallback path on 429.
The existing check `pool is not None and pool.has_available()` lets
fallback fire only after the pool marks every entry as exhausted. With
exactly one credential in the pool (the common shape for Gemini OAuth,
Vertex service accounts, and any personal-key setup), `has_available()`
flips back to True as soon as the cooldown expires — Hermes retries
against the same entry, hits the same daily-quota 429, and burns the
retry budget in a tight loop before ever reaching the configured
`fallback_model`. Observed in the wild as 4+ hours of 429 noise on a
single Gemini key instead of falling through to Vertex as configured.
Rotation is only meaningful with more than one credential — gate on
`len(pool.entries()) > 1`. Multi-credential pools keep the current
wait-for-rotation behaviour unchanged.
Fixes#11314. Related to #8947, #10210, #7230. Narrower scope than
open PRs #8023 (classifier change) and #11492 (503/529 credential-pool
bypass) — this addresses the single-credential 429 case specifically
and does not conflict with either.
Tests: 6 new unit tests in tests/run_agent/test_provider_fallback.py
covering (a) None pool, (b) single-cred available, (c) single-cred in
cooldown, (d) 2-cred available rotates, (e) multi-cred all cooling-down
falls back, (f) many-cred available rotates. All 18 tests in the file
pass.
When using GitHub Copilot as provider, HTTP 401 errors could cause
Hermes to silently fall back to the next model in the chain instead
of recovering. This adds a one-shot retry mechanism that:
1. Re-resolves the Copilot token via the standard priority chain
(COPILOT_GITHUB_TOKEN -> GH_TOKEN -> GITHUB_TOKEN -> gh auth token)
2. Rebuilds the OpenAI client with fresh credentials and Copilot headers
3. Retries the failed request before falling back
The fix handles the common case where the gho_* OAuth token remains
valid but the httpx client state becomes stale (e.g. after startup
race conditions or long-lived sessions).
Key design decisions:
- Always rebuild client even if token string unchanged (recovers stale state)
- Uses _apply_client_headers_for_base_url() for canonical header management
- One-shot flag guard prevents infinite 401 loops (matches existing pattern
used by Codex/Nous/Anthropic providers)
- No token exchange via /copilot_internal/v2/token (returns 404 for some
account types; direct gho_* auth works reliably)
Tests: 3 new test cases covering end-to-end 401->refresh->retry,
client rebuild verification, and same-token rebuild scenarios.
Docs: Updated providers.md with Copilot auth behavior section.
json.JSONDecodeError inherits from ValueError. The agent loop's
non-retryable classifier at run_agent.py ~L10782 treated any
ValueError/TypeError as a local programming bug and short-circuited
retry. Without a carve-out, a transient JSONDecodeError from a
provider that returned a malformed response body, a truncated stream,
or a router-layer corruption would fail the turn immediately.
Add JSONDecodeError to the existing UnicodeEncodeError exclusion
tuple so the classified-retry logic (which already handles 429/529/
context-overflow/etc.) gets to run on bad-JSON errors.
Tests (tests/run_agent/test_jsondecodeerror_retryable.py):
- JSONDecodeError: NOT local validation
- UnicodeEncodeError: NOT local validation (existing carve-out)
- bare ValueError: IS local validation (programming bug)
- bare TypeError: IS local validation (programming bug)
- source-level assertion that run_agent.py still carries the carve-out
(guards against accidental revert)
Closes#14782
Make the main-branch test suite pass again. Most failures were tests
still asserting old shapes after recent refactors; two were real source
bugs.
Source fixes:
- tools/mcp_tool.py: _kill_orphaned_mcp_children() slept 2s on every
shutdown even when no tracked PIDs existed, making test_shutdown_is_parallel
measure ~3s for 3 parallel 1s shutdowns. Early-return when pids is empty.
- hermes_cli/tips.py: tip 105 was 157 chars; corpus max is 150.
Test fixes (mostly stale mock targets / missing fixture fields):
- test_zombie_process_cleanup, test_agent_cache: patch run_agent.cleanup_vm
(the local name bound at import), not tools.terminal_tool.cleanup_vm.
- test_browser_camofox: patch tools.browser_camofox.load_config, not
hermes_cli.config.load_config (the source module, not the resolved one).
- test_flush_memories_codex._chat_response_with_memory_call: add
finish_reason, tool_call.id, tool_call.type so the chat_completions
transport normalizer doesn't AttributeError.
- test_concurrent_interrupt: polling_tool signature now accepts
messages= kwarg that _invoke_tool() passes through.
- test_minimax_provider: add _fallback_chain=[] to the __new__'d agent
so switch_model() doesn't AttributeError.
- test_skills_config: SKILLS_DIR MagicMock + .rglob stopped working
after the scanner switched to agent.skill_utils.iter_skill_index_files
(os.walk-based). Point SKILLS_DIR at a real tmp_path and patch
agent.skill_utils.get_external_skills_dirs.
- test_browser_cdp_tool: browser_cdp toolset was intentionally split into
'browser-cdp' (commit 96b0f3700) so its stricter check_fn doesn't gate
the whole browser toolset; test now expects 'browser-cdp'.
- test_registry: add tools.browser_dialog_tool to the expected
builtin-discovery set (PR #14540 added it).
- test_file_tools TestPatchHints: patch_tool surfaces hints as a '_hint'
key on the JSON payload, not inline '[Hint: ...' text.
- test_write_deny test_hermes_env: resolve .env via get_hermes_home() so
the path matches the profile-aware denylist under hermetic HERMES_HOME.
- test_checkpoint_manager test_falls_back_to_parent: guard the walk-up
so a stray /tmp/pyproject.toml on the host doesn't pick up /tmp as the
project root.
- test_quick_commands: set cli.session_id in the __new__'d CLI so the
alias-args path doesn't trip AttributeError when fuzzy-matching leaks
a skill command across xdist test distribution.
- Load prompt_caching.cache_ttl in AIAgent (5m default, 1h opt-in)
- Document DEFAULT_CONFIG and developer guide example
- Add unit tests for default, 1h, and invalid TTL fallback
Made-with: Cursor
Pin the behaviour added in the preceding commit — `_get_proxy_for_base_url()`
must return None for hosts covered by NO_PROXY and the HTTPS_PROXY otherwise,
and the full `_create_openai_client()` path must NOT mount HTTPProxy for a
NO_PROXY host.
Refs: #14966
Manual /compress crashed with 'LCMEngine' object has no attribute
'_align_boundary_forward' when any context-engine plugin was active.
The gateway handler reached into _align_boundary_forward and
_find_tail_cut_by_tokens on tmp_agent.context_compressor, but those
are ContextCompressor-specific — not part of the generic ContextEngine
ABC — so every plugin engine (LCM, etc.) raised AttributeError.
- Add optional has_content_to_compress(messages) to ContextEngine ABC
with a safe default of True (always attempt).
- Override it in the built-in ContextCompressor using the existing
private helpers — preserves exact prior behavior for 'compressor'.
- Rewrite gateway /compress preflight to call the ABC method, deleting
the private-helper reach-in.
- Add focus_topic to the ABC compress() signature. Make _compress_context
retry without focus_topic on TypeError so older strict-sig plugins
don't crash on manual /compress <focus>.
- Regression test with a fake ContextEngine subclass that only
implements the ABC (mirrors LCM's surface).
Reported by @selfhostedsoul (Discord, Apr 22).
Closes#11616.
The agent's API retry loop hardcoded max_retries = 3, so users with
fallback providers on flaky primaries burned through ~3 × provider
timeout (e.g. 3 × 180s = 9 minutes) before their fallback chain got a
chance to kick in.
Expose a new config key:
agent:
api_max_retries: 3 # default unchanged
Set it to 1 for fast failover when you have fallback providers, or
raise it if you prefer longer tolerance on a single provider. Values
< 1 are clamped to 1 (single attempt, no retry); non-integer values
fall back to the default.
This wraps the Hermes-level retry loop only — the OpenAI SDK's own
low-level retries (max_retries=2 default) still run beneath this for
transient network errors.
Changes:
- hermes_cli/config.py: add agent.api_max_retries default 3 with comment.
- run_agent.py: read self._api_max_retries in AIAgent.__init__; replace
hardcoded max_retries = 3 in the retry loop with self._api_max_retries.
- cli-config.yaml.example: documented example entry.
- hermes_cli/tips.py: discoverable tip line.
- tests/run_agent/test_api_max_retries_config.py: 4 tests covering
default, override, clamp-to-one, and invalid-value fallback.
NormalizedResponse and ToolCall now have backward-compat properties
so the agent loop can read them directly without the shim:
ToolCall: .type, .function (returns self), .call_id, .response_item_id
NormalizedResponse: .reasoning_content, .reasoning_details,
.codex_reasoning_items
This eliminates the 35-line shim and its 4 call sites in run_agent.py.
Also changes flush_memories guard from hasattr(response, 'choices')
to self.api_mode in ('chat_completions', 'bedrock_converse') so it
works with raw boto3 dicts too.
WS1 items 3+4 of Cycle 2 (#14418).
3-layer chain (transport → v2 → v1) was collapsed to 2-layer in PR 7.
This collapses the remaining 2-layer (transport → v1 → NR mapping in
transport) to 1-layer: v1 now returns NormalizedResponse directly.
Before: adapter returns (SimpleNamespace, finish_reason) tuple,
transport unpacks and maps to NormalizedResponse (22 lines).
After: adapter returns NormalizedResponse, transport is a
1-line passthrough.
Also updates ToolCall construction — adapter now creates ToolCall
dataclass directly instead of SimpleNamespace(id, type, function).
WS1 item 1 of Cycle 2 (#14418).