Files
hermes-agent/tests
Teknium 9e236df3f8 feat: per-turn primary runtime restoration and transport recovery
Make provider fallback turn-scoped in long-lived CLI sessions. Previously,
a single transient failure pinned the session to the fallback provider for
every subsequent turn. Now the primary model/provider is restored at the
start of each run_conversation() call.

Three pieces:

1. _primary_runtime dict snapshot at __init__ — captures model, provider,
   base_url, api_mode, api_key, client_kwargs, prompt caching flag, and
   context compressor state. Single dict is easy to extend; avoids the
   brittleness of N individual _primary_* attributes.

2. _restore_primary_runtime() — called at the top of run_conversation().
   Restores all primary state, rebuilds the client, resets the fallback
   chain index (so all fallbacks are available again), and restores the
   context compressor's model/context_length/threshold that
   _try_activate_fallback() overwrites. No-op in the gateway (fresh
   agent per message).

3. _try_recover_primary_transport() — after max_retries exhaust, rebuilds
   the primary client (clearing stale connection pools) and gives one more
   attempt before falling through to fallback. Only for transient transport
   errors (ReadTimeout, ConnectTimeout, PoolTimeout, ConnectError,
   RemoteProtocolError). Skipped for aggregator providers (OpenRouter,
   Nous) which manage their own retry infrastructure.

Inspired by PR #4612 (betamod) which identified this gap.

Includes 25 tests covering snapshot creation, restore correctness,
fallback index reset, compressor state restoration, transport recovery
scoping, wait time behavior, and error resilience.
2026-04-02 09:56:05 -07:00
..