mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-28 06:51:16 +08:00
Re-land of #10933, now guarded by the tests in #11266. When a provider drops a TCP connection mid-stream, the socket can enter CLOSE-WAIT and ''epoll_wait'' may never fire — no data or error signal arrives, so the httpx read timeout never triggers and the agent hangs indefinitely. The other defenses (''_force_close_tcp_sockets'', stale stream detector) all ride on the socket layer reporting the dead connection, which it never does without probes. Inject ''SO_KEEPALIVE'' + ''TCP_KEEPIDLE''/''KEEPINTVL''/''KEEPCNT'' into the httpx transport. Kernel probes after 30s idle, retries every 10s, gives up after 3 → dead peer detected within ~60s instead of hanging forever. Platform-aware: ''TCP_KEEPIDLE'' on Linux, ''TCP_KEEPALIVE'' on macOS. Silent no-op on Windows or anywhere the socket options aren't available. The original land (#10933) mutated ''client_kwargs'' in place when it injected the ''httpx.Client''. Since callers pass ''self._client_kwargs'' by reference, the injected client leaked into the instance state. After the first request, the OpenAI SDK closed its ''http_client'' — including the injected one. The next ''_create_openai_client'' call re-read the now-closed ''httpx.Client'' from ''self._client_kwargs'' and every subsequent chat raised ''APIConnectionError'' with cause ''RuntimeError: Cannot send a request, as the client has been closed'' (AlexKucera's Discord report, 2026-04-16). The defensive ''client_kwargs = dict(client_kwargs)'' copy already on main (taeuk178's #10978) means this injection only lands in the per-call local copy. Each ''_create_openai_client'' invocation gets its OWN fresh ''httpx.Client'' whose lifetime is tied to the paired ''OpenAI'' client. When that ''OpenAI'' client is closed (rebuild, teardown, credential rotation), its ''httpx.Client'' closes with it and the next call constructs a fresh one — no stale closed transport can be reused. Full 4-test matrix all green (unit + live with real OpenRouter round trips, HERMES_LIVE_TESTS=1): tests/run_agent/test_create_openai_client_kwargs_isolation.py PASS tests/run_agent/test_create_openai_client_reuse.py PASS (2) tests/run_agent/test_sequential_chats_live.py PASS Socket options verified on the live httpx transport: _socket_options: [(1, 9, 1), (6, 4, 30), (6, 5, 10), (6, 6, 3)] = (SO_KEEPALIVE=1, TCP_KEEPIDLE=30s, TCP_KEEPINTVL=10s, TCP_KEEPCNT=3) Sequential-chat reproduction of the #10933 failure was explicitly run against this patch — the defensive copy on main prevents the closed transport from leaking back into ''self._client_kwargs'', so every rebuild constructs a fresh transport. Closes #10324
This commit is contained in:
35
run_agent.py
35
run_agent.py
@@ -4395,6 +4395,41 @@ class AIAgent:
|
||||
self._client_log_context(),
|
||||
)
|
||||
return client
|
||||
# Inject TCP keepalives so the kernel detects dead provider connections
|
||||
# instead of letting them sit silently in CLOSE-WAIT (#10324). Without
|
||||
# this, a peer that drops mid-stream leaves the socket in a state where
|
||||
# epoll_wait never fires, ``httpx`` read timeout may not trigger, and
|
||||
# the agent hangs until manually killed. Probes after 30s idle, retry
|
||||
# every 10s, give up after 3 → dead peer detected within ~60s.
|
||||
#
|
||||
# Safety against #10933: the ``client_kwargs = dict(client_kwargs)``
|
||||
# above means this injection only lands in the local per-call copy,
|
||||
# never back into ``self._client_kwargs``. Each ``_create_openai_client``
|
||||
# invocation therefore gets its OWN fresh ``httpx.Client`` whose
|
||||
# lifetime is tied to the OpenAI client it is passed to. When the
|
||||
# OpenAI client is closed (rebuild, teardown, credential rotation),
|
||||
# the paired ``httpx.Client`` closes with it, and the next call
|
||||
# constructs a fresh one — no stale closed transport can be reused.
|
||||
# Tests in ``tests/run_agent/test_create_openai_client_reuse.py`` and
|
||||
# ``tests/run_agent/test_sequential_chats_live.py`` pin this invariant.
|
||||
if "http_client" not in client_kwargs:
|
||||
try:
|
||||
import httpx as _httpx
|
||||
import socket as _socket
|
||||
_sock_opts = [(_socket.SOL_SOCKET, _socket.SO_KEEPALIVE, 1)]
|
||||
if hasattr(_socket, "TCP_KEEPIDLE"):
|
||||
# Linux
|
||||
_sock_opts.append((_socket.IPPROTO_TCP, _socket.TCP_KEEPIDLE, 30))
|
||||
_sock_opts.append((_socket.IPPROTO_TCP, _socket.TCP_KEEPINTVL, 10))
|
||||
_sock_opts.append((_socket.IPPROTO_TCP, _socket.TCP_KEEPCNT, 3))
|
||||
elif hasattr(_socket, "TCP_KEEPALIVE"):
|
||||
# macOS (uses TCP_KEEPALIVE instead of TCP_KEEPIDLE)
|
||||
_sock_opts.append((_socket.IPPROTO_TCP, _socket.TCP_KEEPALIVE, 30))
|
||||
client_kwargs["http_client"] = _httpx.Client(
|
||||
transport=_httpx.HTTPTransport(socket_options=_sock_opts),
|
||||
)
|
||||
except Exception:
|
||||
pass # Fall through to default transport if socket opts fail
|
||||
client = OpenAI(**client_kwargs)
|
||||
logger.info(
|
||||
"OpenAI client created (%s, shared=%s) %s",
|
||||
|
||||
Reference in New Issue
Block a user