fix: enable TCP keepalives to detect dead provider connections (#10324) (#11277)

Re-land of #10933, now guarded by the tests in #11266.

When a provider drops a TCP connection mid-stream, the socket can enter
CLOSE-WAIT and ''epoll_wait'' may never fire — no data or error signal
arrives, so the httpx read timeout never triggers and the agent hangs
indefinitely. The other defenses (''_force_close_tcp_sockets'', stale
stream detector) all ride on the socket layer reporting the dead
connection, which it never does without probes.

Inject ''SO_KEEPALIVE'' + ''TCP_KEEPIDLE''/''KEEPINTVL''/''KEEPCNT''
into the httpx transport. Kernel probes after 30s idle, retries every
10s, gives up after 3 → dead peer detected within ~60s instead of
hanging forever. Platform-aware: ''TCP_KEEPIDLE'' on Linux,
''TCP_KEEPALIVE'' on macOS. Silent no-op on Windows or anywhere
the socket options aren't available.

The original land (#10933) mutated ''client_kwargs'' in place when it
injected the ''httpx.Client''. Since callers pass ''self._client_kwargs''
by reference, the injected client leaked into the instance state. After
the first request, the OpenAI SDK closed its ''http_client'' — including
the injected one. The next ''_create_openai_client'' call re-read the
now-closed ''httpx.Client'' from ''self._client_kwargs'' and every
subsequent chat raised ''APIConnectionError'' with cause ''RuntimeError:
Cannot send a request, as the client has been closed'' (AlexKucera's
Discord report, 2026-04-16).

The defensive ''client_kwargs = dict(client_kwargs)'' copy already on
main (taeuk178's #10978) means this injection only lands in the
per-call local copy. Each ''_create_openai_client'' invocation gets
its OWN fresh ''httpx.Client'' whose lifetime is tied to the paired
''OpenAI'' client. When that ''OpenAI'' client is closed (rebuild,
teardown, credential rotation), its ''httpx.Client'' closes with it
and the next call constructs a fresh one — no stale closed transport
can be reused.

Full 4-test matrix all green (unit + live with real OpenRouter round
trips, HERMES_LIVE_TESTS=1):

    tests/run_agent/test_create_openai_client_kwargs_isolation.py      PASS
    tests/run_agent/test_create_openai_client_reuse.py                 PASS (2)
    tests/run_agent/test_sequential_chats_live.py                      PASS

Socket options verified on the live httpx transport:

    _socket_options: [(1, 9, 1), (6, 4, 30), (6, 5, 10), (6, 6, 3)]
    = (SO_KEEPALIVE=1, TCP_KEEPIDLE=30s, TCP_KEEPINTVL=10s, TCP_KEEPCNT=3)

Sequential-chat reproduction of the #10933 failure was explicitly
run against this patch — the defensive copy on main prevents the
closed transport from leaking back into ''self._client_kwargs'', so
every rebuild constructs a fresh transport.

Closes #10324
This commit is contained in:
Teknium
2026-04-16 20:04:54 -07:00
committed by GitHub
parent ab33ce1c86
commit 8c478983ed

View File

@@ -4395,6 +4395,41 @@ class AIAgent:
self._client_log_context(),
)
return client
# Inject TCP keepalives so the kernel detects dead provider connections
# instead of letting them sit silently in CLOSE-WAIT (#10324). Without
# this, a peer that drops mid-stream leaves the socket in a state where
# epoll_wait never fires, ``httpx`` read timeout may not trigger, and
# the agent hangs until manually killed. Probes after 30s idle, retry
# every 10s, give up after 3 → dead peer detected within ~60s.
#
# Safety against #10933: the ``client_kwargs = dict(client_kwargs)``
# above means this injection only lands in the local per-call copy,
# never back into ``self._client_kwargs``. Each ``_create_openai_client``
# invocation therefore gets its OWN fresh ``httpx.Client`` whose
# lifetime is tied to the OpenAI client it is passed to. When the
# OpenAI client is closed (rebuild, teardown, credential rotation),
# the paired ``httpx.Client`` closes with it, and the next call
# constructs a fresh one — no stale closed transport can be reused.
# Tests in ``tests/run_agent/test_create_openai_client_reuse.py`` and
# ``tests/run_agent/test_sequential_chats_live.py`` pin this invariant.
if "http_client" not in client_kwargs:
try:
import httpx as _httpx
import socket as _socket
_sock_opts = [(_socket.SOL_SOCKET, _socket.SO_KEEPALIVE, 1)]
if hasattr(_socket, "TCP_KEEPIDLE"):
# Linux
_sock_opts.append((_socket.IPPROTO_TCP, _socket.TCP_KEEPIDLE, 30))
_sock_opts.append((_socket.IPPROTO_TCP, _socket.TCP_KEEPINTVL, 10))
_sock_opts.append((_socket.IPPROTO_TCP, _socket.TCP_KEEPCNT, 3))
elif hasattr(_socket, "TCP_KEEPALIVE"):
# macOS (uses TCP_KEEPALIVE instead of TCP_KEEPIDLE)
_sock_opts.append((_socket.IPPROTO_TCP, _socket.TCP_KEEPALIVE, 30))
client_kwargs["http_client"] = _httpx.Client(
transport=_httpx.HTTPTransport(socket_options=_sock_opts),
)
except Exception:
pass # Fall through to default transport if socket opts fail
client = OpenAI(**client_kwargs)
logger.info(
"OpenAI client created (%s, shared=%s) %s",