feat: warn at session start when compression model context is too small (#7894)
Two-phase design so the warning fires before the user's first message
on every platform:
Phase 1 (__init__):
_check_compression_model_feasibility() runs during agent construction.
Resolves the auxiliary compression model (same chain as call_llm with
task='compression'), compares its context length to the main model's
compression threshold. If too small, emits via _emit_status() (prints
for CLI) and stores the warning in _compression_warning.
Phase 2 (run_conversation, first call):
_replay_compression_warning() re-sends the stored warning through
status_callback — which the gateway wires AFTER construction. The
warning is then cleared so it only fires once.
This ensures:
- CLI users see the warning immediately at startup (right after the
context limit line)
- Gateway users (Telegram, Discord, Slack, WhatsApp, Signal, Matrix,
Mattermost, Home Assistant, DingTalk, etc.) receive it via
status_callback('lifecycle', ...) on their first message
- logger.warning() always hits agent.log regardless of platform
Also warns when no auxiliary LLM provider is configured at all.
Entire check wrapped in try/except — never blocks startup.
11 tests covering: core warning logic, boundary conditions, exception
safety, two-phase store+replay, gateway callback wiring, and
single-delivery guarantee.
2026-04-11 12:01:30 -07:00
|
|
|
"""Tests for _check_compression_model_feasibility() — warns when the
|
|
|
|
|
auxiliary compression model's context is smaller than the main model's
|
|
|
|
|
compression threshold.
|
|
|
|
|
|
|
|
|
|
Two-phase design:
|
|
|
|
|
1. __init__ → runs the check, prints via _vprint (CLI), stores warning
|
|
|
|
|
2. run_conversation (first call) → replays stored warning through
|
|
|
|
|
status_callback (gateway platforms)
|
|
|
|
|
"""
|
|
|
|
|
|
|
|
|
|
from unittest.mock import MagicMock, patch
|
|
|
|
|
|
2026-04-20 00:56:04 -07:00
|
|
|
import pytest
|
|
|
|
|
|
feat: warn at session start when compression model context is too small (#7894)
Two-phase design so the warning fires before the user's first message
on every platform:
Phase 1 (__init__):
_check_compression_model_feasibility() runs during agent construction.
Resolves the auxiliary compression model (same chain as call_llm with
task='compression'), compares its context length to the main model's
compression threshold. If too small, emits via _emit_status() (prints
for CLI) and stores the warning in _compression_warning.
Phase 2 (run_conversation, first call):
_replay_compression_warning() re-sends the stored warning through
status_callback — which the gateway wires AFTER construction. The
warning is then cleared so it only fires once.
This ensures:
- CLI users see the warning immediately at startup (right after the
context limit line)
- Gateway users (Telegram, Discord, Slack, WhatsApp, Signal, Matrix,
Mattermost, Home Assistant, DingTalk, etc.) receive it via
status_callback('lifecycle', ...) on their first message
- logger.warning() always hits agent.log regardless of platform
Also warns when no auxiliary LLM provider is configured at all.
Entire check wrapped in try/except — never blocks startup.
11 tests covering: core warning logic, boundary conditions, exception
safety, two-phase store+replay, gateway callback wiring, and
single-delivery guarantee.
2026-04-11 12:01:30 -07:00
|
|
|
from run_agent import AIAgent
|
|
|
|
|
from agent.context_compressor import ContextCompressor
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
def _make_agent(
|
|
|
|
|
*,
|
|
|
|
|
compression_enabled: bool = True,
|
|
|
|
|
threshold_percent: float = 0.50,
|
|
|
|
|
main_context: int = 200_000,
|
|
|
|
|
) -> AIAgent:
|
|
|
|
|
"""Build a minimal AIAgent with a compressor, skipping __init__."""
|
|
|
|
|
agent = AIAgent.__new__(AIAgent)
|
|
|
|
|
agent.model = "test-main-model"
|
|
|
|
|
agent.provider = "openrouter"
|
|
|
|
|
agent.base_url = "https://openrouter.ai/api/v1"
|
|
|
|
|
agent.api_key = "sk-test"
|
2026-04-12 00:10:19 -04:00
|
|
|
agent.api_mode = "chat_completions"
|
feat: warn at session start when compression model context is too small (#7894)
Two-phase design so the warning fires before the user's first message
on every platform:
Phase 1 (__init__):
_check_compression_model_feasibility() runs during agent construction.
Resolves the auxiliary compression model (same chain as call_llm with
task='compression'), compares its context length to the main model's
compression threshold. If too small, emits via _emit_status() (prints
for CLI) and stores the warning in _compression_warning.
Phase 2 (run_conversation, first call):
_replay_compression_warning() re-sends the stored warning through
status_callback — which the gateway wires AFTER construction. The
warning is then cleared so it only fires once.
This ensures:
- CLI users see the warning immediately at startup (right after the
context limit line)
- Gateway users (Telegram, Discord, Slack, WhatsApp, Signal, Matrix,
Mattermost, Home Assistant, DingTalk, etc.) receive it via
status_callback('lifecycle', ...) on their first message
- logger.warning() always hits agent.log regardless of platform
Also warns when no auxiliary LLM provider is configured at all.
Entire check wrapped in try/except — never blocks startup.
11 tests covering: core warning logic, boundary conditions, exception
safety, two-phase store+replay, gateway callback wiring, and
single-delivery guarantee.
2026-04-11 12:01:30 -07:00
|
|
|
agent.quiet_mode = True
|
|
|
|
|
agent.log_prefix = ""
|
|
|
|
|
agent.compression_enabled = compression_enabled
|
|
|
|
|
agent._print_fn = None
|
|
|
|
|
agent.suppress_status_output = False
|
|
|
|
|
agent._stream_consumers = []
|
|
|
|
|
agent._executing_tools = False
|
|
|
|
|
agent._mute_post_response = False
|
|
|
|
|
agent.status_callback = None
|
|
|
|
|
agent.tool_progress_callback = None
|
|
|
|
|
agent._compression_warning = None
|
2026-04-19 23:10:01 +05:30
|
|
|
agent._aux_compression_context_length_config = None
|
2026-04-25 05:41:56 -07:00
|
|
|
agent.tools = []
|
feat: warn at session start when compression model context is too small (#7894)
Two-phase design so the warning fires before the user's first message
on every platform:
Phase 1 (__init__):
_check_compression_model_feasibility() runs during agent construction.
Resolves the auxiliary compression model (same chain as call_llm with
task='compression'), compares its context length to the main model's
compression threshold. If too small, emits via _emit_status() (prints
for CLI) and stores the warning in _compression_warning.
Phase 2 (run_conversation, first call):
_replay_compression_warning() re-sends the stored warning through
status_callback — which the gateway wires AFTER construction. The
warning is then cleared so it only fires once.
This ensures:
- CLI users see the warning immediately at startup (right after the
context limit line)
- Gateway users (Telegram, Discord, Slack, WhatsApp, Signal, Matrix,
Mattermost, Home Assistant, DingTalk, etc.) receive it via
status_callback('lifecycle', ...) on their first message
- logger.warning() always hits agent.log regardless of platform
Also warns when no auxiliary LLM provider is configured at all.
Entire check wrapped in try/except — never blocks startup.
11 tests covering: core warning logic, boundary conditions, exception
safety, two-phase store+replay, gateway callback wiring, and
single-delivery guarantee.
2026-04-11 12:01:30 -07:00
|
|
|
|
|
|
|
|
compressor = MagicMock(spec=ContextCompressor)
|
|
|
|
|
compressor.context_length = main_context
|
|
|
|
|
compressor.threshold_tokens = int(main_context * threshold_percent)
|
|
|
|
|
agent.context_compressor = compressor
|
|
|
|
|
|
|
|
|
|
return agent
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# ── Core warning logic ──────────────────────────────────────────────
|
|
|
|
|
|
|
|
|
|
|
2026-04-20 00:56:04 -07:00
|
|
|
@patch("agent.model_metadata.get_model_context_length", return_value=80_000)
|
feat: warn at session start when compression model context is too small (#7894)
Two-phase design so the warning fires before the user's first message
on every platform:
Phase 1 (__init__):
_check_compression_model_feasibility() runs during agent construction.
Resolves the auxiliary compression model (same chain as call_llm with
task='compression'), compares its context length to the main model's
compression threshold. If too small, emits via _emit_status() (prints
for CLI) and stores the warning in _compression_warning.
Phase 2 (run_conversation, first call):
_replay_compression_warning() re-sends the stored warning through
status_callback — which the gateway wires AFTER construction. The
warning is then cleared so it only fires once.
This ensures:
- CLI users see the warning immediately at startup (right after the
context limit line)
- Gateway users (Telegram, Discord, Slack, WhatsApp, Signal, Matrix,
Mattermost, Home Assistant, DingTalk, etc.) receive it via
status_callback('lifecycle', ...) on their first message
- logger.warning() always hits agent.log regardless of platform
Also warns when no auxiliary LLM provider is configured at all.
Entire check wrapped in try/except — never blocks startup.
11 tests covering: core warning logic, boundary conditions, exception
safety, two-phase store+replay, gateway callback wiring, and
single-delivery guarantee.
2026-04-11 12:01:30 -07:00
|
|
|
@patch("agent.auxiliary_client.get_text_auxiliary_client")
|
2026-04-20 00:56:04 -07:00
|
|
|
def test_auto_corrects_threshold_when_aux_context_below_threshold(mock_get_client, mock_ctx_len):
|
|
|
|
|
"""Auto-correction: aux >= 64K floor but < threshold → lower threshold
|
|
|
|
|
to aux_context so compression still works this session."""
|
feat: warn at session start when compression model context is too small (#7894)
Two-phase design so the warning fires before the user's first message
on every platform:
Phase 1 (__init__):
_check_compression_model_feasibility() runs during agent construction.
Resolves the auxiliary compression model (same chain as call_llm with
task='compression'), compares its context length to the main model's
compression threshold. If too small, emits via _emit_status() (prints
for CLI) and stores the warning in _compression_warning.
Phase 2 (run_conversation, first call):
_replay_compression_warning() re-sends the stored warning through
status_callback — which the gateway wires AFTER construction. The
warning is then cleared so it only fires once.
This ensures:
- CLI users see the warning immediately at startup (right after the
context limit line)
- Gateway users (Telegram, Discord, Slack, WhatsApp, Signal, Matrix,
Mattermost, Home Assistant, DingTalk, etc.) receive it via
status_callback('lifecycle', ...) on their first message
- logger.warning() always hits agent.log regardless of platform
Also warns when no auxiliary LLM provider is configured at all.
Entire check wrapped in try/except — never blocks startup.
11 tests covering: core warning logic, boundary conditions, exception
safety, two-phase store+replay, gateway callback wiring, and
single-delivery guarantee.
2026-04-11 12:01:30 -07:00
|
|
|
agent = _make_agent(main_context=200_000, threshold_percent=0.50)
|
2026-04-20 00:56:04 -07:00
|
|
|
# threshold = 100,000 — aux has 80,000 (above 64K floor, below threshold)
|
feat: warn at session start when compression model context is too small (#7894)
Two-phase design so the warning fires before the user's first message
on every platform:
Phase 1 (__init__):
_check_compression_model_feasibility() runs during agent construction.
Resolves the auxiliary compression model (same chain as call_llm with
task='compression'), compares its context length to the main model's
compression threshold. If too small, emits via _emit_status() (prints
for CLI) and stores the warning in _compression_warning.
Phase 2 (run_conversation, first call):
_replay_compression_warning() re-sends the stored warning through
status_callback — which the gateway wires AFTER construction. The
warning is then cleared so it only fires once.
This ensures:
- CLI users see the warning immediately at startup (right after the
context limit line)
- Gateway users (Telegram, Discord, Slack, WhatsApp, Signal, Matrix,
Mattermost, Home Assistant, DingTalk, etc.) receive it via
status_callback('lifecycle', ...) on their first message
- logger.warning() always hits agent.log regardless of platform
Also warns when no auxiliary LLM provider is configured at all.
Entire check wrapped in try/except — never blocks startup.
11 tests covering: core warning logic, boundary conditions, exception
safety, two-phase store+replay, gateway callback wiring, and
single-delivery guarantee.
2026-04-11 12:01:30 -07:00
|
|
|
mock_client = MagicMock()
|
|
|
|
|
mock_client.base_url = "https://openrouter.ai/api/v1"
|
|
|
|
|
mock_client.api_key = "sk-aux"
|
|
|
|
|
mock_get_client.return_value = (mock_client, "google/gemini-3-flash-preview")
|
|
|
|
|
|
|
|
|
|
messages = []
|
|
|
|
|
agent._emit_status = lambda msg: messages.append(msg)
|
|
|
|
|
|
|
|
|
|
agent._check_compression_model_feasibility()
|
|
|
|
|
|
|
|
|
|
assert len(messages) == 1
|
|
|
|
|
assert "Compression model" in messages[0]
|
2026-04-20 00:56:04 -07:00
|
|
|
assert "80,000" in messages[0] # aux context
|
|
|
|
|
assert "100,000" in messages[0] # old threshold
|
|
|
|
|
assert "Auto-lowered" in messages[0]
|
|
|
|
|
# Actionable persistence guidance included
|
|
|
|
|
assert "config.yaml" in messages[0]
|
feat: warn at session start when compression model context is too small (#7894)
Two-phase design so the warning fires before the user's first message
on every platform:
Phase 1 (__init__):
_check_compression_model_feasibility() runs during agent construction.
Resolves the auxiliary compression model (same chain as call_llm with
task='compression'), compares its context length to the main model's
compression threshold. If too small, emits via _emit_status() (prints
for CLI) and stores the warning in _compression_warning.
Phase 2 (run_conversation, first call):
_replay_compression_warning() re-sends the stored warning through
status_callback — which the gateway wires AFTER construction. The
warning is then cleared so it only fires once.
This ensures:
- CLI users see the warning immediately at startup (right after the
context limit line)
- Gateway users (Telegram, Discord, Slack, WhatsApp, Signal, Matrix,
Mattermost, Home Assistant, DingTalk, etc.) receive it via
status_callback('lifecycle', ...) on their first message
- logger.warning() always hits agent.log regardless of platform
Also warns when no auxiliary LLM provider is configured at all.
Entire check wrapped in try/except — never blocks startup.
11 tests covering: core warning logic, boundary conditions, exception
safety, two-phase store+replay, gateway callback wiring, and
single-delivery guarantee.
2026-04-11 12:01:30 -07:00
|
|
|
assert "auxiliary:" in messages[0]
|
|
|
|
|
assert "compression:" in messages[0]
|
|
|
|
|
assert "threshold:" in messages[0]
|
|
|
|
|
# Warning stored for gateway replay
|
|
|
|
|
assert agent._compression_warning is not None
|
refactor(memory): remove flush_memories entirely (#15696)
The AIAgent.flush_memories pre-compression save, the gateway
_flush_memories_for_session, and everything feeding them are
obsolete now that the background memory/skill review handles
persistent memory extraction.
Problems with flush_memories:
- Pre-dates the background review loop. It was the only memory-save
path when introduced; the background review now fires every 10 user
turns on CLI and gateway alike, which is far more frequent than
compression or session reset ever triggered flush.
- Blocking and synchronous. Pre-compression flush ran on the live agent
before compression, blocking the user-visible response.
- Cache-breaking. Flush built a temporary conversation prefix
(system prompt + memory-only tool list) that diverged from the live
conversation's cached prefix, invalidating prompt caching. The
gateway variant spawned a fresh AIAgent with its own clean prompt
for each finalized session — still cache-breaking, just in a
different process.
- Redundant. Background review runs in the live conversation's
session context, gets the same content, writes to the same memory
store, and doesn't break the cache. Everything flush_memories
claimed to preserve is already covered.
What this removes:
- AIAgent.flush_memories() method (~248 LOC in run_agent.py)
- Pre-compression flush call in _compress_context
- flush_memories call sites in cli.py (/new + exit)
- GatewayRunner._flush_memories_for_session + _async_flush_memories
(and the 3 call sites: session expiry watcher, /new, /resume)
- 'flush_memories' entry from DEFAULT_CONFIG auxiliary tasks,
hermes tools UI task list, auxiliary_client docstrings
- _memory_flush_min_turns config + init
- #15631's headroom-deduction math in
_check_compression_model_feasibility (headroom was only needed
because flush dragged the full main-agent system prompt along;
the compression summariser sends a single user-role prompt so
new_threshold = aux_context is safe again)
- The dedicated test files and assertions that exercised
flush-specific paths
What this renames (with read-time backcompat on sessions.json):
- SessionEntry.memory_flushed -> SessionEntry.expiry_finalized.
The session-expiry watcher still uses the flag to avoid re-running
finalize/eviction on the same expired session; the new name
reflects what it now actually gates. from_dict() reads
'expiry_finalized' first, falls back to the legacy 'memory_flushed'
key so existing sessions.json files upgrade seamlessly.
Supersedes #15631 and #15638.
Tested: 383 targeted tests pass across run_agent/, agent/, cli/,
and gateway/ session-boundary suites. No behavior regressions —
background memory review continues to handle persistent memory
extraction on both CLI and gateway.
2026-04-25 08:21:14 -07:00
|
|
|
# Threshold on the live compressor was actually lowered to aux_context.
|
|
|
|
|
assert agent.context_compressor.threshold_tokens == 80_000
|
2026-04-20 00:56:04 -07:00
|
|
|
|
|
|
|
|
|
|
|
|
|
@patch("agent.model_metadata.get_model_context_length", return_value=32_768)
|
|
|
|
|
@patch("agent.auxiliary_client.get_text_auxiliary_client")
|
|
|
|
|
def test_rejects_aux_below_minimum_context(mock_get_client, mock_ctx_len):
|
|
|
|
|
"""Hard floor: aux context < MINIMUM_CONTEXT_LENGTH (64K) → session
|
|
|
|
|
refuses to start (ValueError), mirroring the main-model rejection."""
|
|
|
|
|
agent = _make_agent(main_context=200_000, threshold_percent=0.50)
|
|
|
|
|
mock_client = MagicMock()
|
|
|
|
|
mock_client.base_url = "https://openrouter.ai/api/v1"
|
|
|
|
|
mock_client.api_key = "sk-aux"
|
|
|
|
|
mock_get_client.return_value = (mock_client, "tiny-aux-model")
|
|
|
|
|
|
|
|
|
|
agent._emit_status = lambda msg: None
|
|
|
|
|
|
|
|
|
|
with pytest.raises(ValueError) as exc_info:
|
|
|
|
|
agent._check_compression_model_feasibility()
|
|
|
|
|
|
|
|
|
|
err = str(exc_info.value)
|
|
|
|
|
assert "tiny-aux-model" in err
|
|
|
|
|
assert "32,768" in err
|
|
|
|
|
assert "64,000" in err
|
|
|
|
|
assert "below the minimum" in err
|
feat: warn at session start when compression model context is too small (#7894)
Two-phase design so the warning fires before the user's first message
on every platform:
Phase 1 (__init__):
_check_compression_model_feasibility() runs during agent construction.
Resolves the auxiliary compression model (same chain as call_llm with
task='compression'), compares its context length to the main model's
compression threshold. If too small, emits via _emit_status() (prints
for CLI) and stores the warning in _compression_warning.
Phase 2 (run_conversation, first call):
_replay_compression_warning() re-sends the stored warning through
status_callback — which the gateway wires AFTER construction. The
warning is then cleared so it only fires once.
This ensures:
- CLI users see the warning immediately at startup (right after the
context limit line)
- Gateway users (Telegram, Discord, Slack, WhatsApp, Signal, Matrix,
Mattermost, Home Assistant, DingTalk, etc.) receive it via
status_callback('lifecycle', ...) on their first message
- logger.warning() always hits agent.log regardless of platform
Also warns when no auxiliary LLM provider is configured at all.
Entire check wrapped in try/except — never blocks startup.
11 tests covering: core warning logic, boundary conditions, exception
safety, two-phase store+replay, gateway callback wiring, and
single-delivery guarantee.
2026-04-11 12:01:30 -07:00
|
|
|
|
|
|
|
|
|
|
|
|
|
@patch("agent.model_metadata.get_model_context_length", return_value=200_000)
|
|
|
|
|
@patch("agent.auxiliary_client.get_text_auxiliary_client")
|
|
|
|
|
def test_no_warning_when_aux_context_sufficient(mock_get_client, mock_ctx_len):
|
|
|
|
|
"""No warning when aux model context >= main model threshold."""
|
|
|
|
|
agent = _make_agent(main_context=200_000, threshold_percent=0.50)
|
|
|
|
|
# threshold = 100,000 — aux has 200,000 (sufficient)
|
|
|
|
|
mock_client = MagicMock()
|
|
|
|
|
mock_client.base_url = "https://openrouter.ai/api/v1"
|
|
|
|
|
mock_client.api_key = "sk-aux"
|
|
|
|
|
mock_get_client.return_value = (mock_client, "google/gemini-2.5-flash")
|
|
|
|
|
|
|
|
|
|
messages = []
|
|
|
|
|
agent._emit_status = lambda msg: messages.append(msg)
|
|
|
|
|
|
|
|
|
|
agent._check_compression_model_feasibility()
|
|
|
|
|
|
|
|
|
|
assert len(messages) == 0
|
|
|
|
|
assert agent._compression_warning is None
|
|
|
|
|
|
|
|
|
|
|
2026-04-12 00:10:19 -04:00
|
|
|
def test_feasibility_check_passes_live_main_runtime():
|
|
|
|
|
"""Compression feasibility should probe using the live session runtime."""
|
|
|
|
|
agent = _make_agent(main_context=200_000, threshold_percent=0.50)
|
|
|
|
|
agent.model = "gpt-5.4"
|
|
|
|
|
agent.provider = "openai-codex"
|
|
|
|
|
agent.base_url = "https://chatgpt.com/backend-api/codex"
|
|
|
|
|
agent.api_key = "codex-token"
|
|
|
|
|
agent.api_mode = "codex_responses"
|
|
|
|
|
|
|
|
|
|
mock_client = MagicMock()
|
|
|
|
|
mock_client.base_url = "https://chatgpt.com/backend-api/codex"
|
|
|
|
|
mock_client.api_key = "codex-token"
|
|
|
|
|
|
|
|
|
|
with patch("agent.auxiliary_client.get_text_auxiliary_client", return_value=(mock_client, "gpt-5.4")) as mock_get_client, \
|
|
|
|
|
patch("agent.model_metadata.get_model_context_length", return_value=200_000):
|
|
|
|
|
agent._emit_status = lambda msg: None
|
|
|
|
|
agent._check_compression_model_feasibility()
|
|
|
|
|
|
|
|
|
|
mock_get_client.assert_called_once_with(
|
|
|
|
|
"compression",
|
|
|
|
|
main_runtime={
|
|
|
|
|
"model": "gpt-5.4",
|
|
|
|
|
"provider": "openai-codex",
|
|
|
|
|
"base_url": "https://chatgpt.com/backend-api/codex",
|
|
|
|
|
"api_key": "codex-token",
|
|
|
|
|
"api_mode": "codex_responses",
|
|
|
|
|
},
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
|
2026-04-12 17:47:14 -07:00
|
|
|
@patch("agent.model_metadata.get_model_context_length", return_value=1_000_000)
|
|
|
|
|
@patch("agent.auxiliary_client.get_text_auxiliary_client")
|
|
|
|
|
def test_feasibility_check_passes_config_context_length(mock_get_client, mock_ctx_len):
|
|
|
|
|
"""auxiliary.compression.context_length from config is forwarded to
|
|
|
|
|
get_model_context_length so custom endpoints that lack /models still
|
|
|
|
|
report the correct context window (fixes #8499)."""
|
|
|
|
|
agent = _make_agent(main_context=200_000, threshold_percent=0.85)
|
2026-04-19 23:10:01 +05:30
|
|
|
agent._aux_compression_context_length_config = 1_000_000
|
2026-04-12 17:47:14 -07:00
|
|
|
mock_client = MagicMock()
|
|
|
|
|
mock_client.base_url = "http://custom-endpoint:8080/v1"
|
|
|
|
|
mock_client.api_key = "sk-custom"
|
|
|
|
|
mock_get_client.return_value = (mock_client, "custom/big-model")
|
|
|
|
|
|
|
|
|
|
agent._emit_status = lambda msg: None
|
|
|
|
|
agent._check_compression_model_feasibility()
|
|
|
|
|
|
|
|
|
|
mock_ctx_len.assert_called_once_with(
|
|
|
|
|
"custom/big-model",
|
|
|
|
|
base_url="http://custom-endpoint:8080/v1",
|
|
|
|
|
api_key="sk-custom",
|
|
|
|
|
config_context_length=1_000_000,
|
fix(compression): pass provider to context length resolver in feasibility check
_check_compression_model_feasibility calls get_model_context_length
without provider=, so Codex OAuth users get 1,050,000 (from models.dev
for 'openai') instead of the actual 272,000 limit. This happens because
_infer_provider_from_url maps chatgpt.com → 'openai' (not 'openai-codex'),
skipping the Codex-specific resolution branch entirely.
Result: compression threshold set at 85% of 1.05M = 892K — conversations
never trigger compression, the context grows unbounded, and when gateway
hygiene eventually forces compression, the Codex endpoint drops the
oversized streaming request ('peer closed connection without sending
complete message body').
Fix: forward self.provider to get_model_context_length so provider-
specific resolution branches (Codex OAuth 272K, Copilot live /models,
Nous suffix-match) fire correctly.
Reported by user on GPT 5.5 via Codex OAuth Pro (paste.rs/vsra3).
2026-04-25 19:26:26 +05:30
|
|
|
provider="openrouter",
|
2026-04-12 17:47:14 -07:00
|
|
|
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@patch("agent.model_metadata.get_model_context_length", return_value=128_000)
|
|
|
|
|
@patch("agent.auxiliary_client.get_text_auxiliary_client")
|
|
|
|
|
def test_feasibility_check_ignores_invalid_context_length(mock_get_client, mock_ctx_len):
|
|
|
|
|
"""Non-integer context_length in config is silently ignored."""
|
|
|
|
|
agent = _make_agent(main_context=200_000, threshold_percent=0.50)
|
2026-04-19 23:10:01 +05:30
|
|
|
agent._aux_compression_context_length_config = None
|
2026-04-12 17:47:14 -07:00
|
|
|
mock_client = MagicMock()
|
|
|
|
|
mock_client.base_url = "http://custom:8080/v1"
|
|
|
|
|
mock_client.api_key = "sk-test"
|
|
|
|
|
mock_get_client.return_value = (mock_client, "custom/model")
|
|
|
|
|
|
|
|
|
|
agent._emit_status = lambda msg: None
|
|
|
|
|
agent._check_compression_model_feasibility()
|
|
|
|
|
|
|
|
|
|
mock_ctx_len.assert_called_once_with(
|
|
|
|
|
"custom/model",
|
|
|
|
|
base_url="http://custom:8080/v1",
|
|
|
|
|
api_key="sk-test",
|
|
|
|
|
config_context_length=None,
|
fix(compression): pass provider to context length resolver in feasibility check
_check_compression_model_feasibility calls get_model_context_length
without provider=, so Codex OAuth users get 1,050,000 (from models.dev
for 'openai') instead of the actual 272,000 limit. This happens because
_infer_provider_from_url maps chatgpt.com → 'openai' (not 'openai-codex'),
skipping the Codex-specific resolution branch entirely.
Result: compression threshold set at 85% of 1.05M = 892K — conversations
never trigger compression, the context grows unbounded, and when gateway
hygiene eventually forces compression, the Codex endpoint drops the
oversized streaming request ('peer closed connection without sending
complete message body').
Fix: forward self.provider to get_model_context_length so provider-
specific resolution branches (Codex OAuth 272K, Copilot live /models,
Nous suffix-match) fire correctly.
Reported by user on GPT 5.5 via Codex OAuth Pro (paste.rs/vsra3).
2026-04-25 19:26:26 +05:30
|
|
|
provider="openrouter",
|
2026-04-12 17:47:14 -07:00
|
|
|
)
|
|
|
|
|
|
|
|
|
|
|
2026-04-19 23:10:01 +05:30
|
|
|
def test_init_feasibility_check_uses_aux_context_override_from_config():
|
|
|
|
|
"""Real AIAgent init should cache and forward auxiliary.compression.context_length."""
|
|
|
|
|
|
|
|
|
|
class _StubCompressor:
|
|
|
|
|
def __init__(self, *args, **kwargs):
|
|
|
|
|
self.context_length = 200_000
|
|
|
|
|
self.threshold_tokens = 100_000
|
|
|
|
|
self.threshold_percent = 0.50
|
|
|
|
|
|
|
|
|
|
def get_tool_schemas(self):
|
|
|
|
|
return []
|
|
|
|
|
|
|
|
|
|
def on_session_start(self, *args, **kwargs):
|
|
|
|
|
return None
|
|
|
|
|
|
|
|
|
|
cfg = {
|
|
|
|
|
"auxiliary": {
|
|
|
|
|
"compression": {
|
|
|
|
|
"context_length": 1_000_000,
|
|
|
|
|
},
|
|
|
|
|
},
|
|
|
|
|
}
|
|
|
|
|
mock_client = MagicMock()
|
|
|
|
|
mock_client.base_url = "http://custom-endpoint:8080/v1"
|
|
|
|
|
mock_client.api_key = "sk-custom"
|
|
|
|
|
|
|
|
|
|
with (
|
|
|
|
|
patch("hermes_cli.config.load_config", return_value=cfg),
|
|
|
|
|
patch("run_agent.get_tool_definitions", return_value=[]),
|
|
|
|
|
patch("run_agent.check_toolset_requirements", return_value={}),
|
|
|
|
|
patch("run_agent.OpenAI"),
|
|
|
|
|
patch("run_agent.ContextCompressor", new=_StubCompressor),
|
|
|
|
|
patch("agent.auxiliary_client.get_text_auxiliary_client", return_value=(mock_client, "custom/big-model")),
|
|
|
|
|
patch("agent.model_metadata.get_model_context_length", return_value=1_000_000) as mock_ctx_len,
|
|
|
|
|
):
|
|
|
|
|
agent = AIAgent(
|
|
|
|
|
api_key="test-key-1234567890",
|
|
|
|
|
base_url="https://openrouter.ai/api/v1",
|
|
|
|
|
quiet_mode=True,
|
|
|
|
|
skip_context_files=True,
|
|
|
|
|
skip_memory=True,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
assert agent._aux_compression_context_length_config == 1_000_000
|
|
|
|
|
mock_ctx_len.assert_called_once_with(
|
|
|
|
|
"custom/big-model",
|
|
|
|
|
base_url="http://custom-endpoint:8080/v1",
|
|
|
|
|
api_key="sk-custom",
|
|
|
|
|
config_context_length=1_000_000,
|
fix(compression): pass provider to context length resolver in feasibility check
_check_compression_model_feasibility calls get_model_context_length
without provider=, so Codex OAuth users get 1,050,000 (from models.dev
for 'openai') instead of the actual 272,000 limit. This happens because
_infer_provider_from_url maps chatgpt.com → 'openai' (not 'openai-codex'),
skipping the Codex-specific resolution branch entirely.
Result: compression threshold set at 85% of 1.05M = 892K — conversations
never trigger compression, the context grows unbounded, and when gateway
hygiene eventually forces compression, the Codex endpoint drops the
oversized streaming request ('peer closed connection without sending
complete message body').
Fix: forward self.provider to get_model_context_length so provider-
specific resolution branches (Codex OAuth 272K, Copilot live /models,
Nous suffix-match) fire correctly.
Reported by user on GPT 5.5 via Codex OAuth Pro (paste.rs/vsra3).
2026-04-25 19:26:26 +05:30
|
|
|
provider="",
|
2026-04-19 23:10:01 +05:30
|
|
|
)
|
|
|
|
|
|
|
|
|
|
|
feat: warn at session start when compression model context is too small (#7894)
Two-phase design so the warning fires before the user's first message
on every platform:
Phase 1 (__init__):
_check_compression_model_feasibility() runs during agent construction.
Resolves the auxiliary compression model (same chain as call_llm with
task='compression'), compares its context length to the main model's
compression threshold. If too small, emits via _emit_status() (prints
for CLI) and stores the warning in _compression_warning.
Phase 2 (run_conversation, first call):
_replay_compression_warning() re-sends the stored warning through
status_callback — which the gateway wires AFTER construction. The
warning is then cleared so it only fires once.
This ensures:
- CLI users see the warning immediately at startup (right after the
context limit line)
- Gateway users (Telegram, Discord, Slack, WhatsApp, Signal, Matrix,
Mattermost, Home Assistant, DingTalk, etc.) receive it via
status_callback('lifecycle', ...) on their first message
- logger.warning() always hits agent.log regardless of platform
Also warns when no auxiliary LLM provider is configured at all.
Entire check wrapped in try/except — never blocks startup.
11 tests covering: core warning logic, boundary conditions, exception
safety, two-phase store+replay, gateway callback wiring, and
single-delivery guarantee.
2026-04-11 12:01:30 -07:00
|
|
|
@patch("agent.auxiliary_client.get_text_auxiliary_client")
|
|
|
|
|
def test_warns_when_no_auxiliary_provider(mock_get_client):
|
|
|
|
|
"""Warning emitted when no auxiliary provider is configured."""
|
|
|
|
|
agent = _make_agent()
|
|
|
|
|
mock_get_client.return_value = (None, None)
|
|
|
|
|
|
|
|
|
|
messages = []
|
|
|
|
|
agent._emit_status = lambda msg: messages.append(msg)
|
|
|
|
|
|
|
|
|
|
agent._check_compression_model_feasibility()
|
|
|
|
|
|
|
|
|
|
assert len(messages) == 1
|
|
|
|
|
assert "No auxiliary LLM provider" in messages[0]
|
|
|
|
|
assert agent._compression_warning is not None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
def test_skips_check_when_compression_disabled():
|
|
|
|
|
"""No check performed when compression is disabled."""
|
|
|
|
|
agent = _make_agent(compression_enabled=False)
|
|
|
|
|
|
|
|
|
|
messages = []
|
|
|
|
|
agent._emit_status = lambda msg: messages.append(msg)
|
|
|
|
|
|
|
|
|
|
agent._check_compression_model_feasibility()
|
|
|
|
|
|
|
|
|
|
assert len(messages) == 0
|
|
|
|
|
assert agent._compression_warning is None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@patch("agent.auxiliary_client.get_text_auxiliary_client")
|
|
|
|
|
def test_exception_does_not_crash(mock_get_client):
|
|
|
|
|
"""Exceptions in the check are caught — never blocks startup."""
|
|
|
|
|
agent = _make_agent()
|
|
|
|
|
mock_get_client.side_effect = RuntimeError("boom")
|
|
|
|
|
|
|
|
|
|
messages = []
|
|
|
|
|
agent._emit_status = lambda msg: messages.append(msg)
|
|
|
|
|
|
|
|
|
|
# Should not raise
|
|
|
|
|
agent._check_compression_model_feasibility()
|
|
|
|
|
|
|
|
|
|
# No user-facing message (error is debug-logged)
|
|
|
|
|
assert len(messages) == 0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@patch("agent.model_metadata.get_model_context_length", return_value=100_000)
|
|
|
|
|
@patch("agent.auxiliary_client.get_text_auxiliary_client")
|
|
|
|
|
def test_exact_threshold_boundary_no_warning(mock_get_client, mock_ctx_len):
|
|
|
|
|
"""No warning when aux context exactly equals the threshold."""
|
|
|
|
|
agent = _make_agent(main_context=200_000, threshold_percent=0.50)
|
|
|
|
|
mock_client = MagicMock()
|
|
|
|
|
mock_client.base_url = "https://openrouter.ai/api/v1"
|
|
|
|
|
mock_client.api_key = "sk-aux"
|
|
|
|
|
mock_get_client.return_value = (mock_client, "test-model")
|
|
|
|
|
|
|
|
|
|
messages = []
|
|
|
|
|
agent._emit_status = lambda msg: messages.append(msg)
|
|
|
|
|
|
|
|
|
|
agent._check_compression_model_feasibility()
|
|
|
|
|
|
|
|
|
|
assert len(messages) == 0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@patch("agent.model_metadata.get_model_context_length", return_value=99_999)
|
|
|
|
|
@patch("agent.auxiliary_client.get_text_auxiliary_client")
|
2026-04-20 00:56:04 -07:00
|
|
|
def test_just_below_threshold_auto_corrects(mock_get_client, mock_ctx_len):
|
|
|
|
|
"""Auto-correct fires when aux context is one token below the threshold
|
|
|
|
|
(and above the 64K hard floor)."""
|
feat: warn at session start when compression model context is too small (#7894)
Two-phase design so the warning fires before the user's first message
on every platform:
Phase 1 (__init__):
_check_compression_model_feasibility() runs during agent construction.
Resolves the auxiliary compression model (same chain as call_llm with
task='compression'), compares its context length to the main model's
compression threshold. If too small, emits via _emit_status() (prints
for CLI) and stores the warning in _compression_warning.
Phase 2 (run_conversation, first call):
_replay_compression_warning() re-sends the stored warning through
status_callback — which the gateway wires AFTER construction. The
warning is then cleared so it only fires once.
This ensures:
- CLI users see the warning immediately at startup (right after the
context limit line)
- Gateway users (Telegram, Discord, Slack, WhatsApp, Signal, Matrix,
Mattermost, Home Assistant, DingTalk, etc.) receive it via
status_callback('lifecycle', ...) on their first message
- logger.warning() always hits agent.log regardless of platform
Also warns when no auxiliary LLM provider is configured at all.
Entire check wrapped in try/except — never blocks startup.
11 tests covering: core warning logic, boundary conditions, exception
safety, two-phase store+replay, gateway callback wiring, and
single-delivery guarantee.
2026-04-11 12:01:30 -07:00
|
|
|
agent = _make_agent(main_context=200_000, threshold_percent=0.50)
|
|
|
|
|
mock_client = MagicMock()
|
|
|
|
|
mock_client.base_url = "https://openrouter.ai/api/v1"
|
|
|
|
|
mock_client.api_key = "sk-aux"
|
|
|
|
|
mock_get_client.return_value = (mock_client, "small-model")
|
|
|
|
|
|
|
|
|
|
messages = []
|
|
|
|
|
agent._emit_status = lambda msg: messages.append(msg)
|
|
|
|
|
|
|
|
|
|
agent._check_compression_model_feasibility()
|
|
|
|
|
|
|
|
|
|
assert len(messages) == 1
|
|
|
|
|
assert "small-model" in messages[0]
|
2026-04-20 00:56:04 -07:00
|
|
|
assert "Auto-lowered" in messages[0]
|
refactor(memory): remove flush_memories entirely (#15696)
The AIAgent.flush_memories pre-compression save, the gateway
_flush_memories_for_session, and everything feeding them are
obsolete now that the background memory/skill review handles
persistent memory extraction.
Problems with flush_memories:
- Pre-dates the background review loop. It was the only memory-save
path when introduced; the background review now fires every 10 user
turns on CLI and gateway alike, which is far more frequent than
compression or session reset ever triggered flush.
- Blocking and synchronous. Pre-compression flush ran on the live agent
before compression, blocking the user-visible response.
- Cache-breaking. Flush built a temporary conversation prefix
(system prompt + memory-only tool list) that diverged from the live
conversation's cached prefix, invalidating prompt caching. The
gateway variant spawned a fresh AIAgent with its own clean prompt
for each finalized session — still cache-breaking, just in a
different process.
- Redundant. Background review runs in the live conversation's
session context, gets the same content, writes to the same memory
store, and doesn't break the cache. Everything flush_memories
claimed to preserve is already covered.
What this removes:
- AIAgent.flush_memories() method (~248 LOC in run_agent.py)
- Pre-compression flush call in _compress_context
- flush_memories call sites in cli.py (/new + exit)
- GatewayRunner._flush_memories_for_session + _async_flush_memories
(and the 3 call sites: session expiry watcher, /new, /resume)
- 'flush_memories' entry from DEFAULT_CONFIG auxiliary tasks,
hermes tools UI task list, auxiliary_client docstrings
- _memory_flush_min_turns config + init
- #15631's headroom-deduction math in
_check_compression_model_feasibility (headroom was only needed
because flush dragged the full main-agent system prompt along;
the compression summariser sends a single user-role prompt so
new_threshold = aux_context is safe again)
- The dedicated test files and assertions that exercised
flush-specific paths
What this renames (with read-time backcompat on sessions.json):
- SessionEntry.memory_flushed -> SessionEntry.expiry_finalized.
The session-expiry watcher still uses the flag to avoid re-running
finalize/eviction on the same expired session; the new name
reflects what it now actually gates. from_dict() reads
'expiry_finalized' first, falls back to the legacy 'memory_flushed'
key so existing sessions.json files upgrade seamlessly.
Supersedes #15631 and #15638.
Tested: 383 targeted tests pass across run_agent/, agent/, cli/,
and gateway/ session-boundary suites. No behavior regressions —
background memory review continues to handle persistent memory
extraction on both CLI and gateway.
2026-04-25 08:21:14 -07:00
|
|
|
assert agent.context_compressor.threshold_tokens == 99_999
|
feat: warn at session start when compression model context is too small (#7894)
Two-phase design so the warning fires before the user's first message
on every platform:
Phase 1 (__init__):
_check_compression_model_feasibility() runs during agent construction.
Resolves the auxiliary compression model (same chain as call_llm with
task='compression'), compares its context length to the main model's
compression threshold. If too small, emits via _emit_status() (prints
for CLI) and stores the warning in _compression_warning.
Phase 2 (run_conversation, first call):
_replay_compression_warning() re-sends the stored warning through
status_callback — which the gateway wires AFTER construction. The
warning is then cleared so it only fires once.
This ensures:
- CLI users see the warning immediately at startup (right after the
context limit line)
- Gateway users (Telegram, Discord, Slack, WhatsApp, Signal, Matrix,
Mattermost, Home Assistant, DingTalk, etc.) receive it via
status_callback('lifecycle', ...) on their first message
- logger.warning() always hits agent.log regardless of platform
Also warns when no auxiliary LLM provider is configured at all.
Entire check wrapped in try/except — never blocks startup.
11 tests covering: core warning logic, boundary conditions, exception
safety, two-phase store+replay, gateway callback wiring, and
single-delivery guarantee.
2026-04-11 12:01:30 -07:00
|
|
|
|
|
|
|
|
|
|
|
|
|
# ── Two-phase: __init__ + run_conversation replay ───────────────────
|
|
|
|
|
|
|
|
|
|
|
2026-04-20 00:56:04 -07:00
|
|
|
@patch("agent.model_metadata.get_model_context_length", return_value=80_000)
|
feat: warn at session start when compression model context is too small (#7894)
Two-phase design so the warning fires before the user's first message
on every platform:
Phase 1 (__init__):
_check_compression_model_feasibility() runs during agent construction.
Resolves the auxiliary compression model (same chain as call_llm with
task='compression'), compares its context length to the main model's
compression threshold. If too small, emits via _emit_status() (prints
for CLI) and stores the warning in _compression_warning.
Phase 2 (run_conversation, first call):
_replay_compression_warning() re-sends the stored warning through
status_callback — which the gateway wires AFTER construction. The
warning is then cleared so it only fires once.
This ensures:
- CLI users see the warning immediately at startup (right after the
context limit line)
- Gateway users (Telegram, Discord, Slack, WhatsApp, Signal, Matrix,
Mattermost, Home Assistant, DingTalk, etc.) receive it via
status_callback('lifecycle', ...) on their first message
- logger.warning() always hits agent.log regardless of platform
Also warns when no auxiliary LLM provider is configured at all.
Entire check wrapped in try/except — never blocks startup.
11 tests covering: core warning logic, boundary conditions, exception
safety, two-phase store+replay, gateway callback wiring, and
single-delivery guarantee.
2026-04-11 12:01:30 -07:00
|
|
|
@patch("agent.auxiliary_client.get_text_auxiliary_client")
|
|
|
|
|
def test_warning_stored_for_gateway_replay(mock_get_client, mock_ctx_len):
|
|
|
|
|
"""__init__ stores the warning; _replay sends it through status_callback."""
|
|
|
|
|
agent = _make_agent(main_context=200_000, threshold_percent=0.50)
|
|
|
|
|
mock_client = MagicMock()
|
|
|
|
|
mock_client.base_url = "https://openrouter.ai/api/v1"
|
|
|
|
|
mock_client.api_key = "sk-aux"
|
|
|
|
|
mock_get_client.return_value = (mock_client, "google/gemini-3-flash-preview")
|
|
|
|
|
|
|
|
|
|
# Phase 1: __init__ — _emit_status prints (CLI) but callback is None
|
|
|
|
|
vprint_messages = []
|
|
|
|
|
agent._emit_status = lambda msg: vprint_messages.append(msg)
|
|
|
|
|
agent._check_compression_model_feasibility()
|
|
|
|
|
|
|
|
|
|
assert len(vprint_messages) == 1 # CLI got it
|
|
|
|
|
assert agent._compression_warning is not None # stored for replay
|
|
|
|
|
|
|
|
|
|
# Phase 2: gateway wires callback post-init, then run_conversation replays
|
|
|
|
|
callback_events = []
|
|
|
|
|
agent.status_callback = lambda ev, msg: callback_events.append((ev, msg))
|
|
|
|
|
agent._replay_compression_warning()
|
|
|
|
|
|
|
|
|
|
assert any(
|
2026-04-20 00:56:04 -07:00
|
|
|
ev == "lifecycle" and "Auto-lowered" in msg
|
feat: warn at session start when compression model context is too small (#7894)
Two-phase design so the warning fires before the user's first message
on every platform:
Phase 1 (__init__):
_check_compression_model_feasibility() runs during agent construction.
Resolves the auxiliary compression model (same chain as call_llm with
task='compression'), compares its context length to the main model's
compression threshold. If too small, emits via _emit_status() (prints
for CLI) and stores the warning in _compression_warning.
Phase 2 (run_conversation, first call):
_replay_compression_warning() re-sends the stored warning through
status_callback — which the gateway wires AFTER construction. The
warning is then cleared so it only fires once.
This ensures:
- CLI users see the warning immediately at startup (right after the
context limit line)
- Gateway users (Telegram, Discord, Slack, WhatsApp, Signal, Matrix,
Mattermost, Home Assistant, DingTalk, etc.) receive it via
status_callback('lifecycle', ...) on their first message
- logger.warning() always hits agent.log regardless of platform
Also warns when no auxiliary LLM provider is configured at all.
Entire check wrapped in try/except — never blocks startup.
11 tests covering: core warning logic, boundary conditions, exception
safety, two-phase store+replay, gateway callback wiring, and
single-delivery guarantee.
2026-04-11 12:01:30 -07:00
|
|
|
for ev, msg in callback_events
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@patch("agent.model_metadata.get_model_context_length", return_value=200_000)
|
|
|
|
|
@patch("agent.auxiliary_client.get_text_auxiliary_client")
|
|
|
|
|
def test_no_replay_when_no_warning(mock_get_client, mock_ctx_len):
|
|
|
|
|
"""_replay_compression_warning is a no-op when there's no stored warning."""
|
|
|
|
|
agent = _make_agent(main_context=200_000, threshold_percent=0.50)
|
|
|
|
|
mock_client = MagicMock()
|
|
|
|
|
mock_client.base_url = "https://openrouter.ai/api/v1"
|
|
|
|
|
mock_client.api_key = "sk-aux"
|
|
|
|
|
mock_get_client.return_value = (mock_client, "big-model")
|
|
|
|
|
|
|
|
|
|
agent._emit_status = lambda msg: None
|
|
|
|
|
agent._check_compression_model_feasibility()
|
|
|
|
|
|
|
|
|
|
assert agent._compression_warning is None
|
|
|
|
|
|
|
|
|
|
callback_events = []
|
|
|
|
|
agent.status_callback = lambda ev, msg: callback_events.append((ev, msg))
|
|
|
|
|
agent._replay_compression_warning()
|
|
|
|
|
|
|
|
|
|
assert len(callback_events) == 0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
def test_replay_without_callback_is_noop():
|
|
|
|
|
"""_replay_compression_warning doesn't crash when status_callback is None."""
|
|
|
|
|
agent = _make_agent()
|
|
|
|
|
agent._compression_warning = "some warning"
|
|
|
|
|
agent.status_callback = None
|
|
|
|
|
|
|
|
|
|
# Should not raise
|
|
|
|
|
agent._replay_compression_warning()
|
|
|
|
|
|
|
|
|
|
|
2026-04-20 00:56:04 -07:00
|
|
|
@patch("agent.model_metadata.get_model_context_length", return_value=80_000)
|
feat: warn at session start when compression model context is too small (#7894)
Two-phase design so the warning fires before the user's first message
on every platform:
Phase 1 (__init__):
_check_compression_model_feasibility() runs during agent construction.
Resolves the auxiliary compression model (same chain as call_llm with
task='compression'), compares its context length to the main model's
compression threshold. If too small, emits via _emit_status() (prints
for CLI) and stores the warning in _compression_warning.
Phase 2 (run_conversation, first call):
_replay_compression_warning() re-sends the stored warning through
status_callback — which the gateway wires AFTER construction. The
warning is then cleared so it only fires once.
This ensures:
- CLI users see the warning immediately at startup (right after the
context limit line)
- Gateway users (Telegram, Discord, Slack, WhatsApp, Signal, Matrix,
Mattermost, Home Assistant, DingTalk, etc.) receive it via
status_callback('lifecycle', ...) on their first message
- logger.warning() always hits agent.log regardless of platform
Also warns when no auxiliary LLM provider is configured at all.
Entire check wrapped in try/except — never blocks startup.
11 tests covering: core warning logic, boundary conditions, exception
safety, two-phase store+replay, gateway callback wiring, and
single-delivery guarantee.
2026-04-11 12:01:30 -07:00
|
|
|
@patch("agent.auxiliary_client.get_text_auxiliary_client")
|
|
|
|
|
def test_run_conversation_clears_warning_after_replay(mock_get_client, mock_ctx_len):
|
|
|
|
|
"""After replay in run_conversation, _compression_warning is cleared
|
|
|
|
|
so the warning is not sent again on subsequent turns."""
|
|
|
|
|
agent = _make_agent(main_context=200_000, threshold_percent=0.50)
|
|
|
|
|
mock_client = MagicMock()
|
|
|
|
|
mock_client.base_url = "https://openrouter.ai/api/v1"
|
|
|
|
|
mock_client.api_key = "sk-aux"
|
|
|
|
|
mock_get_client.return_value = (mock_client, "small-model")
|
|
|
|
|
|
|
|
|
|
agent._emit_status = lambda msg: None
|
|
|
|
|
agent._check_compression_model_feasibility()
|
|
|
|
|
|
|
|
|
|
assert agent._compression_warning is not None
|
|
|
|
|
|
|
|
|
|
# Simulate what run_conversation does
|
|
|
|
|
callback_events = []
|
|
|
|
|
agent.status_callback = lambda ev, msg: callback_events.append((ev, msg))
|
|
|
|
|
if agent._compression_warning:
|
|
|
|
|
agent._replay_compression_warning()
|
|
|
|
|
agent._compression_warning = None # as in run_conversation
|
|
|
|
|
|
|
|
|
|
assert len(callback_events) == 1
|
|
|
|
|
|
|
|
|
|
# Second turn — nothing replayed
|
|
|
|
|
callback_events.clear()
|
|
|
|
|
if agent._compression_warning:
|
|
|
|
|
agent._replay_compression_warning()
|
|
|
|
|
agent._compression_warning = None
|
|
|
|
|
|
|
|
|
|
assert len(callback_events) == 0
|