Files
hermes-agent/tests/run_agent/test_image_shrink_recovery.py
Teknium ec671c4154 feat(image-input): native multimodal routing based on model vision capability (#16506)
* feat(image-input): native multimodal routing based on model vision capability

Attach user-sent images as OpenAI-style content parts on the user turn when
the active model supports native vision, so vision-capable models see real
pixels instead of a lossy text description from vision_analyze.

Routing decision (agent/image_routing.py::decide_image_input_mode):

  agent.image_input_mode = auto | native | text  (default: auto)

In auto mode:
  - If auxiliary.vision.provider/model is explicitly configured, keep the
    text pipeline (user paid for a dedicated vision backend).
  - Else if models.dev reports supports_vision=True for the active
    provider/model, attach natively.
  - Else fall back to text (current behaviour).

Call sites updated: gateway/run.py (all messaging platforms), tui_gateway
(dashboard/Ink), cli.py (interactive /attach + drag-drop).

run_agent.py changes:
  - _prepare_anthropic_messages_for_api now passes image parts through
    unchanged when the model supports vision — the Anthropic adapter
    translates them to native image blocks. Previous behaviour
    (vision_analyze → text) only runs for non-vision Anthropic models.
  - New _prepare_messages_for_non_vision_model mirrors the same contract
    for chat.completions and codex_responses paths, so non-vision models
    on any provider get text-fallback instead of failing at the provider.
  - New _model_supports_vision() helper reads models.dev caps.

vision_analyze description rewritten: positions it as a tool for images
NOT already visible in the conversation (URLs, tool output, deeper
inspection). Prevents the model from redundantly calling it on images
already attached natively.

Config default: agent.image_input_mode = auto.

Tests: 35 new (test_image_routing.py + test_vision_aware_preprocessing.py),
all existing tests that reference _prepare_anthropic_messages_for_api
still pass (198 targeted + new tests green).

* feat(image-input): size-cap + resize oversized images, charge image tokens in compressor

Two follow-ups that make the native image routing safer for long / heavy
sessions:

1) Oversize handling in build_native_content_parts:
   - 20 MB ceiling per image (matches vision_tools._MAX_BASE64_BYTES,
     the most restrictive provider — Gemini inline data).
   - Delegates to vision_tools._resize_image_for_vision (Pillow-based,
     already battle-tested) to downscale to 5 MB first-try.
   - If Pillow is missing or resize still overshoots, the image is
     dropped and reported back in skipped[]; caller falls back to text
     enrichment for that image.

2) Image-token accounting in context_compressor:
   - New _IMAGE_TOKEN_ESTIMATE = 1600 (matches Claude Code's constant;
     within the realistic range for Anthropic/GPT-4o/Gemini billing).
   - _content_length_for_budget() helper: sums text-part lengths and
     charges _IMAGE_CHAR_EQUIVALENT (1600 * 4 chars) per image/image_url/
     input_image part.  Base64 payload inside image_url is NOT counted
     as chars — dimensions don't matter, only image-presence.
   - Both tail-cut sites (_prune_old_tool_results L527 and
     _find_tail_cut_by_tokens L1126) now call the helper so multi-image
     conversations don't slip past compression budget.

Tests: 9 new in test_image_routing.py (oversize triggers resize,
resize-fails-returns-None, oversize-skipped-reported), 11 new in
test_compressor_image_tokens.py (flat charge per image, multiple images,
Responses-API / Anthropic-native / OpenAI-chat shapes, no-inflation on
raw base64, bounds-check on the constant, integration test that an
image-heavy tail actually gets trimmed).

* fix(image-input): replace blanket 20MB ceiling with empirically-verified per-provider limits

The previous commit imposed a hardcoded 20 MB base64 ceiling on all
providers, triggering auto-resize on anything larger. This was wrong in
both directions:

  * Too loose for Anthropic — actual limit is 5 MB (returns HTTP 400
    'image exceeds 5 MB maximum' above that).
  * Too strict for OpenAI / Codex / OpenRouter — accept 49 MB+ without
    complaint (empirically verified April 2026 with progressive PNG
    sizes).

New behaviour:

  * _PROVIDER_BASE64_CEILING table: only anthropic and bedrock have a
    ceiling (5 MB, since bedrock-on-Claude shares Anthropic's decoder).
  * Providers NOT in the table get no ceiling — images attach at native
    size and we trust the provider to return its own error if it
    disagrees. A provider-specific 400 message is clearer than us
    guessing wrong and silently degrading image quality.
  * build_native_content_parts() gains a keyword-only provider arg;
    gateway/CLI/TUI pass the active provider so Anthropic users get
    auto-resize protection while OpenAI users don't pay it.
  * Resize target dropped from 5 MB to 4 MB to slide safely under
    Anthropic's boundary with header overhead.

Empirical measurements (direct API, no Hermes in the loop):

    image b64     anthropic   openrouter/gpt5.5   codex-oauth/gpt5.5
    0.19 MB       ✓           ✓                   ✓
    12.37 MB      ✗ 400 5MB   ✓                   ✓
    23.85 MB      ✗ 400 5MB   ✓                   ✓
    49.46 MB      ✗ 413       ✓                   ✓

Tests: rewrote TestOversizeHandling (5 tests): no-ceiling pass-through,
Anthropic resize fires, Anthropic skip on resize-fail, build_native_parts
routes ceiling by provider, unknown provider gets no ceiling. All 52
targeted tests pass.

* refactor(image-input): attempt native, shrink-and-retry on provider reject

Replace proactive per-provider size ceilings with a reactive shrink path
on the provider's actual rejection. All providers now attempt native
full-size attachment first; if the provider returns an image-too-large
error, the agent silently shrinks and retries once.

Why the previous design was wrong: hardcoding provider ceilings
(anthropic=5MB, others=unlimited) meant OpenAI users on a 10MB image
paid no tax, but Anthropic users lost quality on anything >5MB even
though the empirical behaviour at provider-reject time is the same
(shrink + retry). Baking the table into the routing layer also
requires updating Hermes every time a provider's limit changes.

Reactive design:
  - image_routing.py: _file_to_data_url encodes native size, no ceiling.
    build_native_content_parts drops its provider kwarg.
  - error_classifier.py: new FailoverReason.image_too_large + pattern
    match ("image exceeds", "image too large", etc.) checked BEFORE
    context_overflow so Anthropic's 5MB rejection lands in the right
    bucket.
  - run_agent.py: new _try_shrink_image_parts_in_messages walks api
    messages in-place, re-encodes oversized data: URL image parts
    through vision_tools._resize_image_for_vision to fit under 4MB,
    handles both chat.completions (dict image_url) and Responses
    (string image_url) shapes, ignores http URLs (provider-fetched).
    New image_shrink_retry_attempted flag in the retry loop fires the
    shrink exactly once per turn after credential-pool recovery but
    before auth retries.

E2E verified live against Anthropic claude-sonnet-4-6:
  - 17.9MB PNG (23.9MB b64) attached at native size
  - Anthropic returns 400 "image exceeds 5 MB maximum"
  - Agent logs '📐 Image(s) exceeded provider size limit — shrank and
    retrying...'
  - Retry succeeds, correct response delivered in 6.8s total.

Tests: 12 new (8 shrink-helper shapes + 4 classifier signals),
replaces 5 proactive-ceiling tests with 3 simpler 'native attach works'
tests. 181 targeted tests pass. test_enum_members_exist in
test_error_classifier.py updated for the new enum value.
2026-04-27 06:27:59 -07:00

278 lines
11 KiB
Python

"""Tests for reactive image-shrink recovery.
Covers the full chain for Anthropic's 5 MB per-image ceiling (and any
future provider that returns an image-too-large error):
1. agent/error_classifier.py: 400 with "image exceeds 5 MB maximum"
gets FailoverReason.image_too_large, not context_overflow.
2. run_agent._try_shrink_image_parts_in_messages mutates the API
payload in-place, re-encoding native data: URL image parts to fit
under 4 MB using vision_tools._resize_image_for_vision.
The end-to-end wiring in the retry loop is not unit-tested here — it's
covered by the live E2E in the PR description. These tests lock in the
two pieces that matter independently: the classifier signal and the
payload rewriter.
"""
from __future__ import annotations
import base64
from pathlib import Path
import pytest
from agent.error_classifier import FailoverReason, classify_api_error
class _FakeApiError(Exception):
"""Stand-in for an openai.BadRequestError with status_code + body."""
def __init__(self, status_code: int, message: str, body: dict | None = None):
super().__init__(message)
self.status_code = status_code
self.body = body or {"error": {"message": message}}
self.response = None # required by some code paths
# ─── Classifier ──────────────────────────────────────────────────────────────
class TestImageTooLargeClassification:
def test_anthropic_400_image_exceeds_message(self):
"""Anthropic's exact wording must classify as image_too_large, not context."""
err = _FakeApiError(
status_code=400,
message=(
"messages.0.content.1.image.source.base64: image exceeds 5 MB "
"maximum: 12966600 bytes > 5242880 bytes"
),
)
result = classify_api_error(err, provider="anthropic", model="claude-sonnet-4-6")
assert result.reason == FailoverReason.image_too_large
assert result.retryable is True
def test_generic_image_too_large_no_status(self):
"""No status_code path: message text alone triggers classification."""
err = Exception("image too large for this endpoint")
result = classify_api_error(err, provider="some-provider", model="some-model")
assert result.reason == FailoverReason.image_too_large
assert result.retryable is True
def test_image_too_large_not_confused_with_context_overflow(self):
"""'image exceeds' must NOT be mis-classified as context_overflow.
The context_overflow patterns include 'exceeds the limit' which is a
superstring risk — verify the image-too-large check fires first.
"""
err = _FakeApiError(
status_code=400,
message="image exceeds the limit for this model",
)
result = classify_api_error(err, provider="anthropic", model="claude-sonnet-4-6")
assert result.reason == FailoverReason.image_too_large
def test_regular_context_overflow_unaffected(self):
"""Context-overflow errors without image keywords still classify correctly."""
err = _FakeApiError(
status_code=400,
message="prompt is too long: context length 300000 exceeds max of 200000",
)
result = classify_api_error(err, provider="anthropic", model="claude-sonnet-4-6")
assert result.reason == FailoverReason.context_overflow
# ─── Shrink helper ───────────────────────────────────────────────────────────
def _big_png_data_url(size_kb: int) -> str:
"""Build a data URL with a plausible large base64 payload."""
# Use real PNG header so MIME detection works; fill to target size.
raw = b"\x89PNG\r\n\x1a\n" + b"X" * (size_kb * 1024)
return "data:image/png;base64," + base64.b64encode(raw).decode("ascii")
def _make_agent():
"""Build a bare AIAgent for method-level testing, no provider setup."""
from run_agent import AIAgent
agent = object.__new__(AIAgent)
agent.provider = "anthropic"
agent.model = "claude-sonnet-4-6"
return agent
class TestShrinkImagePartsHelper:
def test_no_messages_returns_false(self):
agent = _make_agent()
assert agent._try_shrink_image_parts_in_messages([]) is False
assert agent._try_shrink_image_parts_in_messages(None) is False
def test_no_image_parts_returns_false(self):
agent = _make_agent()
msgs = [
{"role": "user", "content": "plain text"},
{"role": "assistant", "content": "ack"},
]
assert agent._try_shrink_image_parts_in_messages(msgs) is False
def test_small_image_part_not_shrunk(self, monkeypatch):
"""An image under 4 MB is left alone — shrink helper only touches oversized ones."""
agent = _make_agent()
small_url = _big_png_data_url(100) # ~100 KB + b64 overhead
resize_hits = {"count": 0}
monkeypatch.setattr(
"tools.vision_tools._resize_image_for_vision",
lambda *a, **kw: resize_hits.__setitem__("count", resize_hits["count"] + 1) or small_url,
raising=False,
)
msgs = [{
"role": "user",
"content": [
{"type": "text", "text": "hi"},
{"type": "image_url", "image_url": {"url": small_url}},
],
}]
assert agent._try_shrink_image_parts_in_messages(msgs) is False
assert resize_hits["count"] == 0
# URL unchanged.
assert msgs[0]["content"][1]["image_url"]["url"] == small_url
def test_oversized_image_url_dict_shape_rewritten(self, monkeypatch):
"""OpenAI chat.completions shape: {image_url: {url: data:...}}."""
agent = _make_agent()
oversized_url = _big_png_data_url(5000) # ~5 MB raw → ~6.7 MB b64
shrunk = "data:image/jpeg;base64," + "A" * 1000 # small
def _fake_resize(path, mime_type=None, max_base64_bytes=None):
return shrunk
monkeypatch.setattr(
"tools.vision_tools._resize_image_for_vision",
_fake_resize,
raising=False,
)
msgs = [{
"role": "user",
"content": [
{"type": "text", "text": "look"},
{"type": "image_url", "image_url": {"url": oversized_url}},
],
}]
changed = agent._try_shrink_image_parts_in_messages(msgs)
assert changed is True
assert msgs[0]["content"][1]["image_url"]["url"] == shrunk
def test_oversized_input_image_string_shape_rewritten(self, monkeypatch):
"""OpenAI Responses shape: {type: input_image, image_url: "data:..."}."""
agent = _make_agent()
oversized_url = _big_png_data_url(5000)
shrunk = "data:image/jpeg;base64," + "B" * 1000
monkeypatch.setattr(
"tools.vision_tools._resize_image_for_vision",
lambda *a, **kw: shrunk,
raising=False,
)
msgs = [{
"role": "user",
"content": [
{"type": "input_text", "text": "look"},
{"type": "input_image", "image_url": oversized_url},
],
}]
changed = agent._try_shrink_image_parts_in_messages(msgs)
assert changed is True
assert msgs[0]["content"][1]["image_url"] == shrunk
def test_multiple_images_all_shrunk(self, monkeypatch):
agent = _make_agent()
big1 = _big_png_data_url(5000)
big2 = _big_png_data_url(6000)
shrunk = "data:image/jpeg;base64," + "C" * 500
monkeypatch.setattr(
"tools.vision_tools._resize_image_for_vision",
lambda *a, **kw: shrunk,
raising=False,
)
msgs = [{
"role": "user",
"content": [
{"type": "text", "text": "compare"},
{"type": "image_url", "image_url": {"url": big1}},
{"type": "image_url", "image_url": {"url": big2}},
],
}]
changed = agent._try_shrink_image_parts_in_messages(msgs)
assert changed is True
assert msgs[0]["content"][1]["image_url"]["url"] == shrunk
assert msgs[0]["content"][2]["image_url"]["url"] == shrunk
def test_http_url_images_not_touched(self, monkeypatch):
"""Only data: URLs are candidates — http URLs are server-fetched."""
agent = _make_agent()
resize_hits = {"count": 0}
monkeypatch.setattr(
"tools.vision_tools._resize_image_for_vision",
lambda *a, **kw: resize_hits.__setitem__("count", resize_hits["count"] + 1) or "shrunk",
raising=False,
)
msgs = [{
"role": "user",
"content": [
{"type": "text", "text": "at this url"},
{"type": "image_url", "image_url": {"url": "https://example.com/big.png"}},
],
}]
assert agent._try_shrink_image_parts_in_messages(msgs) is False
assert resize_hits["count"] == 0
def test_shrink_failure_returns_false_and_leaves_url_intact(self, monkeypatch):
"""If re-encode fails, leave the URL alone so the caller surfaces the original error."""
agent = _make_agent()
oversized_url = _big_png_data_url(5000)
monkeypatch.setattr(
"tools.vision_tools._resize_image_for_vision",
lambda *a, **kw: None, # resize returned nothing usable
raising=False,
)
msgs = [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": oversized_url}},
],
}]
assert agent._try_shrink_image_parts_in_messages(msgs) is False
assert msgs[0]["content"][0]["image_url"]["url"] == oversized_url
def test_shrink_that_makes_it_bigger_rejected(self, monkeypatch):
"""If the 'shrink' somehow produces a larger payload, skip it."""
agent = _make_agent()
oversized_url = _big_png_data_url(5000)
even_bigger = "data:image/png;base64," + "Z" * (10 * 1024 * 1024)
monkeypatch.setattr(
"tools.vision_tools._resize_image_for_vision",
lambda *a, **kw: even_bigger,
raising=False,
)
msgs = [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": oversized_url}},
],
}]
assert agent._try_shrink_image_parts_in_messages(msgs) is False
# Original URL still in place, not replaced by the bigger one.
assert msgs[0]["content"][0]["image_url"]["url"] == oversized_url