Files
hermes-agent/scripts/out/E_no_tool_needed__enabled.json
teknium1 cd65d1d287 test(tool-search): add live end-to-end harness
Adds a real-model live test for the tool_search feature. Spins up a real
AIAgent against Claude Haiku 4.5 via OpenRouter, registers 20 fake MCP
tools with realistic shapes, runs 5 scenarios twice each (tool_search ON
and OFF), and records the full transcript per run.

Captures both the bridge call sequence the model emitted (tool_search /
tool_describe / tool_call) and the underlying tool calls that actually
executed through the registry. Records iteration count, elapsed time,
and final response for an A/B comparison.

Scenarios cover:
  A. Obvious single tool — direct keyword match
  B. Vague paraphrased intent — stress retrieval quality
  C. Multi-step chain — two deferred tools in sequence
  D. Mixed core + deferred — verify core tools (read_file) get called
     directly, not through tool_call
  E. No tool needed — verify no spurious tool_search invocations

Baseline run included in scripts/out/ for reference. All 10 runs
(5 scenarios x 2 modes) pass — every expected underlying tool was
invoked, no core tool was incorrectly routed through tool_call, no
tool name was hallucinated.

Round-trip cost observed: tool_search enabled added +3 to +4 model
round trips per task vs disabled. Single-tool tasks completed in ~16-20s
vs ~10-11s direct. Multi-tool tasks ~20s vs ~14s. The bridge overhead
is real and measurable but the task completion rate is identical.
2026-05-28 23:36:17 -07:00

15 lines
494 B
JSON

{
"scenario_id": "E_no_tool_needed",
"scenario_description": "Question doesn't need any tool \u2014 model should just answer",
"tool_search_enabled": true,
"model": "anthropic/claude-haiku-4.5 (via openrouter)",
"prompt": "What's 7 times 8? Answer with just the number.",
"expected_underlying_tools": [],
"n_fake_tools_registered": 20,
"elapsed_seconds": 8.25,
"bridge_calls": [],
"underlying_tool_calls": [],
"final_response": "56",
"n_iterations": 1,
"error": null
}