mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-19 08:30:48 +08:00
Adds a real-model live test for the tool_search feature. Spins up a real
AIAgent against Claude Haiku 4.5 via OpenRouter, registers 20 fake MCP
tools with realistic shapes, runs 5 scenarios twice each (tool_search ON
and OFF), and records the full transcript per run.
Captures both the bridge call sequence the model emitted (tool_search /
tool_describe / tool_call) and the underlying tool calls that actually
executed through the registry. Records iteration count, elapsed time,
and final response for an A/B comparison.
Scenarios cover:
A. Obvious single tool — direct keyword match
B. Vague paraphrased intent — stress retrieval quality
C. Multi-step chain — two deferred tools in sequence
D. Mixed core + deferred — verify core tools (read_file) get called
directly, not through tool_call
E. No tool needed — verify no spurious tool_search invocations
Baseline run included in scripts/out/ for reference. All 10 runs
(5 scenarios x 2 modes) pass — every expected underlying tool was
invoked, no core tool was incorrectly routed through tool_call, no
tool name was hallucinated.
Round-trip cost observed: tool_search enabled added +3 to +4 model
round trips per task vs disabled. Single-tool tasks completed in ~16-20s
vs ~10-11s direct. Multi-tool tasks ~20s vs ~14s. The bridge overhead
is real and measurable but the task completion rate is identical.
15 lines
494 B
JSON
15 lines
494 B
JSON
{
|
|
"scenario_id": "E_no_tool_needed",
|
|
"scenario_description": "Question doesn't need any tool \u2014 model should just answer",
|
|
"tool_search_enabled": true,
|
|
"model": "anthropic/claude-haiku-4.5 (via openrouter)",
|
|
"prompt": "What's 7 times 8? Answer with just the number.",
|
|
"expected_underlying_tools": [],
|
|
"n_fake_tools_registered": 20,
|
|
"elapsed_seconds": 8.25,
|
|
"bridge_calls": [],
|
|
"underlying_tool_calls": [],
|
|
"final_response": "56",
|
|
"n_iterations": 1,
|
|
"error": null
|
|
} |