fix: wire _ephemeral_max_output_tokens into chat_completions and add NVIDIA NIM default

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-28 06:51:16 +08:00

Based on #12152 by @LVT382009.

Two fixes to run_agent.py:

1. _ephemeral_max_output_tokens consumption in chat_completions path:
   The error-recovery ephemeral override was only consumed in the
   anthropic_messages branch of _build_api_kwargs.  All chat_completions
   providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.)
   silently ignored it.  Now consumed at highest priority, matching the
   anthropic pattern.

2. NVIDIA NIM max_tokens default (16384):
   NVIDIA NIM falls back to a very low internal default when max_tokens
   is omitted, causing models like GLM-4.7 to truncate immediately
   (thinking tokens exhaust the budget before the response starts).

3. Progressive length-continuation boost:
   When finish_reason='length' triggers a continuation retry, the output
   budget now grows progressively (2x base on retry 1, 3x on retry 2,
   capped at 32768) via _ephemeral_max_output_tokens.  Previously the
   retry loop just re-sent the same token limit on all 3 attempts.

This commit is contained in:

LVT382009

2026-04-18 22:49:30 +05:30

committed by

Teknium

parent 0f778f7768

commit f7af90e2da

2 changed files with 20 additions and 1 deletions

									
										1

scripts/release.py
									
												View File
												
				@@ -267,6 +267,7 @@ AUTHOR_MAP = {

				    "aviralarora002@gmail.com": "AviArora02-commits",

				    "junminliu@gmail.com": "JimLiu",

				    "jarvischer@gmail.com": "maxchernin",

				    "levantam.98.2324@gmail.com": "LVT382009",

				}

fix: wire _ephemeral_max_output_tokens into chat_completions and add NVIDIA NIM default

1 scripts/release.py Unescape Escape View File

1

scripts/release.py

View File