8.4 KiB
sidebar_position, title, description
| sidebar_position | title | description |
|---|---|---|
| 15 | Streaming | Token-by-token live response display across all platforms |
Streaming Responses
When enabled, hermes-agent streams LLM responses token-by-token instead of waiting for the full generation. Users see the response typing out live — the same experience as ChatGPT, Claude, or Gemini.
Streaming is disabled by default and can be enabled globally or per-platform.
How It Works
LLM generates tokens → callback fires per token → queue → consumer displays
Telegram/Discord/Slack:
Token arrives → Accumulate → Every 1.5s, edit the message with new text + ▌ cursor
Done → Final edit removes cursor
API Server:
Token arrives → SSE event sent to client immediately
Done → finish chunk + [DONE]
The agent's internal operation doesn't change — tools still execute normally, memory and skills work as before. Streaming only affects how the final text response is delivered to the user.
Enable Streaming
Option 1: Environment variable
# Enable for all platforms
export HERMES_STREAMING_ENABLED=true
hermes gateway
Option 2: config.yaml
streaming:
enabled: true # Master switch
Option 3: Per-platform
streaming:
enabled: false # Off by default
telegram: true # But on for Telegram
discord: true # And Discord
api_server: true # And the API server
Platform Support
| Platform | Streaming Method | Rate Limit | Notes |
|---|---|---|---|
| Telegram | Progressive message editing | ~20 edits/min | 1.5s edit interval, ▌ cursor |
| Discord | Progressive message editing | 5 edits/5s | 1.5s edit interval |
| Slack | Progressive message editing | ~50 calls/min | 1.5s edit interval |
| API Server | SSE (Server-Sent Events) | No limit | Real token-by-token events |
| ❌ Not supported | — | No message editing API | |
| Home Assistant | ❌ Not supported | — | No message editing API |
| CLI | ❌ Not yet implemented | — | KawaiiSpinner provides feedback |
Platforms without message editing support automatically fall back to non-streaming (the response appears all at once, as before).
What Users See
Telegram/Discord/Slack
- Agent starts working (typing indicator shows)
- After ~20 tokens, a message appears with partial text and a ▌ cursor
- Every 1.5 seconds, the message is edited with more accumulated text
- When the response is complete, the cursor disappears
Tool progress messages still work alongside streaming — tool names/previews appear as before, and the streamed response is shown in a separate message.
API Server (frontends like Open WebUI)
When stream: true is set in the request, the API server returns Server-Sent Events:
data: {"choices":[{"delta":{"role":"assistant"}}]}
data: {"choices":[{"delta":{"content":"Here"}}]}
data: {"choices":[{"delta":{"content":" is"}}]}
data: {"choices":[{"delta":{"content":" the"}}]}
data: {"choices":[{"delta":{"content":" answer"}}]}
data: {"choices":[{"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Frontends like Open WebUI display this as live typing.
How It Works Internally
Architecture
┌─────────────┐ stream_callback(delta) ┌──────────────────┐
│ LLM API │ ──────────────────────────► │ queue.Queue() │
│ (stream) │ (runs in agent thread) │ (thread-safe) │
└─────────────┘ └────────┬─────────┘
│
┌──────────────┼──────────┐
│ │ │
┌─────▼─────┐ ┌─────▼────┐ ┌──▼──────┐
│ Gateway │ │ API Svr │ │ CLI │
│ edit msg │ │ SSE evt │ │ (TODO) │
└───────────┘ └──────────┘ └─────────┘
AIAgent.__init__accepts an optionalstream_callbackfunction- When set,
_interruptible_api_call()routes to_run_streaming_chat_completion()instead of the normal non-streaming path - The streaming method calls the OpenAI API with
stream=True, iterates chunks, and callsstream_callback(delta_text)for each text token - Tool call deltas are accumulated silently (no streaming for tool arguments)
- When the stream ends,
stream_callback(None)signals completion - The method returns a fake response object compatible with the existing code path
- If streaming fails for any reason, it falls back to a normal non-streaming API call
Thread Safety
The agent runs in a background thread (via _interruptible_api_call). The consumer (gateway async task, API server SSE writer) runs in the main event loop. A queue.Queue bridges them — it's thread-safe by design.
Graceful Fallback
If the LLM provider doesn't support stream=True or the streaming connection fails, the agent automatically falls back to a non-streaming API call. The user gets the response normally, just without the live typing effect. No error is shown.
Configuration Reference
streaming:
enabled: false # Master switch (default: off)
# Per-platform overrides (optional):
telegram: true # Enable for Telegram
discord: true # Enable for Discord
slack: true # Enable for Slack
api_server: true # Enable for API server
# Tuning (optional):
edit_interval: 1.5 # Seconds between message edits (default: 1.5)
min_tokens: 20 # Tokens before first display (default: 20)
| Variable | Default | Description |
|---|---|---|
HERMES_STREAMING_ENABLED |
false |
Master switch via env var |
streaming.enabled |
false |
Master switch via config |
streaming.<platform> |
(unset) | Per-platform override |
streaming.edit_interval |
1.5 |
Seconds between Telegram/Discord edits |
streaming.min_tokens |
20 |
Minimum tokens before first message |
Interaction with Other Features
Tool Execution
When the agent calls tools (terminal, file operations, web search, etc.), no text tokens are generated — tool arguments are accumulated silently. Tool progress messages continue to work as before. After tools finish, the next LLM call may produce the final text response, which streams normally.
Context Compression
Compression happens between API calls, not during streaming. No interaction.
Interrupts
If the user sends a new message while streaming, the agent is interrupted. The HTTP connection is closed (stopping token generation), accumulated text is shown as-is, and the new message is processed.
Prompt Caching
Streaming doesn't affect prompt caching — the request is identical, just with stream=True added.
Responses API (Codex)
The Codex/Responses API streaming path also supports the stream_callback. Token deltas from response.output_text.delta events are emitted via the callback.
Troubleshooting
Streaming isn't working
- Check the config:
streaming.enabled: truein config.yaml orHERMES_STREAMING_ENABLED=true - Check per-platform:
streaming.telegram: trueoverrides the master switch - Restart the gateway after changing config
- Check logs for "Streaming failed, falling back" — indicates the provider may not support streaming
Response appears twice
If you see the response both in a progressively-edited message AND as a separate final message, this is a bug. The streaming system should suppress the normal send when tokens were delivered via streaming. Please file an issue.
Messages update too slowly
The default edit interval is 1.5 seconds (to respect platform rate limits). You can lower it in config:
streaming:
edit_interval: 1.0 # Faster updates (may hit rate limits)
Going below 1.0s risks Telegram rate limiting (429 errors).
No streaming on WhatsApp/HomeAssistant
These platforms don't support message editing, so streaming automatically falls back to non-streaming. This is expected behavior.