diff --git a/website/docs/user-guide/features/api-server.md b/website/docs/user-guide/features/api-server.md new file mode 100644 index 00000000000..bab2f72fdb3 --- /dev/null +++ b/website/docs/user-guide/features/api-server.md @@ -0,0 +1,217 @@ +--- +sidebar_position: 14 +title: "API Server" +description: "Expose hermes-agent as an OpenAI-compatible API for any frontend" +--- + +# API Server + +The API server exposes hermes-agent as an OpenAI-compatible HTTP endpoint. Any frontend that speaks the OpenAI format — Open WebUI, LobeChat, LibreChat, NextChat, ChatBox, and hundreds more — can connect to hermes-agent and use it as a backend. + +Your agent handles requests with its full toolset (terminal, file operations, web search, memory, skills) and returns the final response. Tool calls execute invisibly server-side. + +## Quick Start + +### 1. Enable the API server + +Add to `~/.hermes/.env`: + +```bash +API_SERVER_ENABLED=true +``` + +### 2. Start the gateway + +```bash +hermes gateway +``` + +You'll see: + +``` +[API Server] API server listening on http://127.0.0.1:8642 +``` + +### 3. Connect a frontend + +Point any OpenAI-compatible client at `http://localhost:8642/v1`: + +```bash +# Test with curl +curl http://localhost:8642/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{"model": "hermes-agent", "messages": [{"role": "user", "content": "Hello!"}]}' +``` + +Or connect Open WebUI, LobeChat, or any other frontend — see the [Open WebUI integration guide](/docs/user-guide/messaging/open-webui) for step-by-step instructions. + +## Endpoints + +### POST /v1/chat/completions + +Standard OpenAI Chat Completions format. Stateless — the full conversation is included in each request via the `messages` array. + +**Request:** +```json +{ + "model": "hermes-agent", + "messages": [ + {"role": "system", "content": "You are a Python expert."}, + {"role": "user", "content": "Write a fibonacci function"} + ], + "stream": false +} +``` + +**Response:** +```json +{ + "id": "chatcmpl-abc123", + "object": "chat.completion", + "created": 1710000000, + "model": "hermes-agent", + "choices": [{ + "index": 0, + "message": {"role": "assistant", "content": "Here's a fibonacci function..."}, + "finish_reason": "stop" + }], + "usage": {"prompt_tokens": 50, "completion_tokens": 200, "total_tokens": 250} +} +``` + +**Streaming** (`"stream": true`): Returns Server-Sent Events (SSE) with token-by-token response chunks. When streaming is enabled in config, tokens are emitted live as the LLM generates them. When disabled, the full response is sent as a single SSE chunk. + +### POST /v1/responses + +OpenAI Responses API format. Supports server-side conversation state via `previous_response_id` — the server stores full conversation history (including tool calls and results) so multi-turn context is preserved without the client managing it. + +**Request:** +```json +{ + "model": "hermes-agent", + "input": "What files are in my project?", + "instructions": "You are a helpful coding assistant.", + "store": true +} +``` + +**Response:** +```json +{ + "id": "resp_abc123", + "object": "response", + "status": "completed", + "model": "hermes-agent", + "output": [ + {"type": "function_call", "name": "terminal", "arguments": "{\"command\": \"ls\"}", "call_id": "call_1"}, + {"type": "function_call_output", "call_id": "call_1", "output": "README.md src/ tests/"}, + {"type": "message", "role": "assistant", "content": [{"type": "output_text", "text": "Your project has..."}]} + ], + "usage": {"input_tokens": 50, "output_tokens": 200, "total_tokens": 250} +} +``` + +#### Multi-turn with previous_response_id + +Chain responses to maintain full context (including tool calls) across turns: + +```json +{ + "input": "Now show me the README", + "previous_response_id": "resp_abc123" +} +``` + +The server reconstructs the full conversation from the stored response chain — all previous tool calls and results are preserved. + +#### Named conversations + +Use the `conversation` parameter instead of tracking response IDs: + +```json +{"input": "Hello", "conversation": "my-project"} +{"input": "What's in src/?", "conversation": "my-project"} +{"input": "Run the tests", "conversation": "my-project"} +``` + +The server automatically chains to the latest response in that conversation. Like the `/title` command for gateway sessions. + +### GET /v1/responses/{id} + +Retrieve a previously stored response by ID. + +### DELETE /v1/responses/{id} + +Delete a stored response. + +### GET /v1/models + +Lists `hermes-agent` as an available model. Required by most frontends for model discovery. + +### GET /health + +Health check. Returns `{"status": "ok"}`. + +## System Prompt Handling + +When a frontend sends a `system` message (Chat Completions) or `instructions` field (Responses API), hermes-agent **layers it on top** of its core system prompt. Your agent keeps all its tools, memory, and skills — the frontend's system prompt adds extra instructions. + +This means you can customize behavior per-frontend without losing capabilities: +- Open WebUI system prompt: "You are a Python expert. Always include type hints." +- The agent still has terminal, file tools, web search, memory, etc. + +## Authentication + +Bearer token auth via the `Authorization` header: + +``` +Authorization: Bearer your-secret-key +``` + +Configure the key via `API_SERVER_KEY` env var. If no key is set, all requests are allowed (for local-only use). + +## Configuration + +### Environment Variables + +| Variable | Default | Description | +|----------|---------|-------------| +| `API_SERVER_ENABLED` | `false` | Enable the API server | +| `API_SERVER_PORT` | `8642` | HTTP server port | +| `API_SERVER_HOST` | `127.0.0.1` | Bind address (localhost only by default) | +| `API_SERVER_KEY` | _(none)_ | Bearer token for auth | + +### config.yaml + +```yaml +# Not yet supported — use environment variables. +# config.yaml support coming in a future release. +``` + +## CORS + +The API server includes CORS headers on all responses (`Access-Control-Allow-Origin: *`), so browser-based frontends can connect directly. + +## Compatible Frontends + +Any frontend that supports the OpenAI API format works. Tested/documented integrations: + +| Frontend | Stars | Connection | +|----------|-------|------------| +| [Open WebUI](/docs/user-guide/messaging/open-webui) | 126k | Full guide available | +| LobeChat | 73k | Custom provider endpoint | +| LibreChat | 34k | Custom endpoint in librechat.yaml | +| AnythingLLM | 56k | Generic OpenAI provider | +| NextChat | 87k | BASE_URL env var | +| ChatBox | 39k | API Host setting | +| Jan | 26k | Remote model config | +| HF Chat-UI | 8k | OPENAI_BASE_URL | +| big-AGI | 7k | Custom endpoint | +| OpenAI Python SDK | — | `OpenAI(base_url="http://localhost:8642/v1")` | +| curl | — | Direct HTTP requests | + +## Limitations + +- **Response storage is in-memory** — stored responses (for `previous_response_id`) are lost on gateway restart. Max 100 stored responses (LRU eviction). +- **No file upload** — vision/document analysis via uploaded files is not yet supported through the API. +- **Model field is cosmetic** — the `model` field in requests is accepted but the actual LLM model used is configured server-side in config.yaml. diff --git a/website/docs/user-guide/features/streaming.md b/website/docs/user-guide/features/streaming.md new file mode 100644 index 00000000000..624f3913901 --- /dev/null +++ b/website/docs/user-guide/features/streaming.md @@ -0,0 +1,210 @@ +--- +sidebar_position: 15 +title: "Streaming" +description: "Token-by-token live response display across all platforms" +--- + +# Streaming Responses + +When enabled, hermes-agent streams LLM responses token-by-token instead of waiting for the full generation. Users see the response typing out live — the same experience as ChatGPT, Claude, or Gemini. + +Streaming is **disabled by default** and can be enabled globally or per-platform. + +## How It Works + +``` +LLM generates tokens → callback fires per token → queue → consumer displays + +Telegram/Discord/Slack: + Token arrives → Accumulate → Every 1.5s, edit the message with new text + ▌ cursor + Done → Final edit removes cursor + +API Server: + Token arrives → SSE event sent to client immediately + Done → finish chunk + [DONE] +``` + +The agent's internal operation doesn't change — tools still execute normally, memory and skills work as before. Streaming only affects how the **final text response** is delivered to the user. + +## Enable Streaming + +### Option 1: Environment variable + +```bash +# Enable for all platforms +export HERMES_STREAMING_ENABLED=true +hermes gateway +``` + +### Option 2: config.yaml + +```yaml +streaming: + enabled: true # Master switch +``` + +### Option 3: Per-platform + +```yaml +streaming: + enabled: false # Off by default + telegram: true # But on for Telegram + discord: true # And Discord + api_server: true # And the API server +``` + +## Platform Support + +| Platform | Streaming Method | Rate Limit | Notes | +|----------|-----------------|------------|-------| +| **Telegram** | Progressive message editing | ~20 edits/min | 1.5s edit interval, ▌ cursor | +| **Discord** | Progressive message editing | 5 edits/5s | 1.5s edit interval | +| **Slack** | Progressive message editing | ~50 calls/min | 1.5s edit interval | +| **API Server** | SSE (Server-Sent Events) | No limit | Real token-by-token events | +| **WhatsApp** | ❌ Not supported | — | No message editing API | +| **Home Assistant** | ❌ Not supported | — | No message editing API | +| **CLI** | ❌ Not yet implemented | — | KawaiiSpinner provides feedback | + +Platforms without message editing support automatically fall back to non-streaming (the response appears all at once, as before). + +## What Users See + +### Telegram/Discord/Slack + +1. Agent starts working (typing indicator shows) +2. After ~20 tokens, a message appears with partial text and a ▌ cursor +3. Every 1.5 seconds, the message is edited with more accumulated text +4. When the response is complete, the cursor disappears + +Tool progress messages still work alongside streaming — tool names/previews appear as before, and the streamed response is shown in a separate message. + +### API Server (frontends like Open WebUI) + +When `stream: true` is set in the request, the API server returns Server-Sent Events: + +``` +data: {"choices":[{"delta":{"role":"assistant"}}]} + +data: {"choices":[{"delta":{"content":"Here"}}]} + +data: {"choices":[{"delta":{"content":" is"}}]} + +data: {"choices":[{"delta":{"content":" the"}}]} + +data: {"choices":[{"delta":{"content":" answer"}}]} + +data: {"choices":[{"delta":{},"finish_reason":"stop"}]} + +data: [DONE] +``` + +Frontends like Open WebUI display this as live typing. + +## How It Works Internally + +### Architecture + +``` +┌─────────────┐ stream_callback(delta) ┌──────────────────┐ +│ LLM API │ ──────────────────────────► │ queue.Queue() │ +│ (stream) │ (runs in agent thread) │ (thread-safe) │ +└─────────────┘ └────────┬─────────┘ + │ + ┌──────────────┼──────────┐ + │ │ │ + ┌─────▼─────┐ ┌─────▼────┐ ┌──▼──────┐ + │ Gateway │ │ API Svr │ │ CLI │ + │ edit msg │ │ SSE evt │ │ (TODO) │ + └───────────┘ └──────────┘ └─────────┘ +``` + +1. `AIAgent.__init__` accepts an optional `stream_callback` function +2. When set, `_interruptible_api_call()` routes to `_run_streaming_chat_completion()` instead of the normal non-streaming path +3. The streaming method calls the OpenAI API with `stream=True`, iterates chunks, and calls `stream_callback(delta_text)` for each text token +4. Tool call deltas are accumulated silently (no streaming for tool arguments) +5. When the stream ends, `stream_callback(None)` signals completion +6. The method returns a fake response object compatible with the existing code path +7. If streaming fails for any reason, it falls back to a normal non-streaming API call + +### Thread Safety + +The agent runs in a background thread (via `_interruptible_api_call`). The consumer (gateway async task, API server SSE writer) runs in the main event loop. A `queue.Queue` bridges them — it's thread-safe by design. + +### Graceful Fallback + +If the LLM provider doesn't support `stream=True` or the streaming connection fails, the agent automatically falls back to a non-streaming API call. The user gets the response normally, just without the live typing effect. No error is shown. + +## Configuration Reference + +```yaml +streaming: + enabled: false # Master switch (default: off) + + # Per-platform overrides (optional): + telegram: true # Enable for Telegram + discord: true # Enable for Discord + slack: true # Enable for Slack + api_server: true # Enable for API server + + # Tuning (optional): + edit_interval: 1.5 # Seconds between message edits (default: 1.5) + min_tokens: 20 # Tokens before first display (default: 20) +``` + +| Variable | Default | Description | +|----------|---------|-------------| +| `HERMES_STREAMING_ENABLED` | `false` | Master switch via env var | +| `streaming.enabled` | `false` | Master switch via config | +| `streaming.` | _(unset)_ | Per-platform override | +| `streaming.edit_interval` | `1.5` | Seconds between Telegram/Discord edits | +| `streaming.min_tokens` | `20` | Minimum tokens before first message | + +## Interaction with Other Features + +### Tool Execution + +When the agent calls tools (terminal, file operations, web search, etc.), no text tokens are generated — tool arguments are accumulated silently. Tool progress messages continue to work as before. After tools finish, the next LLM call may produce the final text response, which streams normally. + +### Context Compression + +Compression happens between API calls, not during streaming. No interaction. + +### Interrupts + +If the user sends a new message while streaming, the agent is interrupted. The HTTP connection is closed (stopping token generation), accumulated text is shown as-is, and the new message is processed. + +### Prompt Caching + +Streaming doesn't affect prompt caching — the request is identical, just with `stream=True` added. + +### Responses API (Codex) + +The Codex/Responses API streaming path also supports the `stream_callback`. Token deltas from `response.output_text.delta` events are emitted via the callback. + +## Troubleshooting + +### Streaming isn't working + +1. Check the config: `streaming.enabled: true` in config.yaml or `HERMES_STREAMING_ENABLED=true` +2. Check per-platform: `streaming.telegram: true` overrides the master switch +3. Restart the gateway after changing config +4. Check logs for "Streaming failed, falling back" — indicates the provider may not support streaming + +### Response appears twice + +If you see the response both in a progressively-edited message AND as a separate final message, this is a bug. The streaming system should suppress the normal send when tokens were delivered via streaming. Please file an issue. + +### Messages update too slowly + +The default edit interval is 1.5 seconds (to respect platform rate limits). You can lower it in config: + +```yaml +streaming: + edit_interval: 1.0 # Faster updates (may hit rate limits) +``` + +Going below 1.0s risks Telegram rate limiting (429 errors). + +### No streaming on WhatsApp/HomeAssistant + +These platforms don't support message editing, so streaming automatically falls back to non-streaming. This is expected behavior. diff --git a/website/docs/user-guide/messaging/index.md b/website/docs/user-guide/messaging/index.md index 8ff3a49e7bf..74b03361d48 100644 --- a/website/docs/user-guide/messaging/index.md +++ b/website/docs/user-guide/messaging/index.md @@ -1,12 +1,12 @@ --- sidebar_position: 1 title: "Messaging Gateway" -description: "Chat with Hermes from Telegram, Discord, Slack, WhatsApp, Signal, or Email — architecture and setup overview" +description: "Chat with Hermes from Telegram, Discord, Slack, WhatsApp, Signal, Email, or any OpenAI-compatible frontend — architecture and setup overview" --- # Messaging Gateway -Chat with Hermes from Telegram, Discord, Slack, WhatsApp, Signal, or Email. The gateway is a single background process that connects to all your configured platforms, handles sessions, runs cron jobs, and delivers voice messages. +Chat with Hermes from Telegram, Discord, Slack, WhatsApp, Signal, Email, or any OpenAI-compatible frontend (Open WebUI, LobeChat, etc.). The gateway is a single background process that connects to all your configured platforms, handles sessions, runs cron jobs, and delivers voice messages. ## Architecture @@ -15,12 +15,12 @@ Chat with Hermes from Telegram, Discord, Slack, WhatsApp, Signal, or Email. The │ Hermes Gateway │ ├─────────────────────────────────────────────────────────────────┤ │ │ -│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ ┌────────┐ ┌───────┐│ -│ │ Telegram │ │ Discord │ │ WhatsApp │ │ Slack │ │ Signal │ │ Email ││ -│ │ Adapter │ │ Adapter │ │ Adapter │ │Adapter │ │Adapter │ │Adapter││ -│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └───┬────┘ └───┬────┘ └──┬────┘│ -│ │ │ │ │ │ │ │ -│ └─────────────┼────────────┼────────────┼──────────┼─────────┘ │ +│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ ┌────────┐ ┌───────┐ ┌──────────┐│ +│ │ Telegram │ │ Discord │ │ WhatsApp │ │ Slack │ │ Signal │ │ Email │ │API Server││ +│ │ Adapter │ │ Adapter │ │ Adapter │ │Adapter │ │Adapter │ │Adapter│ │ (OpenAI) ││ +│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └───┬────┘ └───┬────┘ └──┬────┘ └────┬─────┘│ +│ │ │ │ │ │ │ │ │ +│ └─────────────┼────────────┼────────────┼──────────┼─────────┼───────────┘ │ │ │ │ │ ┌────────▼────────┐ │ │ │ Session Store │ │ @@ -204,6 +204,7 @@ Each platform has its own toolset: | Slack | `hermes-slack` | Full tools including terminal | | Signal | `hermes-signal` | Full tools including terminal | | Email | `hermes-email` | Full tools including terminal | +| API Server | `hermes-api_server` | Full tools including terminal | ## Next Steps @@ -213,3 +214,5 @@ Each platform has its own toolset: - [WhatsApp Setup](whatsapp.md) - [Signal Setup](signal.md) - [Email Setup](email.md) +- [Open WebUI Setup](open-webui.md) +- [API Server](../features/api-server.md)