Reshape of PR #17211 (@versun). Lets users wire any local or external TTS CLI into Hermes without adding engine-specific Python code. Users declare any number of named providers in config.yaml and switch between them with tts.provider: <name>, alongside the built-ins (edge, openai, elevenlabs, …). Config shape: tts: provider: piper-en providers: piper-en: type: command command: 'piper -m ~/model.onnx -f {output_path} < {input_path}' output_format: wav Placeholders: {input_path}, {text_path}, {output_path}, {format}, {voice}, {model}, {speed}. Use {{ / }} for literal braces. Key behavior: - Built-in provider names always win — a tts.providers.openai entry cannot shadow the native OpenAI provider. - type: command is the default when command: is set. - Placeholder values are shell-quote-aware (bare / single / double context), so paths with spaces and shell metacharacters are safe. - Default delivery is a regular audio attachment. voice_compatible: true opts in to Telegram voice-bubble delivery via ffmpeg Opus conversion. - Command failures (non-zero exit, timeout, empty output) surface to the agent with stderr/stdout included so you can debug from chat. - Process-tree kill on timeout (Unix killpg, Windows taskkill /T). - max_text_length defaults to 5000 for command providers; override under tts.providers.<name>.max_text_length. Tests: tests/tools/test_tts_command_providers.py — 42 new tests cover provider resolution, shell-quote context, placeholder rendering with injection payloads, timeout, non-zero exit, empty output, voice_compatible opt-in, and end-to-end dispatch through text_to_speech_tool. All 88 pre-existing TTS tests still pass. Docs: new "Custom command providers" section in website/docs/user-guide/features/tts.md with three worked examples (Piper, VoxCPM, MLX-Kokoro), placeholder reference, optional keys, behavior notes, and security caveat. E2E-verified live: isolated HERMES_HOME, command provider declared in config.yaml, text_to_speech_tool dispatches through the registered shell command and the output file is produced as expected. Co-authored-by: Versun <me+github7604@versun.org>
13 KiB
sidebar_position, title, description
| sidebar_position | title | description |
|---|---|---|
| 9 | Voice & TTS | Text-to-speech and voice message transcription across all platforms |
Voice & TTS
Hermes Agent supports both text-to-speech output and voice message transcription across all messaging platforms.
:::tip Nous Subscribers
If you have a paid Nous Portal subscription, OpenAI TTS is available through the Tool Gateway without a separate OpenAI API key. Run hermes model or hermes tools to enable it.
:::
Text-to-Speech
Convert text to speech with nine providers:
| Provider | Quality | Cost | API Key |
|---|---|---|---|
| Edge TTS (default) | Good | Free | None needed |
| ElevenLabs | Excellent | Paid | ELEVENLABS_API_KEY |
| OpenAI TTS | Good | Paid | VOICE_TOOLS_OPENAI_KEY |
| MiniMax TTS | Excellent | Paid | MINIMAX_API_KEY |
| Mistral (Voxtral TTS) | Excellent | Paid | MISTRAL_API_KEY |
| Google Gemini TTS | Excellent | Free tier | GEMINI_API_KEY |
| xAI TTS | Excellent | Paid | XAI_API_KEY |
| NeuTTS | Good | Free (local) | None needed |
| KittenTTS | Good | Free (local) | None needed |
Platform Delivery
| Platform | Delivery | Format |
|---|---|---|
| Telegram | Voice bubble (plays inline) | Opus .ogg |
| Discord | Voice bubble (Opus/OGG), falls back to file attachment | Opus/MP3 |
| Audio file attachment | MP3 | |
| CLI | Saved to ~/.hermes/audio_cache/ |
MP3 |
Configuration
# In ~/.hermes/config.yaml
tts:
provider: "edge" # "edge" | "elevenlabs" | "openai" | "minimax" | "mistral" | "gemini" | "xai" | "neutts" | "kittentts"
speed: 1.0 # Global speed multiplier (provider-specific settings override this)
edge:
voice: "en-US-AriaNeural" # 322 voices, 74 languages
speed: 1.0 # Converted to rate percentage (+/-%)
elevenlabs:
voice_id: "pNInz6obpgDQGcFmaJgB" # Adam
model_id: "eleven_multilingual_v2"
openai:
model: "gpt-4o-mini-tts"
voice: "alloy" # alloy, echo, fable, onyx, nova, shimmer
base_url: "https://api.openai.com/v1" # Override for OpenAI-compatible TTS endpoints
speed: 1.0 # 0.25 - 4.0
minimax:
model: "speech-2.8-hd" # speech-2.8-hd (default), speech-2.8-turbo
voice_id: "English_Graceful_Lady" # See https://platform.minimax.io/faq/system-voice-id
speed: 1 # 0.5 - 2.0
vol: 1 # 0 - 10
pitch: 0 # -12 - 12
mistral:
model: "voxtral-mini-tts-2603"
voice_id: "c69964a6-ab8b-4f8a-9465-ec0925096ec8" # Paul - Neutral (default)
gemini:
model: "gemini-2.5-flash-preview-tts" # or gemini-2.5-pro-preview-tts
voice: "Kore" # 30 prebuilt voices: Zephyr, Puck, Kore, Enceladus, Gacrux, etc.
xai:
voice_id: "eve" # xAI TTS voice (see https://docs.x.ai/docs/api-reference#tts)
language: "en" # ISO 639-1 code
sample_rate: 24000 # 22050 / 24000 (default) / 44100 / 48000
bit_rate: 128000 # MP3 bitrate; only applies when codec=mp3
# base_url: "https://api.x.ai/v1" # Override via XAI_BASE_URL env var
neutts:
ref_audio: ''
ref_text: ''
model: neuphonic/neutts-air-q4-gguf
device: cpu
kittentts:
model: KittenML/kitten-tts-nano-0.8-int8 # 25MB int8; also: kitten-tts-micro-0.8 (41MB), kitten-tts-mini-0.8 (80MB)
voice: Jasper # Jasper, Bella, Luna, Bruno, Rosie, Hugo, Kiki, Leo
speed: 1.0 # 0.5 - 2.0
clean_text: true # Expand numbers, currencies, units
Speed control: The global tts.speed value applies to all providers by default. Each provider can override it with its own speed setting (e.g., tts.openai.speed: 1.5). Provider-specific speed takes precedence over the global value. Default is 1.0 (normal speed).
Telegram Voice Bubbles & ffmpeg
Telegram voice bubbles require Opus/OGG audio format:
- OpenAI, ElevenLabs, and Mistral produce Opus natively — no extra setup
- Edge TTS (default) outputs MP3 and needs ffmpeg to convert:
- MiniMax TTS outputs MP3 and needs ffmpeg to convert for Telegram voice bubbles
- Google Gemini TTS outputs raw PCM and uses ffmpeg to encode Opus directly for Telegram voice bubbles
- xAI TTS outputs MP3 and needs ffmpeg to convert for Telegram voice bubbles
- NeuTTS outputs WAV and also needs ffmpeg to convert for Telegram voice bubbles
- KittenTTS outputs WAV and also needs ffmpeg to convert for Telegram voice bubbles
# Ubuntu/Debian
sudo apt install ffmpeg
# macOS
brew install ffmpeg
# Fedora
sudo dnf install ffmpeg
Without ffmpeg, Edge TTS, MiniMax TTS, NeuTTS, and KittenTTS audio are sent as regular audio files (playable, but shown as a rectangular player instead of a voice bubble).
:::tip If you want voice bubbles without installing ffmpeg, switch to the OpenAI, ElevenLabs, or Mistral provider. :::
Custom command providers
If a TTS engine you want isn't natively supported (Piper, VoxCPM, MLX-Kokoro, XTTS CLI, a voice-cloning script, anything else that exposes a CLI), you can wire it in as a command-type provider without writing any Python. Hermes writes the input text to a temp UTF-8 file, runs your shell command, and reads the audio file the command produced.
Declare one or more providers under tts.providers.<name> and switch between them with tts.provider: <name> — the same way you switch between built-ins like edge and openai.
tts:
provider: piper-en # pick any name under tts.providers
providers:
piper-en:
type: command
command: "piper -m ~/models/en_US-amy.onnx -f {output_path} < {input_path}"
output_format: wav
voxcpm:
type: command
command: "voxcpm --ref ~/voice.wav --text-file {input_path} --out {output_path}"
output_format: mp3
timeout: 180
voice_compatible: true # try to deliver as a Telegram voice bubble
mlx-kokoro:
type: command
command: "python -m mlx_kokoro --in {input_path} --out {output_path} --voice {voice}"
voice: af_sky
output_format: wav
Placeholders
Your command template can reference these placeholders. Hermes substitutes them at render time and shell-quotes each value for the surrounding context (bare / single-quoted / double-quoted), so paths with spaces and other shell-sensitive characters are safe.
| Placeholder | Meaning |
|---|---|
{input_path} |
Path to the temp UTF-8 text file Hermes wrote |
{text_path} |
Alias for {input_path} |
{output_path} |
Path the command must write audio to |
{format} |
mp3 / wav / ogg / flac |
{voice} |
tts.providers.<name>.voice, empty when unset |
{model} |
tts.providers.<name>.model |
{speed} |
Resolved speed multiplier (provider or global) |
Use {{ and }} for literal braces.
Optional keys
| Key | Default | Meaning |
|---|---|---|
timeout |
120 |
Seconds; the process tree is killed on expiry (Unix killpg, Windows taskkill /T). |
output_format |
mp3 |
One of mp3 / wav / ogg / flac. Auto-inferred from the output extension if Hermes picks a path. |
voice_compatible |
false |
When true, Hermes converts MP3/WAV output to Opus/OGG via ffmpeg so Telegram renders a voice bubble. |
max_text_length |
5000 |
Input is truncated to this length before rendering the command. |
voice / model |
empty | Passed to the command as placeholder values only. |
Behavior notes
- Built-in names always win. A
tts.providers.openaientry never shadows the native OpenAI provider, so no user config can silently replace a built-in. - Default delivery is a document. Command providers deliver as regular audio attachments on every platform. Opt in to voice-bubble delivery per-provider with
voice_compatible: true. - Command failures surface to the agent. Non-zero exit, empty output, or timeout all return an error with the command's stderr/stdout included so you can debug the provider from the conversation.
type: commandis the default whencommand:is set. Writingtype: commandexplicitly is good practice but not required; an entry with a non-emptycommandstring is treated as a command provider.{input_path}/{text_path}are interchangeable. Use whichever reads better in your command.
Security
Command-type providers run whatever shell command you configure, with your user's permissions. Hermes quotes placeholder values and enforces the configured timeout, but the command template itself is trusted local input — treat it the same way you would a shell script on your PATH.
Voice Message Transcription (STT)
Voice messages sent on Telegram, Discord, WhatsApp, Slack, or Signal are automatically transcribed and injected as text into the conversation. The agent sees the transcript as normal text.
| Provider | Quality | Cost | API Key |
|---|---|---|---|
| Local Whisper (default) | Good | Free | None needed |
| Groq Whisper API | Good–Best | Free tier | GROQ_API_KEY |
| OpenAI Whisper API | Good–Best | Paid | VOICE_TOOLS_OPENAI_KEY or OPENAI_API_KEY |
:::info Zero Config
Local transcription works out of the box when faster-whisper is installed. If that's unavailable, Hermes can also use a local whisper CLI from common install locations (like /opt/homebrew/bin) or a custom command via HERMES_LOCAL_STT_COMMAND.
:::
Configuration
# In ~/.hermes/config.yaml
stt:
provider: "local" # "local" | "groq" | "openai" | "mistral" | "xai"
local:
model: "base" # tiny, base, small, medium, large-v3
openai:
model: "whisper-1" # whisper-1, gpt-4o-mini-transcribe, gpt-4o-transcribe
mistral:
model: "voxtral-mini-latest" # voxtral-mini-latest, voxtral-mini-2602
xai:
model: "grok-stt" # xAI Grok STT
Provider Details
Local (faster-whisper) — Runs Whisper locally via faster-whisper. Uses CPU by default, GPU if available. Model sizes:
| Model | Size | Speed | Quality |
|---|---|---|---|
tiny |
~75 MB | Fastest | Basic |
base |
~150 MB | Fast | Good (default) |
small |
~500 MB | Medium | Better |
medium |
~1.5 GB | Slower | Great |
large-v3 |
~3 GB | Slowest | Best |
Groq API — Requires GROQ_API_KEY. Good cloud fallback when you want a free hosted STT option.
OpenAI API — Accepts VOICE_TOOLS_OPENAI_KEY first and falls back to OPENAI_API_KEY. Supports whisper-1, gpt-4o-mini-transcribe, and gpt-4o-transcribe.
Mistral API (Voxtral Transcribe) — Requires MISTRAL_API_KEY. Uses Mistral's Voxtral Transcribe models. Supports 13 languages, speaker diarization, and word-level timestamps. Install with pip install hermes-agent[mistral].
xAI Grok STT — Requires XAI_API_KEY. Posts to https://api.x.ai/v1/stt as multipart/form-data. Good choice if you're already using xAI for chat or TTS and want one API key for everything. Auto-detection order puts it after Groq — explicitly set stt.provider: xai to force it.
Custom local CLI fallback — Set HERMES_LOCAL_STT_COMMAND if you want Hermes to call a local transcription command directly. The command template supports {input_path}, {output_dir}, {language}, and {model} placeholders.
Fallback Behavior
If your configured provider isn't available, Hermes automatically falls back:
- Local faster-whisper unavailable → Tries a local
whisperCLI orHERMES_LOCAL_STT_COMMANDbefore cloud providers - Groq key not set → Falls back to local transcription, then OpenAI
- OpenAI key not set → Falls back to local transcription, then Groq
- Mistral key/SDK not set → Skipped in auto-detect; falls through to next available provider
- Nothing available → Voice messages pass through with an accurate note to the user