Replace OpenAI-only STT with a dual-provider system mirroring the TTS architecture (Edge TTS free / ElevenLabs paid): STT: faster-whisper local (free, default) / OpenAI Whisper API (paid) Changes: - tools/transcription_tools.py: Full rewrite with provider dispatch, config loading, local faster-whisper backend, and OpenAI API backend. Auto-downloads model (~150MB for 'base') on first voice message. Singleton model instance reused across calls. - pyproject.toml: Add faster-whisper>=1.0.0 as core dependency - hermes_cli/config.py: Expand stt config to match TTS pattern with provider selection and per-provider model settings - agent/context_compressor.py: Fix .strip() crash when LLM returns non-string content (dict from llama.cpp, None). Fixes #1100 partially. - tests/: 23 new tests for STT providers + 2 for compressor fix - docs/: Updated Voice & TTS page with STT provider table, model sizes, config examples, and fallback behavior Fallback behavior: - Local not installed → OpenAI API (if key set) - OpenAI key not set → local whisper (if installed) - Neither → graceful error message to user Co-authored-by: Jah-yee <Jah-yee@users.noreply.github.com>
3.6 KiB
sidebar_position, title, description
| sidebar_position | title | description |
|---|---|---|
| 9 | Voice & TTS | Text-to-speech and voice message transcription across all platforms |
Voice & TTS
Hermes Agent supports both text-to-speech output and voice message transcription across all messaging platforms.
Text-to-Speech
Convert text to speech with three providers:
| Provider | Quality | Cost | API Key |
|---|---|---|---|
| Edge TTS (default) | Good | Free | None needed |
| ElevenLabs | Excellent | Paid | ELEVENLABS_API_KEY |
| OpenAI TTS | Good | Paid | VOICE_TOOLS_OPENAI_KEY |
Platform Delivery
| Platform | Delivery | Format |
|---|---|---|
| Telegram | Voice bubble (plays inline) | Opus .ogg |
| Discord | Audio file attachment | MP3 |
| Audio file attachment | MP3 | |
| CLI | Saved to ~/.hermes/audio_cache/ |
MP3 |
Configuration
# In ~/.hermes/config.yaml
tts:
provider: "edge" # "edge" | "elevenlabs" | "openai"
edge:
voice: "en-US-AriaNeural" # 322 voices, 74 languages
elevenlabs:
voice_id: "pNInz6obpgDQGcFmaJgB" # Adam
model_id: "eleven_multilingual_v2"
openai:
model: "gpt-4o-mini-tts"
voice: "alloy" # alloy, echo, fable, onyx, nova, shimmer
Telegram Voice Bubbles & ffmpeg
Telegram voice bubbles require Opus/OGG audio format:
- OpenAI and ElevenLabs produce Opus natively — no extra setup
- Edge TTS (default) outputs MP3 and needs ffmpeg to convert:
# Ubuntu/Debian
sudo apt install ffmpeg
# macOS
brew install ffmpeg
# Fedora
sudo dnf install ffmpeg
Without ffmpeg, Edge TTS audio is sent as a regular audio file (playable, but shows as a rectangular player instead of a voice bubble).
:::tip If you want voice bubbles without installing ffmpeg, switch to the OpenAI or ElevenLabs provider. :::
Voice Message Transcription (STT)
Voice messages sent on Telegram, Discord, WhatsApp, Slack, or Signal are automatically transcribed and injected as text into the conversation. The agent sees the transcript as normal text.
| Provider | Quality | Cost | API Key |
|---|---|---|---|
| Local Whisper (default) | Good | Free | None needed |
| OpenAI Whisper API | Good–Best | Paid | VOICE_TOOLS_OPENAI_KEY |
:::info Zero Config
Local transcription works out of the box — no API key needed. The faster-whisper model (~150 MB for base) is auto-downloaded on first voice message.
:::
Configuration
# In ~/.hermes/config.yaml
stt:
provider: "local" # "local" (free, faster-whisper) | "openai" (API)
local:
model: "base" # tiny, base, small, medium, large-v3
openai:
model: "whisper-1" # whisper-1, gpt-4o-mini-transcribe, gpt-4o-transcribe
Provider Details
Local (faster-whisper) — Runs Whisper locally via faster-whisper. Uses CPU by default, GPU if available. Model sizes:
| Model | Size | Speed | Quality |
|---|---|---|---|
tiny |
~75 MB | Fastest | Basic |
base |
~150 MB | Fast | Good (default) |
small |
~500 MB | Medium | Better |
medium |
~1.5 GB | Slower | Great |
large-v3 |
~3 GB | Slowest | Best |
OpenAI API — Requires VOICE_TOOLS_OPENAI_KEY. Supports whisper-1, gpt-4o-mini-transcribe, and gpt-4o-transcribe.
Fallback Behavior
If your configured provider isn't available, Hermes automatically falls back:
- Local not installed → Falls back to OpenAI API (if key is set)
- OpenAI key not set → Falls back to local Whisper (if installed)
- Neither available → Voice messages pass through with a note to the user