2026-03-05 05:24:55 -08:00
---
sidebar_position: 9
title: "Voice & TTS"
description: "Text-to-speech and voice message transcription across all platforms"
---
# Voice & TTS
Hermes Agent supports both text-to-speech output and voice message transcription across all messaging platforms.
## Text-to-Speech
Convert text to speech with three providers:
| Provider | Quality | Cost | API Key |
|----------|---------|------|---------|
| **Edge TTS ** (default) | Good | Free | None needed |
| **ElevenLabs ** | Excellent | Paid | `ELEVENLABS_API_KEY` |
| **OpenAI TTS ** | Good | Paid | `VOICE_TOOLS_OPENAI_KEY` |
### Platform Delivery
| Platform | Delivery | Format |
|----------|----------|--------|
| Telegram | Voice bubble (plays inline) | Opus `.ogg` |
| Discord | Audio file attachment | MP3 |
| WhatsApp | Audio file attachment | MP3 |
docs: comprehensive accuracy audit fixes (35+ corrections)
CRITICAL fixes:
- Installation: Remove false prerequisites (installer auto-installs everything except git)
- Tools: Remove non-existent 'web_crawl' tool from tools table
- Memory: Remove non-existent 'read' action (only add/replace/remove exist)
- Code execution: Fix 'search' to 'search_files' in sandbox tools list
- CLI commands: Fix --model/--provider/--toolsets/--verbose as chat subcommand flags
IMPORTANT fixes:
- Installation: Add missing installer features (Node.js, ripgrep, ffmpeg, skills seeding)
- Installation: Add 6 missing package extras to table (mcp, honcho, tts-premium, etc)
- Installation: Fix mkdir to include all directories the installer creates
- Quickstart: Add OpenAI Codex to provider table
- CLI: Fix all 'hermes --flag' to 'hermes chat --flag' across all docs
- Configuration: Remove non-existent --max-turns CLI flag
- Tools: Fix 'search' to 'search_files', add missing 'process' tool
- Skills: Remove skills_categories() (not a registered tool)
- Cron: Remove unsupported 'daily at 9am' schedule format
- TTS: Fix output directory to ~/.hermes/audio_cache/
- Delegation: Clarify depth limit wording
- Architecture: Fix default model, chat() signature, file names
- Contributing: Fix Python requirement from 3.11+ to 3.10+
- CLI reference: Add missing commands (login, tools, sessions subcommands)
- Env vars: Fix TERMINAL_DOCKER_IMAGE default, add HERMES_MODEL
2026-03-05 06:50:22 -08:00
| CLI | Saved to `~/.hermes/audio_cache/` | MP3 |
2026-03-05 05:24:55 -08:00
### Configuration
```yaml
# In ~/.hermes/config.yaml
tts:
provider: "edge" # "edge" | "elevenlabs" | "openai"
edge:
voice: "en-US-AriaNeural" # 322 voices, 74 languages
elevenlabs:
voice_id: "pNInz6obpgDQGcFmaJgB" # Adam
model_id: "eleven_multilingual_v2"
openai:
model: "gpt-4o-mini-tts"
voice: "alloy" # alloy, echo, fable, onyx, nova, shimmer
```
### Telegram Voice Bubbles & ffmpeg
Telegram voice bubbles require Opus/OGG audio format:
- **OpenAI and ElevenLabs** produce Opus natively — no extra setup
- **Edge TTS** (default) outputs MP3 and needs **ffmpeg ** to convert:
```bash
# Ubuntu/Debian
sudo apt install ffmpeg
# macOS
brew install ffmpeg
# Fedora
sudo dnf install ffmpeg
```
Without ffmpeg, Edge TTS audio is sent as a regular audio file (playable, but shows as a rectangular player instead of a voice bubble).
:::tip
If you want voice bubbles without installing ffmpeg, switch to the OpenAI or ElevenLabs provider.
:::
2026-03-13 09:07:09 -07:00
## Voice Message Transcription (STT)
2026-03-05 05:24:55 -08:00
2026-03-13 09:07:09 -07:00
Voice messages sent on Telegram, Discord, WhatsApp, Slack, or Signal are automatically transcribed and injected as text into the conversation. The agent sees the transcript as normal text.
2026-03-05 05:24:55 -08:00
2026-03-13 09:07:09 -07:00
| Provider | Quality | Cost | API Key |
|----------|---------|------|---------|
| **Local Whisper ** (default) | Good | Free | None needed |
| **OpenAI Whisper API ** | Good– Best | Paid | `VOICE_TOOLS_OPENAI_KEY` |
2026-03-05 05:24:55 -08:00
2026-03-13 09:07:09 -07:00
:::info Zero Config
Local transcription works out of the box — no API key needed. The `faster-whisper` model (~150 MB for `base` ) is auto-downloaded on first voice message.
:::
2026-03-05 05:24:55 -08:00
### Configuration
```yaml
# In ~/.hermes/config.yaml
stt:
2026-03-13 09:07:09 -07:00
provider: "local" # "local" (free, faster-whisper) | "openai" (API)
local:
model: "base" # tiny, base, small, medium, large-v3
openai:
model: "whisper-1" # whisper-1, gpt-4o-mini-transcribe, gpt-4o-transcribe
2026-03-05 05:24:55 -08:00
```
2026-03-13 09:07:09 -07:00
### Provider Details
**Local (faster-whisper)** — Runs Whisper locally via [faster-whisper ](https://github.com/SYSTRAN/faster-whisper ). Uses CPU by default, GPU if available. Model sizes:
| Model | Size | Speed | Quality |
|-------|------|-------|---------|
| `tiny` | ~75 MB | Fastest | Basic |
| `base` | ~150 MB | Fast | Good (default) |
| `small` | ~500 MB | Medium | Better |
| `medium` | ~1.5 GB | Slower | Great |
| `large-v3` | ~3 GB | Slowest | Best |
**OpenAI API** — Requires `VOICE_TOOLS_OPENAI_KEY` . Supports `whisper-1` , `gpt-4o-mini-transcribe` , and `gpt-4o-transcribe` .
### Fallback Behavior
If your configured provider isn't available, Hermes automatically falls back:
- **Local not installed** → Falls back to OpenAI API (if key is set)
- **OpenAI key not set** → Falls back to local Whisper (if installed)
- **Neither available** → Voice messages pass through with a note to the user