feat(stt): add free local whisper transcription via faster-whisper

Replace OpenAI-only STT with a dual-provider system mirroring the TTS
architecture (Edge TTS free / ElevenLabs paid):

  STT: faster-whisper local (free, default) / OpenAI Whisper API (paid)

Changes:
- tools/transcription_tools.py: Full rewrite with provider dispatch,
  config loading, local faster-whisper backend, and OpenAI API backend.
  Auto-downloads model (~150MB for 'base') on first voice message.
  Singleton model instance reused across calls.
- pyproject.toml: Add faster-whisper>=1.0.0 as core dependency
- hermes_cli/config.py: Expand stt config to match TTS pattern with
  provider selection and per-provider model settings
- agent/context_compressor.py: Fix .strip() crash when LLM returns
  non-string content (dict from llama.cpp, None). Fixes #1100 partially.
- tests/: 23 new tests for STT providers + 2 for compressor fix
- docs/: Updated Voice & TTS page with STT provider table, model sizes,
  config examples, and fallback behavior

Fallback behavior:
- Local not installed → OpenAI API (if key set)
- OpenAI key not set → local whisper (if installed)
- Neither → graceful error message to user

Co-authored-by: Jah-yee <Jah-yee@users.noreply.github.com>
This commit is contained in:
teknium1
2026-03-13 09:07:09 -07:00
parent 2001b88c23
commit b430b5acfe
7 changed files with 532 additions and 137 deletions

View File

@@ -67,23 +67,48 @@ Without ffmpeg, Edge TTS audio is sent as a regular audio file (playable, but sh
If you want voice bubbles without installing ffmpeg, switch to the OpenAI or ElevenLabs provider.
:::
## Voice Message Transcription
## Voice Message Transcription (STT)
Voice messages sent on Telegram, Discord, WhatsApp, or Slack are automatically transcribed and injected as text into the conversation. The agent sees the transcript as normal text.
Voice messages sent on Telegram, Discord, WhatsApp, Slack, or Signal are automatically transcribed and injected as text into the conversation. The agent sees the transcript as normal text.
| Provider | Model | Quality | Cost |
|----------|-------|---------|------|
| **OpenAI Whisper** | `whisper-1` (default) | Good | Low |
| **OpenAI GPT-4o** | `gpt-4o-mini-transcribe` | Better | Medium |
| **OpenAI GPT-4o** | `gpt-4o-transcribe` | Best | Higher |
| Provider | Quality | Cost | API Key |
|----------|---------|------|---------|
| **Local Whisper** (default) | Good | Free | None needed |
| **OpenAI Whisper API** | GoodBest | Paid | `VOICE_TOOLS_OPENAI_KEY` |
Requires `VOICE_TOOLS_OPENAI_KEY` in `~/.hermes/.env`.
:::info Zero Config
Local transcription works out of the box — no API key needed. The `faster-whisper` model (~150 MB for `base`) is auto-downloaded on first voice message.
:::
### Configuration
```yaml
# In ~/.hermes/config.yaml
stt:
enabled: true
model: "whisper-1"
provider: "local" # "local" (free, faster-whisper) | "openai" (API)
local:
model: "base" # tiny, base, small, medium, large-v3
openai:
model: "whisper-1" # whisper-1, gpt-4o-mini-transcribe, gpt-4o-transcribe
```
### Provider Details
**Local (faster-whisper)** — Runs Whisper locally via [faster-whisper](https://github.com/SYSTRAN/faster-whisper). Uses CPU by default, GPU if available. Model sizes:
| Model | Size | Speed | Quality |
|-------|------|-------|---------|
| `tiny` | ~75 MB | Fastest | Basic |
| `base` | ~150 MB | Fast | Good (default) |
| `small` | ~500 MB | Medium | Better |
| `medium` | ~1.5 GB | Slower | Great |
| `large-v3` | ~3 GB | Slowest | Best |
**OpenAI API** — Requires `VOICE_TOOLS_OPENAI_KEY`. Supports `whisper-1`, `gpt-4o-mini-transcribe`, and `gpt-4o-transcribe`.
### Fallback Behavior
If your configured provider isn't available, Hermes automatically falls back:
- **Local not installed** → Falls back to OpenAI API (if key is set)
- **OpenAI key not set** → Falls back to local Whisper (if installed)
- **Neither available** → Voice messages pass through with a note to the user