diff --git a/VISION.md b/VISION.md new file mode 100644 index 0000000000..a32a118ca6 --- /dev/null +++ b/VISION.md @@ -0,0 +1,75 @@ +# Hermes Agent — Vision Board & Roadmap + +A living brainstorming doc for features, ideas, and strategic direction. +Last updated: March 2, 2026 + +--- + +## Voice Mode + +**Inspiration:** Claude Code's /voice rollout (March 2026) — lets users talk +to the coding agent instead of typing, toggled with a slash command. + +### CLI UX (primary target) + +The voice mode lives inside the existing CLI terminal experience: + +1. **Activation:** User types `/voice` in the Hermes CLI to toggle voice on/off +2. **Status indicator:** A persistent banner appears at the top of the prompt + area: `Voice mode enabled — hold Space to speak` +3. **Push-to-talk:** User holds the Space bar to record. Releasing sends the + audio for transcription. The input prompt placeholder changes to guide: + `> hold space bar to speak` +4. **Transcription:** Speech is transcribed to text and submitted as a normal + user message — the agent processes it identically to typed input +5. **Agent response:** Text response streams to the terminal as usual. + Optionally, TTS can read the response aloud (we already have + text_to_speech). Could be a `/voice tts` sub-toggle. +6. **Deactivation:** `/voice` again to toggle off, returns to normal typing + +**Implementation notes:** +- Push-to-talk needs raw terminal/keyboard input (prompt_toolkit has key + binding support — we already use it for the CLI input) +- Audio capture via PyAudio or sounddevice, stream to STT provider +- Visual feedback while recording: waveform animation or pulsing indicator + in the terminal (could use rich/textual for this) +- Space bar hold must NOT conflict with normal typing when voice is off + +### Gateway Platforms + +- **Telegram:** Already receives voice messages natively — transcribe them + automatically with STT and process as text. Users already send voice + notes; we just need to handle the audio file. +- **Discord:** Similar — voice messages come as attachments, transcribe and + process +- **WhatsApp:** Voice notes are a primary interaction mode, same approach + +### Ideas + +- Agent can already do TTS output (text_to_speech tool exists) — pair with + voice input for a full conversational loop +- Latency matters — voice conversations feel bad above ~2s response time +- Could adjust system prompt in voice mode to be more concise/conversational +- Audio cues for tool call confirmations, errors, completion +- Streaming STT (transcribe while user is still speaking) for lower latency + +### Open Questions + +- Which STT provider? (Whisper local, Deepgram, AssemblyAI, etc.) + - Local Whisper = no API dependency but needs GPU for speed + - Deepgram/AssemblyAI = fast streaming, but adds a service dependency +- Should voice mode change the system prompt to be more conversational/concise? +- How to handle tool call confirmations in voice — audio cues? +- Do we want full duplex (agent can interrupt/be interrupted) or half-duplex? + +--- + +## Ideas Backlog + +*(New ideas get added here, then organized into sections as they mature)* + +--- + +## Shipped + +*(Track completed vision items here for posterity)*