Hi everyone,
I’d like to request official support or community development for integrating Microsoft VibeVoice into Home Assistant. This model is an open-source, real-time speech-to-speech and speech-to-text system that could dramatically improve local voice interactions in HA.
What is VibeVoice?
Microsoft VibeVoice is an open-source real-time voice model offering:
- Streaming voice input and output
- Fast, low-latency inference
- Support for wake-word workflows
- Multiple audio codecs
- On-device or server inference
Links:
- GitHub repo: GitHub - microsoft/VibeVoice: Open-Source Frontier Voice AI
- Realtime 0.5B docs: VibeVoice/docs/vibevoice-realtime-0.5b.md at d295d1e1d0fff1ad42bc0450d5b593f8e59356b9 · microsoft/VibeVoice · GitHub
- Hugging Face model card: microsoft/VibeVoice-Realtime-0.5B · Hugging Face
Why this matters for Home Assistant
Home Assistant is pushing strongly into local voice (Assist, Piper, Whisper, etc.).
VibeVoice brings several advantages:
- Real-time voice-to-voice agent capability
- Very low latency compared to traditional STT + TTS pipelines
- Fully local, open-source, self-hostable
- Could work as a backend for Assist or as a new voice pipeline
- Supports streaming duplex audio, ideal for natural conversations
- Works well even with small hardware (0.5B model)
This could massively improve responsiveness for HA Assist and give users a privacy-preserving alternative to cloud assistants.
Requested Integration Features
- Native integration as a Speech Pipeline backend
- Local inference through container / addon (Docker)
- Use as:
- STT provider
- TTS provider
- Full duplex voice agent
- Websocket or gRPC streaming support
- Configurable model path
- Wake-word compatibility
Potential Implementation Ideas
- Wrap the VibeVoice server in a Home Assistant Add-on
- Expose STT and TTS endpoints through Assist pipeline interfaces
- Use the Realtime API for interactive conversations
- Optionally integrate with HA Voice Satellite devices (ESPHome, RasPi, etc.)
Why I think this fits HA’s vision
VibeVoice aligns perfectly with HA’s mission of local-first, privacy-preserving voice control.
It could become a core building block to elevate Assist from “voice commands” to natural spoken interaction.
Call for Contributors
If anyone is interested in exploring this integration, I’d be happy to help test, benchmark, or document the setup.
Thanks for considering it!