Feature Request: Microsoft VibeVoice Integration

Hi everyone,
I’d like to request official support or community development for integrating Microsoft VibeVoice into Home Assistant. This model is an open-source, real-time speech-to-speech and speech-to-text system that could dramatically improve local voice interactions in HA.


:pushpin: What is VibeVoice?

Microsoft VibeVoice is an open-source real-time voice model offering:

  • Streaming voice input and output
  • Fast, low-latency inference
  • Support for wake-word workflows
  • Multiple audio codecs
  • On-device or server inference

Links:


:jigsaw: Why this matters for Home Assistant

Home Assistant is pushing strongly into local voice (Assist, Piper, Whisper, etc.).
VibeVoice brings several advantages:

  • Real-time voice-to-voice agent capability
  • Very low latency compared to traditional STT + TTS pipelines
  • Fully local, open-source, self-hostable
  • Could work as a backend for Assist or as a new voice pipeline
  • Supports streaming duplex audio, ideal for natural conversations
  • Works well even with small hardware (0.5B model)

This could massively improve responsiveness for HA Assist and give users a privacy-preserving alternative to cloud assistants.


:wrench: Requested Integration Features

  • Native integration as a Speech Pipeline backend
  • Local inference through container / addon (Docker)
  • Use as:
    • STT provider
    • TTS provider
    • Full duplex voice agent
  • Websocket or gRPC streaming support
  • Configurable model path
  • Wake-word compatibility

:triangular_ruler: Potential Implementation Ideas

  • Wrap the VibeVoice server in a Home Assistant Add-on
  • Expose STT and TTS endpoints through Assist pipeline interfaces
  • Use the Realtime API for interactive conversations
  • Optionally integrate with HA Voice Satellite devices (ESPHome, RasPi, etc.)

:speech_balloon: Why I think this fits HA’s vision

VibeVoice aligns perfectly with HA’s mission of local-first, privacy-preserving voice control.
It could become a core building block to elevate Assist from “voice commands” to natural spoken interaction.


:pray: Call for Contributors

If anyone is interested in exploring this integration, I’d be happy to help test, benchmark, or document the setup.

Thanks for considering it!

3 Likes

Feature requests are not considered here at all. Go to here:

Also the AI formatting tends to put people off, you may want to personalize it a bit more.