One of the biggest pain points in Home Assistant voice control is false triggers from TV audio, other people talking, and background noise.
This project aims to solve that by adding speaker-aware gating in front of STT, so transcription is fed mainly by the target speaker’s voice.
So audio is enhanced first, then speaker similarity is checked per window, and only accepted segments are sent to ASR.
Why This Matters (Background)
In real home environments, voice assistants often struggle with:
TV/music speech in the background
Multiple people speaking nearby
Far-field noise and room echo
Pure ASR alone cannot reliably solve “who is speaking.”
This project pushes speaker filtering before ASR to improve command reliability for the intended user.
Current Limitations (Community Help Welcome)
Post-gate segment reconstruction still needs better algorithms
With fixed windows, some middle segments may be rejected while head/tail segments pass. Reconstructing stable ASR input without dropping words is still an open area.
HA side currently cannot be force-stopped by STT server
The server can reject segments, but the client microphone usually keeps recording until session end.
This is a HA/Wyoming client capability boundary.
More real-world tuning data is needed
Thresholds and behavior vary by mic array, room acoustics, and speaking distance.
Default ASR model: sherpa-onnx-qwen3-asr-0.6B-int8
Speaker gate default: SPEAKER_GATE=false (enable when needed)
Denoise default: DENOISE=true
Required models are auto-downloaded on first run into ${HOME}/data/models
If you are also working on “assistant listens only to the owner,” contributions are very welcome (issues, PRs, field-test logs).
Repository again: GitHub - xiasi0/wyoming-sherpa-onnx · GitHub
HA side currently cannot be force-stopped by STT server
I have a client and a server that implements this. You can add sending a stop signal (modification is backward compatible, the standard client simply ignores it) to your project and test it with this custom client.
Thanks a lot for the pointers and for sharing your approach.
I want to highlight one ecosystem issue here: Home Assistant currently does not let an STT server reliably stop the microphone capture from the server side, which is not very reasonable for this kind of pipeline. It would be great if someone could contribute a proper HA PR for this, instead of every user reinventing custom client-side workarounds.
In my setup, I use Qwen3-ASR. It is relatively robust when background speech comes from music and still gives stable command text in many cases. But TV voice and other real speakers are much harder.
I added a simple gating algorithm to preserve the full target-speaker utterance more reliably:
keep 1 block before the first accepted block
keep all blocks between first accepted and last accepted
keep 1 block after the last accepted block
This significantly reduces missing beginning/ending words and dropped middle content.
Next, I plan to filter false-positive blocks with extra processing (for example, separation then re-verification). But I’m hitting practical limits:
with 2 speakers, separation works reasonably well
with 3+ speakers, separation often fails (my current model is not designed for >2 speakers)
separation is computationally expensive and very slow on CPU
I do not want to require CUDA/GPU, because most users don’t have that hardware
So for now, CPU-friendly robustness is still the key constraint.
This isn’t implemented at the system level because it doesn’t make much sense right now, especially when working with offline (non-streaming) ASR (which is what 99.9 percent of users use).
In this case, the custom server developer literally duplicates the functionality of the system VAD, but based on their own criteria. For example, in the matching project I linked to, you could add an interruption to listening if the target voice is no longer detected.
I wouldn’t call this reinventing anything new, as everything remains within the protocol; we simply added logic for handling premature TranscriptStop event. All new ideas are tested on custom components, so this is normal practice. It’s still too early to release something like this.
Noise and music (without voice) aren’t a problem for most ASR. I’ve experimented quite extensively with DeepFilterNet and decided there’s no point in wasting resources on it.
As for voice separation, improvements should start with the microphone module—beamforming is standard in the smart speaker world. Processing data when the energy levels of voice and background are inherently different is much more convenient.
Before this improvement, we won’t get good results (for reasonable resources), no matter what approach we use.
It’s also not entirely clear why your project is tied to qwen; it’s far from the best engine, neither in terms of quality nor speed.
Thanks for your thoughtful feedback — I actually agree with many of your points.
Parakeet is a very strong model, but unfortunately I can’t use it for my current setup because Chinese support is a hard requirement ().
Before this, I mainly used Whisper large-v3 on GPU, which is also excellent, but in my tests it struggles more when there is speech mixed into background music/TV audio. Qwen helps reduce that issue in my environment.
I also agree that with single-channel mixed audio streams, backend-only processing gives limited gains.
Beamforming on the edge microphone side is likely the highest-ROI path.
At the moment, apart from XMOS XVF3800-class solutions, I honestly don’t see many practical options for consumer-grade setups yet ().