Make Your Assistant Listen Only to Your Voice

Make Your Assistant Listen Only to Your Voice

One of the biggest pain points in Home Assistant voice control is false triggers from TV audio, other people talking, and background noise.
This project aims to solve that by adding speaker-aware gating in front of STT, so transcription is fed mainly by the target speaker’s voice.

Repository:
GitHub - xiasi0/wyoming-sherpa-onnx · GitHub


Privacy First: Fully Local

This stack runs fully on your local machine/network.
Your voice samples and audio streams are not uploaded to any cloud service by this project.


What This Project Does

wyoming-sherpa-onnx is a Wyoming-protocol offline ASR server built around Qwen3-ASR (supports 0.6B / 1.7B).
Before ASR, it adds two processing stages:

  1. GTCRN denoise (switchable)
  2. Fixed-window speaker gate (switchable)

Pipeline:

Audio -> GTCRN denoise -> fixed-window speaker gate -> Qwen3-ASR

So audio is enhanced first, then speaker similarity is checked per window, and only accepted segments are sent to ASR.


Why This Matters (Background)

In real home environments, voice assistants often struggle with:

  • TV/music speech in the background
  • Multiple people speaking nearby
  • Far-field noise and room echo

Pure ASR alone cannot reliably solve “who is speaking.”
This project pushes speaker filtering before ASR to improve command reliability for the intended user.


Current Limitations (Community Help Welcome)

  1. Post-gate segment reconstruction still needs better algorithms
    With fixed windows, some middle segments may be rejected while head/tail segments pass. Reconstructing stable ASR input without dropping words is still an open area.

  2. HA side currently cannot be force-stopped by STT server
    The server can reject segments, but the client microphone usually keeps recording until session end.
    This is a HA/Wyoming client capability boundary.

  3. More real-world tuning data is needed
    Thresholds and behavior vary by mic array, room acoustics, and speaking distance.


Where Community Contributions Are Most Valuable

  • Better gate post-processing (temporal smoothing, robust segment merge, anti-jitter logic)
  • Standardized real-home evaluation datasets and benchmarks
  • HA/Wyoming protocol improvements for server-driven mic stop
  • Cross-device testing and tuning recommendations

Docker Quick Start

1) Clone the project

git clone https://github.com/xiasi0/wyoming-sherpa-onnx.git
cd wyoming-sherpa-onnx

2) Start the service

docker compose up -d --build
docker compose logs -f

3) Default host directories

  • Models: ${HOME}/data/models
  • Speaker reference audio: ${HOME}/data/speaker_refs

Container paths:

  • /app/data/models
  • /data/speaker_refs

4) Default behavior

  • Default ASR model: sherpa-onnx-qwen3-asr-0.6B-int8
  • Speaker gate default: SPEAKER_GATE=false (enable when needed)
  • Denoise default: DENOISE=true
  • Required models are auto-downloaded on first run into ${HOME}/data/models

If you are also working on “assistant listens only to the owner,” contributions are very welcome (issues, PRs, field-test logs).
Repository again: GitHub - xiasi0/wyoming-sherpa-onnx · GitHub

1 Like

HA side currently cannot be force-stopped by STT server

I have a client and a server that implements this. You can add sending a stop signal (modification is backward compatible, the standard client simply ignores it) to your project and test it with this custom client.

1 Like

Thanks a lot for the pointers and for sharing your approach.

I want to highlight one ecosystem issue here: Home Assistant currently does not let an STT server reliably stop the microphone capture from the server side, which is not very reasonable for this kind of pipeline. It would be great if someone could contribute a proper HA PR for this, instead of every user reinventing custom client-side workarounds.

In my setup, I use Qwen3-ASR. It is relatively robust when background speech comes from music and still gives stable command text in many cases. But TV voice and other real speakers are much harder.

I added a simple gating algorithm to preserve the full target-speaker utterance more reliably:

  • keep 1 block before the first accepted block
  • keep all blocks between first accepted and last accepted
  • keep 1 block after the last accepted block

This significantly reduces missing beginning/ending words and dropped middle content.

Next, I plan to filter false-positive blocks with extra processing (for example, separation then re-verification). But I’m hitting practical limits:

  • with 2 speakers, separation works reasonably well
  • with 3+ speakers, separation often fails (my current model is not designed for >2 speakers)
  • separation is computationally expensive and very slow on CPU
  • I do not want to require CUDA/GPU, because most users don’t have that hardware

So for now, CPU-friendly robustness is still the key constraint.

This isn’t implemented at the system level because it doesn’t make much sense right now, especially when working with offline (non-streaming) ASR (which is what 99.9 percent of users use).

In this case, the custom server developer literally duplicates the functionality of the system VAD, but based on their own criteria. For example, in the matching project I linked to, you could add an interruption to listening if the target voice is no longer detected.

I wouldn’t call this reinventing anything new, as everything remains within the protocol; we simply added logic for handling premature TranscriptStop event. All new ideas are tested on custom components, so this is normal practice. It’s still too early to release something like this.

Noise and music (without voice) aren’t a problem for most ASR. I’ve experimented quite extensively with DeepFilterNet and decided there’s no point in wasting resources on it.

As for voice separation, improvements should start with the microphone module—beamforming is standard in the smart speaker world. Processing data when the energy levels of voice and background are inherently different is much more convenient.

Before this improvement, we won’t get good results (for reasonable resources), no matter what approach we use.

It’s also not entirely clear why your project is tied to qwen; it’s far from the best engine, neither in terms of quality nor speed.

1 Like

Thanks for your thoughtful feedback — I actually agree with many of your points.

Parakeet is a very strong model, but unfortunately I can’t use it for my current setup because Chinese support is a hard requirement (:sweat_smile:).
Before this, I mainly used Whisper large-v3 on GPU, which is also excellent, but in my tests it struggles more when there is speech mixed into background music/TV audio. Qwen helps reduce that issue in my environment.

I also agree that with single-channel mixed audio streams, backend-only processing gives limited gains.
Beamforming on the edge microphone side is likely the highest-ROI path.

At the moment, apart from XMOS XVF3800-class solutions, I honestly don’t see many practical options for consumer-grade setups yet (:sweat_smile:).