Ability to custom train a wake word

Hey HA tinkers. I’m playing with Voice PE and Satellite1 a lot lately and I feel the weakest point is the wake word detection success ratio. Everything after it is about hardware’s horse power, but wake word lives on the device and depends on Micro Wake Word. And it sucks most of the time. I know we can take control over the ESPHome config and set a custom wake word, but my post here is about something else - can we really “train” a wake word using our own voice and own pronunciation? Then, maybe this can be extended for every household member, so the assistant “knows” who is speaking and we could use that context in really powerful per-person actions. Is this even possible on today’s VA hardware?

Thanks!

Please search next time…

Its actually one of the coolest features imho you don’t like theirs make your own.

Thanks, I’ve searched and as I said, I know this approach. I’m asking about something else here - to train a model with MY OWN voice and pronunciation. Please get back to the original post as I have further elaborated the idea with other household members and per-person context. Thanks.

Did you read it? Thays exactly what it does.

I can use it to cancel out my South Texas twang or a German pronunciation of Jarvis…

1 Like

You can easily train your own wake word using personal voice samples, we did this using GitHub - TaterTotterson/microWakeWord-Trainer-AppleSilicon: Train microWakeWord models on Apple Silicon Macs (M1, M2, M3...) with full GPU acceleration via Metal (MPS). For use on Home Assistant Voice.

The ability to have per-person context within the ASR is something entirely separate, like the ability to say “turn on the lights in my room” and it knows who is speaking. There are some ASR models that technically support this but I believe that is not currently supported by HA in any way yet.

IMO they have quite a few things to improve before working on something like that.

The code in the container performs fairly simple steps:

  • It starts sample generation in Piper using a special English model. It also allows you to manually record your own samples in UI. You can easily add your own data to the sample directory.
  • After this, the data is modified (various noises are added) and converted to a training format.
  • Training starts, and you receive a finished model.

The advantage of containers is that you won’t encounter dependency hell.

The disadvantage is that they are not optimized for disk space and are only suitable for English.

Also keep in mind that it’s easy to train a model that will activate well in response to your voice. However, achieving a low false positive rate is difficult.