Home Assistant Voice PE - Wake Word

Having now received our Home Assistant Voice PE and configured it to work with Home Assistant Cloud, I find it’s not quite as responsive to regional accents as I was hoping.

We’ve (collectively as a household) chosen to use “Hey Jarvis” as our wake word. I’d heard that it might respond best to my broad Scottish accent. However, it has been a hit and miss. It responds every time to our kids, who both have milder accents then me and my wife who has an English accent. I also expected it to perform better in a noisy room than it does. I’ve not tried it with music playing or other device noises, but it seems to struggle if too many people are speaking at the same time. I guess that’s more of a learning process for us as we continue to use it.

I did see this on the HA blog

Does anyone know if this website has been created yet, or is it the same as the open wake word site?

1 Like

I’ve got a regular boring south of England accent and its not very responsive for me at all so far either

2 Likes

I have a very mild North East Scottish accent. I’ve only had my device since this afternoon, but I’ve found that “okay nabu” seems to work better than “hey Jarvis.” However, I don’t find the device particularly responsive, and I don’t think it’s related to my accent. Even if I use my posh telephone voice while sitting in a silent room directly in front of the device, it sometimes still doesn’t register the wake word. On the other hand, if I whisper “hey Google,” a Google Mini on the other side of the room picks it up without any trouble. It’s likely just hardware limitations or the fact that it’s still relatively early in development.

1 Like

Yes, we all have one of those :laughing:

Glad it’s not just me then.

Must admit, some of our Amazon Echo devices have become a bit cloth-eared recently. Perseverance is the answer I think. I tried the link I posted above to add my voice to the “collective” but it just wants us to say “Okay Nabu” as we walk around the room :man_shrugging:

Interestingly the same her for me. I wanted to use the hey Jarvis wake word and straight out the box after setup it would not respond to the wake word like 9 times out of 10. However when changing the wake word to ok nabu the wake word worked 99% of the time.

Multiple attempts to use the Jarvis wake word have all resulted in it being completely unusable so I am currently stuck with using nabu wake word.

Yeah, I’ve switched to using “Okay Nabu” and it works 99% of the time.

i switched to “okay mycheck” or whatever it is haha and thast is flawless

Interesting to see people’s comments. For us, the wake word is recognized extremely reliably (american english) but the device takes ages to process the rest of the command, and 50% of the time it doesn’t come back after spinning forever, 30% of the time it does the wrong thing and 20% of the time does what we asked. I just made up those numbers on the spot but feels roughly right.

Everything is using Nabu Casa cloud.

It’s a lot more brittle than I imagined but I’m still happy we are supporting nabu casa (both with a cloud subscription and through the hardware purchase) which hopefully helps subsidizes future development.

Is it possible to see what words it actually parses and how those are processed by the “brains” part of the pipeline?

1 Like

From my understanding, and I could be wrong, I think you can see this if you got to Settings > Voice Assistants three dots next to the assistant your using and select debug.

I now have 2 HA Voice PEs and 2 Respeaker Lite HA voice kits running and they all respond similarly to each wake word. We’ve switched to “Hey Mycroft” and that seems as good as “Okay Nabu”. So it must just be the way us Scots pronounce Jarvis. I suspect it’s the hard “r” sound as most English speakers will pronounce it Jah-vis.

1 Like

Thanks and your guess on the ‘R’ sounds super plausible to me! Though Mycroft also has it but these models are so darn opaque it’s super tough to reason about why they do or do not work, incl from the ppl who design them.
If you are feeling generous and comfortable donating your voice, you could look into participating in that initiative. I’m sure it’ll help improve accuracy of the model for Scots all around :grinning:.
Happy New Year :tada:

I’ve actually had the site sitting on my desktop for a while meaning to contribute.

I’ll get round to doing it now that the seasonal festivities are all but done.

Happy New Year :clinking_glasses:

1 Like

Try saying it in an American accent - with the emphasis on the “r” of “jar”. Works every time for me. Alas the models seem to be very American-English specific in their pronunciation models. The solution for Englanders with flat “r” sounds is probably a custom wakeword made from “hey_jahvis”, but although you can create it there’s seemingly no way of using it on the Voice PE at the moment.

I think this is why “OK Nabu” seems to work better for a lot of people. But “Hey Mycroft” too, since these have a lot more similarity of pronunciation both sides of The Pond.

“Hey Mycroft” is working well for us, but we get quite a few false “wakes” from the TV.

As for emphasising the “r” in Jar-vis, I’m sure I mentioned I was a Scot and we probably emphasise the “r” more than most :wink:

Ah but your Gaelic “r” is more refined than the American “r”, and with less effect on the overall pronunciation of the word, particularly how long you hover around the vowel sounds. Try putting on an American accent and it may work for you. I tried it in my best Highlands/Inverness accent and it was a miss most times, only slightly less than with flat Southern English vowels. :slight_smile:

1 Like

Its a bit like the English trying to say Whale Oil Beef Hooked in the usual bad Irish accent… :slight_smile:

3 Likes

Okay I’ll have a go at a US accent :grin: any particular state?

I might have to watch a few US tv shows to practise.

Search up old recordings of Walter Cronkite and try his perfect GenAm accent! :crazy_face:

1 Like

The dataset they create uses Piper and it creates 1000 synthetic KW that have very little variation. Piper is a low computational embedded TTS whilst like this post and my subsequent there are many later TTS providing many voices, emotions, languages and cloning, that provide SotA TTS voice.

I think kahrendt inherited the dataset creation from openwakeword and for a classification model its not good and so the results you get are not surprising.
A classification model with the correct dataset is both lighter and more accurate than an embedded model for synthetic data such as OpenWakeWord.
Big data farmed tons of usage data, with highly detailed user profiles and has gold standard datasets and they can create very light, accurate models.
Whilst open source is in that catch-22 and doesn’t have an option to opt-in to capture device usage data and supply good meta-data. So is using synthetic data and rather bad synthetic data and some 101 errors in the audio processing of the dataset when they have purchased in silicon for speech enhancement.

Hopefully the massive leaps forward in inference engines, generative AI, ML and local compute power will enable some clever people to change all this at some point soon, and we can all have better wake words.

Its not needed as KWS with a good dataset are accurate and light. The datasets are needed and that is just on device recordings, with metadata so you can filter and balance for the KW of choice.
You can also do something called ondevice training but not sure if tf4micro supports it, but you could capture usage KW so a KWS would learn your voice.
We are not waiting for tech, but maybe the esp32 was a bad choice as a PiZero2 runs native tflite and definately can use fine tuned models.
Currently the dataset being produced by piper for KW, the manner they apply reverberation, applying noise and the !kw dataset are pretty bad.
They could be improved but that is down to the devs.
A classification model does nothing clever and like the old saying of ‘garbage in, garbage out’ the quality of the dataset sets the overall accuracy.
The Google commandset is a benchmark dataset that is deliberately bad to stress KWS and the top KWS manage 97-98% accuracy on that and with a good dataset and ondevice dataset capture very low compute KWS are extremely accurate.