Very low wake word detection and speech recognition rate on Voice PE

Got my Voice PE a few days ago and my biggest issue with it is the very low wake word recognition ( I am using the German language btw). I know comparing the voice PE with Alexa is not fair, but that is what I have here to test. With the PE I have to be at max 1.5 meters away and I have to raise my voice quite a bit for it to even get the wake word. Understanding the command is a 1 in 10 chance. On the Alexa device on the other hand I can be 5+ meters away and successfully give commands without having to raise my voice. The difference between the two devices in regards to recognition and command execution it pretty hefty.

I did enable the debug option so it saves the wav files to disk. I can personally understand the command from the wav though there is a base noise floor to the whole recording that could prevent the stt from recognising the speech.

Now, is there an option to tweak the noise suppression and the mic gain without having to take over the device in ESPHome? I don’t really fancy that process given that I could run into all sorts of troubles or incompatibilities later down the road.

2 Likes

You’ll have to take over the device.

Hi
I have had my Voice PE for roughly a week. I think it does a great job

I also have 5 home made Wyoming Satellites. They detect false wake words so much that I had to disable the microphone on the one in the bed room because even my scorring makes it think I say “Hej Jarvis”

I am for the moment still using Okey Nabu on the Voice PE. Which wakeword are you using?
The first hour I had problems because I kept on saying “Hey Nabu” and that worked 1 in 10. Then I realized that it is “Okey Nabu” and it reacted 100% reliably as long as I am in the same room.

One thing Amazon Echoes does much better is detecting anything when the TV is talking. But that is also a really difficult task and I am deeply impressed with the Amazon devices for that.

I wish it was that sensitive here. Funny enough hey Jarvis works even worse than ok Nabu. Also tried it without backgmusic or anything , it’s not really any better. The ground noise floor in my place is between 38 and 42 db, so pretty average I guess. It’s a bit of a letdown.

I have noticed that some IT devices with multiple microphones and echo cancelling can have problems when the device is placed in front of surfaces that reflect a lot of sound. You want to try a placement away from wall. Even rotating the device so the two microphones pick up reflected sound differently can make a difference

1 Like

Try the stand I made and posted upstream. Really improves detection.

I just changed my wake word to Hey Jarvis. I could not make it work until I tried to say Hey Jarvis without much pause between Hey and Jarvis. I think Germans and Danes naturally put a long space between the two words when we try to speak clearly and obviously the wake word was trained by more native English speakers. Try and say the wake word faster with very little pause between the two words.

Yup, it has a speaker built in. You should be able to hear the ring as a confirmation sound to the initial “Hey Nabu” when you first setup the device.

I have the completely opposite experience, I reacts to everything! I’m honestly very impressed with the microphones for even being able to hear me some of the times it decides that I say “hey Jarvis”

I’m a native English (British) speaker with a very neutral accent and the wake word detection is about 70-80%. Even the tiniest change in tone and inflection affect it. I have to almost say the wakeword (ok nabu) ‘perfectly’ for it to work.

Twice its woken up to someone on the TV saying ‘Ok man’ in a german accent (but in english), but I can’t get it to do it.

It seems much better (80%+) if I am within a meter of it when speaking. Normally its about 2.5-3 meters away in a quiet room.

Its not microWakeWord as GitHub - kahrendt/microWakeWord: A TensorFlow based wake word detection training framework using synthetic sample generation suitable for certain microcontrollers. is a great job of implementing the the fairly hefty and complex C microcontroller code.
The training and dataset is bad though and the output of Piper gives two voices (Gender) English with very little variation so the dataset is hugely overfitted.