I have a question. Is there a “stop word” for voice assistant? I always have a lot of background noise in my environment, but when I wake Nabu up and give it command it usually recognizes them ok. The issue is that it keeps listening due to the background noise and never stops. The way I deal with this now I to cover up the microphone with my fingers after I had said my command, and then it goes off and does what I want.
What I would like is the ability to say “Hey Nabu, add red onion to shopping list. stop” and it would stop listening and process the input. Is this doable. Maybe the stop word could be something like “stop listening”?’
There is a trained wake word for stop (maybe cancel?) but regardless it doesn’t work like you’d want. It interrupts whatever it’s doing to kill the action. So If I read you right the mic is staying open andt sending audio. But if you uttered the cancel it’d just stop the action. So it doesn’t do you any good. Same bet effects of hitting the button.
That said no, the device is jot listening for a stop. It’s listening for the absence of active speech. There is no trailer work or the concept.
This is potentially possible, since the system supports streaming for ASR. But in fact, I only know of one Whisper modification that implemented streaming.
It would also require changes to the STT component.
Alternatively, you can create your own speech recognition component that will wait for “stop words” and interrupt command reception when it receives them. A proof of concept can be created in one evening. ASR with streaming support is still the main difficulty.
I am trying prompts that attempt to get AI to look for incorrectss wake detection. I got the idea from a post in forums but cannot find it again so I made my own. Havent had a chance to test it however
You are a voice assistant located in a noisy environment. You often mistake sound from the TV as your wake word. If you believe you mistakenly detected the wake word you should respond with “_”.
Was looking for this as it drives us crazy and we are disabling our HA Voice PAs since they won’t stop listening if there is a radio or movie playing in the background.
After 30 seconds we get HA Voice replying “Excuse me, I don’t understand Turn on the lighs in living room <26 seconds of transcribed radio chatter>”. I understand the challenges to pick up the actual command while a radio is playing in the background, but both Sonos and Siri have nailed this.
Let me know if it works for you. I use 4b model on what I believe is underpowered hardware possibly. I think it affects my results. I was told 8b is about minimum for good results. Curious how this works for others.
Sonos, Siri and Alexa are not good comparison. These are financially backed company with dedicated development resources. Not really comparable to open source project. It took them several years to get as good as they are. People forget how bad they were in the beginning. The hardware is also vastly superior in some cases if not all.
As I already said, it’s all about streaming ASR and data processing. For example, the server receives a stream of recognized text and at each step checks it against a dictionary of expected commands. If the command is spoken, then the session ends.