How to Achieve Offline Voice Control for Smart Home without Wake Word using Home Assistant

pylihongge · November 24, 2024, 5:08pm

I am looking for a solution to enable offline voice control for my smart home devices without the use of a wake word. The system should allow me to control devices by simple voice commands, such as “turn on the light,” “turn off the air conditioner,” or “open/close the curtains,” without needing a specific wake word.

Requirements:

No Wake Word: I want the system to operate without the need for a wake word (e.g., “Hey Siri,” “Alexa,” etc.). Voice commands should be recognized without needing any specific activation phrase.
Offline Functionality: The system should work without an internet connection. However, if an offline solution is not feasible, an online solution would be acceptable, but it must still not require a wake word.
Microphone-only Setup: I only want to use a microphone device for voice input, with no speakers or any form of audible feedback.

Hardware and Software Setup:

Home Assistant Green (running Home Assistant OS)
ReSpeaker USB Mic Array (ReSpeaker 4-Mic Array)
LINK：ReSpeaker USB Mic Array | Seeed Studio Wiki

_DAS59301024×768 236 KB
Smart Home Gateway (Central hub for controlling devices)
Various smart home devices (lights, air conditioner, curtains, etc.)

Previous Attempts:

I have tried using HAOS combined with the RHASSPY plugin, but encountered several issues:

ReSpeaker 4-Mic Array is not recognized correctly by HAOS and the RHASSPY plugin. There seems to be a driver issue, as I couldn’t install it properly.
When attempting to use the recording command in SSH, it fails due to missing commands, preventing the microphone from functioning as expected.

Request:

Could anyone provide a new solution or technical approach to achieve this setup, either using Home Assistant or other compatible systems? I would greatly appreciate any suggestions or guidance.

Thank you very much!

NathanCu · November 24, 2024, 7:30pm

How exactly do you expect this to work?

Theres a reason wake words are used. Otherwise you have an always open audio channel (read: very bad from a systems utilization pov)

Honestly it’s not at all practical.

jeffcrum · November 24, 2024, 9:55pm

Totally agree with @NathanCu

I would be confident in saying this could never happen on an RPi. And would need a hefty server system to ‘constantly listen’, convert STT, and decide if you really meant it to ‘do something’ (NEVER GONNA HAPPEN).

pylihongge · November 25, 2024, 2:26am

Thanks for both of your feedback! I understand the concerns about the challenges of not using a wake word — especially in terms of system resource usage and the need for continuous listening. I also recognize that a Raspberry Pi or similar devices may struggle with this, but my setup is based on Home Assistant Green, so I’m looking for solutions that would work within the constraints of this hardware.

My goal is to achieve voice control for devices like lights, AC, curtains, TV, and music, without relying on a wake word. While I understand this might not be a typical use case, I’m open to utilizing Home Assistant’s built-in capabilities or leveraging networked solutions to make this work.

I’m not necessarily looking for an entirely offline solution, but I want to find a way to make it efficient within the hardware I’m using. If you have any thoughts on how to achieve this with Home Assistant Green or any ways to minimize resource usage while keeping the functionality, I’d really appreciate your insights!

NathanCu · November 25, 2024, 3:09am

The constraints exist even on that hardware.

The problem is the open channel and continuous processing

I wouldn’t even want it on my NUC 14.

Its simply impractical on ANY hardware.

You need the wake word

jeffcrum · November 25, 2024, 3:11am

Yeah. This ain’t happening. As already stated, you’ll need a supercomputer for your request. You want to run it on a subcomputer.

pylihongge · November 28, 2024, 11:42am

For the audio part without wake word, i think of two solutions.

use some streaming tts (kaldi, if you want to run on cpu) which will transcript everything, and when you detect intent in the transcription, do the command.
train multiple wake word, one per needed command

cbib05 · November 29, 2024, 6:00am

Assuming you did have the processing capacity to do constant TTS, without a wake word you don’t have a clear start marker. Consider these two conversations:

Bob wants to borrow our pressure washer, I’ll go grab it. At least they can get some use out of it since we don’t. Turn on the lights.

Can you carry this washing? Remember the kids are asleep so don’t turn on the lights.

So not only do you need to extract the text, you need to reliably detect punctuation based on inflection and context, and of course humans are good at making inferences from incomplete data, so it’s common for people to talk like this:

Can you carry this washing? Remember the kids are asleep^{so ^don’t} …um… turn on the lights.

And of course there might be multiple speakers, each adding context:

“What do I need to do once the guests arrive?”
“Turn off the TV and play something relaxing on spotify”

Add in multiple people talking in the same room…

“What do I need to do once the guests arrive?”
“Take the steaks out to the barbeque and offer them a drink”
“This show is stressing me out. Turn off the TV and play something relaxing on spotify”

It’s not impossible that you might solve some of these problems by tagging text with metadata to identify individual speakers (based on voice recognition and direction detection) and then feeding it into a LLM to attempt to parse out meanings, but even building the logic to determine when a chunk of text is complete in order to start building the prompt is going to be a massive undertaking, and you can forget about running all this on a Green, you’re going to need multiple high end GPUs to do it all locally.

The wake word isn’t just to save on processing capacity, it clearly differentiates between instructions to the system and normal conversation content, it gives a definite starting point to begin parsing what it receives, and (depending on your hardware) allows the device to adjust gain and direction so that it’s focused on a specific person’s speech, all of which is critical to building an efficient, reliable and user friendly voice assistant.