For me it’s currently mostly the overall reliability and speed of the detection of wakeword and commands. I don’t care about improvised-looking hardware, or more effort to set something up, but my experiments at least so far where nowhere near a result that was as day-to-day usable as a physical button in the hallway or a widget on my phone screen. I have to admit: i do use the fully local version on a Pi4, so speed is limited by design. But the fully local approach and that it can run on small hardware brought me to HA in the first place
Anyway, curious to see/hear what everyone else bumps against the most. Thanks for your time
I’m running HA in a Docker container on my Synology. It looks like to get voice control working I either have to run a bunch of other docker containers doing other things, setup some hardware to do certain things but still have HA in Docker, or move all things HA to dedicated hardware. And that’s just the complication to get the backend going. I haven’t even looked at the front end hardware setups yet.
HA is a hobby project for me, so I’m not afraid to dig in, but every time I start planning, it looks like months of work with no clear idea how well it will work at the end anyway. I think I’ll wait until year 2 or 3 of voice and check again.
Speach recognition in Spanish (I think that other languages have same problem).
It does not work. Not a single sentence recognised. I have tried multiple mics: mobile phone app, webcam mic via browser and esp32.
I haven’t even been able to test wake word because of this.
I am speaking about local, as my goal is being cloud free in this topic.
Finnish is recognized very well, but there is a very long delay that makes it unusable (about 15 seconds, with Nabu Casa cloud). Maybe the delay comes from wait/timeout in ending the recognition, because in some very rare occasions the delay is just one second.
I have ESP32-S3-BOX-3 , I trained for a wakup word with extended parameters (with GPU support) , I use nabu casa for TTS/STT , it looks that I can NOT yet “release it in my house” to my wife/kids becasue of the wake word It’s hard to trigger it… I will try also with snowboy addon, but … I don’t have high hopes that with 10…20…50 records with my voice it will work better than the “VERY trained” openwakeword I created… I hope for some improvements somehow …
First gen box or box3? I have the first gen box (ESP32-S3-Box) and I find the voice detection in general (both wake and command) to be pretty bad. Apparently the Box3 is better.
I’ve corrected my post now : i have box 3 (with blue base)
I suspect the wake word difficulty is because i’m not an English native speaker. But i’ve generated a custom wake up word which sounds extremely similar in English with Romanian (from my point of view).
Yeap, reliability. Unless I speak (English) with nearly perfect intonation in a noise-free environment it very often fails to render a proper recognition. I’ve tried in other languages that I can speak with little accent and it’s about the same.
Maybe a standalone solution would be the way to go, like some users have done with good results.
Okay, wow. It’s just a few replies so far, but it’s interesting to see that I’m not alone with the reliability issues. And I guess that’s then not just a case of “get a better microphone”.
@synesthesiam , with your detailed knowledge about the “nuts and bolts”, and the complete picture of how everything works together for voice control in HA: is that something where e.g. we as a community could help by agreeing to provide sample sounds? Or is that rather something where we need hardware manufacturers provide better drivers for their sound hardware? …?
Thanks to everyone for the feedback! It’s been a great year, and I’m looking forward to polishing the voice experience for HA in the coming year
There are many different “little” problems that are spread across all of the unique use cases people have. Some have solutions already, while others need more expertise than I can provide (or more brute force testing ). For example:
S3-BOX Audio
@sparkydave mentions that Willow works better on the S3-BOX. This is the case because Willow uses Espressif’s frameworks to do audio cleanup and wake word detection locally. We use ESPHome as our foundation instead of Espressif’s code samples, and we fought for months before deciding to try again later.
Those frameworks assume they have exclusive control of the entire device, including being able to patch the OS. So for now, we have poorer audio quality than what the hardware is capable of providing. Additionally, the local wake word detection is not customizable without paying Espressif a significant amount of money. But things will get better (see below)!
Command Reliability
There are many possible causes for this, including:
Poor audio quality resulting in bad speech-to-text transcripts
This could be from a poor quality mic, or just not having noise suppression, etc. enabled
Transcripts with small errors that HA fails to recognized (e.g., “bed light” → “bad light”)
Commands spoken with unexpected wording – HA is fairly rigid right now (but this will change)
Poor language support by Whisper (if using local)
For local speech-to-text, it’s very hard to tell when things will work and when they won’t. Whisper is generally great for English, but it can depend heavily on which size of model you’re using and what sorts of entity/area names you have.
Roadmap
Some things on the roadmap for the coming year that will help:
User-tuned models for openWakeWord – provide a few audio samples and have it tune itself to recognize just your voice
Better Assist error messages and debugging tools
Using the S3-BOX-3 local audio libraries for cleanup/wake word recognition
Fuzzier (text) command recognition in HA – can correct a handful of text errors in a command
Alternative intent recognizers, including LLM’s like GPT
Alternative Solutions
Here are a few add-ons to check out if you’re having problems and want to experiment:
Wake word detection reliability/speed
porcupine1 - limited selection of wake words, but lightweight
Awesome, thanks a lot for the detailed insights And all of the hard work leading towards it!
Might try out an old pi3 as a satellite (with local wakeword) over the holidays, curious to see the difference to the atom. Will make it clearer where in my setup things work better or worse.
This !! This !! This !! this makes me veeery enthuziastic !!! I hope there will be a possibility to tune it for a couple of voices - e.g.: entire family (man/woman/kid voices)
What sounds ideally from my point of view: To be able to record the voice with the wake-up word spoken by me, my wife, my kids , for multiple (e.g. 10…20 ) times, and then to be able to do the magic that it will work at least as good as “ok google” works
Is there something like a best practice around naming conventions you would recommend, also with the future in mind?
Example: if i have a main light in every room, HA itself doesnt allow naming them just “main light” (duplicate names), with area to specify the location. So “Switch on the living room main light” and “Switch on the kitchen main light” would only work if the room is part of the light name itself, instead of relying just on the area setting.
This is the hope I’ve started my implementation, but haven’t been able to test it with multiple people yet. David already has this implemented in openWakeWord as “custom verifier models”, and his tests show a significant improvement in accuracy.
You can have entities with the same name, just not the same entity id, so you should be able to have “main light” in different areas.
Something missing from HA right now is using area context to disambiguate these “duplicate” names. So saying “turn on main light” should always prefer to turn on the entity/device named “main light” in the satellite’s area. This is a minor change, and is on my to-do list.
So far the box-3 hasn’t quite lived up to expectations. Wake word isn’t reliable enough. It often locks up on the last step, so i have to reboot it. And not once has it played the full response at the end, its always cut off at the beginning, or end, or both.
I can’t say I have much issues with wake word (porcupine1), the detection is good, works from about as far as the Echo 4; it will usually falsely trigger once (maybe twice) for the duration of a movie (I don’t think that’s too bad at this point, considering it’s sitting right next to the speaker). I’m surprised this is currently #1 in the poll…
Whisper is slow however, smaller models are fast, but unreliable (past the simple on/off command), I’ve settled on small-int8 (Intel NUC 6th gen i5 with HA OS), it’s better on reliability, but already much slower (~4sec, from post-speech to start of TTS).
It also seems to have a mind of its own; whenever it doesn’t understand a word properly, if you repeat the same thing over and over, it will give you the same result, but if you say something else and then come back to it, it might just work… (little green men in the wires no doubt… )
And finally Piper, no problem here either. It lacks customization that’s all. Personalizing all responses requires hijacking the text in on_tts_start event; voice identification will be nice to help on that too. A simple append+prepend template in the config would be nice. But first, reliability and performance everywhere, then customization…
The poll doesn’t mention ESPHome per say, but I thought I’d mention the “messy” pipeline setup too. It needs a simpler declaration (like most other components) that just calls the proper functions at the proper time, without leaving the user to deal with those functions, and the accompanying state/error detection logic… There seems to be issues related to audio and sleep too in quite a few setups, similar to when a jack is plugged-in directly into HA (on the NUC); audio being cut off, or not playing at all, if the speaker isn’t first “woken up” by a loud-enough sound or the jack unplugged/replugged. I’ve only experienced the second on my custom boards, but definitely the first one also, on the NUC itself.
Of course, my stuff all is on a breadboard for easy debugging, so I can’t comment on the aesthetic (unless you like the spaghetti look…). I haven’t looked into 3D printing an audio-optimized box yet.
There is a lack of “plug and play” options for sure, and a lack of supported day-to-day non-device-related intents. Both of these are/will be a major factor against adoption…