Poll: What's your biggest struggle with voice control right now?

Okay, wow. It’s just a few replies so far, but it’s interesting to see that I’m not alone with the reliability issues. And I guess that’s then not just a case of “get a better microphone”.

@synesthesiam , with your detailed knowledge about the “nuts and bolts”, and the complete picture of how everything works together for voice control in HA: is that something where e.g. we as a community could help by agreeing to provide sample sounds? Or is that rather something where we need hardware manufacturers provide better drivers for their sound hardware? …?

Thanks to everyone for the feedback! It’s been a great year, and I’m looking forward to polishing the voice experience for HA in the coming year :slightly_smiling_face:

There are many different “little” problems that are spread across all of the unique use cases people have. Some have solutions already, while others need more expertise than I can provide (or more brute force testing :smile: ). For example:

S3-BOX Audio

@sparkydave mentions that Willow works better on the S3-BOX. This is the case because Willow uses Espressif’s frameworks to do audio cleanup and wake word detection locally. We use ESPHome as our foundation instead of Espressif’s code samples, and we fought for months before deciding to try again later.

Those frameworks assume they have exclusive control of the entire device, including being able to patch the OS. So for now, we have poorer audio quality than what the hardware is capable of providing. Additionally, the local wake word detection is not customizable without paying Espressif a significant amount of money. But things will get better (see below)!

Command Reliability

There are many possible causes for this, including:

  • Poor audio quality resulting in bad speech-to-text transcripts
    • This could be from a poor quality mic, or just not having noise suppression, etc. enabled
  • Transcripts with small errors that HA fails to recognized (e.g., “bed light” → “bad light”)
  • Commands spoken with unexpected wording – HA is fairly rigid right now (but this will change)
  • Poor language support by Whisper (if using local)

For local speech-to-text, it’s very hard to tell when things will work and when they won’t. Whisper is generally great for English, but it can depend heavily on which size of model you’re using and what sorts of entity/area names you have.

Roadmap

Some things on the roadmap for the coming year that will help:

  • User-tuned models for openWakeWord – provide a few audio samples and have it tune itself to recognize just your voice
  • Better Assist error messages and debugging tools
  • Using the S3-BOX-3 local audio libraries for cleanup/wake word recognition
  • Fuzzier (text) command recognition in HA – can correct a handful of text errors in a command
  • Alternative intent recognizers, including LLM’s like GPT

Alternative Solutions

Here are a few add-ons to check out if you’re having problems and want to experiment:

  • Wake word detection reliability/speed
  • Command reliability
    • vosk - local speech-to-text add-on where you can specify exactly which sentences can be spoken; very fast, but limited

Thanks again to everyone for the feedback; keep it coming! I’m happy to answer questions here over the holidays :christmas_tree:

6 Likes

Awesome, thanks a lot for the detailed insights :slight_smile: And all of the hard work leading towards it!

Might try out an old pi3 as a satellite (with local wakeword) over the holidays, curious to see the difference to the atom. Will make it clearer where in my setup things work better or worse.

1 Like

This !! This !! This !! this makes me veeery enthuziastic !!! :slight_smile: I hope there will be a possibility to tune it for a couple of voices - e.g.: entire family (man/woman/kid voices) :slight_smile:
What sounds ideally from my point of view: To be able to record the voice with the wake-up word spoken by me, my wife, my kids , for multiple (e.g. 10…20 ) times, and then to be able to do the magic that it will work at least as good as “ok google” works :smiley:

1 Like

Is there something like a best practice around naming conventions you would recommend, also with the future in mind?

Example: if i have a main light in every room, HA itself doesnt allow naming them just “main light” (duplicate names), with area to specify the location. So “Switch on the living room main light” and “Switch on the kitchen main light” would only work if the room is part of the light name itself, instead of relying just on the area setting.

This is the hope :slightly_smiling_face: I’ve started my implementation, but haven’t been able to test it with multiple people yet. David already has this implemented in openWakeWord as “custom verifier models”, and his tests show a significant improvement in accuracy.

You can have entities with the same name, just not the same entity id, so you should be able to have “main light” in different areas.

Something missing from HA right now is using area context to disambiguate these “duplicate” names. So saying “turn on main light” should always prefer to turn on the entity/device named “main light” in the satellite’s area. This is a minor change, and is on my to-do list.

2 Likes

Lack of functionality in the Android HA.app for wake word detection and as an always on wall panel display.

1 Like

So far the box-3 hasn’t quite lived up to expectations. Wake word isn’t reliable enough. It often locks up on the last step, so i have to reboot it. And not once has it played the full response at the end, its always cut off at the beginning, or end, or both.

My hardware (x2):

  • ESP32-S3-WROOM-1-N16R8
  • INMP441 x2
  • Jack output to standalone speakers with PCM5102a

I can’t say I have much issues with wake word (porcupine1), the detection is good, works from about as far as the Echo 4; it will usually falsely trigger once (maybe twice) for the duration of a movie (I don’t think that’s too bad at this point, considering it’s sitting right next to the speaker). I’m surprised this is currently #1 in the poll…

Whisper is slow however, smaller models are fast, but unreliable (past the simple on/off command), I’ve settled on small-int8 (Intel NUC 6th gen i5 with HA OS), it’s better on reliability, but already much slower (~4sec, from post-speech to start of TTS).
It also seems to have a mind of its own; whenever it doesn’t understand a word properly, if you repeat the same thing over and over, it will give you the same result, but if you say something else and then come back to it, it might just work… (little green men in the wires no doubt… :laughing:)

And finally Piper, no problem here either. It lacks customization that’s all. Personalizing all responses requires hijacking the text in on_tts_start event; voice identification will be nice to help on that too. A simple append+prepend template in the config would be nice. But first, reliability and performance everywhere, then customization…

The poll doesn’t mention ESPHome per say, but I thought I’d mention the “messy” pipeline setup too. It needs a simpler declaration (like most other components) that just calls the proper functions at the proper time, without leaving the user to deal with those functions, and the accompanying state/error detection logic… There seems to be issues related to audio and sleep too in quite a few setups, similar to when a jack is plugged-in directly into HA (on the NUC); audio being cut off, or not playing at all, if the speaker isn’t first “woken up” by a loud-enough sound or the jack unplugged/replugged. I’ve only experienced the second on my custom boards, but definitely the first one also, on the NUC itself.

Of course, my stuff all is on a breadboard for easy debugging, so I can’t comment on the aesthetic (unless you like the spaghetti look…). I haven’t looked into 3D printing an audio-optimized box yet.
There is a lack of “plug and play” options for sure, and a lack of supported day-to-day non-device-related intents. Both of these are/will be a major factor against adoption…

2 Likes

Hi All,

First of all, it is really great to have the voice control in HA! But there are some thing I do not like:
First is the absence of audible feedback of wake word detection. If my Atom Echo is out of sight, or on a sunny day it is hard to understand if the wake word has been detected. It would be a really very important feature! Sending some triggered by the wake word detection response to a continuously working media player is not an option. That is because not everyone has one, and the confirmation would be detected by the satellite (Atom Echo in my case) as a command.
Second is the occasional stability of the voice system in general. Despite I am trying to speak clearly (in Hungarian), wake word is not always detected. Also, I get “I did not understand that” response too frequently. I have two satellites. The first one (installed as first) usually works much better than the second. The second one is frequently not working at all. After wake word detection (several trials) whatever I say, it will not understand. Than, it starts to understand, but flashes slowly 10-15 sec before the command is actually executed. So, the second, identical to the first satellite is very unstable despite using HA Cloud.
The third thing is about custom triggering sentences for automations. These are working, usually. But after detecting the command and starting the requested automation, the Atom Echo flashes fast 10-15 sec or more. Only after that the confirmation (“Done”) comes and the satellite returns to listening state.
In general, I like this voice assistant very much, but there is still a lot to improve. (In Hungarian it still does not understand anything except “turn on” and “turn off”)

It seems that I managed to solve the second problem I mentioned:
The second Atom Echo was put at a place, where it could detect voice not loudly enough. Due to that, there was no sharp, clean boundary between the speech and the silence. That is why, not detecting properly the end of the voice activity, it waited until the STT cycle timed out (~15 sec). My flat is a quiet place and I wanted my voice being detected more precisely. Therefore, I added some tweaked parameters to the configuration of the Atom Echo devices using the ESP Home extension. Those parameters are overriding ones downloaded from the GitHub project (Line #5).
I changed noise suppression level to 1 (quiet place) and the volume multiplier to 5 (stronger voice recorded). With these parameters the time-out is avoided and assistant acts and replies almost promptly.
My configuration:

substitutions:
  name: m5stack-atom-echo-0f8d14
  friendly_name: Háló asszisztens
packages:
  m5stack.atom-echo-voice-assistant: github://esphome/firmware/voice-assistant/m5stack-atom-echo.yaml@main
esphome:
  name: ${name}
  name_add_mac_suffix: false
  friendly_name: ${friendly_name}
api:
  encryption:
    key: <my-key-replaced>


wifi:
  ssid: !secret wifi_ssid
  password: !secret wifi_password

voice_assistant:
   noise_suppression_level: 1
   auto_gain: 31dBFS
   volume_multiplier: 5.0
   vad_threshold: 3


1 Like

Thank you very much. Vosk works great in Spanish. The small model is really fast in rpi4 and quite accurate, but the big model still works fine in rpi4 (1,5 s in STT) and is really accurate. And it just consume basically ram (like 4Gb RAM, which is ok in my 8Gb ram RPI) but not much CPU.
At least I can test the year of the voice.

1 Like

An issue I have is trying to get the example from the release video to work.

I had posted a comment about it here but it isn’t getting any traction so thought I’d spam it here too… The compiler errors wanting esp-adf instead of esp-idf despite esp-adf not being a valid option (and also fails if tried)

My code is straight from the example given on git… no idea what else to do.

EDIT: found out I had a couple of lines of code missing. All good now.

1 Like

I think the lack of follow-up ability. When you start using LLM/GPT3/4, and it asks for more info or a followup, you’re out of luck… (I did hear someone say to just say the wake word and answer, but not sure that works).

Really need local processing for what HA can do and have a fallback option if HA doesn’t know, then let another assistant pipeline work… Meaning, I want to control everything in HA locally, but also use it for conversational things.

Saying the wake word after follow up works.
What also works, " Is the hallway light on?" (answer is yes), (Wake Word) “Turn it off please”

If you don’t mind custom components, there are solutions for this.

I made one for example:

As for the point of this thread, my second most used feature of Alexa isn’t available out of the box, and that’s a problem for me, and that’s setting timers and reminders using voice.

I have implemented timers manually, but it was a hassle to set up, and having it built in would be nice.

1 Like

My voice assistant doesnt recognize my daughter and wife’s voice, only mine! It seems that female voice is a challenge for it.

1 Like

Using the voice assistant on wearOS is very hit or miss. It rarely gets anything wrong but sometimes picks up background noise like a TV. Also, it does very poorly at recognizing when the phrase has ended and continues to listen way longer then needed.

It’s still a very impressive feat and looking forward to it being more polished in the future!

Check our the voice reminders blueprint I have just posted :wink:

Anyone know how to add an audio acknowledgement on successful wake word detection?

Im not always able to look at the led to see if its listening.

for example (“yes sir?”) after triggering the wakeword.