I’d like to document my recent experiences and tinkerings with regards to voice assistants, Home Assistant, and ESPHome.
The “Wife Factor”
In order to completely replace Google Assistant / Alexa, there has to be a certain level of polish involved, both on the hardware and software side.
The hardware needs to be discrete, compact, and somewhat pleasant to look at.
The software needs to work 99% of the time - the wake word needs to be recognized as good as, or better than, consumer devices. And for it to be a viable replacement to consumer devices, it has to understand the intent better than the competition. (LLM+Device Control has been a game changer in this respect. )
I think we’re getting really close to matching the overall quality / acceptance factor that a consumer device has.
What Works Well
- Voice Control of Exposed Home Devices
- LLM Integration (This is AWESOME )
- Built-in Wake Words
- Cloud-based Speech to Text
- Cloud-based Text to Speech
- Overall speed of wake word detection and LLM / TTS response
What Needs Work
Custom TTS servers (either OpenAI API-compatible or not)
I want to be able to control and generate my own TTS locally using xTTS or some other API-based server - currently, there are only limited options (Piper, and MaryTTS) which do not have sufficiently good quality. I was able to get this to work by hacking the OpenAI TTS component to allow it to connect to any local server, but I’d like to see official support for this.
Custom Intents / Responses with LLMs
Script support has been awesome for LLMs - natural phrases like “I’m leaving” or “I’m going to bed” actually do what I want now! But there are certain cases when I want to be able to return specific data to the LLM and have it interpret that data and speak the response to me. For example - if I say “Where is my wife?” - if she’s at home, I want it to use my BLE room sensors, otherwise, use her primary device location. I can return that from a script to make it easier for the LLM to understand.
Yes, I could create a helper entity for this - but I want to guide the LLM to say something specific, like “She’s at home, in her office.”
I didn’t see anything about this in the documentation, so if it is possible to return data from a script back to an LLM, let me know.
Granular Permissions on Satellites
Exposing different entities to different satellites / voice assistants would be beneficial. For example, I may want full control over the home in my office, but for my house guests, I may only want a subset of entities to be controlled from the kitchen.
Again, if this is possible today, please let me know how!
Wyoming Satellites
Getting this up and running on an RPi is painful. Between fighting with the audio drivers for all of the various audio HATs, to having to manually clone the repo and install the Satellite and OpenWakeWord as a Linux service by hand, this was not fun.
There have been cases where the service has been unreliable - it will work, but if I ask it 2 or 3 things in a row, it breaks internally and stops accepting the wake word / responding. After about 5-10 minutes, I suspect something crashes/resets, and it starts working again.
TL;DR - The Wyoming Satellite (and openWakeWord) projects need some love.
Custom Wake Words
Training custom wake words is difficult. Google Colab is great and all, but if I have a dedicated GPU at home, I should be able to utilize that to train a custom wake word in a fraction of the time. I tried running the colab locally but it’s really difficult. Even just having a Linux script or a small Python repository to git pull / install requirements / run locally would be good.
microWakeWord
This is awesome, and has worked well, but I would love to train my own custom micro wake words for my devices. Currently, there’s only 3 (more or less “hard-coded”) supported ones. I don’t want to have to write custom Python code to train a microWakeWord. Better instructions or a tutorial here would be awesome.
Whisper / Local STT
Whisper is easy enough to set up and get working, but it’s very very slow - or, if you configure it to not be slow, it’s not accurate. Using a cloud-based STT service is an order of magnitude faster and more accurate, but I’d like to not rely on the cloud if possible - isn’t that the whole point?
Pain Points / Missing Features
Playing Audio / TTS Via Wyoming Satellites
I have a device, it’s connected to Home Assistant, it has speakers, it can already play TTS audio… so why in the world can’t I send it TTS audio directly from Home Assistant? Like, when I walk into a room first thing in the morning - “Good morning!” Or providing audible home announcements like “The back door’s been left open!” This is critical. I don’t want to have separate speakers for announcements vs. voice assistants!
EDIT: See below - this works if you jump through hoops - installing Mopidy+Snapcast is a solution, but a ton of work. Wyoming should still support this natively.
Voice Hardware
Voice hardware is… to be blunt: not good. I’ve tried the M5stack Atom Echo, and I’ve tried the ESP-32-BOX / ESP32-S3-BOX. Both do not meet the “Wife Factor” qualifications - they are way too quiet (there aren’t even volume controls?!), and they don’t have enough microphones to catch far-away wake words.
One of the Atom Echos that I have is completely dead (no power, no LEDs, nothing). That’s disappointing.
The ESP-32-BOX that I originally ordered was a “Gen1” - right before the S3-BOX came out. So none of the tutorials that I was following were working - very confusing. I ended up having to order an S3-BOX, wait for it to arrive, and then I was able to finally get it to work.
Hardware should be generally available - the fact that the S3-BOX is going in and out of stock means that people are going to be hesitant to adopt it.
Setting up an RPi with an audio HAT (I used the RaspiAudio MIC+ v2 with a Pi 4B) was a pain - driver installation was flaky, volume control was also flaky (and of course, by default, it’s set to 100% - too loud, actually!), but once I got everything set up, it works - really well, actually! If there was better support/docs for hardware combinations (e.g. buy this RPi + this Audio HAT, follow this tutorial to get everything set up, and the instructions were kept up to date), the experience would have been much better.
I know that community members have contributed forum posts, blog posts, etc with tutorials on this stuff, but having something more official from the HA team itself would be an improvement - that’s where most people are going to start.
I’d love to hear about your own experiences - do you agree with any of the points above? Did I miss something (documentation, new feature) that I should be aware of? Let me know!