The Current State of Voice - July 2024

I’d like to document my recent experiences and tinkerings with regards to voice assistants, Home Assistant, and ESPHome.

The “Wife Factor”

In order to completely replace Google Assistant / Alexa, there has to be a certain level of polish involved, both on the hardware and software side.

The hardware needs to be discrete, compact, and somewhat pleasant to look at.

The software needs to work 99% of the time - the wake word needs to be recognized as good as, or better than, consumer devices. And for it to be a viable replacement to consumer devices, it has to understand the intent better than the competition. (LLM+Device Control has been a game changer in this respect. :muscle:)

I think we’re getting really close to matching the overall quality / acceptance factor that a consumer device has.

What Works Well

  • :white_check_mark: Voice Control of Exposed Home Devices
  • :white_check_mark: LLM Integration (This is AWESOME :star:)
  • :white_check_mark: Built-in Wake Words
  • :white_check_mark: Cloud-based Speech to Text
  • :white_check_mark: Cloud-based Text to Speech
  • :white_check_mark: Overall speed of wake word detection and LLM / TTS response

What Needs Work

:warning: Custom TTS servers (either OpenAI API-compatible or not)

I want to be able to control and generate my own TTS locally using xTTS or some other API-based server - currently, there are only limited options (Piper, and MaryTTS) which do not have sufficiently good quality. I was able to get this to work by hacking the OpenAI TTS component to allow it to connect to any local server, but I’d like to see official support for this.

:warning: Custom Intents / Responses with LLMs

Script support has been awesome for LLMs - natural phrases like “I’m leaving” or “I’m going to bed” actually do what I want now! But there are certain cases when I want to be able to return specific data to the LLM and have it interpret that data and speak the response to me. For example - if I say “Where is my wife?” - if she’s at home, I want it to use my BLE room sensors, otherwise, use her primary device location. I can return that from a script to make it easier for the LLM to understand.

Yes, I could create a helper entity for this - but I want to guide the LLM to say something specific, like “She’s at home, in her office.”

I didn’t see anything about this in the documentation, so if it is possible to return data from a script back to an LLM, let me know. :slight_smile:

:warning: Granular Permissions on Satellites

Exposing different entities to different satellites / voice assistants would be beneficial. For example, I may want full control over the home in my office, but for my house guests, I may only want a subset of entities to be controlled from the kitchen.

Again, if this is possible today, please let me know how!

:warning: Wyoming Satellites

Getting this up and running on an RPi is painful. Between fighting with the audio drivers for all of the various audio HATs, to having to manually clone the repo and install the Satellite and OpenWakeWord as a Linux service by hand, this was not fun.

There have been cases where the service has been unreliable - it will work, but if I ask it 2 or 3 things in a row, it breaks internally and stops accepting the wake word / responding. After about 5-10 minutes, I suspect something crashes/resets, and it starts working again.

TL;DR - The Wyoming Satellite (and openWakeWord) projects need some love. :heart:

:warning: Custom Wake Words

Training custom wake words is difficult. Google Colab is great and all, but if I have a dedicated GPU at home, I should be able to utilize that to train a custom wake word in a fraction of the time. I tried running the colab locally but it’s really difficult. Even just having a Linux script or a small Python repository to git pull / install requirements / run locally would be good.

:warning: microWakeWord

This is awesome, and has worked well, but I would love to train my own custom micro wake words for my devices. Currently, there’s only 3 (more or less “hard-coded”) supported ones. I don’t want to have to write custom Python code to train a microWakeWord. :upside_down_face: Better instructions or a tutorial here would be awesome.

:warning: Whisper / Local STT

Whisper is easy enough to set up and get working, but it’s very very slow - or, if you configure it to not be slow, it’s not accurate. Using a cloud-based STT service is an order of magnitude faster and more accurate, but I’d like to not rely on the cloud if possible - isn’t that the whole point? :wink:

Pain Points / Missing Features

:x: Playing Audio / TTS Via Wyoming Satellites

I have a device, it’s connected to Home Assistant, it has speakers, it can already play TTS audio… so why in the world can’t I send it TTS audio directly from Home Assistant? Like, when I walk into a room first thing in the morning - “Good morning!” Or providing audible home announcements like “The back door’s been left open!” This is critical. I don’t want to have separate speakers for announcements vs. voice assistants!

EDIT: See below - this works if you jump through hoops - installing Mopidy+Snapcast is a solution, but a ton of work. Wyoming should still support this natively.

:x: Voice Hardware

Voice hardware is… to be blunt: not good. I’ve tried the M5stack Atom Echo, and I’ve tried the ESP-32-BOX / ESP32-S3-BOX. Both do not meet the “Wife Factor” qualifications - they are way too quiet (there aren’t even volume controls?!), and they don’t have enough microphones to catch far-away wake words.

One of the Atom Echos that I have is completely dead (no power, no LEDs, nothing). That’s disappointing.

The ESP-32-BOX that I originally ordered was a “Gen1” - right before the S3-BOX came out. So none of the tutorials that I was following were working - very confusing. I ended up having to order an S3-BOX, wait for it to arrive, and then I was able to finally get it to work.

Hardware should be generally available - the fact that the S3-BOX is going in and out of stock means that people are going to be hesitant to adopt it.

Setting up an RPi with an audio HAT (I used the RaspiAudio MIC+ v2 with a Pi 4B) was a pain - driver installation was flaky, volume control was also flaky (and of course, by default, it’s set to 100% - too loud, actually!), but once I got everything set up, it works - really well, actually! :tada: If there was better support/docs for hardware combinations (e.g. buy this RPi + this Audio HAT, follow this tutorial to get everything set up, and the instructions were kept up to date), the experience would have been much better.

I know that community members have contributed forum posts, blog posts, etc with tutorials on this stuff, but having something more official from the HA team itself would be an improvement - that’s where most people are going to start.


I’d love to hear about your own experiences - do you agree with any of the points above? Did I miss something (documentation, new feature) that I should be aware of? Let me know!

5 Likes

My experience with the PI/2 mic hat has been good. I have 5 running now connected to Amplifiers and in ceiling speakers. There is a tutorial in the official docs for using the PI/2 mic hat combo.
I also have Pulseaudio and snapcast installed so I can get whole home audio and TTS announcements.
I no longer have any Amazon devices installled since I got these up and running.
The ESP32 options just dont compare.

Interesting! So, you can run Wyoming Satellite / openWakeWord along with Snapcast as a snapclient and stream TTS announcements to it, and they don’t interfere with one another? I will definitely have to give this a try!

I personally chose the Raspiaudio MIC+ because it has a built-in amplifier and speakers, in addition to a microphone - making a great all-in-one HAT. But for external speakers, I would definitely choose the ReSpeaker mic HATs.

Yes. Works very well.

Note that the version of snapcast referenced in the tutorial no longer works with the Wyoming Satellite. The tutorial hasnt been updated yet. This is not my repository.
Replace those steps with the following:

cd wyoming-enhancements/snapcast/
wget https://github.com/badaix/snapcast/releases/download/v0.28.0/snapclient_0.28.0-1_armhf-bookworm.deb
sudo apt install ./snapclient_0.28.0-1_armhf-bookworm.deb

I really like using the esp32 box s3. I am using “bubbas” firmware which allows volume control, which was greatly needed.

I spent some time and installed Mopidy (because it’s a media_player) and connected it to a Snapcast server in a Linux VM, then I installed the Snapcast clients on my RPi 4’s and you’re right! It works like a charm.

It’s quite convoluted (HA > Mopidy > Snapserver > Snapclient > RPi Speakers) which is why I hope Wyoming will implement media_player support in a future release, but this is definitely a good start!

1 Like

The satellite has the ability to play audio, even without installing MPD.
But this requires small modifications. Community members suggested solution at the beginning of the year. Check out this topic.
Unfortunately, the existing solutions have not been updated and conflict with the latest update - the timer functionality stops working. I think it’s easily fixable, but no one is doing it right now. But you can still try it at work.
Hopefully it will be in the main branch someday.

Regarding the installation, it is sufficient to create one instance of the device and then duplicate the SD card for the remainder. You will only need to configure the name.

Yes! I am looking forward to the “Bypass wake word” feature to make it into the main branch for Satellite. Been tracking this for some time.

And yes, I definitely duplicated my SD card for my other duplicate hardware - that is a good trick to know for sure. :sunglasses:

for local whisper, you can run the wyoming-faster-whisper server on a dedicated gpu. it’s quite fast and you can run the larger models which results in good accuracy in my opinion.

if you prefer “plain” Whisper that can also be used in other projects, there is also a community project server to forward Wyoming requests to the whisper server. haven’t tried that myself though.

I’ve been working with the Assist features a lot recently and have made a fork of Stream Assist (Stream Assist CC) for experimentation. I’m not a particularly experienced developer, so I wouldn’t be comfortable (yet) contributing directly, but at the very least I can implement these features for myself and then offer to add them to the main repo once the kinks are worked out.

With that said, has anyone been able to figure out a way to reliably know when TTS playback has finished on any given media player?

I’ve managed to add support for short term memory so follow-up questions maintain context for a few minutes after the last interaction, but this feature isn’t as intuitive/useful without the ability to skip wake words on follow-up interactions. I think I can implement the ability to skip wake words in the integration but it can’t be done unless we have some way of knowing exactly when TTS playback ends.

Aside from that, I’m going to work on adding globally consistent conversation threads, so the assistant remembers context across all devices (assuming all of them are tied to the same conversation agent).

Wyoming satellites know when the TTS audio has finished playing - there is a command line event you can subscribe to. I currently have it set to change the status of an RGB LED and it works great. Maybe take a look at how they do it? The command line option is --tts-played-command.

If only it could be so easy for other devices :smiling_face_with_tear:

I’m using several of the Thinksmart View tablets in my case, so TTS playback is actually just a WAV file being casted to an app exposed as a media player. For some reason, media playback reporting for anything in Home Assistant is slow and inconsistent (I sanity checked this with a local install of VLC thinking it would have the lowest possible latency but it was no different).

So technically I could use the state of the media player entity, but it would have way too much latency. Since Stream Assist can cover basically any combination of input and output devices, it would be extremely valuable to find a way to do this. So is there something about the way HA polls media players that makes this happen? If so, is there anything that could be done to correct it?

Bro, how do you setup rgb led for wyoming satellites when tts playing? I’m finding solution for it

One feature that we really use a lot is Amazon Music with Alexa. Is there any chance to get that running with HAS voice assistant?