The Current State of Voice - July 2024

I’d like to document my recent experiences and tinkerings with regards to voice assistants, Home Assistant, and ESPHome.

The “Wife Factor”

In order to completely replace Google Assistant / Alexa, there has to be a certain level of polish involved, both on the hardware and software side.

The hardware needs to be discrete, compact, and somewhat pleasant to look at.

The software needs to work 99% of the time - the wake word needs to be recognized as good as, or better than, consumer devices. And for it to be a viable replacement to consumer devices, it has to understand the intent better than the competition. (LLM+Device Control has been a game changer in this respect. :muscle:)

I think we’re getting really close to matching the overall quality / acceptance factor that a consumer device has.

What Works Well

  • :white_check_mark: Voice Control of Exposed Home Devices
  • :white_check_mark: LLM Integration (This is AWESOME :star:)
  • :white_check_mark: Built-in Wake Words
  • :white_check_mark: Cloud-based Speech to Text
  • :white_check_mark: Cloud-based Text to Speech
  • :white_check_mark: Overall speed of wake word detection and LLM / TTS response

What Needs Work

:warning: Custom TTS servers (either OpenAI API-compatible or not)

I want to be able to control and generate my own TTS locally using xTTS or some other API-based server - currently, there are only limited options (Piper, and MaryTTS) which do not have sufficiently good quality. I was able to get this to work by hacking the OpenAI TTS component to allow it to connect to any local server, but I’d like to see official support for this.

:warning: Custom Intents / Responses with LLMs

Script support has been awesome for LLMs - natural phrases like “I’m leaving” or “I’m going to bed” actually do what I want now! But there are certain cases when I want to be able to return specific data to the LLM and have it interpret that data and speak the response to me. For example - if I say “Where is my wife?” - if she’s at home, I want it to use my BLE room sensors, otherwise, use her primary device location. I can return that from a script to make it easier for the LLM to understand.

Yes, I could create a helper entity for this - but I want to guide the LLM to say something specific, like “She’s at home, in her office.”

I didn’t see anything about this in the documentation, so if it is possible to return data from a script back to an LLM, let me know. :slight_smile:

:warning: Granular Permissions on Satellites

Exposing different entities to different satellites / voice assistants would be beneficial. For example, I may want full control over the home in my office, but for my house guests, I may only want a subset of entities to be controlled from the kitchen.

Again, if this is possible today, please let me know how!

:warning: Wyoming Satellites

Getting this up and running on an RPi is painful. Between fighting with the audio drivers for all of the various audio HATs, to having to manually clone the repo and install the Satellite and OpenWakeWord as a Linux service by hand, this was not fun.

There have been cases where the service has been unreliable - it will work, but if I ask it 2 or 3 things in a row, it breaks internally and stops accepting the wake word / responding. After about 5-10 minutes, I suspect something crashes/resets, and it starts working again.

TL;DR - The Wyoming Satellite (and openWakeWord) projects need some love. :heart:

:warning: Custom Wake Words

Training custom wake words is difficult. Google Colab is great and all, but if I have a dedicated GPU at home, I should be able to utilize that to train a custom wake word in a fraction of the time. I tried running the colab locally but it’s really difficult. Even just having a Linux script or a small Python repository to git pull / install requirements / run locally would be good.

:warning: microWakeWord

This is awesome, and has worked well, but I would love to train my own custom micro wake words for my devices. Currently, there’s only 3 (more or less “hard-coded”) supported ones. I don’t want to have to write custom Python code to train a microWakeWord. :upside_down_face: Better instructions or a tutorial here would be awesome.

:warning: Whisper / Local STT

Whisper is easy enough to set up and get working, but it’s very very slow - or, if you configure it to not be slow, it’s not accurate. Using a cloud-based STT service is an order of magnitude faster and more accurate, but I’d like to not rely on the cloud if possible - isn’t that the whole point? :wink:

Pain Points / Missing Features

:x: Playing Audio / TTS Via Wyoming Satellites

I have a device, it’s connected to Home Assistant, it has speakers, it can already play TTS audio… so why in the world can’t I send it TTS audio directly from Home Assistant? Like, when I walk into a room first thing in the morning - “Good morning!” Or providing audible home announcements like “The back door’s been left open!” This is critical. I don’t want to have separate speakers for announcements vs. voice assistants!

EDIT: See below - this works if you jump through hoops - installing Mopidy+Snapcast is a solution, but a ton of work. Wyoming should still support this natively.

:x: Voice Hardware

Voice hardware is… to be blunt: not good. I’ve tried the M5stack Atom Echo, and I’ve tried the ESP-32-BOX / ESP32-S3-BOX. Both do not meet the “Wife Factor” qualifications - they are way too quiet (there aren’t even volume controls?!), and they don’t have enough microphones to catch far-away wake words.

One of the Atom Echos that I have is completely dead (no power, no LEDs, nothing). That’s disappointing.

The ESP-32-BOX that I originally ordered was a “Gen1” - right before the S3-BOX came out. So none of the tutorials that I was following were working - very confusing. I ended up having to order an S3-BOX, wait for it to arrive, and then I was able to finally get it to work.

Hardware should be generally available - the fact that the S3-BOX is going in and out of stock means that people are going to be hesitant to adopt it.

Setting up an RPi with an audio HAT (I used the RaspiAudio MIC+ v2 with a Pi 4B) was a pain - driver installation was flaky, volume control was also flaky (and of course, by default, it’s set to 100% - too loud, actually!), but once I got everything set up, it works - really well, actually! :tada: If there was better support/docs for hardware combinations (e.g. buy this RPi + this Audio HAT, follow this tutorial to get everything set up, and the instructions were kept up to date), the experience would have been much better.

I know that community members have contributed forum posts, blog posts, etc with tutorials on this stuff, but having something more official from the HA team itself would be an improvement - that’s where most people are going to start.


I’d love to hear about your own experiences - do you agree with any of the points above? Did I miss something (documentation, new feature) that I should be aware of? Let me know!

5 Likes

My experience with the PI/2 mic hat has been good. I have 5 running now connected to Amplifiers and in ceiling speakers. There is a tutorial in the official docs for using the PI/2 mic hat combo.
I also have Pulseaudio and snapcast installed so I can get whole home audio and TTS announcements.
I no longer have any Amazon devices installled since I got these up and running.
The ESP32 options just dont compare.

Interesting! So, you can run Wyoming Satellite / openWakeWord along with Snapcast as a snapclient and stream TTS announcements to it, and they don’t interfere with one another? I will definitely have to give this a try!

I personally chose the Raspiaudio MIC+ because it has a built-in amplifier and speakers, in addition to a microphone - making a great all-in-one HAT. But for external speakers, I would definitely choose the ReSpeaker mic HATs.

Yes. Works very well.

Note that the version of snapcast referenced in the tutorial no longer works with the Wyoming Satellite. The tutorial hasnt been updated yet. This is not my repository.
Replace those steps with the following:

cd wyoming-enhancements/snapcast/
wget https://github.com/badaix/snapcast/releases/download/v0.28.0/snapclient_0.28.0-1_armhf-bookworm.deb
sudo apt install ./snapclient_0.28.0-1_armhf-bookworm.deb

I really like using the esp32 box s3. I am using “bubbas” firmware which allows volume control, which was greatly needed.

I spent some time and installed Mopidy (because it’s a media_player) and connected it to a Snapcast server in a Linux VM, then I installed the Snapcast clients on my RPi 4’s and you’re right! It works like a charm.

It’s quite convoluted (HA > Mopidy > Snapserver > Snapclient > RPi Speakers) which is why I hope Wyoming will implement media_player support in a future release, but this is definitely a good start!

1 Like

The satellite has the ability to play audio, even without installing MPD.
But this requires small modifications. Community members suggested solution at the beginning of the year. Check out this topic.
Unfortunately, the existing solutions have not been updated and conflict with the latest update - the timer functionality stops working. I think it’s easily fixable, but no one is doing it right now. But you can still try it at work.
Hopefully it will be in the main branch someday.

Regarding the installation, it is sufficient to create one instance of the device and then duplicate the SD card for the remainder. You will only need to configure the name.

Yes! I am looking forward to the “Bypass wake word” feature to make it into the main branch for Satellite. Been tracking this for some time.

And yes, I definitely duplicated my SD card for my other duplicate hardware - that is a good trick to know for sure. :sunglasses:

for local whisper, you can run the wyoming-faster-whisper server on a dedicated gpu. it’s quite fast and you can run the larger models which results in good accuracy in my opinion.

if you prefer “plain” Whisper that can also be used in other projects, there is also a community project server to forward Wyoming requests to the whisper server. haven’t tried that myself though.

I’ve been working with the Assist features a lot recently and have made a fork of Stream Assist (Stream Assist CC) for experimentation. I’m not a particularly experienced developer, so I wouldn’t be comfortable (yet) contributing directly, but at the very least I can implement these features for myself and then offer to add them to the main repo once the kinks are worked out.

With that said, has anyone been able to figure out a way to reliably know when TTS playback has finished on any given media player?

I’ve managed to add support for short term memory so follow-up questions maintain context for a few minutes after the last interaction, but this feature isn’t as intuitive/useful without the ability to skip wake words on follow-up interactions. I think I can implement the ability to skip wake words in the integration but it can’t be done unless we have some way of knowing exactly when TTS playback ends.

Aside from that, I’m going to work on adding globally consistent conversation threads, so the assistant remembers context across all devices (assuming all of them are tied to the same conversation agent).

Wyoming satellites know when the TTS audio has finished playing - there is a command line event you can subscribe to. I currently have it set to change the status of an RGB LED and it works great. Maybe take a look at how they do it? The command line option is --tts-played-command.

If only it could be so easy for other devices :smiling_face_with_tear:

I’m using several of the Thinksmart View tablets in my case, so TTS playback is actually just a WAV file being casted to an app exposed as a media player. For some reason, media playback reporting for anything in Home Assistant is slow and inconsistent (I sanity checked this with a local install of VLC thinking it would have the lowest possible latency but it was no different).

So technically I could use the state of the media player entity, but it would have way too much latency. Since Stream Assist can cover basically any combination of input and output devices, it would be extremely valuable to find a way to do this. So is there something about the way HA polls media players that makes this happen? If so, is there anything that could be done to correct it?

Bro, how do you setup rgb led for wyoming satellites when tts playing? I’m finding solution for it

One feature that we really use a lot is Amazon Music with Alexa. Is there any chance to get that running with HAS voice assistant?

Just a simple Python script:

import RPi.GPIO as GPIO
import sys

# Pin definitions
RED_PIN = 22
GREEN_PIN = 27
BLUE_PIN = 17

def set_color(red, green, blue):
    GPIO.output(RED_PIN, red)
    GPIO.output(GREEN_PIN, green)
    GPIO.output(BLUE_PIN, blue)

if __name__ == "__main__":
    if len(sys.argv) != 4:
        print("Usage: python3 rgb.py <Red 0 or 1> <Green 0 or 1> <Blue 0 or 1>")
        sys.exit(1)

    try:
        red = int(sys.argv[1])
        green = int(sys.argv[2])
        blue = int(sys.argv[3])
    except ValueError:
        print("Invalid input. Please ensure RGB values are 0 or 1.")
        sys.exit(1)

    if not (red in [0, 1] and green in [0, 1] and blue in [0, 1]):
        print("Invalid input. Please ensure RGB values are 0 or 1.")
        sys.exit(1)

    GPIO.setmode(GPIO.BCM)
    GPIO.setwarnings(False)
    GPIO.setup(RED_PIN, GPIO.OUT)
    GPIO.setup(GREEN_PIN, GPIO.OUT)
    GPIO.setup(BLUE_PIN, GPIO.OUT)

    try:
        set_color(red, green, blue)
    except KeyboardInterrupt:
        pass

Which you can invoke with command line options to the satellite binary:

[Unit]
Description=Wyoming Satellite
Wants=network-online.target
After=network-online.target
Requires=wyoming-openwakeword.service

# --vad doesnt work :(

[Service]
Type=simple
User=voice
ExecStart=/home/voice/wyoming-satellite/script/run \
--name MyVoice \
--uri tcp://0.0.0.0:10700 \
--mic-command 'arecord -r 16000 -c 1 -f S16_LE -t raw' \
--snd-command 'aplay -r 22050 -c 1 -f S16_LE -t raw' \
--wake-uri 'tcp://127.0.0.1:10400' \
--wake-word-name 'alexa' \
--mic-auto-gain 5 \
--mic-noise-suppression 2 \
--vad-buffer-seconds 1 \
--vad-wake-word-timeout 2 \
--connected-command    '/usr/bin/python3 /home/voice/scripts/rgb.py 1 1 1' \
--detect-command       '/usr/bin/python3 /home/voice/scripts/rgb.py 0 0 1' \
--detection-command    '/usr/bin/python3 /home/voice/scripts/rgb.py 0 1 0' \
--stt-start-command    '/usr/bin/python3 /home/voice/scripts/rgb.py 0 1 1' \
--synthesize-command   '/usr/bin/python3 /home/voice/scripts/rgb.py 1 1 0' \
--tts-start-command    '/usr/bin/python3 /home/voice/scripts/rgb.py 1 0 1' \
--tts-stop-command     '/usr/bin/python3 /home/voice/scripts/rgb.py 1 0 1' \
--tts-played-command   '/usr/bin/python3 /home/voice/scripts/rgb.py 0 0 1' \
--disconnected-command '/usr/bin/python3 /home/voice/scripts/rgb.py 1 0 0' \
--error-command        '/usr/bin/python3 /home/voice/scripts/rgb.py 1 0 0'
WorkingDirectory=/home/voice/wyoming-satellite
Restart=always
RestartSec=1

[Install]
WantedBy=default.target

I have the RGB LED (which is R/G/B/Gnd) plugged into these pins, but you can plug them into any open pins and modify the script:

image

And now your voice assistant will glow when it recognizes your voice, when it’s thinking, and when it’s speaking. :slight_smile:

1 Like

Thank you for helping me, it’s very helpful. I have a question, is that we won’t need to use service to control the led if you already have a script for each event?

Would you please post your info with code in the Community Guides section of the forum, or maybe a GitHub repository ?

Personally I use reSpeaker and Voice Bonnet HATs which include their own LEDs (so the code to activate my LEDs will be different) - but yours is the first demonstration of using the wyoming-satellite event commands that I have seen … and so will provide a straightforward example for others (including me) to adapt to their own requirement.

With Rhasspy I used Hermes LED Control which allows patterns with the LEDS … so when I get the basics working (voice assist and media_player on the one RasPi, and removing the 15 second delay after stop speaking the command) I hope to try my hand at something similar for wyoming-satellite. A good stand-alone exercise for someone (me) still getting my head around python :wink:

Thank you for sharing :grin:

I do not properly understand your question, but I will try to answer what I think you may be asking…

  • There is no new service for rgb.py - it is being run from the wyoming-satellite service each time wyoming-satellite’s status changes triggers one of the Event Commands
  • Because each of these command options executes a separate CLI command, you could have different programs (or call shell scripts) for each Event Command.

Please excuse if you already understand what qJake is doing here, but there will surely be other new users who can do with a detailed explanation…

  1. qJake has one python program (script) called rgb.py which changes the state of one LED according to Red, Green and Blue parameters passed on the command line. This is a simple independent program.

  2. qJake has modified his wyoming-satellite.system service to call rgb.py with different colour parameters for many of the events which wyoming-satellite provides links for.

For example according to the documentation for Event Commands (at the bottom of GitHub - rhasspy/wyoming-satellite: Remote voice satellite using Wyoming protocol) the --detect-command option is called “when wake word detection has started, but not detected yet (no stdin)”.

When qJake’s device hears a sound, and starts the wakeword detection trying to decide if someone is saying the wakeword, the
--detect-command '/usr/bin/python3 /home/voice/scripts/rgb.py 0 0 1'
line in his wyoming-satellite.service file runs the rgb.py script with no Red, no Green and 1 Blue, causing his RGB LED to glow Blue.

If the sound is detected as a wakeword, the --detection-command will turn the RGB LED Green to show that he can continue speaking his command.

And so on…

@qJake I note --tts-played-command goes back to Blue after the response has finished playing. Should this be back to White (R+G+B), or is it because listening for wakeword is default behaviour ?

I think that would be “Spouse Factor” these days.

Good thread though, thank you.

Thank you for your detail explaination, it is very helpful for newbie like me.

Back to my question, I’m installing wyoming-satellite follow by the guide at rhasspy/wyoming-satellite as you mentioned above and they have an example script at wyoming-satellite/examples/2mic_service.py at a2bb7c8f57162a2ea5a10b56eb67334f92ff5b8e · rhasspy/wyoming-satellite · GitHub .and run this script by a seperate service called 2mic_leds.service

[Unit]
Description=2Mic LEDs

[Service]
Type=simple
ExecStart=/home/pi/wyoming-satellite/examples/.venv/bin/python3 2mic_service.py --uri 'tcp://127.0.0.1:10500'
WorkingDirectory=/home/pi/wyoming-satellite/examples
Restart=always
RestartSec=1

[Install]
WantedBy=default.target

I think we don’t need this 2mic_leds.service anymore if we have a script for rgb led and define led status for each event like what @qJake are doing above.

Just wanna to ask again here to make sure what I think is correct