An Example Successful Implementation

Wanted to share a success story for Voice Assistant utilizing Wyoming Satellite and the Assist pipeline. I’ve got a pretty nice little satellite setup now around the house that’s improving daily as I add additional intents and sentences. Thought it would be useful to note some of the hardware and software choices and pitfalls I ran into throughout the process in the hope it saves somebody else some time.

First, the final product:


The speaker, mic array and pi are held together with 3M Scotch fasteners, so I can pick the whole unit up by the speaker, but also detach the three individual parts for easier setup/access.

Second, the pieces of the puzzle:


rpi4B with power supply and case (~$40 USD total)
SD card ($5 USD)
Jeecoo Speaker A10 ($12)
Ground Loop ($8)
Respeaker 4 Mic Array with case (~$70)

Obviously this isn’t a “low cost” hardware option. I did setup the ATOM Echo and try a few other lower cost solutions, including an rpi zero 2 and other mic options, but found a variety of challenges that eventually led me to this stack. I’ll also note that while I had a couple of Pi’s floating around the house from other projects, it was MUCH easier to start with a fresh SD card. The Pi4 is also overkill for the hardware, but it kept things running smoothly and I could actually get one in stock unlike the slightly cheaper (<$10 diff) Pi 3 which have been harder to acquire. The speaker works great and has a volume knob for manual adjustment which has come in handy.

The Respeaker mic-array was the big spend, but for me was absolutely worth it. The hardware has been great, audio quality/pickup is the closest I’ve seen to the commercial products and honestly I haven’t seen anything else even in the vicinity.

I utilized this tutorial, but with the hardware mentioned above.

To start, I used the rPi installer from here to load the 64-bit OS light onto the SD card. This is critical as its ambiguously written in the instructions as to if the 64-bit is required for VAD only, or for anything to work properly. I found I couldn’t get it cleanly running with Wyoming Satellite at all without the 64-bit setup, though could run the (soon to be deprecated?) HomeAssistant-Satellite on the 32-bit OS.

After loading the OS, I plugged in the Pi and booted up. Important note here, I did attempt the rPi Zero 2 W per the tutorial but found the 2.4Ghz wifi band in my house to be noisy enough that I could barely sustain an SSH session. Given my STT options and how I was expecting to use the satellite, this was a non-starter. The better Wifi experience from the beefier hardware eliminated so many potential issues, I felt like the extra $20 or so was worth it.

Following the tutorial instructions, I installed Wyoming Satellite and OpenWakeword and plugged in the reSpeaker array and speaker. Important note: make sure the speaker plugs into the mic array as it allow the onboard reSpeaker hardware to deal with feedback/etc. Tested both with aplay/arecord on the hardware and they were entirely plug-n-play, no additional configuration required. Setup the services per the instructions to ensure they would be available on-boot.

The Wyoming Protocol in HA immediately detected the new satellite and provided a device and some very nice entities for it out-of-the-box. I didn’t need to mess with that configuration significantly as the Respeaker array also has some built-in noise suppression capabilities. I intend to continue to tune it over time, but the OOB experience was good enough to put it in my living room and not get yelled at by my wife for false positives or negatives from the unit (the ultimate test).

For my Assist pipeline, I have two configured, one wired to an OpenAI integration per this tutorial, and the other a standard HA Pipeline for controlling the home. Obviously the eventual goal is to merge them, but I haven’t found a solution I personally like for that yet (I know there are a few out there, and they look great, just haven’t spent enough time exploring it yet). With the OpenAI pipeline, I did add some custom instructions to keep replies to 2-3 sentences to keep from timing out Piper TTS. Speaking of…

I found Piper to be really solid. Not absolutely perfect, but for a completely local TTS model, its been more than sufficient. Plus my younger kid likes playing with the various voice configurations, so that’s a nice little bonus.

I tried like crazy to get an acceptable Whisper model working locally and just couldn’t make it work. It was either too slow, or in the case of the OpenAI pipeline, just couldn’t understand enough of my speech to be useful. I decided to just go with a cloud-based STT solution and am unbelievably happy I did - honestly wish I had saved myself the hassle and just started with this approach, but you live and learn. The performance is excellent, it handles everything I’ve thrown at it, and despite hours and hours of testing this month already, I’ve racked up a grand total of $1.73 in charges. I’m using Googles STT and get <1s latency and near-perfect results. I used this integration on HACS.

Now, its just a fun software challenge. Happy to share more on my sentence implementations or configurations if useful, but I’ve been having a blast “teaching” my assistant new tricks regularly. I still need to tweak the Wakeword parameters (a few too many false activations throughout the day, not crazy, but not perfect), tune the overall audio better (some sentences are occasionally mis-interpreted), and play with the LEDs on the speaker unit (they do automatic VAD and DOA visuals OOB, but I’d love to have them show wake/thinking as well), and improve my cable management behind the unit.

Some things I’m hopeful for from future releases:

  • A ‘failover’ mode for Pipelines. If the standard HA pipeline doesn’t understand the command, kick it over to the OpenAI pipeline for further interpretation or response
  • Ability to “hold” a conversation - e.g. receive responses and continue in the same conversation context

Those two alone would give me a clean-path to a fully competitive solution to the commercial units (Google Home/Alexa) that I refuse to have in my house.

Huge thank you to @synesthesiam and team. I had been following Rhasspy for a couple of years (Mike helped me with some troubleshooting almost two years ago on that front!) and was over the moon when he joined the HA team. The future is very bright indeed!

5 Likes

This is a great write-up, thank you! I’ve ordered some of these parts to do my own testing :slightly_smiling_face:

Both fail-over and multiple conversation turns are in the works, though I can’t say when exactly they’ll be ready. Would you be scripting the multi-turn conversations in automations, or externally via something like NodeRED?


Also, for anyone reading this who may not realize it: if you have an HA Cloud subscription, you’ll have speech-to-text (STT) and text-to-speech (TTS) services available out of the box with no extra $$$.

If not, you’ll need to install something extra like the Whisper add-on or the mentioned Google STT integration. I’ve also got a Vosk STT add-on available that is much faster (but often less accurate) than Whisper.

1 Like

Nice! Im looking around to see different implementations and hardware. I am currently awaiting an ojnu voice chip to mod a google mini and have a muse luxe and m5. I knew the m5 was just for testing purposes, but also the Luxe has trouble getting my speech right. So I might try something like this!

1 Like

Definitely multi-turn conversations via automation for me. It feels like a set of Assist pipeline services that allow you to retain the conversation context as you can via the websocket would be most useful, but admittedly I haven’t fully thought it all the way through!

Just as a follow-on I’ll note that I’m still tinkering with the right satellite audio adjustments, but the ability to listen to raw audio input is super helpful, thanks for that debug feature. Finding the gain/suppression balance in each environment is a bit of trial’n’error.

Sat2

Last minor thing (more for the community), for the wakewords its definitely hard to gauge what “good” is in each of the models - community provided or default. After reading through the core OpenWakeword repo (what a wealth of contextual knowledge!) I have a better sense, but ended up going with “hey mycroft” based solely on the assumption its one of the better tuned models. I’d love to eventually see some consistent benchmarking - but if there’s another resource/asset I should be utilizing to determine this, would be great to know about it.

1 Like

Just a quick note for anyone coming across this, the Repeaker array above just got a 50% price cut to $30 for the unit (lucky me since I already bought mine :roll_eyes:), so this setup should be build-able for <$100 USD/unit

If the Respeaker isn’t available this cheap, then would a Jabra Speak be a great choice for a speaker plus mic array? it behaves as a USB headset but is a handsfree speaker / voice conferencing device.
https://www.amazon.nl/Jabra-Speak-Speaker-Phone-Plug/dp/B004MOWGZ2/ref=asc_df_B004MOWGZ2/?tag=nlshogostdde-21&linkCode=df0&hvadid=430562379523&hvpos=&hvnetw=g&hvrand=4986425300227540499&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9065100&hvtargid=pla-338189074546&psc=1&mcid=6e06fa3053d73dc08666b8ebdd47f5cc

Thank you for this thread! I have almost the same setup, just need to find a speaker since I have only earbuds for now…

Could you share how it’s possible to listen to audio after processing of gain and mic volume? I did not find if it was a setting or some kind of parameter to use.

I saw there’s even a feature request that was open recently:

Its the “–debug-recording-dir” option for the wyoming-satellite script/run. It saves both the wake.wav and stt.wav files, so you can listen to with aplay.

1 Like

Thanks a lot Don!

@kbromer I have a similar setup, with a speaker connected to the respeaker, but I’m not able to use the speaker for answers, only short sounds already on the device, did you manage to use the satellite as a media player? Thank you!