8 Years of Voice Journey

So this is a short article about my journey with voice control over the past 8 years. It’s just a summary of how voice improved over the last few years in general. Maybe some people might find it an interesting read.

I started relatively early with voice around 2017. My goal was to turn off light switches via voice, since my bed was far away from the light switch. I am often driven by laziness. Sometimes you have to get up from the couch to stay on the couch…

CMU Sphinx

The first setup was CMU sphinx on a pi with a cheap Logitech webcam as a mic combined with some cpp magic to send 433Mhz signals. There were wall switches with 433Mhz signals that you could control from these cheap remotes.

It did not work well and I had to scream a bit but sometimes it was switching the lights.

SNIPS.ai

A bit later, around 2018, Snips.ai came along and worked surprisingly well - the mqtt hermes protocol was easy enough to hook in a bunch of intents that I wrote in python to tell jokes and switch my 433 Mhz controller for the lights.

“Hey Snips, tell a joke…”. The joke was that Snips was bought by Sonos and I experienced the cloud lock-in for the first time. While you could run Snips fully offline, the training had to run in their cloud. Sonos did not take long to take down the Snips Cloud Training - so my setup became unchangeable and I could not add new intents. I kept it alive and hoped to find a replacement.

In 2019 I found a project called Rhasspy 2.4 that essentially was a drop-in replacement for Snips as it followed its Hermes mqtt protocol. It’s also the first time I learned about @synesthesiam . He is the mastermind behind Rhasspy.
I think it took not long until I got it running and all my intents and light control was working again. I could finally add new intents again as well. So far I still implemented everything that was controllable via direct listening on mqtt for parsed intents. On the hardware side, I was still limited to switching lights and outlets via 433Mhz from the Pi.

Homeassistant and Node-RED

In 2020 I found the homeassistant project that allowed to build a kind of hardware abstraction layer. With all these integrations, I started buying more appliances, threw out the 433Mhz devices to replace it with Shellys and essentially got my hands on a lot of other fun things.
Homeassistant did not have voice support yet though. Rhasspy somehow was available as an addon but I was running homeassistant via docker, so that was painful to get working together while not running directly on Home Assistant OS (HAOS).

I managed to write a lot of glue code in Node-RED to bridge Rhasspy into Homeassistant. Node-RED was collecting all entities, filtered them and used Rhasspy’s api endpoints for training to make Rhasspy aware of them. I even built some magic to be area aware. If the satellite was in the “living room” and there was a “living room light” entity, it was enough to say “turn on the light” to only turn on the living room light. As Rhasspy also provided a lot of other api endpoints I also added Signal Messenger. I could write intents directly via chat or send a voice message that would be handled via Rhasspy for intent handling. My glue code Node-RED-Signal-Rhasspy-Homeassistant bridge would then handle it and call the right services in Homeassistant.

My Rhasspy Glue Code Magic:

Handling Messages from and to Signal Messenger:

Integration of Signal with Rhasspy/Hermes

Year of the Voice

In 2023 Homeassistant announced the year of the voice. That’s also when I learned that @synesthesiam — the same person behind Rhasspy, which I had been using for 4 years — was involved. I had an Atom echo to try - but this thing was just useless to get anything going properly. Also wyoming satellite together with whisper was very far from usable and my Rhasspy setup was still way better in understanding the commands and executing them. However, I read every blog post about new voice features and got my hands on an esp32 box 3 - which was really challenging back then. It was definitely better than the atom echo and integrated better than my messy wyoming satellite setup but still - not on par with my Rhasspy setup.

Speech-to-Phrase

Fast forward to early 2025. @synesthesiam released speech-to-phrase. It was a bit hard to set up as docs were a bit confusing for running it in docker but I managed. This finally had detection rates similar to my rhasspy setup. Then I also got my hands on a few Voice PE which had some issues in the beginning but now are working reliably. Over the past few months, I migrated most of my intents from Node-RED to Homeassistant directly.

Getting Signal Messenger running again

I still wanted to use Signal Messenger to send voice commands to open the door while waiting outside. As I wanted to get away from my Node Red glue code magic, I rewrote my Signal bot which now directly talks to the Homeassistant assist web api endpoints.

So now, a bit less than two years after the “Year of the Voice” announcement, it finally feels like the Year of the Voice for me…

I want to thank the whole Homeassistant team, especially @synesthesiam for working on this and bringing local control, so I never have to care about another cloud going down…

Next step is fully local LLMs that do what we want.

So the last thing to do for now…

pete@homeassistant:~/Programming/docker/rhasspy$ docker compose down
[+] Running 2/2
 ✔ Container rhasspy        Removed                                                                                                                                                                                                                                                                                                                                                                                                                                                    0.2s 
 ✔ Network rhasspy_default  Removed                                                                                                                                                                                                                                                                                                                                                                                                                                                    0.2s 
7 Likes

This is the kind of stuff that makes me happy to read :smiling_face_with_tear:
Thanks for sharing @async!

In many ways, I feel I’m still catching up to what Snips was able to do. It was sad what happened with Sonos, but it also allowed Rhasspy to grow.
Now with LLMs everywhere, it seems like the next logical step for open source voice will be integrating with them instead of separate modules for each part of a pipeline. The only problem is needing such a huge amount of compute power :smile:

Yesterday voxtral was released. It feels like this could replace half of the pipeline from speech to text, to intent recognition to actual tool calling in one step already.

If this could run on the amd mobile gpu chipsets I think we are not that far away from being reasonable performance-wise and close to hobby-ist budget.

My Journey started not that long ago.

I wanted to control my lights and other stuff locally similar to google assistant and set timers and alarms too.

I even got a speakerphone but standalone usb-mic and speaker in the end became a better solution.

The problems with speakerphone were:

  • voice detection is about the same or worse as usb-mic, despite having multiple mics and noise cancelling feature
  • its speaker is bad because speakerphone is working in conference mode.

So my current setup is usb-mic and old good speaker connected directly to device where HA running. Everything is running in docker containers:

  • wyoming-microwakeword
  • wyoming-satellite
  • Speech-to-Phrase
  • piper

I tried local LLM, it’s not reliable and predictable.

Speech-to-Phrase is pretty good, but still makes many mistakes.

I look forward to microwakeword and Speech-to-Phrase improvements and also the way to add noise cancelling feature to the usb-mic, because working TV or loud music makes everything almost unusable.

Imagine if you could whisper voice-commands when loud music playing and Assist could understand them every single time! :face_holding_back_tears:

Huge thanks to the entire team and Mike in particular for the tremendous work done. After adding streaming, the entire ecosystem feels complete and works really well, even on low-end hardware.
Not all languages have good TTS or STT, but users can contribute by creating high-quality voice datasets. For example, by working with espeak-ng dictionaries and then training a new voice — Piper can sytnesize a good voice. For higher quality results, there are plenty of cloud-based solutions.
STT is not so easy for some languages, there should be local stakeholders in countries ready to create a quality product (or at least train models for StP). Otherwise we can only use whisper from OpenAI

One of the challenging tasks that needs to be addressed is implementing a fully continuous dialogue. To achieve this, it will be necessary to determine at the STT level (or immediately after) whether new audio data is a continuation of the dialogue or just background noise from a TV or other people’s speech. Even in closed ecosystems with cloud-based voice processing, this can be problematic. Local implementation will require unconventional solutions. Perhaps LLMs can help tackle this task as well.

My first “Smart” device was a couple of Bauhn smart power plugs from Aldi supermarket … and used them to automate turning our electric blankets on according to a schedule. The Bauhn devices became more and more unreliable - i assumed the cloud server was massively overloaded - and the day that the electric blanket turned on at 10AM instead of 9PM I just unplugged them.

I noticed the rise of alexa and google home - but I have a music collection and radio, and didn’t see a benefit to paying each time the device played a song. There just didn’t seem any other real use case. Then of course there were numerous reports of advertising based on what you had talked about last night. No thanks.

It must have been about 4 years ago that I succumbed and bought a couple of TP-Link HS110 energy monitoring power plugs (which curiously are still working while all the later models have died). I was getting fed up with having to get my phone out, log in to my phone, and start the app in order to turn a light on - is that really better than standing up and walking 2 steps to the wall switch ? I don’t think so … but when I found Rhasspy I became an instant convert.

I built 2 Rhaspy satellites (based on RasPi 3A and RasPi Zero with reSpeaker boards) and integrated them with Home Assistant and Node Red (HA Automations were shite then). They are still running now. I have been astounded at how much HA Voice Assist has developed in a few short years … but I had difficulty to make the switch, and put it on hold.

I’m getting to the end of my current greenhouse project, and looking forward to purchasing a Voice Assistant PE. Given all the changes I have done to my HA over the time I think it will be wise to rebuild my HAOS server from scratch based around the current state of Voice Assistant, and then try to bring my Raspberry Pi satellites up-to-date.

1 Like

Yesterday I experimented with two wyoming-satellites running on the same device as HA.
One satellite is usb-mic and speaker and 2nd one is usb-speakerphone. I used the same speaker as output as it’s better than internal speakerphone speaker.

Everything running in docker containers. Now in HA I have 2 Assist devices and I can set different piper voices for them. It was fun to test which one of them would trigger first.

usb-mic worked better in my tests.