The Current State of Voice - July 2024

Ah yes … that “2mic_leds.service” is for the seeed reSpeaker 2-Mic HAT boards, using the 3 APA102 LEDs on that board (and the many clones and copies).

You are correct that the 2mic_leds.service is not required.

qJake has wired a RGB LED directly to GPIO pins on his RasPi, so his method can be used fairly easily with any microphone and speaker connected (including a reSpeaker if the respeaker doesn’t use GPIOs 17, 27 and 22)

Actually the Adafruit Voice Bonnet (which i am currently using for testing) is similar to the reSpeaker but includes Dotstar LEDs and so I simply did not do that part of the tutorial. I have used the --awake-wav sounds/awake.wav --done-wav sounds/done.wav options to give audio feedback to the user.

I think I should start by saying that I have been using Rhasspy with home Assistant for a few years now on 3 Raspberry Pi’s with reSpeaker HATs. Looking forward to getting the last few wrinkles out of wyoming-satellite so I can swap all my satellites over.

User expectations
I have not used Alexa or Google devices so I don’t have a particular standard I am comparing Rhasspy/Voice Assist to. Similarly we have not got used to AI. I doubt that any open source project can ever come close to the quality or cost that the big boys can subsidise - and unfortunately Voice Assist will always be measured against that impossible standard.

Granular Permissions

There is a clear use case for recognising which person is giving the command, this has already been requested, and the response was that there are other projects working on this, so it may be possible to integrate … but not a priority for the near future.

Wyoming-satellite

I actually found it pretty straightforward … but maybe that’s because of my pervious experience with Mike’s Rhasspy system. There certainly were challenges trying to get my head around how Rhasspy worked; and since I had something working well I didn’t start into Voice Assist until chapter 4 when the functionality came close.

Playing Audio / TTS Via Wyoming Satellites

I hear you brother !! RasPis already make good media players, so this seems obvious. At least expose a media_player interface so we can send TTS audio messages. Two devices both doing audio output in the same room seems overkill … so possibly make the mic a tiny separate ESP32 device (or multiple “ears” around the room) and send audio output to a separate media_player device ? But if we have audio out and audio in on the same device, we could subtract the audio out and hear what else is happening in the room.

After experimentation trying to get back to minimal components I have just concluded that Voice Assist and media_player are mutually exclusive. Squeezelite blocks Wyoming-satellite; but mpd coexists, playing music or processing voice commands … until wyoming-satellite tries to make a sound while music is playing.
@synesthesiam seems to keep dodging this issue, so I guess there isn’t a straightforward way to do it.

Voice Hardware is a concern
RasPi with a reSpeaker HAT used to be the recommended hardware for Rhasspy … but while the seeed reSpeaker page talks about multiple microphones, Voice Activity Detection, Direction of Arrival and Key Word Spotting (and even has a video demonstrating) they forgot to incorporate them in the device driver or release any source code before they stopped supporting the reSpeaker in 2018. A simple USB mic does just as well. Add to that the supply and cost issues with RasPi during/since covid, and that combination no longer looks attractive.

I am awaiting a Nabu Casa voice assistant product (like ESP32-S3-BOX, but optimised for voice assistant, rather than being a technology demo), and this appears to be in development … but apparently waiting for some DSP (Digital Signal Processing) algorithms to make it into public domain.

User-level documentation

This is a real “soapbox” issue for me.

  • I understand that FOSS projects start out as one developer writing something for his own use, then releasing it in case other developers find it useful.
  • I understand they want to get on to adding the next great feature instead of wasting all their time trying to explain the bleeding obvious to stupid “users” !
  • I understand that Home Assistant is a moving target, growing exponentially (even without considering all the extra components in HACS); and that is is virtually impossible to regulate.
  • I understand them hanging onto the belief that the “users” are like them - tech savvy (if not professional programmers) who enjoy the challenge of figuring things out for themselves from minimal oblique hints.
  • I do not understand how they think that 1 million users are all experienced programmers. The user base has gradually changed. Face it. I commend Nabu Casa management for focussing on improving the User Interface … but they have so far ignored that documentation is also an important part of the whole “User eXperience”.

I do acknowledge that some parts of the official documentation are better than others … but generally non-technical users have to turn to google and YouTube videos to try to understand what the official documentation is saying. At least the Community Guides are within the home-assistant.io umbrella, even if content is unregulated.

Sorry to rant on
</ soapbox >

2 Likes

@donburch888 Regarding your soapbox comments, I’ve found several things over time that seem to prevent really good voice experiences.

The first by far is the vast differences between what groups of users want and are capable of running (or willing to run). Assist’s default intent recognizer is notoriously rigid partially because of the target hardware (Pi 3/4) but mostly because any dependencies on machine learning libraries are a nightmare to maintain in HA core. Pushing better stuff into add-ons/Docker containers is the only realistic way around this, but this then makes the installation process quite a bit more complex :frowning:

The biggest pain point is GPUs. Many different users want fully local voice, and you can get amazing results but only if you have the right hardware. Even better results are possible with additional training (fine-tuning), but creating an add-on/Docker container for a training environment with a good user interface and keeping it up to date would be a full time job itself!

Hardware is the second issue, which we are fortunately working on in the form of our VoiceKit. It will have a good audio chip on-board for noise filtering/echo cancellation (XMOS), run ESPHome, and be capable of playing media (external 3.5mm port is available to attach to a better speaker than what it comes with). This is what I’ve been focused on lately, and probably will be for most of the rest of the year.

Lastly, I’ve found that I simply can’t keep up :smile: I struggle with the deluge of questions, issues, and PRs for all of the things I’ve been part of over the years. I honestly want to help, but I don’t how best to do it. Many people have suggested giving certain contributors more rights to specific projects. I’d be happy to do this, but not many people have volunteered yet :confused:

5 Likes

Hi @synesthesiam

I want to express my deep appreciation and gratitude for all the work you and your team have been doing. Your efforts to improve voice experiences and develop new hardware solutions are truly commendable. I am particularly excited about the upcoming VoiceKit with its advanced audio capabilities.

However, I have a question regarding the VoiceKit. Do you think it will be a better replacement for the Wyoming Satellite with Pi Zero 2 W and ReSpeaker 2 Mics Hat? I am planning to implement an audio system in the rooms of my house, and I am unsure whether I should proceed now with the existing setup or wait for the new VoiceKit hardware.

Thank you once again for all your hard work!

1 Like

Thank you for the kind words, @vunhtun :slight_smile:

I think the VoiceKit would be a better replacement for the RPi Zero 2 W / Respeaker 2 HAT class of Wyoming satellite as long as you use it mostly for processing voice commands and playing media (likely with an external speaker). The ESP32-S3 chip on the VoiceKit isn’t as powerful as the RPi CPU, but it’s got plenty of power for these things. The XMOS echo cancellation is also quite good (much better than what the Mycroft Mark II had).

If you want to do more advanced things, like interact with external USB devices or a screen, the VoiceKit probably isn’t what you need. It just has LEDs, a speaker, and some expansion pins that can be accessed through ESPHome.

1 Like

Awesome that you decided to add an audio output jack to your VoiceKit to allow connecting an external speaker for playing music.

Will you please also consider adding an audio input jack to the same VoiceKit to allow connecting an external music source like a analog turntable (vinyl record player) music streaming to other speaker providers using Music Assistant?

Iff added such an audio input jack then it could maybe also be used to experiment with other external microphones as well?

First, as others have said, thank you for your work on voice. We’ve definitely come a long way! :slight_smile:

Regarding hardware - yes, a pre-packaged hardware device (VoiceKit) validated to work with Home Assistant that has a (relatively) good onboarding/install experience would likely attract people who don’t want to muck around with installing audio drivers onto Raspbian, among other things. :smiley: This would drive adoption.

Long-term, I would absolutely LOVE to see an all-in-one package that competes with the likes of A**** and G*****, that combines a decent speaker, a mic array, the necessary hardware, and pre-packaged ESP or Linux (w/ Wyoming) firmware that connects to HA seamlessly.

I know we’re a ways from that, but I think there’s a decent chunk of the Home Assistant community that’s in the mindset of “I don’t like my Big Tech cloud-enabled voice assistant, but there aren’t any alternatives that offer the same experience.” So if we had this pre-packaged device that would just work out of the box, there’d probably be a decent chunk of the voice assistant market that would convert fully away from big tech solutions to HA / fully local.

One day… one day we’ll get there. :grin:

1 Like

I don’t believe this is possible with the current design directly, but I’d bet it could be achieved through the expansion ports and the right ADC.

This is the goal of the VoiceKit :slightly_smiling_face: We’re working on a “wizard” for the HA side so that users can be guided through the setup process.

There’s still a lot more to do on my end to get the intent recognizer to be less rigid. I just need the tie to explore potential solutions :smile:

2 Likes

Then please consider making an official ADC expansion addon with audio input jack for the VoiceKit :pray:

Is it maybe possible to use same type of ADC as HiFiBerry uses on some of their ADC models?

Will the VoiceKit be stackable to allow several expansion modules at the same time so can also add Ethernet port with or without PoE (Power-over-Ethernet) too as well as other modules for example ADC audio input and/or display?

Actually Mike, my soapbox is about HA’s documentation still being largely ‘technical notes’ while I believe the user base is now mostly non-technical users … an issue you didn’t touch on.

As far as the rest goes … thank you for confirming / clarifying most of which I had already picked up re user expectations and voice hardware.

It seems to me that there are a fair few hardware and back-end options available, and that some users seem confused. I have been thinking lately that it may be worth me writing a summary of the various options … different use cases; pros and cons; what works with what … basically collecting and summarising what you have said in previous posts … particularly for the benefit of new users who don’t read back through all the previous blog and community posts.

I am delighted to learn that VoiceKit is an official project, and especially that your level of involvement will mean that (when it comes to market) it will be the best compromise between functionality and cost. Can i place my order now ? :wink:

In the meantime, I assume that wyoming-satellite is as good as it will get - at least for now - and so I need to investigate (and document for others with the same issue) how to improve the microphone audio quality to reduce the 15 second timeout. Before installing wyoming-satellite arecord/aplay works well, but after there is a lot of popping in the audio recorded by wyoming-satellite. Maybe that’s because of Adafruit Voice Bonnet, or electromagnetic interference from RasPi 4B.

I used to be an applications programmer … but I now see my main skill as translating fairly technical speak into user-friendly language … and I have written a couple of tutorials. I do try to do my bit in community threads and github issues.

How else can i help ?

1 Like

I’ve resorted to just using OpenAI’s LLM API to do this. Of course, I would prefer if everything were local, but as you have mentioned before… getting your average user to run a local LLM that’s powerful enough to parse voice intents (and on a range of hardware no less!) is definitely a challenge!

I wonder if it would be worth it for Nabu Casa to host some private/secure LLMs in their own cloud that could be included in a subscription for HA users? I know this may not be received very well, but that could potentially be an easier path forward than local LLMs.

1 Like

It was kind of indirect, but the “technical notes” documentation is (I think) partially a reflection of the wide breadth of different use cases for Home Assistant. Nabu Casa has an official documentation writer now, and she certainly has her hands full trying to include many different variations in the tutorials to accommodate different types of users :smile:

With products like the HA Green and the upcoming VoiceKit, I think it will be easier to steer non-technical users towards a “happy path” that works without a lot of tinkering. It’s a fine line to walk, though, because Nabu Casa pushing its own products and services too much may upset the more technical folks :man_shrugging:

I don’t think so :smile: but the release window has been specifically chosen to ensure that there will be plenty of units available. They definitely did not want a situation where all the YouTubers have one, but all the other users are on back order.

The timeout issues may be due to the poor VAD currently in HA. I’m fixing that, and I need to just allow Wyoming satellites to use the local VAD alongside local wake word detection.
As for the popping, I’d guess there’s some warm-up happening with the audio chip. Using SDL2 may help, since it continuously outputs audio even if it’s silence. It doesn’t work well with Pulseaudio last I tested it though :confused:

Thank you! I need to do better with the Wyoming satellite installation process. My cheesy terminal installer should really be replaced by something more robust and accessible via HTTP. Any thoughts on this?

We’ve talked about this, and it comes down to a matter of privacy and cost. The privacy aspect is huge: you have to basically send the state of your house every single time. For cost, everyone else currently has credits which is not how HA Cloud works.
I’m still hopeful that the cost of running local LLMs will come down in the next year or two, and we will have everything in place to take advantage of it :robot:

yes me too. I’m hoping Nvidia is going to come up with a dedicated LLM SBC in the Jetson series for a reasonable price.

Actually I think the hurdle is more the setup than the price. A fairly decent gaming pc will do. it doesn’t have to be high end (I got mine for 800€ with a 3060 graphics card and it works very well). but the setup of the drivers, docker containers with GPU support, etc. is not really something you can ask the average user to do.

Yeah, it’s going to be a while although they have been making process. You can run ha-core on the orin although not sure if that’s the super expensive one or not. Having to port a lot of stuff to GPU based going by this Nvidia developer thread I was reading the other day. It does appear everyone is using a Wyoming satellite or assist microphone I believe though. Still, the 8GB model isn’t exactly cheap but it’s going to be a good year or so but they have made a lot of progress. Still will need to see the final hardware costs needed though.

Everything running on Orin:

homeassistant-core
wyoming-faster-whisper (using faster-whisper container as dependency)
wyoming-openwakeword
wyoming-assist-microphone

YES !!! She certainly does - HA is an ever-expanding moving target.
Thank you for that piece of info. After the frustration of trying to use Open Source if not already an expert, it is a real relief to know Nabu Casa have acknowledged and the project is started.
I assume her first phase is establishing a documentation standard and guidelines to help make future documentation more consistent and user-friendly. Maybe some of us semi-technical users could help with documentation, and let gurus like yourself get back to the interesting work sooner :wink:

FYI … I have given up trying to find a media_player that can share a RasPi with wyoming-satellite; and now setting up a separate RasPi in the same room to be a media_player. So looking forward to VoiceKit, and again having a hardware device which is easy to recommend.

That was a vote of confidence … I am a very limited income pensioner; but whatever it costs and whenever it’s available I am sure VoiceKit will be worth the wait to replace my RasPis. Your mention of “release window” is also encouraging, given how quickly Atom Echo and ESP32 S3-BOX sold out after you mentioned them.

Popping doesn’t go away with time. Curiously the files on HA server had none of the popping recorded on the RasPi, so Whisper is doing an excellent job of tidying the audio.

I wasn’t happy with Adafruit Voice Bonnet back when getting my Rhasspy satellites working; so my next step will be to test with a RasPi 3B, then go back to using an external USB microphone (which looks silly with a RasPi), and I have ordered an Adafruit I2S MEMS Microphone Breakout to try.

Hopefully I can determine some guidelines to help those other users still having an issue with recorded audio quality stopping VAD from working (though it will probably take me a month or two to write it up to my satisfaction).

I did notice your improved reSpeaker install script - thank you. Unfortunately I struggle with linux and FOSS - well away from the accounting programs I used to program under Windows or PICK.

Mike, is it worth putting much effort into the RasPi installation … given your comment above that VoiceKit should pretty much replace the RasPi Zero 2 / reSpeaker class device as the preferred route for new users. That will leave RasPi for experienced users who want more functionality, and are more comfortable with the current installer. Whenever I start a new test system I still follow your tutorial.

I’ve just noticed a RasPi add-on AI module, which I assume you are already aware of. Raspberry Pi AI kit for RasPi 5 adds a Hailo-8L chip in M.2 2242 form factor, and comes pre-installed in the RasPi M.2 HAT+ … though the Toms Hardware review from 2 months back suggests that the software needs to catch up with the hardware.

FYI, Seeed Studio announced ”ReSpeaker Lite " and “ReSpeaker Lite Voice Assistant Kit ” products, with one 2-Mic Array board model that combine XMOS XU-316 + ESP32-S3 for advanced audio processing with ESPHome support, and another DIY-varient as a 2-Mic Array board model with you can use with your own compute solution (other MCU or SBC/computer such as s a Raspberry Pi) via I2S or USB connection:

Really awesome news!

I’m reading that list and seeing a lot of awesome things… and then I’m wondering, how many of those things are software vs. hardware, and how many of them can we expect to come to Wyoming Satellites as well?

For people that want to leverage existing hardware (like any name brand assistant hardware or other castable devices), is there any internal effort supporting any-to-any I/O functionality? As in any incoming audio stream for assist input to any castable endpoint for output? The Stream Assist HACS integration does this and I’ve made a fork of it in an attempt to more deeply integrate this functionality with the assist pipeline. I’ve made some progress but it feels like there isn’t much love for this approach vs the dedicated endpoints, so I’m stuck reverse engineering a lot of this stuff :smiling_face_with_tear:

Unless I’m thinking about this from the wrong angle, there are loads of advantages to this approach. In this example let’s assume a somewhat reliable incoming stream from an ESP32 (android tablets are spotty at best but also an option). With the way an integration like Stream Assist works, we can currently:

  • Have the assistant respond wherever we want
  • Better leverage existing common hardware
  • Have multiple wake words assigned to one stream, so you can call on different assistants from a single device (benefits of this include assigning one wake word to a local only assistant and another to a smarter cloud based assistant, or have different unique personalities with their own voices, or have tailored assistants for different users)

Beyond that, we also have the potential to:

  • Easily add unique wake word confirmation sounds, VAD error sounds, and STT end sounds
  • Have the assistant retain context between interactions with a timeout (I’m guessing the dedicated endpoints can do this but I haven’t worked with them much so I’m not sure)
  • Have the assistant retain context between different endpoints (I believe this is possible but I haven’t figured out how to do it yet)
  • Automatically start listening for follow up requests for more natural interaction (Also pretty sure satellites can do this, but I’ve found a way to make it possible with non-satellite devices by capturing the duration of TTS playback so the integration knows when playback is complete on the cast device. Unfortunately, telling the pipeline to skip the wake word after this has been extremely difficult, so it isn’t working properly yet)

There should also be ways to better leverage the more powerful hardware available on the HASS server, like more advanced noise cancellation for wake words and STT, but I haven’t seen any options to adjust them yet. It may also be possible to evaluate the decibel levels of multiple streams and only respond on the device that is likely to be closest to the user.

All of this is to say there seem to be options on the table that have huge potential but not much traction (that I can see), so I hope listing some of these possibilities can help to bring more attention to it.