On device wake word on ESP32-S3 is here - Voice: Chapter 6

Haos esxi vm on an Intel atom c3758, wondering if it could be another time lack of avx instructions are being a problem.

Edit: Just took a look on the console, apparently it was hitting an out of memory condition with 4Gb, trying again with 6Gb of ram did the trick.

My daughter was not cooperating and letting me watch the entire stream, is it possible to edit the screen to include like a scene selector or something like that?

Our bad, it was broken yesterday but we have fixed it now. It works again.

1 Like

We initially had pushed an update to the existing boxes. However, because compilation requires a lot more memory and time (42 minutes on a Green!), we decided to revert that. People need to either update their package in the YAML or do a fresh install + adopt.

We are looking into being able to ship pre-compiled libraries, which should reduce update time and memory needs.

microWakeWord depends on TensorFlow Lite for microcontrollers, which requires 2GB+ free memory for compilation. We’re working on reducing that need. In the meanwhile, if your system can’t finish the compile, consider using the browser installer.

I watched the live stream. It has to be an ESP32-S3 and requires PSRAM as the previous poster already stated. The atom echo is a vanilla ESP32. The developer who came up with this said any ESP32-S3 with PSRAM should work now but 7 haven’t seen anyone posting saying they got it working. I’ve got an S3 but no mic to wire to it. Seems like removing the display stuff and changing the GPIO pins is all that’s needed but I’m sure someone will be posting some yaml here soon for some board.

After using on device wake word (S3-Box3) for a couple days, it feels snappier and recognizes the wake word even from 3 meters away. Which was not the case using the HA wake word detection. It may be a placebo, but I like it! Will be trying to reuse my old snips satellites (pi zero with 2-mics-pi-hat) for HA Assistant, and hope for a better recognition ratio. HA Voice with wake word is amazing. Thanks for all the hard work people have been putting into this!

2 Likes

@synesthesiam Firstly, THANKS MIKE ! Each step we are getting closer to replacing Alexa and Google devices.

  1. I assume that those of us using Rhasspy should have or be swapping to HA Voice Assist - and seeking support here. And that the Rhasspy project (forum, github and docs) will focus on the non-HA uses for Rhasspy 3 as a generic toolkit ?

  2. Chapter 5 positioned the ESP32-S3-BOX-3 in the middle between the cheap M5 Atom Echo and a RasPi - both price wise and CPU power … but chapter 6 adds microWakeWord which really narrows the gap between ESP32-S3-BOX-3 and RasPi Zero 2W with reSpeaker 2-mic HAT. Is there much benefit to RasPi Zero to justify the extra cost ?
    There is still, of course, RasPi 4/5 as a higher end satellite option for those wanting to do more, and don’t mind a bigger budget.
    Which is your recommendation for anyone setting up a new satellite ?

  3. Personally I still think Nabu Casa should produce their own voice satellite (as they did to produce Home Assistant Blue, Yellow and Green) … though ESP32-S3-BOX-3 comes pretty close for now. In particular I think combining voice assistant and media player makes sense.

From my experience, the 2-mic HAT is superior to the S3Box3, at least as long as the S3Box3 within HA uses only one of the two build in microphones. The willow project (https://heywillow.io/) has fully integrated the S3Box3 and the voice recognition is quite impressive. But for now I am quite happy with the wake word detection on the S3 BOX3.

Regarding the general STT Part, I favor the approach the snips voice platform took with generating/traning an individual model based on the users needs. Only the used intents has been trained which helped to keep false positives low, especially in a multi language environment.
Thinking of something like a (nabu casa) service that generates/train an individual model based on the exposed entities and custom sentences / automations at your local HA instance. Although, to be honest, this sounds more like a deprecated approach that will be useful for low end devices. With AI and the increasing need for local AI processing Power, the way to go may be a dedicated GPU (e.g. CUDA) at home (e.g. Pocket AI | Portable GPU | NVIDIA RTX A500 | ADLINK , https://coral.ai/).

Finding the ‘best box’ for a home automation server has been something I’ve noodled on (and spend far too much coin and brain cycles on :wink: ) Post link below from five years ago give an insight to my dogged clinging…

The Thunderbolt connected device you cite is interesting, to my wasted dazy, I’ve had a couple Thunderbolt 2 and 3 external PCI based NVIDIA GPU attached to HA servers in past. Newer motherboards with Thunderbolt on them are now becoming more available in cost effective form factors. Back to the NVIDIA device you cite, while the amazing work to decrease the memory requirements for AI models is going on at a very rapid pace, I’m not sure that a device with 4 GB of memory will be in the realm of possibilities to support the pipelines of, IMHO, the 3 to 5 models that will make for a truly useful 100% (though I think believe you will want at least your ‘Jarvis’ model to have public internet based knowledge (RAG)) local AI assistant (STT, TTS, video feed processing, and Jarvis overall ‘smarts’. Today and IMHO for the near future (a couple years at least) these will need on the order of 16 GB of memory in an AI MCU.

The ‘other’ factor for a home automation AI box is power efficiency IMHO. I’m still in the power realm of 150 to 500 watts for an Intel/AMD/NVIDIA home automation box. As I said, the new mini motherboards with Intel 13 gen + and AMD newer laptop CPU’s with thunderbolt bring the non-GPU power down significantly. Unfortunately, from my experience so far the NVIDIA GPU side of the Thunderbolt wire with enough RAM and cores, CUDA and AMD based AI MCU’s still idle in high teens and easily hits 200 watts during processing. I will skip the ‘significant’ ‘significant other’ factor of fan noise in this discussion :wink: .

All this blab, coin expenditure and my ‘way back’ post below that was looking for a ultimate home automation server with AI smarts, MCU virtualization and 100% local only processing possibility by me, bring me finally to my ‘point’ AKA, what I am recommending that folks keep and sharp eye on (and adopt now if you are a bit of a ‘leading edge’ experimenter) :

The current Apple M silicon based Mac Mini’s (I recommend the M2, M3 or upcoming M3 MCU’s over the M1 due to the 16 bit floating point hardware support of the M2 and above MCU’s) are the machine that will be the ones to meets the objectives I cite : 100 % local AI control, multi-VM capable and all at a power consumption well well under 100 watt continuous.

The MCU architecture of the Apple M silicon (ARM based) today lets you run multiple Linux VM’s with extremely high VM efficiency using the UTM hypervisors available. The open source work to date and upcoming announcements to allow AI LLM models to run efficiently using the Apple ‘Metal’ (read this as Apple’s CUDA) GPU and MLPU layers is as ground breaking as NVIDIA’s CUDA was 6 to 7 years ago. Unless you are ‘off the grid’ powering your home automation and can ignore power code (how many years till those panels are 'paid off?) , the Apple hardware power abilities for ‘the win’.

I can virtualize with UTM to the same level (and an argument can be made today ‘better’ due to the Apple MCU Rosetta software ability to emulate many CPU’s including Intel/AMD (yes it is slow today, but you really do not need it as Linux is moving to ARM 100% code faster than most any code work today) ) as Proxmox can manage multiple VM and LXC’s.

As I said, give it a look, the price/power point of a 32 GB Apple Mac M2 or M3 Mini is ‘it’ for your next home automation Home Assistant server :studio_microphone: :droplet: .

My understanding is that, while the demo for the reSpeaker 2-mic HAT uses all the fancy features, the driver and source code supplied by seeed do not. I recall someone commenting in particular that the reSpeaker driver uses only one mic.

This disconnect between the advertises “features” and what is actually supported; and the fact seeed stopped updating their drivers over 5 years ago; have let me to stop recommending any seeed product. Unfortunately the several clone products (such as the Adafruit one) also use the seeed driver.

I note that the ESP32-S3-BOX demo also oversells the device. Consider that these companies are not in the business of making and selling end product - these are demos to show the potential to system integrators. For this reason I am doubtful that Espressif will be keen to ramp up production of the ESP32-S3-BOX sufficiently to supply the demand for HA Voice Assist users.

Has anyone has got this to work on a device other than the S3 box? During the Livestream at the 25 minute mark, the ESPHome contributed who came up with this said it will work with any ESP32-S3 with PSRAM and a microphone at a minimum BUT towards the very end one of the other devs said one of the main ESPHome devs said they had been having “fun” with the S3 because of the various models (not sure exactly what he meant) and to post any boards or devices that work. I will just wait and see. I was thinking about the m5stack S3 camera, which does have a mic but I’m sure it’s terrible. At least it could be used for something else if it didn’t.

I also see that Expressif came out with a new ESP32-S3 Korvo development board but hard to spend that much on something that we don’t even know if it works or not. Seems like it would work great IF it works. You would think it would, it has a 3.5mm output, and 3 microphones but I’m not dealing with returns to AliExpress if it doesn’t work with HA… I was just wondering because the S3 box variants are made by Expressif also, you just can’t get one (at retail at least). I think 2 of the 3 models are discontinued anyways

ESP32-S3-Korvo-1 Development Board Espressif Systems AIoT
https://a.aliexpress.com/_mLpKRj8

When watching the kivestream they showed the microwakeword architecture and it really seems like the GPU is used in the last step the most based on their conversation, or at least as much as I understood. That last step was audio of dinner parties taken by someone, along with a lot of other random real world stuff without the wake word. I believe the only requirements is the computer you generate the wake word on needs 2GB of RAM but probably takes longer. When not using a GPU it took 2 days and when sticking a mid level Nvidia card in shaved half a day off, so about a 12 hour difference roughly. I have no idea what his computer specs or what OS the developer was using though.

Regarding AI, it really seems to depend on what you’re doing regarding GPU and performance gain. Things change at a rapid pace although Nvidia has an advantage. The guy in the YouTube link below ran every LLM available on a raspberry pi 5 with 8GB of RAM with the OS being on an external SSD via USB 3.0. It’s response times are pretty ridiculous but 8GB of memory is ideal for 7 billion parameters so roughly a 1GB of RAM per 1 billion parameters according to documentation, if you don’t have enough RAM for the model size it won’t open as he tried 13 billion parameters and it wouldn’t work but the same 7 billion paramter model did. Regarding the Google Coral, I know it helps a ton of you use Frigate for facial detection but I’m not sure what else uses it. I’m sure some integrations I don’t use do. It doesn’t help with LLM’s, at least not the Coral due to lack of VRAM. The guy in the video below tested it and said that was the reason it didn’t make any difference. That’s not to say a future model won’t improve LLM performance significantly though.

I built a voice pipeline using a custom integration that had to be added to HACs called “Extended OpenAI conversation” and it allows you to define what you want to expose to Open AI or any LLM based on Open AI. The default allows only exposed (by exposed I mean exposed to a voice assistant) so you can find tune the results and tell it how to behave and answer. The nice thing is you can run 2 commands at once and syntax doesn’t have to be 100 percent correct. One example was me saying “un pause shield” and my shield TV started playing. I don’t have any scripts, switches or aliases with “un pause” in them, it just worked. I’ve mainly using Assist Microphone, (supported add-on) which works with any USB microphone. I’m using a dedicated speakerphone and it works amazingly well, HA doesn’t have to listen 24/7 but obviously has to plug into the HA server. Then my android phone and some m5stack atom echos with tailbait’s (small battery made by m5stack) but I only turn those in when needed or use the button instead of openeakeword because it hogs resources. I’m looking into a local LLM because API calls are only free to a certain point and obviously all local that way.

Microwakeword architecture

Raspberry pi 5 running and testing every LLM (according to the guy who posted the video).

OpenAi pipeline

I’ve spent a lot of time struggling to get this board to work. My issue is that it isn’t recognising it’s own onboard PSRAM… I’ve contacted the seller but…China.

Once I get it to see the PSRAM (which I think is simply an ESPhome code / config thing) it should work.

In the meantime I’ve ordered some of these which apparently do work. I’ll be able to confirm once they arrive.

That does look pretty cool. Who is going to risk the cost to test it out…?

Over on the ESPHome discord in the #machine-learning channel, a user got it working with the esp32-s3-devkitc-1-wroom-1-N16R8 (link to message).

1 Like

I have it working on a waveshare S3 mini

with Microphone only version of the firmware from here.

I have tried the firmware with the speaker and it appears to not recognise the wake word with that firmware. I have not had a lot of time to play yet with the speaker. I have also ordered some N16R8 S3’s as the memory size is important apparently. This is not my field of expertise but I am making progress.

That is very similar to the one I’m trying to get working (ESP32-S3FH4R2) but with no luck. BigBobbas has been helping me but the ESPhome log shows the device not seeing that it has PSRAM…

UPDATE: I re-tried using the same GPIO as this example and it works now. There was obviously some strange conflict with the GPIO I had selected. So now I can safely say that the boards I linked earlier do work.

Thanks! The developer who mentioned that it didn’t work on any S3 with PSRAM did mention memory differences between models being one of the issues. I thought might be different versions had memory from different manufacturers or something (not my area of expertise either) but it sounds like he was meant it requires a specific amount of PSRAM since that board only has 2MB with 8MB being ideal and possibly 4MB but I’ll stick to 8MB as the price differences is maybe 2 dollars if there is even an option to choose the same model with different amounts

Also thanks to the other posters and links, it’s good to know the devkit and wroom-1 appear to work as long as they have enough PSRAM. It really sounds like that’s the deciding factor but obviously more boards need to be tested. They did mention to post any boards/models users get working. I’m guessing Discord is the place to post that information if you do get it to work on a board that hasn’t already been confirmed. Thanks again!

I’ve no idea if this is a only my device thing, but the wake word doesnt appear to respond over time, this isnt a new issue it was happening on the old build without local wake word.

Appears to happen over time and I have to restart the device. I’ve not seen it reported on the issue trackers so I’m hedging more towards it being an issue with my s3box3.

Theres also an audible “pop” every now and then, I assume this is the microphone becoming active and is normal, but may or may not be related