So I’ve update HA to 2024.2.2 (Core), ESPHome 2024.2.0 and the Firmware on my S3 Box. Unfortunately I don’t have the option to process the wake word on the device (S3 Box3).
What am I missing? Any hints?
No, just used the UI to update. So I thought I’ll give it a try and start from scratch and reinstall my S3 Box. Using the WebTool on the " ESP32-S3-BOX voice assistant" Page and got this:
Installation got stuck on “Preparing Installation”.
As for the yaml file, do I need to replace it manually on the server?
No you don’t have to. I just did mine manually because I couldn’t find any other way to get the device to show up in the ESPhome add-on dashboard. Doing it from the ESPhome Projects site should use the latest version with the on-board wake word… I would expect…
Ok, I’ve validated the install via ESPHome UI, and it seems to be using the up2date yaml file.
Tried to install it via ESPHome Projects Site and got the same “undefined install” message.
Cleaned build files, reinstalled, still no option to set the wake word on device in the UI…
Actually the original firmware missing the switch for the on device wakeword, and I cannot see any additional part that is connected to microwakeword in the config.
That’s interesting… It’s been updated / reverted since I copied the code this morning (Western Australian time).
Here is my version for the Gen 1 box (ESP32-S3-Box, not Box3). In this you can see the Wake Word stuff has been added. You should be able to transfer those bits to the Box3 version.
the on device wake word seems to be included only in the “firmware/wake-word-voice-assistant/esp32-s3-box-3.yaml”. You need to change the path to the correct firmware path (this is for the S3.BOX3):
In the video covering voice chapter 6 there was a idea presented to use the in device wake word for some on device logic.
I really liked the idea, so the question ist:
Can you detect multiple (custom) wake words on one device to toggle different peripherals connected the same esphome node?
For simple logic this could safe from the need to have the commands interpreted via whisper.
Really great to see this added, all we need now is some wake word stuff built in to the HA companion app for android and I could viably ditch my Alexa devices.
It will also be interesting to see how these perform with multiple devices in the vicinity when speaking - they mentioned on the stream that they were looking at adding a “fastest wins” kind of logic, but not sure how useful that would be if you say something like “turn on the light” (using room awareness) and a device in another room hears and responds a fraction faster…
My daughter was not cooperating and letting me watch the entire stream, is it possible to edit the screen to include like a scene selector or something like that?
We initially had pushed an update to the existing boxes. However, because compilation requires a lot more memory and time (42 minutes on a Green!), we decided to revert that. People need to either update their package in the YAML or do a fresh install + adopt.
We are looking into being able to ship pre-compiled libraries, which should reduce update time and memory needs.
microWakeWord depends on TensorFlow Lite for microcontrollers, which requires 2GB+ free memory for compilation. We’re working on reducing that need. In the meanwhile, if your system can’t finish the compile, consider using the browser installer.
I watched the live stream. It has to be an ESP32-S3 and requires PSRAM as the previous poster already stated. The atom echo is a vanilla ESP32. The developer who came up with this said any ESP32-S3 with PSRAM should work now but 7 haven’t seen anyone posting saying they got it working. I’ve got an S3 but no mic to wire to it. Seems like removing the display stuff and changing the GPIO pins is all that’s needed but I’m sure someone will be posting some yaml here soon for some board.
After using on device wake word (S3-Box3) for a couple days, it feels snappier and recognizes the wake word even from 3 meters away. Which was not the case using the HA wake word detection. It may be a placebo, but I like it! Will be trying to reuse my old snips satellites (pi zero with 2-mics-pi-hat) for HA Assistant, and hope for a better recognition ratio. HA Voice with wake word is amazing. Thanks for all the hard work people have been putting into this!
@synesthesiam Firstly, THANKS MIKE ! Each step we are getting closer to replacing Alexa and Google devices.
I assume that those of us using Rhasspy should have or be swapping to HA Voice Assist - and seeking support here. And that the Rhasspy project (forum, github and docs) will focus on the non-HA uses for Rhasspy 3 as a generic toolkit ?
Chapter 5 positioned the ESP32-S3-BOX-3 in the middle between the cheap M5 Atom Echo and a RasPi - both price wise and CPU power … but chapter 6 adds microWakeWord which really narrows the gap between ESP32-S3-BOX-3 and RasPi Zero 2W with reSpeaker 2-mic HAT. Is there much benefit to RasPi Zero to justify the extra cost ?
There is still, of course, RasPi 4/5 as a higher end satellite option for those wanting to do more, and don’t mind a bigger budget.
Which is your recommendation for anyone setting up a new satellite ?
Personally I still think Nabu Casa should produce their own voice satellite (as they did to produce Home Assistant Blue, Yellow and Green) … though ESP32-S3-BOX-3 comes pretty close for now. In particular I think combining voice assistant and media player makes sense.
From my experience, the 2-mic HAT is superior to the S3Box3, at least as long as the S3Box3 within HA uses only one of the two build in microphones. The willow project (https://heywillow.io/) has fully integrated the S3Box3 and the voice recognition is quite impressive. But for now I am quite happy with the wake word detection on the S3 BOX3.
Regarding the general STT Part, I favor the approach the snips voice platform took with generating/traning an individual model based on the users needs. Only the used intents has been trained which helped to keep false positives low, especially in a multi language environment.
Thinking of something like a (nabu casa) service that generates/train an individual model based on the exposed entities and custom sentences / automations at your local HA instance. Although, to be honest, this sounds more like a deprecated approach that will be useful for low end devices. With AI and the increasing need for local AI processing Power, the way to go may be a dedicated GPU (e.g. CUDA) at home (e.g. Pocket AI | Portable GPU | NVIDIA RTX A500 | ADLINK , https://coral.ai/).
Finding the ‘best box’ for a home automation server has been something I’ve noodled on (and spend far too much coin and brain cycles on ) Post link below from five years ago give an insight to my dogged clinging…
The Thunderbolt connected device you cite is interesting, to my wasted dazy, I’ve had a couple Thunderbolt 2 and 3 external PCI based NVIDIA GPU attached to HA servers in past. Newer motherboards with Thunderbolt on them are now becoming more available in cost effective form factors. Back to the NVIDIA device you cite, while the amazing work to decrease the memory requirements for AI models is going on at a very rapid pace, I’m not sure that a device with 4 GB of memory will be in the realm of possibilities to support the pipelines of, IMHO, the 3 to 5 models that will make for a truly useful 100% (though I think believe you will want at least your ‘Jarvis’ model to have public internet based knowledge (RAG)) local AI assistant (STT, TTS, video feed processing, and Jarvis overall ‘smarts’. Today and IMHO for the near future (a couple years at least) these will need on the order of 16 GB of memory in an AI MCU.
The ‘other’ factor for a home automation AI box is power efficiency IMHO. I’m still in the power realm of 150 to 500 watts for an Intel/AMD/NVIDIA home automation box. As I said, the new mini motherboards with Intel 13 gen + and AMD newer laptop CPU’s with thunderbolt bring the non-GPU power down significantly. Unfortunately, from my experience so far the NVIDIA GPU side of the Thunderbolt wire with enough RAM and cores, CUDA and AMD based AI MCU’s still idle in high teens and easily hits 200 watts during processing. I will skip the ‘significant’ ‘significant other’ factor of fan noise in this discussion .
All this blab, coin expenditure and my ‘way back’ post below that was looking for a ultimate home automation server with AI smarts, MCU virtualization and 100% local only processing possibility by me, bring me finally to my ‘point’ AKA, what I am recommending that folks keep and sharp eye on (and adopt now if you are a bit of a ‘leading edge’ experimenter) :
The current Apple M silicon based Mac Mini’s (I recommend the M2, M3 or upcoming M3 MCU’s over the M1 due to the 16 bit floating point hardware support of the M2 and above MCU’s) are the machine that will be the ones to meets the objectives I cite : 100 % local AI control, multi-VM capable and all at a power consumption well well under 100 watt continuous.
The MCU architecture of the Apple M silicon (ARM based) today lets you run multiple Linux VM’s with extremely high VM efficiency using the UTM hypervisors available. The open source work to date and upcoming announcements to allow AI LLM models to run efficiently using the Apple ‘Metal’ (read this as Apple’s CUDA) GPU and MLPU layers is as ground breaking as NVIDIA’s CUDA was 6 to 7 years ago. Unless you are ‘off the grid’ powering your home automation and can ignore power code (how many years till those panels are 'paid off?) , the Apple hardware power abilities for ‘the win’.
I can virtualize with UTM to the same level (and an argument can be made today ‘better’ due to the Apple MCU Rosetta software ability to emulate many CPU’s including Intel/AMD (yes it is slow today, but you really do not need it as Linux is moving to ARM 100% code faster than most any code work today) ) as Proxmox can manage multiple VM and LXC’s.
As I said, give it a look, the price/power point of a 32 GB Apple Mac M2 or M3 Mini is ‘it’ for your next home automation Home Assistant server .
My understanding is that, while the demo for the reSpeaker 2-mic HAT uses all the fancy features, the driver and source code supplied by seeed do not. I recall someone commenting in particular that the reSpeaker driver uses only one mic.
This disconnect between the advertises “features” and what is actually supported; and the fact seeed stopped updating their drivers over 5 years ago; have let me to stop recommending any seeed product. Unfortunately the several clone products (such as the Adafruit one) also use the seeed driver.
I note that the ESP32-S3-BOX demo also oversells the device. Consider that these companies are not in the business of making and selling end product - these are demos to show the potential to system integrators. For this reason I am doubtful that Espressif will be keen to ramp up production of the ESP32-S3-BOX sufficiently to supply the demand for HA Voice Assist users.
Has anyone has got this to work on a device other than the S3 box? During the Livestream at the 25 minute mark, the ESPHome contributed who came up with this said it will work with any ESP32-S3 with PSRAM and a microphone at a minimum BUT towards the very end one of the other devs said one of the main ESPHome devs said they had been having “fun” with the S3 because of the various models (not sure exactly what he meant) and to post any boards or devices that work. I will just wait and see. I was thinking about the m5stack S3 camera, which does have a mic but I’m sure it’s terrible. At least it could be used for something else if it didn’t.
I also see that Expressif came out with a new ESP32-S3 Korvo development board but hard to spend that much on something that we don’t even know if it works or not. Seems like it would work great IF it works. You would think it would, it has a 3.5mm output, and 3 microphones but I’m not dealing with returns to AliExpress if it doesn’t work with HA… I was just wondering because the S3 box variants are made by Expressif also, you just can’t get one (at retail at least). I think 2 of the 3 models are discontinued anyways
When watching the kivestream they showed the microwakeword architecture and it really seems like the GPU is used in the last step the most based on their conversation, or at least as much as I understood. That last step was audio of dinner parties taken by someone, along with a lot of other random real world stuff without the wake word. I believe the only requirements is the computer you generate the wake word on needs 2GB of RAM but probably takes longer. When not using a GPU it took 2 days and when sticking a mid level Nvidia card in shaved half a day off, so about a 12 hour difference roughly. I have no idea what his computer specs or what OS the developer was using though.
Regarding AI, it really seems to depend on what you’re doing regarding GPU and performance gain. Things change at a rapid pace although Nvidia has an advantage. The guy in the YouTube link below ran every LLM available on a raspberry pi 5 with 8GB of RAM with the OS being on an external SSD via USB 3.0. It’s response times are pretty ridiculous but 8GB of memory is ideal for 7 billion parameters so roughly a 1GB of RAM per 1 billion parameters according to documentation, if you don’t have enough RAM for the model size it won’t open as he tried 13 billion parameters and it wouldn’t work but the same 7 billion paramter model did. Regarding the Google Coral, I know it helps a ton of you use Frigate for facial detection but I’m not sure what else uses it. I’m sure some integrations I don’t use do. It doesn’t help with LLM’s, at least not the Coral due to lack of VRAM. The guy in the video below tested it and said that was the reason it didn’t make any difference. That’s not to say a future model won’t improve LLM performance significantly though.
I built a voice pipeline using a custom integration that had to be added to HACs called “Extended OpenAI conversation” and it allows you to define what you want to expose to Open AI or any LLM based on Open AI. The default allows only exposed (by exposed I mean exposed to a voice assistant) so you can find tune the results and tell it how to behave and answer. The nice thing is you can run 2 commands at once and syntax doesn’t have to be 100 percent correct. One example was me saying “un pause shield” and my shield TV started playing. I don’t have any scripts, switches or aliases with “un pause” in them, it just worked. I’ve mainly using Assist Microphone, (supported add-on) which works with any USB microphone. I’m using a dedicated speakerphone and it works amazingly well, HA doesn’t have to listen 24/7 but obviously has to plug into the HA server. Then my android phone and some m5stack atom echos with tailbait’s (small battery made by m5stack) but I only turn those in when needed or use the button instead of openeakeword because it hogs resources. I’m looking into a local LLM because API calls are only free to a certain point and obviously all local that way.
Microwakeword architecture
Raspberry pi 5 running and testing every LLM (according to the guy who posted the video).
I’ve spent a lot of time struggling to get this board to work. My issue is that it isn’t recognising it’s own onboard PSRAM… I’ve contacted the seller but…China.
Once I get it to see the PSRAM (which I think is simply an ESPhome code / config thing) it should work.
In the meantime I’ve ordered some of these which apparently do work. I’ll be able to confirm once they arrive.
with Microphone only version of the firmware from here.
I have tried the firmware with the speaker and it appears to not recognise the wake word with that firmware. I have not had a lot of time to play yet with the speaker. I have also ordered some N16R8 S3’s as the memory size is important apparently. This is not my field of expertise but I am making progress.
That is very similar to the one I’m trying to get working (ESP32-S3FH4R2) but with no luck. BigBobbas has been helping me but the ESPhome log shows the device not seeing that it has PSRAM…
UPDATE: I re-tried using the same GPIO as this example and it works now. There was obviously some strange conflict with the GPIO I had selected. So now I can safely say that the boards I linked earlier do work.
Thanks! The developer who mentioned that it didn’t work on any S3 with PSRAM did mention memory differences between models being one of the issues. I thought might be different versions had memory from different manufacturers or something (not my area of expertise either) but it sounds like he was meant it requires a specific amount of PSRAM since that board only has 2MB with 8MB being ideal and possibly 4MB but I’ll stick to 8MB as the price differences is maybe 2 dollars if there is even an option to choose the same model with different amounts
Also thanks to the other posters and links, it’s good to know the devkit and wroom-1 appear to work as long as they have enough PSRAM. It really sounds like that’s the deciding factor but obviously more boards need to be tested. They did mention to post any boards/models users get working. I’m guessing Discord is the place to post that information if you do get it to work on a board that hasn’t already been confirmed. Thanks again!
I’ve no idea if this is a only my device thing, but the wake word doesnt appear to respond over time, this isnt a new issue it was happening on the old build without local wake word.
Appears to happen over time and I have to restart the device. I’ve not seen it reported on the issue trackers so I’m hedging more towards it being an issue with my s3box3.
Theres also an audible “pop” every now and then, I assume this is the microphone becoming active and is normal, but may or may not be related
Well, I just ordered one. I’ll let everyone know how it works out. On paper it should work but we all know that doesn’t always on out. I just happened to search Amazon and they have them in the US store for the same price. My main issue was ordering from AliExpress and having to deal with a return if it didn’t work but Amazon will take anything back so if it doesn’t work I’ll just send it back for a refund. only 7 left in stock. Not sure about the UK store.
If you live in the UK then you can buy an ESP32-S3-BOX-3 from my store.
I only have a limited amount and no idea how popular they are going to be. I hope people will understand it’s best price I can do it for with all of the effort that’s gone into the site etc.
I attempted to compile the code for the s3-box-3 and got the following compile termination, any ideas?
...
Compiling .pioenvs/esp32-voice-node-5a9788/src/esphome/components/micro_wake_word/micro_wake_word.o
Compiling .pioenvs/esp32-voice-node-5a9788/src/esphome/components/network/util.o
In file included from src/esphome/components/micro_wake_word/micro_wake_word.cpp:1:
src/esphome/components/micro_wake_word/micro_wake_word.h:19:10: fatal error: tensorflow/lite/core/c/common.h: No such file or directory
#include <tensorflow/lite/core/c/common.h>
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
*** [.pioenvs/esp32-voice-node-5a9788/src/esphome/components/micro_wake_word/micro_wake_word.o] Error 1
========================= [FAILED] Took 288.57 seconds =========================
Removing all the yaml and trying again fixed it and it compiled. After adding my esp32-s3 to HA should I expect to be able to add it to my assist pipeline at the bottom under wake word? It just says I don’t have a wake word engine setup yet.
This is not mostly my work and could use some attention to detail for the included h files used for Arduino by esspressif specifically for the korvo-1. Works for the S3 korvo-1 though.
I just got my Box3, when its working I like it. Though it seems to cut off or misundertand words. For instnace when I say “turn on the den” I get an error saying something like “can’t find the device Din” or that.
My main concern however is the wake word “Ok Nabu” is only working around 20% of the time the first time and usually have to say between 3-5 times to get it to wake.
Still surprised about all the satellite talks when basically everyone has a mobile phone and most have tablets. So why spend cash on some satellites if all we really need is wake word support in the companion app.
Would even work on Android TVs, Android cars etc. etc.
No cost, lots of processing power and readily available everywhere.
I guess you do not have kids or other family members that do not always carry their phone everywhere (if they even have one, which I am sure most small kids do not have). Even I who do carry my phone almost all the time personally still use our existing Google Nest / Google smart speakers a lot for hand-free voice control.
I believe that most common usecases are when and where handsfree operation makes sense, like example in the kitchen while your hands are busy, with usecases like controlling lights (brightness or ON/OFF), set/operate timers and reminders or alarms, adding stuff to shopping lists or to-do list, and music controls.
Regardless, there are several usecase reasons that appeal to mainstream users and that is why Google and Amason have each sold more than 500,000 Google Nest / Google Home and Amazon Echo / Alexa smart speakers each ao far.
Check out result in this wish list poll once you done it yourself:
Not sure what you are trying to show me with that poll. It is about features, not hardware. And wake word support is one of the top priorities there.
I do not have kids but I do not need kids to know that I have multiple Android and iOS devices in my household. And I do not need to be the owner of the iOS devices to be able to use them to use voice control.
So the point is, that if the companion app supported wake words, anybody in my home could enter any room that has any Android or iOS device lying around and could give voice commands.
The device just needs to be in hearing distance.
And you are quoting sales for echo, alexa etc. woth 500 k. 500 k units is not that much. And very few people here want to share all their data with big tech.
So a lot of people are buying more or less expensive satellite hardware. They are fun to play with but they have little future. They will lie around somewhere in a year or two because they are too bulky or too slow. Or because people realize that they need to buy one per room because they are not as mobile as all our Android and iOS phones and tablets.
So, sure, you can buy lots of dedicated hardware for a task that really is just a mic and speaker. Or you could use what everybody owns and most people even have old devices lying around (I still have my Samsung S2 and S6edge). So my wife and I would currently own >6 perfectly fine “satellites” in the form of phones and tablets. Just waiting to be used as local voice controls.
Cost for 6 ESP S3 boxes? Couple of hundred euros. For what? A big, bulky satellite with inferior screen and sound compared to a mobile phone or tablet
Alex, you do what works best for you. It’s great if you prefer to use the Android app rather than setup satellite devices. That is why there are several options.
Personally I find that getting my phone out, logging in to the phone, and starting the app before I can turn a light on or off … is more painful than getting out of my chair and walking to the light switch. But speaking a command seems so easy … all i need is for it to work reliably
The idea is to use wake words on the locked phone.
Or have devices like old phones or tablets remain unlocked.
Or speaking to my TV while I am watching.
You can already control all your devices with the locked phone by using the tiles. Now it is “just” necessary to add wake words
And the computing power of a 50 € tablet is much higher than that of a 50 € esp device. Mic and speaker are also better. Imagine just hanging a bunch of firehd tablets on your walls and speaking to them. Would look much nicer than esp devices and offer nice big screens and great touch control
Does anyone know why the home assist voice preview edition hardware doesn’t seem to support microWakeWord?
I have both the home assistant voice preview edition and an M5stack atom echo, the M5 seems far more accurate at respond to wake words when using local wake word processing, it performs much worse using HA for wake word processing, on par with the home assistant voice preview edition
Both devices seem to have the same eap32-s3 esp32.
actually you can create your own ESP32s3 with microphones and audio amplifier for about 10€ (but you have to know how to solder) and the result is good. Spending more than 40€ is not worth it…
I’ve had a lot of luck with Wyoming Satellite on Android using Termux for on device wake words on Android. Works great on my Pixel 8a running a beta version of Android 16 and the nspanel pro 120 which is lower power ARM and Android 8 I believe. About the only issue was getting it to start at boot so I just used some tasker equipment to launch it on boot. You get to control the mic gain which is nice and the mics are already meant for far field communication on phones at least and most people have an older model sitting around somewhere. Don’t need the companion app installed either.